SAS Log Analysis and Debugging in Clinical Trials
Summary
Root cause analysis when debugging SAS logs is manual, time consuming, and error prone - even more so when SAS macros are involved. Yet it is a crucial and required QC process for clinical trial analyses. In this article, we describe how Verisian’s Validator drastically improves the speed and quality of this process by extracting relevant log messages and displaying them with associated code and resolved macros. It also provides full traceability across your entire study, allowing for fast and accurate root cause analysis across files and the full dataset lineage. You can try it out for free here.
Study Debugging SAS logs - Syntax, Data, and Analysis
There are different levels of debugging SAS logs. The first stage often involves looking for syntax errors or other execution errors that indicate obvious issues with a program. But as soon as we arrive at Warnings and especially Notes, things become more difficult. Warnings and Notes are often concerned with the data and its transformations, rather than the SAS language. One can write valid SAS code and still have an erroneous result, for example when a data set wasn’t filtered as expected and contains records that should have been removed.
Data- and Analysis-level Debugging
Consider this example log message:
NOTE: MERGE statement has more than one data set with repeats of BY values.
This indicates that we are merging datasets on a BY variable that is not unique in the datasets being merged. While it is possible this was the desired outcome, let’s assume that we were unaware that the merge did not occur between datasets in which the BY variable is a unique key. The code executed without error, but there is clearly a problem on analysis-level.
There are many more examples, such as the message that a resulting dataset contains 0 observations. In some cases, this might be the desired result. It could, however, also indicate that a merge between two datasets did not proceed as expected, and simply “failed” on data-level.
In other words, a Note can either be benign and be safely ignored, or it may require in-depth debugging.
To tell the difference, the programmer needs to understand the context of the current operation, which might be a single operation among hundreds of interconnected computations across many datasets, files, and programmers. This is not trivial. Using the context to find the problem means running a root cause analysis.
Root Cause Analysis - the Problem of Traceability
How do we understand context?
Staying with the example above, we first look at the merge statement that caused the log message, try to understand the step it was used in, and then the previous steps of the analysis until we find the root of the problem. In simple cases, this might be quick - in reality, this can span hundreds of lines of code or more, and span several files and programmers. The process is manual, time-consuming, extremely error-prone.
Establishing this understanding by manual code tracing is made more difficult by the use of macros. Macros generate SAS code dependent on execution context. Non-trivial macros that are nested and have many parameters are difficult to understand and debug, because the generated code and resulting problems are distributed across MRPINT statements in the SAS log. Macros, however, are not the only source of confusion.
Root Cause Analysis - the Problem of Clarity
SAS logs show code, macro invocations, potentially resolved macro code (given that the MRPINT option is enabled), and log messages interspersed with line breaks, custom log messages and other diagnostic information. Looking at logs alone, programmers are drowned in partially redundant and often irrelevant information. Statistical analysis is difficult enough - trying to discern non-trivial implementation details from a SAS log, or worse, a combination of the SAS log and SAS code, is time-consuming, error-prone, and exhausting.
Given a log message of interest, programmers:
- have to find the problematic code piece in the code files
- read the log for resolved macro code, as macro invocations are “black boxes” in the original code
- run the root cause analysis across log and code files, while keeping track of the traceability manually
Validator: Full Traceability, Full Clarity
Requiring only the study’s SAS log files, we built the Verisian Validator to vastly improve the speed and quality of root-cause analyses and thereby overall study analysis quality and integrity.
Firstly, the Validator builds an easy-to-understand and navigable interface to put full traceability throughout an entire study into the programmer’s hands. In addition, it extracts the log messages that matter, setting them into context with the code that caused them, and resolving macro invocations wherever MRPINT is enabled (i.e. it extracts and displays macro-generated code at the point it is executed, within the “normal” code).
Full Clarity
The Validator extracts all Errors, Warnings, and important Note log messages and keeps track of the code that caused them. Macro invocations are resolved and become part of the full traceability graph (see details below). It displays the log messages, the original log, and extracted code together to have all required information at a glance (note: all screenshots are from our beta-release):
Full Traceability
The logs, the contained code and all datasets, variables, formats, codelists, steps and statements are turned into a queryable graph that establishes full traceability throughout your entire study, spanning across all files. Below is a visualization of datasets and their relationships in a demo study, showcasing how data flows and is manipulated from raw input data to final outputs.
Instead of browsing through files and hundreds of lines of code, the dataset traceability graph and corresponding code allows the programmer to explore the full provenance of any dataset in a study. Once the root cause has been identified, the programmer knows exactly in which file(s) and at which line(s) the solution(s) should be introduced.
Next Steps
Managing the Review and Resolution Process
Reviewing logs for a single file or even a small project can be done by a single person, but for full studies with many programmers contributing, it is crucial to have a central point for quality control and a repeatable process to ensure important log messages are considered.
The Validator soon contain a process for the systematic review of log messages. During log review, relevant messages can be marked as “resolved” (meaning: no further work required) or “requiring resolution”, creating a list of log messages the programming team needs to address to ensure the correctness of results.
Upcoming Validator Features
The Validator will
- scan for code quality issues concerning robustness (ability to handle unexpected data inputs) and transparency (ease of understanding, maintainability, and extension)
- make use of the full traceability to group log messages into “causal chains”, meaning that log messages that have a common cause will be grouped so that by fixing one problem, all grouped messages can be resolved at once
- use a graph-powered inference and AI engine to propose fixes for problematic code
If you want to try the Validator, try either the free demo or upload your own logs in our free cloud trial (coming soon). You can find a set of tutorials for the Verisian Validator here (coming soon). You can also watch this webinar.