Chad Cook, PT, PhD, MBA, OCS, FAAOMPT
Joshua Cleland, PT, DPT, PhD, OCS, FAAOMPT
Peter Huijbregts, PT, DPT, OCS, FAAOMPT, FCAMT
Abstract: Clinical special tests are a mainstay of orthopaedic diagnosis. Within the context of the evidence-based practice paradigm, data on the diagnostic accuracy of these special tests are frequently used in the decision-making process when determining the diagnosis, prognosis, and selection of appropriate intervention strategies. However, the reported diagnostic utility of these tests is signifi cantly affected by study methodology of diagnostic accuracy studies. Methodological shortcomings can influence the outcome of such studies, and this in turn will affect the clinician’s interpretation of diagnostic findings. The methodological issues associated with studies investigating the diagnostic utility of clinical tests have mandated the development of the STARD (Standards for Reporting of Diagnostic Accuracy) and QUADAS (Quality Assessment of Diagnostic Accuracy Studies) criterion lists. The purpose of this paper is to outline the STARD and QUADAS criterion lists and to discuss how these methodological quality assessment tools can assist the clinician in ascertaining clinically useful information from a diagnostic accuracy study.
Key Words: Special Tests, Diagnostic Accuracy, Methodological Quality Assessment Tools, STARD, QUADAS
The clinician’s armamentarium for screening, diagnosis, and prognosis of selected conditions has expanded with the creation of numerous clinical special tests. Diagnostic tests are decidedly dynamic, as new tests are developed concurrently with improvements in technology1. These clinical tests remain extremely popular among orthopaedic diagnosticians, and information gained from these tests is frequently considered during decision-making with regard to patient diagnosis and prognosis and the selection of appropriate interventions. Historically, textbooks describing these tests have ignored mention of the tests’ true ability to identify the presence of a particular disorder as based on studies into diagnostic utility of these tests; rather, they have concentrated solely on test description and scoring. Relying solely on a pathophysiologic and/or pathobiomechanical rationale for the interpretation of clinical tests without considering the research data on diagnostic accuracy of said tests can potentially result in the selection of tests that provide little worthwhile diagnostic or prognostic information. In addition, it can lead clinicians to make incorrect treatment decisions1. With the number of clinical special tests and measures continuing to multiply, it is essential to thoroughly evaluate a test’s diagnostic utility prior to incorporating it into clinical practice2,3.
Clinical special tests exhibit the measurable diagnostic properties of sensitivity and specificity. The sensitivity of a test is the ability of the test to identify a positive finding when the targeted diagnosis is actually present (i.e., true positive)4. Specifi city is the discriminatory ability of a test to identify if the disease or condition is absent when in actuality it is truly is absent (i.e., true negative)4. Sensitivity and specificity values can then be used to calculate positive and negative likelihood ratios (LR). Although sensitivity and specificity—when high—are useful for confirming the presence or absence of a specific disorder, the general consensus seems to be that likelihood ratios are the optimal statistics for determining a shift in the pretest probability that a patient has a specific disorder. Table 1 provides information on statistics relevant to diagnostic utility.
Clinical special tests that demonstrate strong sensitivity are considered clinically useful screening tools5 in that they can be used for ruling out selected diagnoses or impairments6. When a test demonstrates high sensitivity, the likelihood of a false negative finding (i.e., incorrectly identifying the patient as not having the disorder when in reality she actually does have said condition) is low since the test demonstrates the substantive ability to identify those who truly have the disease or impairment, thus demonstrating the ability to “rule out” a condition. Conversely, tests that demonstrate high specificity are appropriate for “ruling in” a finding, indicating that a positive value is more telling than a negative value. The likelihood of a false positive is low because the test demonstrates the ability to correctly identify those who truly do not have the disease or impairment. This ability of highly sensitive and highly specific tests to rule in a condition or rule out a condition, respectively, is captured in the mnemonic below:
• SnNOUT: With highly Sensitive tests, a Negative result will rule a disorder OUT
• SpPIN: With highly Specifi c tests, a Positive result will rule a disorder IN
Likelihood ratios can be either positive or negative. A positive likelihood ratio (LR+) indicates a shift in probability favoring the existence of a disorder if the test is found to be positive. A value of 1 indicates an equivocal strength of diagnostic power; values that are higher suggest greater strength. Conversely, a negative likelihood ratio (LR–) indicates a shift in probability favoring the absence of a disorder if the test is found to be negative. The lower the value, the better the ability of the test to determine the post-test odds that the disease is actually absent in the event the finding is negative. A number closer to 1 indicates that a negative test is equally likely to occur in individuals with or without the disease. Table 2 represents the shifts in probability associated with specific range of positive and negative likelihood ratios that a patient does or does not have a particular disorder given a positive or negative test7.
With the intent of providing a comprehensive overview of all statistical measures relevant to diagnostic utility, Table 1 also provides defi nitions for three additional statistics. The accuracy of a diagnostic test provides a quantitative measure of its overall value, but because it does not differentiate between the diagnostic value of positive and negative test results, its value with regard to diagnostic decisions is minimal. At fi rst sight, positive and negative predictive values seem to have greater diagnostic value. However, because the prevalence in the clinical population being examined has to be identical to the prevalence in the study population from which the predictive values were derived before we can justifiably use predictive values as a basis for diagnostic decisions, their usefulness is again limited.
Many orthopaedic clinical tests are products of traditional examination methods and principles; i.e., the tests were based solely on a pathophysiologic and/or pathobiomechanical rationale. For example, Spurling and Scoville introduced the Spurling’s sign in 1944 as a diagnostic test for cervical radiculopathy8. Over 125 different articles advocating the merit of this test as a diagnostic tool have since cited this 1944 study. In the original article, Spurling and Scoville8 reported a sensitivity of 100%, exclusively for identifying patients with cervical radiculopathy. Consequently, Spurling’s maneuver has been frequently used as a tool for screening for or, in some cases, diagnosing cervical radiculopathy and cervical herniated disks9-13. Table 3 outlines the findings of a number of studies that have investigated the diagnostic utility of the Spurling’s test. The fi ndings in this table can be used to illustrate an important point: Despite the claims by Spurling and Scoville of perfect sensitivity for the Spurling’s test with regard to identifying the presence of cervical radiculopathy, subsequent studies that have investigated the diagnostic value of Spurling’s maneuver have found dramatically different results from those initially reported. For example, Uchihara et al9 reported that the Spurling test exhibited a sensitivity of 11% and a specifi city of 100%, while Tong et al14 reported a sensitivity of 30% and a specificity of 93%. Additional researchers11-12 have found sensitivities similar to those reported in these two studies9,14. However, none have reported values near 100%. Since the numbers among studies are dramatically different, clinicians are left with the following question: Is the test more appropriately used as a screening tool as advocated by Spurling and Scoville8 or as a measure of fair to moderate diagnostic utility as suggested by a number of other authors?
The answer lies in the methodological rigor in the study design and the applicability of these findings to the diagnostic environment of the practicing clinician. Methodological issues can infl uence the outcome of diagnostic utility studies, and this in turn should affect the clinician’s interpretation of diagnostic findings. The methodological issues associated with studies investigating the diagnostic utility of clinical tests have mandated the development of criterion lists to systematically determine the methodological quality of diagnostic utility studies, i.e., the STARD (Standards for Reporting of Diagnostic Accuracy) and QUADAS (Quality Assessment of Diagnostic Accuracy Studies) criteria. The purpose of this paper is to outline the STARD and QUADAS criteria and to discuss how these criteria can assist the clinician in ascertaining clinically useful information from a diagnostic accuracy study.