Date: August 18th, 2020
Guest Skeptic: Dr. Corey Heitz is an emergency physician in Roanoke, Virginia. He is also the CME editor for Academic Emergency Medicine.
Reference: Carpenter et al. Diagnosing COVID-19 in the Emergency Department: A Scoping Review of Clinical Exam, Labs, Imaging Accuracy and Biases. AEM August 2020
Case: You are working in the emergency department during the COVID-19 outbreak, and you see a patient with oxygen saturations of 75% on room air, a fever, and a cough. Upon review of systems, you learn that she lost her sense of taste about two days ago. Your hospital performs COVID reverse transcriptase polymerase chain reaction (rt-PCR) nasal swabs on suspected patients, so you order this test and await the results.
Background: In early 2020, a pandemic broke out with origins thought to be in the Wuhan region of China. A novel coronavirus, SARS-Co-V-2, commonly called COVID-19, rapidly spread around the world, overwhelming hospitals and medical systems, causing significant morbidity and mortality.
The speed with which the outbreak occurred made identification of cases difficult, as the disease exhibited a variety of symptoms, and testing lagged spread. The US Federal Drug Administration (FDA) allowed for emergency development and use of rt-PCR assays, and dozens of companies released assay kits.
I consciously have tried to avoid contributing to the COVID-19 information overload. However, I did do a CAEP Town Hall on therapeutics (SGEM Xtra: Be Skeptical) with Dr. Sean Moore and a friendly debate on mandatory universal masking in public with Dr. Joe Vipond (SGEM Xtra: Masks4All).
This review discusses the diagnostic accuracy of rt-PCR for COVID-19, as well as signs, symptoms, imaging, and other laboratory tests.
Clinical Question: What is the diagnostic accuracy of history, clinical examination, routine labs, rt-PCR, immunology tests and imaging tests for the emergency department diagnosis for COVID19?
Reference: Carpenter et al. Diagnosing COVID-19 in the Emergency Department: A Scoping Review of Clinical Exam, Labs, Imaging Accuracy and Biases. AEM August 2020
- Population: Original research studies describing the frequency of history, physical findings, or diagnostic accuracy of history/physical findings, lab test, or imaging tests for COVID-19
- Intervention: None
- Comparison: None
- Outcome: Diagnostic accuracy (sensitivity, specificity, and likelihood ratios)
This is an SGEMHOP episode which means we have the lead author on the show. Dr. Chris Carpenter is Professor of Emergency Medicine at Washington University in St. Louis and a member of their Emergency Medicine Research Core. He is a member of the SAEM Board of Directors and the former Chair of the SAEM EBM Interest Group and ACEP Geriatric Section. He is Deputy Editor-in-Chief of Academic Emergency Medicine where he is leading the development of the “Guidelines for Reasonable and Appropriate Emergency Care” (GRACE) project. He is also Associate Editor of Annals of Internal Medicine’s ACP Journal Club and the Journal of the American Geriatrics Society, and he serves on the American College of Emergency Physician’s (ACEP) Clinical Policy Committee. Dr. Carpenter also wrote the book on diagnostic testing and clinical decision rules.
Authors’ Conclusions: “With the exception of fever and disorders of smell/taste, history and physical exam findings are unhelpful to distinguish COVID-19 from other infectious conditions that mimic SARS-CoV-2 like influenza. Routine labs are also non-diagnostic, although lymphopenia is a common finding and other abnormalities may predict severe disease. Although rRT-PCR is the current criterion standard, more inclusive consensus-based criteria will likely emerge because of the high false-negative rate of polymerase chain reaction tests. The role of serology and CT in ED assessments remains undefined.”
Quality Checklist for Systematic Review Diagnostic Studies:
- The diagnostic question is clinically relevant with an established criterion standard. Yes/No
- The search for studies was detailed and exhaustive. No
- The methodological quality of primary studies were assessed for common forms of diagnostic research bias. Yes
- The assessment of studies were reproducible. Yes
- There was low heterogeneity for estimates of sensitivity or specificity. No
6. The summary diagnostic accuracy is sufficiently precise to improve upon existing clinical decision-making models. No
Key Results: The authors screen 1,907 citations and 87 were included in the review. None adhere to the Standards for Reporting of Diagnostic Accuracy (STARD) or the updated reporting framework for history and physical examination. Rt-PCR was used as the criterion standard for many of the studies, but none explored the possibility of false negatives.
“PRISMA provides a reproducible reporting framework for systematic review and meta-analysis authors. Multiple PRISMA extensions exist (acupuncture, harms, health equity, network meta-analysis) and in 2018 PRISMA published “scoping review” reporting methods. A scoping review differs from a systematic review in that formal quality assessment of individual diagnostic studies with QUADAS-2 is not performed. PRISMA-ScR still requires a reproducible search strategy and synthesis of research findings. We selected a scoping review rather than a systematic review because we had limited time to find and synthesize the studies amidst our own institution’s COVID-19 chaos, yet we wanted to draw a line in the sand for diagnostic accuracy quality reporting because we were seeing the same research biases occurring repeatedly.”
2) Search: Why did you decide to exclude non-English language studies? Would there not be a benefit to the experience out of other countries (especially China), even if not published in English-language journals?
“This was simply for expediency because we lacked time to find/fund a translator. You will see from the articles that we the majority of the studies were from China. This was because it was early May and there was little experience or research published from Europe or US at that time. As described in Figure 2, we did not exclude any studies for the purpose of language. This probably reflects a bias of our search engines (PubMed and EMBASE) for Asian language journals, as well as the fact that English is increasingly the universal language for scientific reporting.”
3) STARD: Can you tell us more about the Standards for Reporting of Diagnostic Accuracy (STARD) guidelines. None of the included studies adhered to the STARD guidelines. Why are these guidelines so important to follow?
“Over two decades ago, journal editors and publishers convened to create mutually agreeable reporting standards that would transcend specialities beginning with the CONSORT criteria for randomized controlled trials. These reporting standards continue to multiple (nearly 400 now!) and are warehoused at the EQUATOR Network. Like PRISMA for systematic reviews, STARD is the EQUATOR Network reporting standard for diagnostic studies. Unfortunately, as demonstrated in our COVID-19 scoping review, uptake of these reporting standards has been slow in emergency medicine. In 2017, Gallo et al reporting on behalf of the Best Evidence in Emergency Medicine (BEEM) team that ~80% of a randomly selected portion of diagnostic studies from eight EM journals report about half of STARD criteria (Gallo et al 2017). Some elements of STARD that were commonly omitted included reporting the time interval between the index test and the criterion standard, the reproducibility of the index test, harms associated with the test, 2×2 contingency tables, and test performance variability across clinicians, labs, or test interpreters. EQUATOR Network reporting standards like STARD are imperfect, but provide a minimal basement quality standard to ensure that diagnostic investigators evaluate essential features of their research design and that journal reviewers/editors analyze those elements of the study (Carpenter and Meisel AEM 2017) .”
4) Diagnostic Biases: A core papers resident and clinicians should be familiar with is the one on various diagnostic biases (Kohn et al AEM 2013). Let’s go through some of the common diagnostic biases and how they can impact results and specifically COVID19 testing?
Spectrum Bias (Effect): Sensitivity depends on the spectrum of disease, while specificity depends on the spectrum of non-disease. So, you can falsely raise sensitivity if the clinical practice has lots of very sick people. Specificity can look great if you have no sick patients in the cohort (the worried well). How could spectrum bias impact COVID19 testing?
“This is difficult to ascertain using the data provided in the research reporting of the early COVID-19 era. Investigators rarely reported distribution of disease severity (% ICU admissions, APACHE-2 scores) or baseline risk profile (frailty score, comorbid illness score) in COVID-19 positive patients nor the distribution of alternative diagnoses in COVID-19 negative patients. Washington University is participating in a study that includes fifty emergency departments across the United States to derive a PERC-like rule that identifies patients at low-risk of COVID-19 when testing is delayed or unavailable. With the variability in COVID-19 prevalence compounded by fluctuating availability of criterion standard testing resources, we have noted a skew towards testing very low risk or no-risk patients, which will skew specificity upwards and leave sensitivity relatively unaffected. Future COVID-19 diagnostic investigators (whether evaluating history, physical exam, labs, imaging, or decision-aids) need to report sufficient detail to permit stratification of accuracy estimates by disease severity in order to understand the impact of spectrum bias.”
Incorporation Bias: This occurs when results of the test under study are actually used to make the final diagnosis. This makes the test appear more powerful by falsely raising the sensitivity and specificity. Incorporation bias is particularly prevalent when the index test is part of the composite group of findings that determine whether the disease was present of absent.
“In the case of COVID-19, viral cultures were not commonly evaluated (or ever reported as a comparative criterion standard in the research we synthesized). In fact, we did not find any recommendations for a more preferable criterion standard by authors, commentators, or governmental websites like the CDC – so we proposed one that includes a downstream evaluation of exposure history, symptoms at the time of testing, laboratory tests including rRT-PCR, imaging, serology, and viral cultures as an optimal criterion standard for COVID-19. Of course, our recommended criterion standard would also be at risk for incorporation bias when evaluating history and physical exam, labs, or imaging but seems to have more face validity than using PCR as the criterion standard for PCR in which case PCR can never be wrong!”
Differential Verification Bias (Double Gold Standard): This occurs when the test results influence the choice of the reference standard. So, a positive index test gets an immediate/gold standard test whereas the patients with a negative index test get clinical follow-up for disease. This can raise or lower sensitivity/specificity.
“This is likely to occur in COVID-19 when the results of one test (CT demonstrating typical viral pneumonia findings of COVID-19) prompt clinicians or researchers to obtain additional COVID-19 testing such as repeat rRT-PCR or bronchoalveolar lavage specimens for COVID-19 testing. Differential verification bias is associated with increased specificity (and to a lesser extent sensitivity) for diseases that resolve spontaneously. On the other hand, for diseases that only become detectable during follow-up (like repeat rRT-PCR or serology testing) observed specificity and sensitivity are decreased.”
Imperfect Criterion Standard (Copper Standard Bias): This is what can happen if the “gold” standard is not that good of a test. False positives and false negatives can really mess up results.
“If errors on the index and criterion standard are correlated (i.e. usually incorrect at the same time or correct at the same time), observed sensitivity and specificity are falsely increased compared with what we would observe in the real world. On the other hand, if errors on the index and criterion standard do not correlate (are independent), observed sensitivity/specificity are lower than real world settings. Since a “gold standard” for COVID-19 does not yet exist, we proposed one as a starting point (see Table 2 above).”
5a) False-Negatives: What are the implications of false negatives?
“Patient perspective: I don’t have COVID-19! No need for face mask or social isolation for me! Time to party like it’s 1999!”
Hospital perspective: This individual does not have COVID-19, so we can put them in a hospital room with another patient who does not have COVID-19. Also, nurse/physician do not need personal protective equipment with this patient.”
In Figure 3 (see below), we also demonstrated the association between baseline COVID-19 prevalence and false positive/false negative results for three antibody tests.
One approach to reduce false negative rates due to imperfect (or unavailable) rRT-PCR testing was to evaluate every patient with PCR + CT. However, CT is also an imperfect COVID-19 diagnostic test and has additional negative consequences (Rapits et al 2020). The first unwanted side effects of CT are the cost to patient/society and the medical radiation exposure to the patient. The second consequence is potential contamination of CT technicians or subsequent patients in the scanner. Recommendations to deep clean the scanner for an hour after every COVID-19 patient exist, but this delays access to the CT scanner for every patient in the ED. Consequently, the British Society of Thoracic Imaging issues guidelines for which suspected COVID-19 patients would benefit from CT evaluation (Nair et al 2020).
5b) False-Positives: What are the Implications of False-Positives?
“False positives for rRT-PCR are likely uncommon if labs follow CDC testing recommendations. On the other hand, false positives for antibody testing are largely unknown and rarely contemplated. We provided an algebraic manipulation of Bayes Theorem that provides a threshold COVID-19 prevalence at which the likelihood of a true positive is equal to a false positive:
Using this equation and the results reported from one serology study (Bendavid et al 2020), we estimate that threshold to be 0.62% prevalence, but using the results from yet another study (Whitman et al 2020) that threshold to be ~10%. In other words, if regional prevalence is <10% than a positive test is more likely to be a false positive than to be a true positive.
The implications of a false positive test could include unnecessary isolation of individuals (including restriction from work and lost income further increasing health disparities) and the expense of additional diagnostic testing.
We also provide clinicians with a resource to help patients and families to understand the imperfections of rRT-PCR in Figure 4. These Cates plots can be adapted as diagnostic investigators better understand the sensitivity/specificity of rRT-PCR (or antigen/serology tests).”
What do you think the implications are for future research in the diagnostic accuracy of COVID19?
- Diagnostic investigators must adhere to STARD reporting standards or clinicians/policy-makers risk devolving into a confusing Tower of Babel with rampant miscommunication and preventable repetition of research error.
- Journal editors and reviewers should hold researchers to STARD standards by seeking additional data or clarifications when elements of diagnostic testing (such as accuracy among patient subsets, explicit 2×2 contingency table reporting, and inter-rater test reproducibility) are missing.
- Contemplate the harms of testing, including quantification of false-negatives and false-positives and the associated adverse consequences for patients, hospitals, and communities.
- Consider reporting interval likelihood ratios for continuous data.
- Beyond the elementary Cates plots we propose, develop formal shared decision-making resources for patients/families to aid meaningful discussions around the interpretation of signs/symptoms, imaging, labs, and rRT-PCR (Hess et al AEM 2015).
Comment on Authors’ Conclusion Compared to SGEM Conclusion: We agree with the authors’ conclusions.
SGEM Bottom Line: The limitations for diagnostic testing for COVID-19 must be understood. Current PCR tests have a fairly high false negative rate, so serial testing should be performed. There may be a role for imaging in suspected patients, but there are no pathognomonic findings for COVID-19.
Case Resolution: Your patient tests negative for the virus. Despite this, your suspicion is high, so you continue to use appropriate personal protection equipment (PPE) when entering the room.
Clinical Application: Be aware of high false negative rates for rt-PCR testing and maintain a high level of suspicion in high-risk patients.
What Do I Tell the Patient? Your initial COVID test came back negative. However, given the suspicion we have, we are going to continue to protect ourselves, staff and other patients, and are going to care for you as if you have the virus.
Keener Kontest: Last weeks’ winner was Claudia Martin a respiratory therapist, long time listener and multiple keener contest winner. She knew egophany comes from Greek word meaning “bleating of a goat”.
SGEMHOP: Now it is your turn SGEMers. What do you think of this episode on the diagnostic accuracy of history, physical examination, laboratory testing and imaging (CXR and chest CT scan) for COVID19? Tweet your comments using #SGEMHOP. What questions do you have for Chris and his team? Ask them on the SGEM blog. The best social media feedback will be published in AEM.
- Go to the Wiley Health Learning website
- Register and create a log in
- Search for Academic Emergency Medicine – “August”
- Complete the five questions and submit your answers
- Please email Corey (email@example.com) with any questions
Remember to be skeptical of anything you learn, even if you heard it on the Skeptics’ Guide to Emergency Medicine.