Abstract
BACKGROUND AND PURPOSE: The Brain Tumor Reporting and Data System (BT-RADS) is a structured radiology reporting algorithm that was introduced to provide uniformity in posttreatment primary brain tumor follow-up and reporting, but its interrater reliability (IRR) assessment has not been widely studied. Our goal is to evaluate the IRR among neuroradiologists and radiology residents in the use of BT-RADS.
MATERIALS AND METHODS: This retrospective study reviewed 103 consecutive MR studies in 98 adult patients previously diagnosed with and treated for primary brain tumor (January 2019 to February 2019). Six readers with varied experience (4 neuroradiologists and 2 radiology residents) independently evaluated each case and assigned a BT-RADS score. Readers were blinded to the original score reports and the reports from other readers. Cases in which at least 1 neuroradiologist scored differently were subjected to consensus scoring. After the study, a post hoc reference score was also assigned by 2 readers by using future imaging and clinical information previously unavailable to readers. The interrater reliabilities were assessed by using the Gwet AC2 index with ordinal weights and percent agreement.
RESULTS: Of the 98 patients evaluated (median age, 53 years; interquartile range, 41–66 years), 53% were men. The most common tumor type was astrocytoma (77%) of which 56% were grade 4 glioblastoma. Gwet index for interrater reliability among all 6 readers was 0.83 (95% CI: 0.78–0.87). The Gwet index for the neuroradiologists’ group (0.84 [95% CI: 0.79–0.89]) was not statistically different from that for the residents’ group (0.79 [95% CI: 0.72–0.86]) (χ2 = 0.85; P = .36). All 4 neuroradiologists agreed on the same BT-RADS score in 57 of the 103 studies, 3 neuroradiologists agreed in 21 of the 103 studies, and 2 neuroradiologists agreed in 21 of the 103 studies. Percent agreement between neuroradiologist blinded scores and post hoc reference scores ranged from 41%–52%.
CONCLUSIONS: A very good interrater agreement was found when tumor reports were interpreted by independent blinded readers by using BT-RADS criteria. Further study is needed to determine if this high overall agreement can translate into greater consistency in clinical care.
ABBREVIATIONS:
- BI-RADS
- Breast Imaging Reporting and Data System
- BT-RADS
- Brain Tumor Reporting and Data System
- IDH
- isocitrate dehydrogenase
- IQR
- interquartile range
- IRR
- interrater reliability
- WHO
- World Health Organization
The most common adult primary malignant brain tumors are gliomas.1 They are classified according to the evolving World Health Organization (WHO) Classification of Tumors of the Central Nervous System, currently in its fifth edition (year 2021), and include astrocytoma, isocitrate dehydrogenase (IDH)-mutant; oligodendroglioma, IDH-mutant and 1p/19q-codeleted; glioblastoma, IDH–wild-type.2 Glioblastoma, IDH–wild-type, poses a major clinical challenge because of its high rate of recurrence despite optimized management strategies.3 It has a 5-year survival rate as low as 5%–7%.4⇓-6 With high rates of recurrence, long-term monitoring of patients with brain tumors is critical for identifying disease progression and guiding clinical management. MR imaging is an essential tool for diagnosing, monitoring, and managing patients with malignant brain tumors.7 A previous study surveyed a group of neuro-oncology specialists who reported heavy reliance on MR imaging reports as part of their management of patients with brain tumor.8 However, there is considerable overlap between MR imaging findings of tumor progression and treatment-related changes,7,9 resulting in subjective and inconsistent interpretations. As a result, there is a need for improved standardization of brain tumor reporting that can serve as a guide for clinical management.
The Brain Tumor Reporting and Data System (BT-RADS) is a structured radiology reporting algorithm designed to decrease subjectivity among radiologists interpreting MR imaging examinations of treatment response in patients with brain tumors.8,10 BT-RADS was developed by a multidisciplinary team of neuroradiologists, neuro-oncologists, neurosurgeons, and neuropathologists. The scoring template consists of a numeric assessment (score 0–4) of MR imaging findings while considering the patient’s previous treatment course, including radiation completion date and use of antiangiogenic agents (bevacizumab) and steroid therapies.8 In summary, BT-RADS 0 = baseline study; BT-RADS 1a = imaging improvement; BT-RADS 1b = imaging improvement as a result of drug effect; BT-RADS 2 = unchanged imaging features; BT-RADS 3a = worsening features within 12 weeks of completing radiotherapy; BT-RADS 3b = worsening imaging features greater than 12 weeks after completing radiotherapy; BT-RADS 3c = unspecified new enhancing lesions within radiation therapy treatment area; and BT-RADS 4 = specified new lesions outside radiation therapy treatment area.8
A critical part of developing a standardized radiology reporting system is assessing interrater reliability (IRR). The Breast Imaging Reporting and Data System (BI-RADS) is a structured radiology reporting system that has been adopted as a standard component of breast cancer diagnosis and management. BI-RADS has been demonstrated by several previous studies to have acceptable to good levels of interrater reliability when applied to examinations of patients with breast cancer.11⇓-13 Unfortunately, interrater reliability of BT-RADS in posttreatment primary brain tumor is understudied with little information on the transparency and consistency of this scoring scale among its users.
The purpose of this study is to determine the interrater reliability of multiple neuroradiologists by using BT-RADS to score the same cohort of posttreatment primary brain tumor MR imaging examinations. In addition, the authors seek to evaluate the interrater reliability among radiology residents utilizing BT-RADS as a part of their neuroradiology training.
MATERIALS AND METHODS
Patient Selection
This was a Health Insurance Portability and Accountability Act-compliant retrospective study approved by the institutional review board of our institution with waiver of patient informed consent. Adult patients (>18 years old) with a primary parenchymal brain tumor undergoing brain MR imaging between January 1, 2019, and February 28, 2019, at a single center were included. The exclusion criteria were patients with secondary or benign brain tumors, meningiomas, and patients younger than 18 years. Each patient’s imaging date, diagnosis, tumor grade, BT-RADS score from the radiology report, surgery date, radiation therapy completion date, bevacizumab therapy commencement date (if applicable), and steroid medication use (documented as a yes if used at all) were recorded. Because the study was performed on images obtained before recent revision of the WHO criteria, tumors were classified based on the 2016 fourth edition of the WHO Classification of Tumors of the Central Nervous System.14
Image Acquisition
Patients underwent imaging with a standardized brain tumor imaging protocol without and with IV contrast agent. These studies were performed on a mix of scanners with field strength of 1.5T or 3T. Imaging sequences included the following: axial DWI 5 mm slices, sagittal 3D FLAIR 1 mm slices reformatted into axial and coronal images, axial gradient-echo T2WI 5 mm slices, axial fast spin-echo T1WI 5 mm slices, sagittal 3D T1WI precontrast MPRAGE 1 mm slices reformatted into axial and coronal images, axial T2WI fast spin-echo fat-suppressed 5 mm slices, and sagittal 3D T1WI postcontrast MPRAGE 1 mm slices reformatted into axial and coronal images.
Single bolus DSC perfusion imaging was performed on each patient and relative cerebral blood volume maps were calculated by using Dynasuite (Philips Healthcare).
Reader Selection
A total of 6 readers with varied levels of experience were selected for this study: 4 board-certified and radiologists with certificate of added qualifications in neuroradiology (Faculty_1, 3 years; Faculty_2, 1 year; Faculty_3, 3 years; Faculty_4, 3 years), and 2 radiology residents (Resident_1, R-3; Resident_2, R-4). BT-RADS had been actively used in our department’s standard practice for reading primary brain tumor MRIs for approximately 2 years before the start of this study.
Reader Workflow
Each reader reviewed and scored all 103 MRIs by using the BT-RADS system. Readers were blinded to the MR imaging report for that study, BT-RADS score assigned by the radiologist who initially read the patient’s imaging, and scores from other readers. The patient’s diagnosis, tumor grade, surgery date, radiation therapy completion date, bevacizumab therapy commencement date (if applicable), and steroid use (if applicable) were provided to the readers. Readers were given access to previous MRIs and corresponding reports for prior MR imaging examinations used as comparison. However, readers did not view MRIs and corresponding reports that would have been done after the selected MRIs for the study. If an image revealed multiple lesions, readers were instructed to assign the patient’s BT-RADS score based on the worst or highest scoring lesion. All readers had access to reference materials with instructions for BT-RADS scoring, including a flow chart, interactive scoring tool, and reference tables available online (www.btrads.com). Readers also assigned their level of confidence ranging from 1 (least confident) to 5 (most confident), recorded how much DSC perfusion contributed to their score from 1 (very little) to 5 (great deal), and recorded the oldest prior comparison study considered in assigning a score.
Consensus and Post Hoc Reference Scoring
When 1 or more of the 4 neuroradiologists assigned a score different from the rest of the neuroradiologists, such MR imaging examinations were subjected to consensus scoring. When 3 of the 4 neuroradiologists assigned the same score, most vote was assigned the consensus score. When 2 or fewer neuroradiologists agreed on a score for a particular case, 2 of the faculty neuroradiologists with the most experience reviewed the imaging in a consensus review session and assigned a consensus score for all such cases, considering both imaging and faculty scores.
A post hoc reference score was also assigned by the same 2 faculty neuroradiologists by using subsequent follow-up information to the cases that were subjected to consensus scoring. This score used follow-up information obtained after the original scored study, including any available follow-up imaging, clinical worsening, or subsequent pathology results from biopsy or re-resection. This post hoc reference score was considered the criterion standard score for assessment of tumor worsening.
Statistical Analysis
Interrater reliability was calculated by using percent agreement and Gwet AC2 index, applying linear weights.15 While the Gwet index corrects for agreement due to chance, percent agreement does not. This leads to overestimation of IRR for percent agreement. IRR was assessed separately for all 6 readers, the neuroradiologists group, and the residents group. Gwet index was interpreted by using the benchmark scale as described by Altman17: < 0.20 = poor; 0.21–0.40 = fair; 0.41–0.60 = moderate; 0.61–0.80 = good; 0.81–1.00 = very good. Gwet agreement coefficients between neuroradiologists and residents were compared by using the χ2 test of independence. The value of DSC perfusion and readers’ level of confidence in BT-RADS scoring were assessed separately by violin graphs that allow for visualization of data distribution and its probability attenuation. This allows for distribution comparison among multiple groups on the same graphic presentation. An α level of .05 was set as level of significance. All analyses were performed on R Statistical Software (v4.2.2, R Core Team 2022), by using package irrCAC.18
RESULTS
A total of 103 consecutive MR imaging studies of primary brain tumors from 98 patients were evaluated in this study. Five of the patients had 2 MR imaging examinations each included in the study because they were scanned twice during the 2-month imaging search period for this study. The median patient age was 53 years (interquartile range [IQR], 41–66 years). Fifty-three percent of the patient population (52/98) were men, with 18% and 20% of the sample population taking steroids and bevacizumab, respectively. Seventy-seven percent of the MR studies (79/103) were astrocytomas, with grade 4 astrocytomas being the most common tumor in the study (43%, 44/103). Of the grade 4 glioblastoma, 20% (9/44) were of IDH-mutant type. Forty-five percent of all astrocytoma tumor type (36/79) were of IDH-mutant type (Table 1). Analysis of the BT-RADS scores from the original radiology reports assigned by the initial reading radiologist showed that 52% of the sampled MR images (54/103) had a score of 2 (Fig 1).
Bar graph showing relative frequency of BT-RADS scores (from original reports) among selected MR imaging examinations (n = 103).
Patient and tumor characteristics
The overall Gwet AC2 value of interrater agreement among the 6 readers (4 neuroradiologists and 2 resident readers) was calculated to be 0.83 (95% CI: 0.78–0.87) with a percent agreement of 91%. Gwet AC2 value of interrater agreement among the 4 neuroradiologists was calculated to be 0.84 (95% CI: 0.79–0.89) with a percent agreement of 91%. Gwet AC2 value of interrater agreement between the 2 residents was calculated to be 0.79 (95% CI: 0.72–0.86) with a percent agreement of 90%. Although there was a slight difference between neuroradiologists and radiology residents in interrater agreement, this difference was not statistically significant (χ2 = 0.85; P = .36). For cases showing improvement or no change (BT-RADS 0–2), the Gwet index among the neuroradiologists was 0.94 (95% CI: 0.88–0.96) with a percent agreement of 94%; that for the residents was 0.83 (95% CI: 0.75–0.91) with a percent agreement of 90%. For cases with worsening imaging (BT-RADS 3a–4), the Gwet index among the neuroradiologists was 0.66 (95% CI: 0.57–0.76) with a percent agreement of 86%; that for the residents was 0.58 (95% CI: 0.34–0.76) with a percent agreement of 81%.
All 4 neuroradiologists agreed on the same BT-RADS score in 57 of the 103 studies (55%); the remaining 46 studies were subjected to consensus scoring and were given post hoc reference scores as well. The percent agreement between consensus and post hoc reference scoring was 74%. The percent agreement between neuroradiologist blinded scores and consensus scores ranged from 46%–65%; that between blinded scores and post hoc reference scores ranged from 41%–52%. Faculty members with 3 years of clinical experience had higher agreement rates (Table 2). For cases showing disagreement between blinded scores and post hoc reference scores, neuroradiologists generally underestimated BT-RADS scores (Table 3).
Percent agreement between neuroradiologist blinded score and consensus score; blinded score and post hoc reference score (n = 46)
Under- and overestimation of disagreement of neuroradiologist blinded score compared with post hoc reference score
Across all studies, the median oldest comparison used was 4 months (IQR, 3–8 months). Raters used older MR imaging studies when assigning a BT-RADS score of 2 (median, 6 months; IQR, 3–12 months) compared with all other score categories combined (median, 3 months; IQR, 2–5 months). Based on the Mann-Whitney U test, the 2 compared median values are statistically different (P < .001). The month range for the oldest MR comparison used for BT-RADS score 3b appeared modestly increased compared with baseline (Fig 2).
Boxplot graph showing time (in months) from the oldest MRI to the prior MRI used by readers to assign a BT-RADS score. The maximum and minimum values are represented at either end of the whiskers. The box represents the interquartile range (25th percentile to the 75th percentile), with the median value represented by the thick horizontal black line within the box. Outliers are shown away from the whiskers and box by the symbol (○).
For cases showing improvement or no change (BT-RADS scores 0–2), perfusion provided little contribution in the scoring process (mean, 1.2; range, 1.0–4.0). For cases with imaging worsening (BT-RADS scores 3a–4), there was a combined relative increase in the contribution from perfusion (mean, 2.1; range, 1.0–5.0). Contribution from perfusion was highest for score 3a (mean, 2.5; range, 1.0–5.0) and score 4 (mean, 2.6; range, 1.0–5.0) as shown in Fig 3. The mean perfusion contribution for the BT-RADS score 3a–4 group is larger than that for the BT-RADS score 0–2 group (2-sample t-statistic, 8.10; P < .001).
Violin graph showing boxplot inside an attenuation plot. The width of the attenuation plot at any region corresponds to the frequency of the data points at that region. For the boxplot, the maximum and minimum values are represented at either end of the whiskers. The box represents the interquartile range (25th percentile to the 75th percentile), with the median value represented by a thick horizontal black line either within the box or at either end of the box. The mean value is represented by the black triangle (▴). Outliers are shown away from the whiskers and box by small black dots (●). On a scale of 1–5 (with 1 = very little, and 5 = great deal), all 6 readers rated how much perfusion contributed to their interpretation of scan and assignment of BT-RADS score.
Overall, the readers’ level of confidence in assigning score was moderate, regardless of level of training or years of clinical practice (Fig 4). The mean confidence for faculty scoring (mean, 4.0; range 1–5) and that for resident scoring (mean, 3.9; range 1–5) was not statistically different (paired t-statistic, 0.90; P value = .32).
Violin graph showing boxplot inside an attenuation plot. The width of the attenuation plot at any region corresponds to the frequency of the data points at that region. For the boxplot, the maximum and minimum values are represented at either end of the whiskers. The box represents the interquartile range (25th percentile to the 75th percentile), with the median value represented by a thick horizontal black line either within the box or at either end of the box. The mean value is represented by the black triangle (▴). Outliers are shown away from the whiskers and box by small black dots (●). On a scale of 1–5 (with 1 = not sure at all, and 5 = absolutely sure), all 6 readers rated their level of confidence for the BT-RADS score they assigned to each case. Neuroradiologists (Faculty_1, Faculty _2, Faculty _3, and Faculty _4); Radiology residents (Resident_1 and Resident_2).
Representative examples of scored studies are shown in Figs 5 and 6. Fig 5 shows a case of IDH–wild-type glioblastoma in which all readers agreed. Enhancing treated tumor was noted 18 months postsurgery, and 2 months later, the abnormal enhancement had increased more than 25%. This study was scored with a BT-RADS score of 4 and met Response Assessment in Neuro-Oncology criteria for progression. Fig 6 shows a case of IDH–wild-type glioblastoma in which there was disagreement among the readers. Abnormal FLAIR was noted 3 months postsurgery and 1 month postradiation therapy, with increasing enhancement 3 months postradiation. This study was an early posttreatment study in which there was disagreement about whether imaging worsening constituted pseudoprogression or progressive tumor. The post hoc reference score was BT-RADS 4.
Imaging from a patient with agreement among all readers. A 48-year-old man with IDH–wild-type glioblastoma. FLAIR (A) and T1 postcontrast (B) imaging 18 months after surgery showing abnormal FLAIR and enhancing treated tumor in the left temporal lobe. FLAIR (C) and T1 postcontrast (D) 2 months later showing marked increase in the size of abnormal masslike FLAIR and enhancement (white arrows) with >25% increase in cross-sectional area. All readers gave the study a score of BT-RADS 4.
Imaging from a patient with disagreement among readers. A 45-year-old woman with IDH–wild-type glioblastoma. FLAIR (A) and T1 postcontrast (B) imaging 3 months after surgery and 1 month after completing radiation showing multifocal abnormal FLAIR in the bilateral frontal lobes with minimal enhancement in the left frontal lobe (white arrow). FLAIR (C) and T1 postcontrast (D) 2 months later (3 months after completing radiation) showing marked increase in left frontal edema (white arrows) and enhancement (white arrow). One-half of the readers (2 neuroradiologists and 1 resident) gave the study a score of BT-RADS 3a (pseudoprogression) and one-half of the readers gave the study a score of BT-RADS 4 (progression). The post hoc reference score was BT-RADS 4.
DISCUSSION
Changes associated with brain tumor treatment may mimic tumor progression, making longitudinal assessment of brain tumor MR imaging difficult. BT-RADS was developed to improve reporting consistency, similar to other standardized reporting systems such as BI-RADS, which has shown acceptable to good IRR. IRR of BT-RADS has not been widely studied, and the aim of our study was to assess variation between 6 readers with differing levels of clinical training in a department in which BT-RADS has been implemented and readers had 1–2 years of experience before this study. A previous smaller study evaluated the IRR of BT-RADS by using MR images from 23 patients,19 but to the best of our knowledge this is the largest study on IRR of BT-RADS to date.
We found very good overall interrater agreement (Gwet AC2 index, 0.83) among 6 readers, suggesting high BT-RADS consistency in routine clinical practice. Interrater agreement (Gwet AC2 index, 0.83) was higher compared with that estimated by Parillo et al19 (Fleiss κ, 0.70). Compared with Fleiss κ, the Gwet index is more robust and resistant to the κ paradox and less impacted by marginal probability and prevalence,20,21 which may result in the Gwet index being a better measure for this purpose.16 Additionally, we used a larger patient sample and assessed the added value of perfusion MR imaging. Our study produced similar interrater agreement results compared with other neuroradiology standardized reporting systems. For the Neck Imaging Reporting and Data System, Elsholtz et al22 reported Kendall coefficient of concordance of 0.74 and 0.80 for the primary site and nodes, respectively, and Hsu et al23 calculated percent agreement of 84% and 93% for primary sites and nodes, respectively. The percent agreement rates for the Thyroid Imaging Reporting and Data System ranged from 70%–87%.24 All of these prior reports are comparable to this study (Gwet AC2 index, 0.83, percent agreement 91%). In general, percent agreement values tend to be higher than Gwet index or other κ values because percent agreement does not account for agreement due to chance.
Our study demonstrated high agreement for the neuroradiologist’s group (Gwet AC2 index, 0.84) and the resident’s group (Gwet AC2 index, 0.79). This finding suggests that BT-RADS is applicable to all levels of experience, including trainees. Similarity in agreement between the 2 groups is likely related to the simplicity in the structure of BT-RADS in its reference scoring flow cart and detailed category guide.8,25 Even though the difference in agreement between the 2 groups was not statistically significant, there may be small differences below the level of detection for this study. Generally, agreement within the neuroradiologist and resident groups was lower among cases with worsening imaging (BT-RADS 3a–4) compared with cases showing improvement or no change (BT-RADS 0–2). However, the values were much lower in the resident’s group compared with the neuroradiologist’s group. This is explained by the complexity of such cases even for experienced neuroradiologists. This variation likely exists with free text report as structured reporting does not create this variation, but likely allows us to quantitate that variation.
The overall percent agreement between consensus score and post hoc reference score was good, compared with a moderate agreement between blinded neuroradiologists scores and post hoc reference scores. During consensus scoring, the 2 scoring neuroradiologists utilized available information that individual readers were blinded to during the blinded scores, including imaging scores and scores from all 4 neuroradiologists. This provided more information that aided in the consensus scoring and resulted in higher percent agreement when compared with the post hoc reference score. When compared with post hoc scores, BT-RADS scores did slightly underestimate progression. This is partially due to system design, which favors undercalling progression in the early posttreatment period to leave treatment decisions in the hands of the oncology treatment team, which may prefer to wait for subsequent follow-up before changing treatment or removing a patient from a clinical trial. DSC perfusion was found to be useful in scoring more complex cases (BT-RADS scores 3a–4) but provided little information for lower scoring cases (BT-RADS scores 0–2). This is consistent with previous reports that perfusion is a useful adjunct or troubleshooting tool for indeterminate studies,26 but may not contribute to many reports on primary brain tumor cases. Although high confidence in a reporting system does not necessarily equate to assigning the correct score, it is a secondary measure of ease with use of the system. We recorded a moderate level of confidence in using BT-RADS that did not differ between the neuroradiologist and resident groups.
Our study had some limitations. This was a retrospective study conducted in a single facility where BT-RADS is used as part of standard practice in reading primary brain tumor. Differences in practice patterns, such as follow-up at shorter or longer clinical intervals, and range of experience of neuroradiologists interpreting brain tumor imaging may affect outcomes. In addition, we classified the tumors according to the 2016 WHO classification and they were not reclassified. For this study, we recruited only 2 radiology residents compared with 4 neuroradiologists. Last, we did not perform intrarater agreement analysis to assess internal consistency among readers as it was beyond the scope of this study, but it is certainly a topic of future interest. A multicenter prospective study involving evaluation of MR images with varied distribution of more complex cases, inclusion of readers without prior institutional use of BT-RADS, inclusion of more radiology trainees, and conduction of intrarater reliability as secondary analysis is warranted to further support the use of BT-RADS for standardization of brain tumor MR imaging reporting.
CONCLUSIONS
Our study shows good to very good agreement for posttreatment follow-up among experienced neuroradiologists and radiology residents by using BT-RADS. BT-RADS can thus be used to produce consistent and transparent reports even in less experienced hands. Our study also demonstrated similar interrater agreement results as compared with other RADS systems used in neuroradiology. BT-RADS has been adopted at multiple institutions within the United States as well as internationally, and this report suggests that implementing BT-RADS can help provide consistent reports across readers. With a moderate agreement between blinded scores and post hoc scores, future studies to better understand the predictiveness of BT-RADS scores on future outcomes will be a great addition to the literature.
Footnotes
↵# Michael Essien and Maxwell Cooper are co-first authors.
The project was supported by RSNA Research and Education Foundation, through grant number RSCH0235. The content is solely the responsibility of the authors and does not necessarily represent the official views of the RSNA R&E Foundation.
Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.
References
- Received March 5, 2024.
- Accepted after revision April 19, 2024.
- © 2024 by American Journal of Neuroradiology