Abstract
BACKGROUND AND PURPOSE: Artificial intelligence models in radiology are frequently developed and validated using data sets from a single institution and are rarely tested on independent, external data sets, raising questions about their generalizability and applicability in clinical practice. The American Society of Functional Neuroradiology (ASFNR) organized a multicenter artificial intelligence competition to evaluate the proficiency of developed models in identifying various pathologies on NCCT, assessing age-based normality and estimating medical urgency.
MATERIALS AND METHODS: In total, 1201 anonymized, full-head NCCT clinical scans from 5 institutions were pooled to form the data set. The data set encompassed studies with normal findings as well as those with pathologies, including acute ischemic stroke, intracranial hemorrhage, traumatic brain injury, and mass effect (detection of these, task 1). NCCTs were also assessed to determine if findings were consistent with expected brain changes for the patient’s age (task 2: age-based normality assessment) and to identify any abnormalities requiring immediate medical attention (task 3: evaluation of findings for urgent intervention). Five neuroradiologists labeled each NCCT, with consensus interpretations serving as the ground truth. The competition was announced online, inviting academic institutions and companies. Independent central analysis assessed the performance of each model. Accuracy, sensitivity, specificity, positive and negative predictive values, and receiver operating characteristic (ROC) curves were generated for each artificial intelligence model, along with the area under the ROC curve.
RESULTS: Four teams processed 1177 studies. The median age of patients was 62 years, with an interquartile range of 33 years. Nineteen teams from various academic institutions registered for the competition. Of these, 4 teams submitted their final results. No commercial entities participated in the competition. For task 1, areas under the ROC curve ranged from 0.49 to 0.59. For task 2, two teams completed the task with area under the ROC curve values of 0.57 and 0.52. For task 3, teams had little-to-no agreement with the ground truth.
CONCLUSIONS: To assess the performance of artificial intelligence models in real-world clinical scenarios, we analyzed their performance in the ASFNR Artificial Intelligence Competition. The first ASFNR Competition underscored the gap between expectation and reality; and the models largely fell short in their assessments. As the integration of artificial intelligence tools into clinical workflows increases, neuroradiologists must carefully recognize the capabilities, constraints, and consistency of these technologies. Before institutions adopt these algorithms, thorough validation is essential to ensure acceptable levels of performance in clinical settings.
ABBREVIATIONS:
- AI
- artificial intelligence
- ASFNR
- American Society of Functional Neuroradiology
- AUROC
- area under the receiver operating characteristic curve
- GEE
- generalized estimation equation
- IQR
- interquartile range
- NPV
- negative predictive value
- PPV
- positive predictive value
- ROC
- receiver operating characteristic
- TBI
- traumatic brain injury
SUMMARY
PREVIOUS LITERATURE:
Artificial intelligence (AI) in neuroradiology has shown promise, yet independent validation is scarce. Literature reveals a lack of external validation and diminished performance on diverse, external datasets, highlighting the need for realistic multi-institutional assessments.
KEY FINDINGS:
In this multi-center neuroradiology AI competition, AI models underperformed in accurately identifying pathologies on NCCT head scans against expert consensus ground truth.
KNOWLEDGE ADVANCEMENT:
This study emphasizes the complexity of clinical AI application, urging extensive validation before integrating AI models into practice.
Driven by advanced techniques and wider access to large imaging data sets, artificial intelligence (AI) holds the promise of revolutionizing neuroradiology.1 Although the number of potentially useful models developed with AI has rapidly increased, independent validation of the software implementations of these models lags. AI models are often developed and validated using a single institutional data set and are not tested using separate, external data sets.2 Due to the lack of external validation, these models might not be adequately evaluated in contexts representative of daily practice; performance that is biased and/or inaccurate may not be detected. A review of 516 publications on AI tools revealed that only 31 studies (6%) underwent external validation.2 Moreover, a recent analysis revealed that most deep learning algorithms used in radiologic diagnosis exhibited diminished performance on external data sets, underscoring the critical importance of external validation before the wide-scale clinical adoption of these tools.3
Some of the reasons that AI-related studies are typically conducted using in-house data from a single institution include patient privacy concerns, the complexities of data storage and transfer associated with the large volume of imaging data, and the hurdles in obtaining high-quality labeling. Unlike this “develop behind closed doors” approach, AI competitions offer opportunities to train and test algorithms in a more realistic scenario, using de-identified massive data sets from multiple institutions with the ground truth provided by qualified radiologists.4
The success of AI models must be evaluated in a variety of clinical scenarios to establish their validity in real-world settings. In the 2019 American Society of Functional Neuroradiology (ASFNR) AI Competition, a multicenter initiative involving 1201 anonymized NCCT scans from 5 distinct institutions, we assessed participants’ AI models for their proficiency in identifying pathologies on NCCT of the head, specifically acute ischemic stroke, intracranial hemorrhage, traumatic brain injury (TBI), and mass effect. Additionally, we assessed the capability of these AI models to determine the normality of the scan findings based on age and the degree of urgent intervention needed. The purpose of this article was to report the rationale for and the methods and results of the competition orchestrated by the ASFNR Committee and to share the insights gained. The findings of this study shed light on the performance of AI models in neuroradiology by mimicking day-to-day operations and highlight areas for future research to better integrate these tools into clinical practice.
MATERIALS AND METHODS
Data Set
In total, 1201 anonymized, full-head NCCT scans across 5 institutions (Stanford University, University of California Irvine, University of Lausanne, University of Maryland, and Wake Forest University) were pooled to form the 2019 ASFNR AI Competition data set. NCCTs were collected at each institution from consecutive studies conducted at the emergency department using the following inclusion criteria: 1) patients older than 18 years of age, 2) absence of severe motion artifacts, and 3) continuous brain parenchyma coverage from the base to the vertex. Consecutive studies were used to simulate daily radiology practice. If any follow-up scans were encountered during consecutive studies, they were also permitted for inclusion. The competition data set included an unrestricted variety of CT scanner vendors, scanner models, and section thicknesses to evaluate AI models in a cohort representative of clinical practice. Figure 1 presents a flow chart showing the contributions from each institution and the cases excluded along with their respective reasons. The competition was held in 2019. All 5 institutions received approval from their respective institutional review boards.
Institutional contributions to the collaborative database.
Outcomes of Interest
The data set included a wide range of clinically relevant pathologies. Accuracy in identifying the presence of acute ischemic stroke, intracranial hemorrhage, TBI, and mass effect was used as an outcome measure. The categories were not mutually exclusive. For example, a scan revealing a large traumatic epidural hematoma would have been categorized under both intracranial hemorrhage and TBI. TBI was defined using established common data elements.5 In addition, the NCCTs were assessed as having normal or abnormal findings for the patient’s age. Normal NCCT findings for age were defined as a routine examination without any pathologies or having minor incidental findings deemed appropriate for age, including microvascular ischemic changes, brain volume loss, incidental calcification, sinus mucosal thickening, temporomandibular joint degenerative changes, and upper cervical spine degenerative changes. Finally, NCCTs were rated regarding their need for urgent intervention as the following: 1) emergent finding: intervention immediately needed (including scans showing large acute ischemic stroke, large intracranial hemorrhage, severe ventriculomegaly, or severe mass effect); 2) urgent finding: call the referring clinician within an hour (including scans with subacute stroke, minor stroke, small hemorrhage, TBI, or mild mass effect); 3) actionable finding: important but not urgent finding (including scans with brain tumors or mild ventriculomegaly); and 4) incidental finding (scans with minor abnormalities); and 5) scans with normal findings.
Ground Truth
Each NCCT in the competition data set was labeled by 5 board-certified neuroradiologists who had completed a 2-year neuroradiology fellowship training program. The 5 neuroradiologists had 4, 7, 7, 7, and 9 years of experience in neuroradiology. To reduce interobserver variability, we first gave labeling neuroradiologists 50 NCCTs, and subsequently, they received feedback focused on their interpretative approach from senior faculty. This feedback included specific guidelines aimed at standardizing their approach to evaluation, ensuring consistency across all observers. The neuroradiologists were blinded to each other’s interpretation. The only clinical information provided was the patient’s age, which was required to determine whether imaging findings were age-related. The consensus interpretation was accepted as the ground truth when at least 4 neuroradiologists agreed. If this was not the case, the ground truth was determined by a sixth board-certified neuroradiologist (M.W., with 25 years of experience in neuroradiology) who reviewed the individual interpretations of the other 5 neuroradiologists and the follow-up imaging studies obtained for these patients, including MR imaging when available.
Candidate AI Models
A call for candidate AI models was posted on the Internet and social media. Academic institutions and vendors were expressly invited to participate. The competition did not include any data sets for development because it was designed for external validation, assuming that model development had already occurred at institutions using their data sets. The participating teams were provided 100 NCCT scans, including patient age and annotations, to familiarize themselves with the data and its formats. The teams were permitted to engage with these 100 labeled scans, encompassing a range of pathologies as well as scans with normal findings, for familiarization purposes only; however, they were explicitly prohibited from incorporating these scans into their training data sets. The participants were then given the remaining scans and the patient’s age for each scan, but no annotations. Participants were asked to perform 3 tasks based on the predetermined outcomes: detection of acute ischemic stroke, intracranial hemorrhage, acute and chronic TBI, and mass effect (task 1); evaluation for normal or abnormal scan findings for age (task 2); and determination of urgency (task 3). However, participants were permitted to compete in one, several, or all of the tasks. After receiving the data set, they were given a 2-week window to submit the results of their algorithms. Participants were not permitted to view the results of other participants.
Statistical Analysis
Centrally and independent of the participants, the performance of each algorithm was analyzed and compared with that of others. The urgency of NCCT findings was binarized, with “emergent finding” marked yes for immediate intervention required, and all other assessments labeled no. The consensus interpretation was accepted as the ground truth when at least 4 neuroradiologists agreed. To compare the results from the participant teams, reviewers, and ground truth while considering the nature of the data that were repeat measurements, we used a generalized estimation equation (GEE) logistic regression model. This model is designed to evaluate the binary outcomes from correlated response data, thereby enabling us to robustly account for the intragroup correlation observed in repeat measurements across the study participants. To compare the individual AI result with the ground truth, we used a Cohen κ test. The suggested interpretation of the Cohen κ test results is as follows: values ≤0 indicating no agreement, values 0.01–0.20 as none-to-slight, values 0.21–0.40 as fair, values 0.41– 0.60 as moderate, values 0.61–0.80 as substantial, and values 0.81–1.00 as almost perfect agreement.6 Accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and receiver operating characteristic (ROC) curves were generated for each AI model to serve as a summary measure of its performance, and the area under the ROC curve (AUROC) was generated. All statistical results were analyzed using Matlab (Version r2020b; MathWorks), except for GEE models, which were built on SPSS (Version 26; IBM). Statistical significance was set as P < .05.
RESULTS
Of the initial 1201 NCCT examinations, 11 scans were excluded because of severe artifacts. Additionally, 5 cases were reported by the participants as unsuccessful in being uploaded to the DICOM viewer, while 8 examinations were flagged as incomplete studies. Therefore, participants processed 1177 studies, which were then analyzed in this study. Scanners from 3 different manufacturers were used, and the distribution of these scanners across each institution is shown in Fig 2. Scanner information was unavailable for 6 scans. These scanners used section thicknesses of 2.5 and 5 mm, with 255 scans using the 2.5-mm thickness and 916 scans using the 5-mm thickness. The median age of patients was 63 years (IQR, 47–79 years).
Distribution of scanners from 3 manufacturers across participating institutions. UL indicates University of Lausanne; WFU, Wake Forest University; SU, Stanford University; UCI, University of California Irvine; UM, University of Maryland).
Nineteen teams from academic institutions registered for the competition. Four of the 19 enrollees had also provided images in DICOM file format for the competition. No commercial entity signed up for the competition. Four of the 19 academic teams submitted their final results. Keck School of Medicine of University of Southern California, Wake Forest University, University of California Irvine, and the University of Texas Southwestern Medical Center were the 4 participating institutions. The teams from Wake Forest University and the University of California Irvine were the 2 that submitted results and also contributed data to the competition. We required data set providers to ensure that the data had never been used for training their participating algorithms. The specific identifying number assigned to each team will not be disclosed. Of the participating teams, 2 completed all 3 tasks, one completed only 1 task, and another managed to complete 2 tasks. The performance of the AI models for each task is presented in the Online Supplemental Data. The results of the GEE model used to compare the performances of the 4 teams are also shown in the Online Supplemental Data.
Task 1: Lesion Detection
In ground truth evaluation, acute ischemic stroke was diagnosed in 48 patients (4%) with a 95% interobserver agreement. The median age of patients with acute ischemic stroke was 60 years (interquartile range, [IQR], 52–73 years). This section of task 1 was completed by teams 1 and 2. Teams 1 and 2 demonstrated similar performance, with sensitivities of 0.10 and 0.08, specificities of 0.97 and 0.95, PPVs of 0.14 and 0.07, NPVs of 0.96 and 0.96, and accuracies of 0.94 and 0.92, respectively. Both teams exhibited little-to-no agreement with the ground truth, with a Cohen κ value of 0.09 for team 1 and 0.03 for team 2. The ROC curves for this task are shown in Fig 3A; 0.53 was the AUROC value for both teams. GEE analysis showed that team 1 had 34% higher odds (95% CI, 9%–65%) of agreeing with the ground truth compared with team 2, with a P value of .01.
ROC curves for model performance in detecting acute ischemic stroke (A), intracranial hemorrhage (B), traumatic brain injury (C), mass effect (D), and age-based normality assessment (E).
In ground truth evaluation, intracranial hemorrhage was diagnosed in 208 patients (18%) with a 96% interobserver agreement. The median age of patients with intracranial hemorrhage was 69 years (IQR, 51–83 years). All 4 teams completed this section of task 1. Teams 1, 2, 3, and 4 had sensitivities of 0.30, 0.01, 0.19, and 0.22; specificities of 0.86, 0.99, 0.83, and 0.90; PPVs of 0.31, 0.25, 0.19, and 0.32; NPVs of 0.85, 0.82, 0.83, and 0.84; and accuracies of 0.76, 0.82, 0.72, and 0.78, respectively. All teams demonstrated little-to-no agreement with the ground truth, with Cohen κ values of 0.16, 0.01, 0.02, and 0.14 for Teams 1–4, respectively. The ROC curves for this task are shown in Fig 3B. Team 1 had an AUROC value of 0.57, team 3 had a value of 0.51, and team 4 had a value of 0.59. Team 2 did not report their predicted probabilities and ROC curve. GEE analysis showed that team 1 had 11% lower odds (95% CI, 3%–18%, P value = .01), team 2 had 29% higher odds (95% CI, 13%-46%, P value = .01), and team 3 had 29% lower odds (95% CI, 17%–39%, P value = .01) of agreeing with the ground truth compared with team 4.
In ground truth evaluation, TBI (acute and chronic) was diagnosed in 240 patients (20%) with an 85% interobserver agreement. The median age of patients with intracranial hemorrhage was 65.5 years (IQR, 48–82 years). This section of task 1 was completed by teams 1 and 2. Team 1 and team 2 demonstrated comparable performance, with sensitivities of 0.07 and 0.00, specificities of 0.97 and 1.00, PPVs of 0.35 and 0.33, NPVs of 0.80 and 0.80, and accuracies of 0.79 and 0.80, respectively. With Cohen κ values of 0.05 for team 1 and 0.003 for team 2, both teams showed little-to-no agreement with the ground truth. The ROC curves for this task are shown in Fig 3C; 0.49 was the AUROC value for both teams. GEE analysis showed that team 1 had 6% lower odds (95% CI, 0%–13%) of agreeing with the ground truth compared with team 2 with a P value of .06.
In ground truth evaluation, mass effect was diagnosed in 84 patients (7%) with a 96% interobserver agreement. The median age of patients with intracranial hemorrhage was 70.5 years (IQR, 54–85 years). Team 1, team 2, and team 4 completed this section of task 1. Teams 1, 2, and 4 had sensitivities of 0.11, 0.01, and 0.18; specificities of 0.95, 1.00, and 0.95; PPVs of 0.14, 0.50, and 0.22; NPVs of 0.93, 0.93, and 0.94; and accuracies of 0.89, 0.93, and 0.90, respectively. All teams exhibited little-to-no agreement with the ground truth, with Cohen κ values of 0.06, 0.02, and 0.14 for teams 1, 2, and 4, respectively. The ROC curves for this task are shown in Fig 3D. Teams 1, 2, and 4 had AUROC values of 0.50, 0.54, and 0.55, respectively. GEE analysis showed that team 1 had 6% higher odds of agreeing with the ground truth compared with team 4, though the results were statistically insignificant (95% CI, −6%–21%; P value of .34). Team 2 had 34% lower odds (95% CI, 22%–45%) of agreeing with the ground truth compared with team 4, with a P value of .01.
Task 2: Age-Based Normality Assessment
In ground truth evaluation, 753 NCCT scans (64%) were evaluated as having abnormal findings for age, with an 84% interobserver agreement. Task 2 was completed by teams 1, 2, and 3. Teams 1, 2, and 3 had sensitivities of 0.56, 0.34, and 0.66; specificities of 0.56, 0.67, and 0.33; PPVs of 0.69, 0.65, and 0.63; NPVs of 0.42, 0.37, and 0.36; and accuracies of 0.56, 0.46, and 0.54, respectively. Teams 1 and 2 had little-to-no agreement with the ground truth, with the Cohen κ values of 0.116 and 0.011, respectively. Team 3 had a negative Cohen κ value of −0.01, indicating that the concordance between the predictions of their models and the ground truth was slightly below what would occur randomly. The ROC curves for this task are shown in Fig 3E. Teams 1 and 2 had AUROC values of 0.57 and 0.52, respectively. Team 3 did not report their predicted probabilities and ROC curve. GEE analysis indicated that team 1 had 9% higher odds of agreeing with the ground truth compared with team 3; however, this was not statistically significant (95% CI, −6%−27%; P value = .26). Team 2 had 27% lower odds (95% CI, 15%–38%) of agreeing with the ground truth compared with team 3, with a P value of .01.
Task 3: Urgency Categorization
In ground truth evaluation, 63 NCCT scans (5%) were evaluated as having an emergent finding, 182 scans (15%) were assessed as having an urgent finding, 232 scans (20%) were evaluated as having an actionable finding, 458 scans (39%) were assessed as having an incidental finding, and 242 scans (21%) were considered as a having normal findings, with a 71% interobserver agreement. Task 3 was completed by teams 1 and 2. The urgency of NCCT findings was binarized, with emergent finding marked yes for immediate intervention required, and all other assessments labeled no. Teams 1 and 2 demonstrated similar performance, with sensitivities of 0.32 and 0.29, specificities of 0.77 and 0.78, PPVs of 0.26 and 0.25, NPVs of 0.81 and 0.81, and accuracies of 0.68 and 0.68, respectively. Both teams had little-to-no agreement with the ground truth, with a Cohen κ value of 0.08 for team 1 and 0.06 for team 2. GEE analysis showed that team 1 had 24% lower odds (95% CI, 9%–36%) of agreeing with the ground truth compared with team 2, with a P value of .01.
DISCUSSION
In this article, we describe the ASFNR AI Competition, which included 1177 NCCT head scans from 5 different institutions. This study aimed to assess the performance of models in environments distinct from their original creation, using external data from 5 different institutions. The data providers did not use the data they contributed for training their participating algorithms. The ASFNR AI Competition data set had 2 important features. First, this data set was derived from clinical data sets including multiple pathologic entities, closely mirroring real-world scenarios, indicating that participants were tasked with detecting pathologies in consecutive patients admitted to emergency departments, similar to the clinical workflow of neuroradiologists. This situation was in contrast to most other challenge and competition data sets. Second, the data set, a multicenter collection with independent origins, was annotated with a high level of precision by 5 expert neuroradiologists. Given the specialized expertise required in neuroimaging, annotations generated through these experts added to the accuracy and reliability of the data set, making it a robust benchmark for evaluating AI models in neuroradiology.7 In particular, because the study design involved comparing AI performance with the consensus of board-certified neuroradiologists rather than an average radiologist, it set a high bar for AI models. We acknowledge both the difficulty of the task and the elevated standard imposed on AI performance. Notably, despite numerous invitations, no commercial entities competed, thus precluding the evaluation of their performance.
The overall performance of the algorithms was not particularly promising across all categories and subcategory tasks in our study. This outcome may be partly attributed to factors such as the limited timeframe of the challenge, the absence of training data, and the prohibition against teams training with the provided data, all intended to simulate a real clinical practice setting. These same factors may have also led only 4 of the 19 enrolled teams to submit results. In our multifaceted study, diagnostic evaluations encompassing acute ischemic stroke, intracranial hemorrhage, TBI, and mass effect (task 1) demonstrated variable performance outcomes, with AUROCs ranging from as low as 0.49 to as high as 0.59. In terms of acute ischemic stroke detection, the AUROC value for both teams that completed the subtask was 0.53. There is limited research on the use of deep learning to detect acute ischemic stroke in NCCT images. The primary focus in the field has been on the identification of large-vessel occlusion on CTA, the detection of mismatch on CTP, the identification of intracranial hemorrhage, the determination of ASPECT scores, and patient triaging. Chin et al8 used a deep learning model for stroke detection using CT images. They achieved an accuracy rate of 0.977 during the training stage and 0.930 in the testing stage.8 In their study, no external validation was performed.
In our study, AUROC values for intracranial hemorrhage detection ranged from 0.51 to 0.59. The literature on intracranial hemorrhage detection is more extensive compared with acute ischemic stroke detection. In their recent study, Agarwal et al9 demonstrated a pooled sensitivity of 0.901, a pooled specificity of 0.903, and a summary AUROC of 0.948 in detecting intracranial hemorrhage using CT on the basis of 10 studies that applied convolutional neural networks. In another meta-analysis by Daugaard Jørgensen et al,10 convolutional neural networks had a summary AUROC of 0.980 in 6 retrospective studies. Additionally, in our task on TBI, both teams that completed the task achieved an AUROC value of 0.49, indicating that the models performed worse than random classification.
For the task of detecting mass effect, the AUROC values for the teams ranged between 0.50 and 0.55 in our study. In their clinical validation data set, Chilamkurthy et al11 achieved an AUROC of 0.922 for detecting mass effect from CT images using deep learning. In our study, models were evaluated using data independently collected from various institutions, and their performance was slightly better than the flip of a coin. However, the observed low performance in our study could potentially be attributed to characteristics inherent in the external data set itself. Training these models using this specific data set could provide valuable insights. Should such a retrained model fail to exceed the performance metrics of the original participants' models, it would imply fundamental challenges associated with the data set, potentially due to its complexity or heterogeneity, hindering the learning capability of the model. Our study did not encompass this aspect of model retraining. The low performance observed in our study for detecting these urgent pathologies may be attributed to several factors, including the diversity of the data set, training data limitations, and architecture and training method of the AI models. The diversity of the data set, which included images from various institutions using different imaging protocols, may have introduced variability that tested the abilities of the AI model. Second, the training data used by each team may not have adequately represented the full range of pathologies, limiting the ability of the model to detect these critical conditions. Last, the choice of model architecture and the specific approaches to training, including how the models were optimized and regularized, could significantly impact their ability to identify urgent pathologies accurately.
An algorithm capable of detecting any abnormality on NCCT scans holds potential as a screening adjunct, which could assist in prioritizing the interpretation and review of imaging studies. In our study, we evaluated the performance of the models in assessing NCCTs as having normal or abnormal findings relative to the patient’s age. The AUROC values for the participating teams ranged from 0.50 to 0.55, once again emphasizing their poor performance when assessed using an independent external data set. The CT changes related to normal aging in the human brain are well-documented.12 In their quantitative CT study of the brain, Cauley et al13 demonstrated that the brain parenchymal fraction has a strong correlation with age, reflecting an 11% total decline in brain volume throughout adulthood. They also observed that while the total radiodensity of the brain parenchyma diminishes with age, the rate of this decline is not statistically significant. Furthermore, in their study, they reported that the age-associated decrease in brain radiomass was approximately 20%.13 Given these findings, it becomes evident that radiologists, and by extension AI models intended for clinical application, must be well-acquainted with these age-related changes to ensure precise assessment of patients. In the literature, research is primarily focused on brain age estimation using AI, with most studies using MR imaging, though CT-based studies are also available.14⇓⇓⇓⇓–19 While the utility of AI as a screening tool in radiology shows promise, our study uncovered a concerningly low sensitivity in the evaluated models for screening abnormalities based on age. This low sensitivity presents serious challenges for deploying these models as reliable screening tools.
Given the rising demand for imaging evaluations, any delay in communicating crucial findings to the ordering physician might impede timely clinical interventions, potentially affecting treatment outcomes, prompting research into AI applications in imaging evaluation in emergency settings.20 In our study, 63 scans were identified with emergent findings; 182, with urgent findings; 232, with actionable findings; 458, with incidental findings; and 242 were deemed scans with normal findings. The participating teams’ algorithms showed minimal-to-slight agreement with the ground truth. Considering the stakes in emergency evaluations, these results indicate that AI models require further refinement for reliable performance in emergency settings. Furthermore, considerations extend beyond emergency care to outpatient care, in which delays in communicating critical results from nonurgent scans can have serious consequences. A critical finding on an outpatient scan that goes untreated for an extended period may worsen the patient’s condition, emphasizing the importance of rapid AI-assisted triage in such settings. Similarly, the importance of improving AI capabilities is highlighted in rural health care settings or underdeveloped countries, where the scarcity of expert radiologists makes AI an invaluable tool.21 In these situations, AI could be beneficial in determining the need to refer patients to better-equipped tertiary care centers. As a result, there is a need for these AI models to be not only accurate but also reliable and timely across various health care settings.
The Radiological Society of North America 2019 Brain CT Hemorrhage Challenge, held concurrently with our contest, achieved remarkable success.22 A vast data set from multiple institutions and countries was assembled, with >1000 teams from all over the world participating. The 874,035-image brain hemorrhage CT data set, a collaborative creation between the Radiological Society of North America and the American Society of Neuroradiology, has been annotated by a large cohort of volunteer neuroradiologists. It is freely provided to the machine learning and AI research community for noncommercial purposes, aimed at fostering the development of high-quality machine learning and AI algorithms for diagnosing intracranial hemorrhage. Similarly, the ASFNR AI Competition was designed to assess the application of AI models in settings that mirror clinical practice. It intended to simulate the nature of clinical practice, potentially allowing researchers to assess how these AI models perform in medical settings. This practical assessment is critical because it may provide some insights into the readiness of AI applications for integration into clinical decision-making processes, indicating an implementation-oriented approach.
Our study is not without limitations. First, our study was conducted in 2019, and given the rapid advancement of the field, these results may need updating. Nonetheless, it serves as a valuable baseline for future research. Furthermore, we did not have access to the training and internal validation results of the models, making it impossible to determine whether the models were initially flawed or just underperforming during external validation. Another potential limitation was the omission of an analysis of the architecture of the models. Therefore, it is unclear how their models were created and adapted to align with the specified tasks of this challenge. Furthermore, of the 19 teams enrolled, only 4 completed the competition, and of these, some only partially completed it. The reasons for this incomplete participation are unclear. Teams may have enrolled solely to obtain data, or there may have been a lack of desire to share results.
Furthermore, a notable limitation was the limited time allotted to the contestants for preparation. Given the time constraints, it was challenging for participants to adequately prepare for the competition unless they had previously developed models capable of performing the tasks on their data sets. Another limitation was the presence of class imbalances, which necessitates a careful evaluation of accuracy statistics, considering these imbalances. However, this study aimed to mirror clinical implementation, which inherently features class imbalances. Moreover, another limitation was the lack of sex and race information in the data set. Finally, the use of a data set annotated by highly specialized neuroradiologists may not reflect the variable expertise found in typical clinical settings.
CONCLUSIONS
To assess the practicality of AI models in clinical scenarios, we analyzed their performance in the ASFNR AI Competition, a multicenter initiative comprising over 1100 anonymized NCCT scans from 5 distinct institutions. We evaluated the participating AI models for their capability to detect pathologies on NCCT scans of the head, focusing on acute ischemic stroke, intracranial hemorrhage, TBI, and mass effect, as well as assessing scan findings normality based on age and the urgency of intervention required. The models performed poorly in these evaluations. The first ASFNR AI Competition highlighted the gap between expectation and reality; the models largely fell short in their assessments. As the integration of AI tools into clinical workflows increases, neuroradiologists must diligently recognize the capabilities, constraints, and consistency of these technologies. Rather than debating the merits and drawbacks of the algorithms, the competition organizers have sounded an alert to the community: Generalizing AI in medical settings is complex. Before institutions and neuroradiologists adopt these algorithms, thorough validation and comparison with radiologists are essential to ensure their performance in actual clinical settings.
Footnotes
Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.
References
- Received February 20, 2024.
- Accepted after revision April 22, 2024.
- © 2024 by American Journal of Neuroradiology