That never happened – but not because of a lack of effort. Research teams around the world have stepped up to help. The AI community has been particularly quick to develop software that many believed would allow hospitals to more quickly diagnose or triage patients, bringing much-needed first-line support – in theory.
Eventually, hundreds of predictive tools were developed. None of them made a real difference, and some are potentially harmful.
This is a condemning conclusion to several studies published in recent months. In June, the Turing Institute, the British National Center for Data Science and AI, published a report summarizing the discussions at a series of workshops it held in late 2020. The clear consensus was that AI tools had no impact, if any in the fight against covid.
Not suitable for clinical use
This resonates with the results of two large studies that evaluated hundreds of predictive tools developed last year. Wynants is the lead author of one of them, a review in the British Medical Journal it is still updated as new tools are released and existing ones are tested. She and her colleagues studied 232 algorithms for diagnosing patients or predicting how many patients may develop the disease. They found that none were suitable for clinical use. Only two have been singled out as promising enough for future testing.
“It’s shocking,” Wynants says. “I got into it with some worries, but this overcame my fears.”
Wynants ’study was corroborated by another major review conducted by Derek Driggs, a machine learning researcher at the University of Cambridge and his colleagues, and published in Nature Machine Intelligence. This team zoomed in deep learning models for covid diagnosis and predicting patient risk on medical images, such as chest X-rays and chest computed tomography (CT). They reviewed 415 published tools and, like Wynants and her colleagues, concluded that none were suitable for clinical use.
“This pandemic was a big test for artificial intelligence and medicine,” says Driggs, who himself is working on a machine learning tool to help doctors during a pandemic. “It would go a long way to attract the public to our side,” he says. “But I don’t think we passed that test.”
Both teams found that the researchers repeated the same basic mistakes in the way they trained or tested their tools. Incorrect data assumptions often meant that the trained models did not work as claimed.
Wynants and Driggs still believe AI can help. But they worry that it could be detrimental if built the wrong way because they could miss a diagnosis or underestimate the risk to vulnerable patients. “There’s a lot of fuss about machine learning models and what they can do today,” says Driggs.
Unrealistic expectations encourage the use of these tools before they are ready. Both Wynants and Driggs say some of the algorithms they looked at have already been used in hospitals, and some of them are sold by private developers. “I’m afraid they may have harmed patients,” Wynants says.
So what went wrong? And how to bridge that gap? If there is a reversal, it is that the pandemic has made it clear to many researchers that the way AI tools are built needs to change. “The pandemic has put the issues we’ve been dragging on for some time in the spotlight,” Wynants says.
What went wrong
Many of the problems discovered are related to the poor quality of the data that the researchers used to develop their tools. Information about sick patients, including medical recordings, was collected and shared in the midst of a global pandemic, often by physicians who struggled to treat those patients. The researchers wanted to help quickly and these were the only public data available. But that meant that many tools were built using mislabeled data or data from unknown sources.
Driggs highlights the problem of what he calls Frankenstein datasets, which are connected from multiple sources and can contain duplicates. This means that some tools are eventually tested on the same data on which they are trained, which makes them look more accurate than they are.
It also obscures the origin of certain data sets. This may mean that researchers miss important features that distort the training of their models. Many inadvertently used a dataset that contained images of the breasts of children who did not have as examples of what cases without the disease looked like. But as a result, AIs have learned to identify children, not covid.
Driggs’ group trained their own model using a data set that contained a mixture of images taken when patients lay down and stood up. Because patients scanned in the supine position are likely to be seriously ill, AI has erroneously learned to predict a serious risk from a person’s condition.
In some other cases, it has been found that some AIs are captured on the text font that certain hospitals used to mark the scan. As a result, fonts from hospitals with a more serious number of cases have become predictors of prudent risk.
Mistakes like this seem obvious backwards. They can also be corrected by adapting the model, if the researchers are aware of it. It is possible to identify shortcomings and publish a less accurate but less misleading model.
But many tools have been developed either by artificial intelligence researchers who lack medical expertise to spot data gaps or by medical researchers who lack mathematical skills to make up for those shortcomings. A more subtle problem that Driggs points out is the incorporation bias or bias introduced at the time the dataset is flagged. For example, many medical examinations were marked with regard to whether the radiologists who created them said they showed an elevation. But that incorporates or incorporates any biases of that particular doctor into the basic truth of the data set. It would be much better to mark the medical scan with the result of a PCR test, and not the opinion of one doctor, says Driggs. But in busy hospitals there is not always time for statistical beauties.
This has not prevented some of these tools from rushing into clinical practice. Wynants says it’s not clear which ones are used and how. Hospitals will sometimes say that they use the tool only for research purposes, which makes it difficult to assess how much doctors rely on them. “There’s a lot of secrecy,” she says.
Wynants asked one company that sold deep learning algorithms to share information about their approach, but did not respond. She later found several published models of researchers related to this company, and they all have a high risk of bias. “We don’t really know what the company has applied,” she says.
According to Wynants, some hospitals even sign non-disclosure agreements with medical AI providers. When she asked doctors what algorithms or software they used, they sometimes told her not to tell them.
How to fix it
What is the repair? Better data would help, but in times of crisis this is a big question. It is more important to make the most of the data sets we have. The simplest move would be for AI teams to work more closely with clinicians, Driggs says. Researchers also need to share their models and find out how they are trained so that others can test and upgrade them. “Those are two things we could do today,” he says. “And they would solve maybe 50% of the problems we identified.”
Data retrieval would also be easier if the formats were standardized, says Bilal Mateen, a physician who leads clinical technology research at the Wellcome Trust, a London-based global health research charity.
Another problem that Wynants, Driggs, and Mateen identify is that most researchers have rushed to develop their own models, rather than working together or improving existing ones. The result was that the collective efforts of researchers around the world produced hundreds of mediocre tools, rather than a few properly trained and tested.