What works here might not work there - dealing with AI and diversity in Healthcare data

An illustration of a building or hospital with data points

I recently spoke to several Healthcare AI startup founders about data challenges. Almost all of them faced the challenge of having to adapt and continuously improve their models once they start deployment at customers, mostly due to differences in the datasets they encounter there.

Some of them mentioned they never realized how different the data can be. The reality of Healthcare and Healthcare data is that there is huge variability, which presents a number of challenges. This article covers the types of biases you should expect to encounter and some ways to deal with them.

Be mindful of the data biases for healthcare AI

It is simply not practical to train your models while accounting for all potential biases. Being aware of the various potential biases can help design your models to be more robust to such changes, or at the very least anticipate the gaps your should expect and plan ahead for.

Below are the most significant types of bias you can encounter in healthcare datasets. Depending on your model and application, some may be much more relevant than others. Consider their impact to assess your risk level.

A picture of a dense crowd of people in black and white

Population and Cohort related biases

Demographic differences - Your dataset might represent a specific slice of the population which may or may not be representative of your target population. Think broadly about these differences - beyond ethnic diversity, Geographical differences, dietary habits, social determinants of health and even environmental influences can introduce their own biases. Certain medical conditions are more significantly affected by such things.
Prevalence - The dataset you use inherently has a specific prevalence of the disease/condition/event you are attempting to predict, detect or assess in your model. This can lead to over or under representation in the dataset which can impact your model performance in a target cohort. Also beware of "survivorship bias" (or its inverse) where you only include positive (or negative) cases which can be problematic in their own way.

Practice related biases

Medical practice - Medical practices are not standardized across markets, locations, institutions and even practitioners. Your dataset might be representative of certain practices but not others which may have a significant impact on your model. Even where clinical guidelines exist, they differ in content and adoption across institutions.
Documentation - The documentation practices can greatly influence the data your model trains on. This starts with technical aspects such as templates, language, abbreviations and terminology. The more difficult differences occur at the semantic level - for example, the differences between historical/past symptoms and current symptoms is enormous from a clinical perspective and how those are differentiated in documentation varies.
Medical Devices - Differences in medical devices also lead to differences in their output data. One key example is in medical imaging, where vendors can have significant differences in the resulting images. Beyond vendors, even models and software versions can affect the outputs of these devices. If your dataset is biased towards a specific mix of vendors or equipment, you might discover your model doesn't perform well with vendors who are not well represented in your training dataset.
Data acquisition protocols - In addition to devices themselves, how they are used can also change quite significantly across institutions, departments and even operators. This again could cause biases in the training dataset compared to new data you will encounter and impacting performance.

Labeling and Curation related biases

Public dataset accuracies - If you are using public datasets, be aware that some of those include incorrect labels. It is well know that many datasets include incorrect labels introducing label noise. Consider the quality of annotations and ways to validate their accuracy before blindly considering them ground truth.
Labels vs ground truth - When extracting labels from the datasets themselves, be aware that those labels might not represent ground truth. For example the mention of a diagnosis in a radiology report might not prove to be accurate for that patient eventually. The opposite is also true, where labels might be missing - A patient that doesn't have diagnosis code of type II diabetes might have the disease (his bloodwork might be a better source). Such biases can lead to training a model that mimics the inherent inaccuracies of medical practice.
Overlaps and repetition - When curating datasets from various sources it is critical to avoid overlapping records or repetitions that can inadvertently introduce over-representation and bias.
Labeling and annotation - Labeling is a huge challenge in healthcare, with many situations where different professionals might label the same data differently, and even the same professional might label differently on different occasions. Consider these inherent limitations and flaws in your datasets and model development.

Avoid the black box for robustness & explainability

A picture of a black box with the letters "AI" with sparks coming out of it — The magic black box of AI

Present day model architectures, algorithms and foundation models allow you to train high performing models without any explicit feature engineering or knowledge applied. This can be very tempting, however presents two significant challenges for adoption at scale:

Robustness - While the model can perform very well on the datasets you collected, it might be very sensitive to small changes in the data and perform almost unexpectedly when encountering new real-world datasets. This can lead to fragile models that require many adaptations. When left unchecked, models can also based themselves on correlations instead of causation.
Explainability - Lacking explainability (being able to explain how the model reached its conclusion) can hinder you in different ways. It can make your regulatory approval process more complex for one. More importantly though, it can become a real challenge when trying to drive adoption by physicians and healthcare professionals. They would like to understand why the model gave this output. Thinking through explainability early can be a significant advantage.

For a great read on pitfalls of AI models in healthcare read the following article from MIT technology review on models that attempted to tackle COVID. There are some examples of spectacular failures such as this one:

some AIs were found to be picking up on the text font that certain hospitals used to label the scans. As a result, fonts from hospitals with more serious caseloads became predictors of covid risk.

The following article is another good read (medical imaging focused and more technical).

A hybrid approach is often best

Combining explicit feature engineering with unsupervised or semi-supervised learning can be the winning formula. The explicit features create a more robust scaffold for your model training that is less sensitive to materially insignificant changes in the datasets, while also improving potential explainability.

Involve clinicians in the process

Involving clinicians in your development is a great way to both evaluate the different potential biases to understand how significant they can be for your application. They are also a great resource to check whether a model prediction makes sense to them. Is there a medical reason to believe there's signal in the data? If the model can demonstrate what influences the prediction, does it pass a "medical sense" check or is it simply a correlation? Can you rule it out if in doubt?

Prepare for ongoing improvements and monitor for drift

Even if you evaluate all potential sources of bias, plan ahead for the unexpected. Build in the mechanisms that will allow you to monitor for differences in data characteristics in production and enable you to identify drift both initially and over time.

Expect additional adoption challenges

There is no doubt AI has the ability to transform many aspects of Healthcare and Healthcare Delivery. Introducing it in practice is not easy though. Beyond the data and bias challenges discussed above there will be other challenges for successful adoption. Expect to face challenges with trust (explainability helps, but is not a panacea), workflow integration and adaptation and business model difficulties to name a few common challenges.

If you found a significant Healthcare problem or need you can address with AI, pursue it with grit and perseverance while trying to reduce your risks by being mindful of the pitfalls to improve your ability to make a big impact.