Ensuring the fairness of algorithms that predict patient disease risk
“To treat or not to treat?” is the question that clinicians continually ask themselves. To help them in their decision-making, some are turning to disease risk prediction models. These models predict which patients are more or less likely to develop a disease and therefore might benefit from treatment, based on demographic factors and medical data.
With the growth of these tools in the medical field and particularly in this area of clinical guidance, researchers at Stanford and elsewhere are wondering how to guarantee the fairness of the algorithms underlying the models. Bias has become a significant issue when models are not developed using data reflecting diverse populations.
In a new study, Stanford researchers reviewed important clinical guidelines for cardiovascular health that advise the use of a risk calculator to guide prescribing decisions for black women, white women, black men, and white men. The researchers looked at two ways that have been proposed to improve the fairness of calculator algorithms. One approach, known as group recalibration, readjusts the risk model for each subgroup of patients to better match the frequency of observed outcomes. The second approach, called equalized odds, seeks to ensure that error rates are similar for all groups. The researchers found that the recalibration approach overall produced the best match with guideline recommendations.
The findings underscore the importance of creating algorithms that take into account the full context relevant to the populations they serve.
“While machine learning holds great promise in medical and other social contexts, these technologies may exacerbate existing health inequities,” says Agata Foryciarz, a Stanford doctoral student in computer science and lead author of the study published in BMJ Health & Care IT. “Our results suggest that assessing the fairness of disease risk prediction models can make their use more responsible.”
In addition to Foryciarz, researchers include lead author Nigam ShahChief Data Scientist for Stanford Healthcare and a Stanford HAI faculty member; Google Researcher Stephen Pfohl and Google Health Clinical Specialist Birju Patel.
The clinical guidelines evaluated in the study relate to the primary prevention of atherosclerotic cardiovascular disease. This condition is caused by fats, cholesterol, and other substances that build up as plaques on the walls of the arteries. The sticky plaques block blood flow and can lead to adverse effects, including strokes and kidney failure.
The guidelines, published by the American College of Cardiology and the American Heart Association, provide recommendations on when patients should start drugs called statins – drugs that lower levels of certain cholesterols that lead to arterial buildup .
Atherosclerotic cardiovascular disease guidelines take into account medical measures such as blood pressure, cholesterol levels, diabetes diagnoses, smoking status, and treatment for hypertension, as well as demographics on gender, blood pressure, and blood pressure. age and race. Based on this data, the guidelines suggest using a calculator that then estimates patients’ overall risk of developing cardiovascular disease within 10 years. Patients identified as being at intermediate or high risk of disease are advised to start statin therapy. For patients who are at rather borderline or low risk of disease, statin therapy might be unnecessary or undesirable given the potential side effects of the drugs.
“If you as a patient are perceived to be at higher risk than you really are, you may be put on a statin that you don’t need,” Foryciarz says. “Then, on the other hand, if you’re predicted to be at low risk but really should take a statin, doctors might not put in place preventative measures that might have prevented heart disease later on.”
Clinical practice guidelines increasingly recommend that physicians use clinical risk prediction models for various conditions and patient populations. The proliferation of medical decision support calculators – for example on phones and other electronic devices used in clinical settings – means that these applications are often at your fingertips.
“Clinicians are likely to encounter and use these algorithm-based decision support tools more and more, so it is important that designers try to ensure that the tools are as accurate and precise as possible. “, says Foryciarz.
Refine risk assessment
For their study, Foryciarz and his colleagues used a cohort of more than 25,000 patients between the ages of 40 and 79 collected from several large datasets. The researchers compared the actual incidence of atherosclerosis in patients with predictions made by risk models. In these experiments, the researchers built models using the two approaches of group recalibration and equalized odds, then compared the estimates generated by the model calculators with those generated by a simple model calculator with no adjustment for the model. ‘equity.
Recalibrating separately for each of the four subgroups involved running the model for a subset of each subgroup and obtaining a risk score of the actual percentage of patients who developed disease, then adjustment of the underlying model for the larger subgroup. This approach succeeded in reinforcing the desired compatibility of the model with the guidelines for low-risk patients. On the other hand, differences in error rates between subgroups overall emerged, especially on the high-risk side.
The equalized odds approach, on the other hand, required the construction of a new predictive model that was constrained to produce equalized error rates across populations. In practice, this approach achieves similar false positive and false negative rates across all populations. A false positive refers to a patient who was identified as high risk and would be put on a statin, but did not develop atherosclerotic cardiovascular disease, while a false negative refers to a patient identified as low risk, but who has developed atherosclerotic cardiovascular disease. and probably would have benefited from taking a statin.
This equalized odds approach ultimately skewed the decision threshold levels for the different subgroups. Compared to the group recalibration approach, using the calculator constructed with equalized odds in mind would have led to more under- and over-prescription of statins and failed to potentially prevent some of the undesirable results.
The gain in accuracy with group recalibration requires additional time and effort to adjust the original model rather than leaving the model as is, although this is a small price to pay for better clinical results. An additional caveat is that dividing a population into subgroups increases the chances of creating a sample that is too small to assess risks within the subgroup as effectively, while reducing the ability to extend the predictions of the model to other subgroups.
Overall, algorithm designers and clinicians should keep in mind which measures of fairness to use for evaluation and which, if any, to use for model fitting. They should also understand how a model or calculator will be used in practice and how incorrect predictions could lead to clinical decisions that could lead to adverse health effects. Raising awareness of potential biases and further developing fairness approaches for algorithms can improve outcomes for everyone, Foryciarz notes.
“While it’s not always easy to identify which of many subgroups to focus on, it’s better to consider some subgroups than to consider none at all,” Foryciarz says. “Developing algorithms to serve a diverse population means that the algorithms themselves must be developed with this diversity in mind.”
This is part of a series on AI in healthcare. Learn more about:
Stanford HAI’s mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.