Theme

Data Science has the power to provide evidence for and numerical confirmation of social justice issues. Conscientious application of data science techniques and a mindful analysis can reveal meaningful patterns and structures in data, particularly in the context of health care and environmental issues. Understanding and developing new mathematical techniques that allow the unbiased and privacy-preserving analysis of such data sets can be a crucial step toward solving structural problems.

The projects in this theme analyze and improve the robustness and fairness of data science techniques and demonstrate their applicability to health care and environmental data sets. Our goal is to advance the mathematical underpinnings of data science and provide proof-of-concept results on real-world challenges. Our outreach activities aim at communicating our results to the general public. The participants will investigate how to address the shortcomings of current algorithms and the existing data and learn about their results’ sociopolitical implications.

Projects

Data for Social Justice

Synthetically Rebalancing Healthcare Datasets via Conditional DDPM

In recent years, machine learning techniques are increasingly used in the real world, especially in medicine and healthcare. They have been praised for their great promise but are also at the center of heated controversy. Recent reports identify the main clinical, social, and ethical risks AI poses in healthcare. Specifically, potential errors and patient harm, risk of bias and increased health inequalities, lack of transparency and trust, and vulnerability to hacking and data privacy breaches. We aim to review the social consequences of AI models in healthcare and develop methods to minimize these risks and maximize the benefits of medical AI, such as increased transparency, explainability, and interpretability.

Dec 1, 2023

Accurately Classifying Out-Of-Distribution Data in Facial Recognition

Standard classification theory assumes that the input distributions of the test and the training sets are identical. In reality, classifiers are often applied to unseen data (out-of-distribution data) that is different from the training data distribution (in-distribution data). Therefore, building a trust-worthy machine learning system is important, for example, in criminal justice applications, where facial recognition for decision-making is highly consequential. Our project focuses on developing a neural network that differentiates data from different distributions. We are interested in the following question: Can we improve the performance of the neural network on out-of-distribution data by using multiple in-distribution datasets during training? We approach this problem using the Outlier Exposure Model and investigate how the performance of the model changes when other datasets of facial images are used.

Nicole Yang

Dec 1, 2023

Fast & Fair: Efficient Training of Fair Neural Networks

This project explores adversarial training techniques to develop fairer Deep Neural Networks (DNNs) to mitigate the inherent bias they are known to exhibit. DNNs are susceptible to inheriting bias with respect to sensitive attributes such as race and gender, which can lead to life-altering outcomes (e.g., demographic bias in facial recognition software used to arrest a suspect). We propose a robust optimization problem to improve fairness in DNNs, and leveraging second-order information, we can efficiently find a solution.

Elizabeth Newman

Dec 1, 2023

Investigating Impacts of Environmental and Socioeconomic Data

Environmental factors such as poor air and water quality are highly correlated with disease and adverse health outcomes. These determinants of health and well-being are often directly related to social inequalities such as socioeconomic status, household composition and disability, minority status and language, and housing time and transportation. For instance, NO2 is one combustion byproduct associated with multiple adverse health outcomes. Various methods have been proposed to obtain high-resolution NO2 models covering the entire contiguous US, enabling predictions, even in unmonitored areas. However, to this point, it is unclear how these models of environmental triggers are correlated with socioeconomic status. This project addresses the following questions: Can we predict social vulnerability maps based on existing data? Where should monitoring stations be placed to assess NO2 levels accurately, especially in high-risk and vulnerable locations? How can these vulnerability maps be used to enable predictions with high certainty? We will use computational tools from mathematics, statistics, computer science, and data science to address these questions and more. In this project, students will gain hands-on experience with data science, mathematical and atmospheric modeling, inverse problems, and uncertainty quantification.

Dec 1, 2023

Low-Level Ozone Prediction

Low-level ozone is a harmful air pollutant that can cause adverse health effects like coughing and chest pain and can worsen preexisting conditions like asthma. Many low-income communities are near sources of air pollution, causing these communities to be disproportionately impacted. Low-level ozone is difficult to model since it is not emitted directly into the air but is created by chemical reactions between oxides of nitrogen (NOx) and volatile organic compounds (VOC) in the presence of sunlight. Rising temperatures due to climate change are also increasing low-level ozone levels. Predicting low-level ozone can inform air quality forecasts and help determine which communities are most at risk. Functional linear regression models can be used to predict low levels of ozone. In this project, we will study functional models based on bivariate splines over triangulation to approximate the spatially distributed ozone measurements on a surface. We will explore methods to compute forecasts, quantify the uncertainty associated with our predictions, and identify the spatial distribution of low-level ozone.

Bree Ettinger

Dec 1, 2023