Data Science has the power to provide evidence for and numerical confirmation of social justice issues. Conscientious application of data science techniques and a mindful analysis can reveal meaningful patterns and structures in data, particularly in the context of health care and environmental issues. Understanding and developing new mathematical techniques that allow the unbiased and privacy-preserving analysis of such data sets can be a crucial step toward solving structural problems.
The projects in this theme analyze and improve the robustness and fairness of data science techniques and demonstrate their applicability to health care and environmental data sets. Our goal is to advance the mathematical underpinnings of data science and provide proof-of-concept results on real-world challenges. Our outreach activities aim at communicating our results to the general public. The participants will investigate how to address the shortcomings of current algorithms and the existing data and learn about their results’ sociopolitical implications.
To facilitate interactions across the teams and interactions between participants and faculty, our site emphasizes cohort activities. Those activities include:
We involve participants in planning these activities and adapt them based on their background knowledge and project needs.
Mentor: Elizabeth Newman
Deep neural networks (DNNs) have become the workhorse for the classification of complex data; in particular, they achieve state-of-the-art performance on imaging tasks. Despite their success and extensive use, DNNs are known to exhibit significant bias, which can lead to life-altering outcomes (e.g., demographic bias in facial recognition software used to arrest a suspect). This project will explore adversarial training techniques to develop fairer DNNs that mitigate this inherent bias. Because adversarial training, often posed as a minimax problem, can be time-consuming, we will focus on accelerating training using second-order information.
Mentor: Yuanzhe Xi
In recent years, machine learning techniques are increasingly used in real-world, especially in medicine and healthcare. They have been praised for the great promise it offers but has also been at the center of heated controversy. Recent reports identify and point out the main clinical, social, and ethical risks posed by AI in healthcare. Specifically, potential errors and patient harm; risk of bias and increased health inequalities; lack of transparency and trust; and vulnerability to hacking and data privacy breaches.
This summer, we aim to review the social consequences of AI models in healthcare and develop methods to minimize these risks and maximize the benefits of medical AI, such as increased transparency, explainability, and interpretability.
Mentor: Bree Ettinger
Low-level ozone is a harmful air pollutant that can cause adverse health effects like coughing and chest pain, and can worsen preexisting conditions like asthma. Many low-income communities are near sources of air pollution causing these communities to be disproportionately impacted. Low-level ozone is difficult to model since it is not emitted directly into the air, but is created by chemical reactions between oxides of nitrogen (NOx) and volatile organic compounds (VOC) in the presence of sunlight. Rising temperatures due to climate change are also increasing low-level ozone levels.
Predicting low-level ozone can inform air quality forecasts and help determine which communities are most at risk. Functional linear regression models can be used to predict low levels of ozone. In this project, we will study functional models based on bivariate splines over triangulation to approximate the spatially distributed ozone measurements on a surface. We will explore methods of not just computing forecasts, but also quantifying the uncertainty associated with our predictions and identifying the spatial distribution of low-level ozone.
Mentor: Julianne Chung, Matthias Chung, Elizabeth Newman
Environmental factors such as poor air and water quality are highly correlated with disease and adverse health outcomes. These determinants of health and well-being are often directly related to social inequalities such as socioeconomic status, household composition and disability, minority status and language, and housing time and transportation. For instance, NO2 is one combustion byproduct that has been associated with multiple adverse health outcomes. Various methods have been proposed to obtain high-resolution NO2 models covering the entire contiguous US, thereby enabling predictions, even in unmonitored areas. However, to this point, it is unclear how these models of environmental triggers are correlated with socioeconomic status.
In this project, we address the following questions: Based on existing data, can we predict social vulnerability maps? Where should monitoring stations be placed in order to assess NO2 levels with high accuracy, especially in locations identified as high-risk and vulnerable? How can these vulnerability maps be used to enable predictions with high certainty? We will use computational tools from mathematics, statistics, computer science, and data science to address these questions and more. In this project, students will gain hands-on experience with data science, mathematical and atmospheric modeling, inverse problems, and uncertainty quantification.
Mentor: Nicole Yang
An assumption in standard classification theory is that the input distributions of the test and the training sets are identical. In reality, it is common for the machine to have unseen data (‘out-of-distribution’ data) that is different from the training data distribution (‘in-distribution’ data). Therefore, building a trust-worthy machine learning system is important specially in social justice problems, where data distribution from under-represented group could be different from what is used in training a machine learning model which may result in overconfidently wrong decisions or predictions.
The project focuses on developing a neural network that can differentiate data from different distributions. We are interested in the following question: When there is no information to determine what is out-of-distribution when training a model, how can we design a system that can detect anomalous inputs? We will approach this problem by using by using Outlier Exposure Model where an auxiliary data set that is different from either in- or out-of-distribution data is introduced to improve the generalization ability of the neural networks. We further investigate how we can improve the common over-confidence issue that has been seen in out-of-distribution studies. To do this, we adopt the idea of the Joint Energy-based Model, where an energy-based classifier is used. We will apply these techniques for the problem of criminal justice, where facial recognition for decision making are highly consequential. For example, the facial information data could be different for under-represented groups which can give rise to unreliable detection and decision making in crime investigations.