Annual Ethical AI Symposium

April 8, 2024 8:45 AM - 5:00 PM

Michigan League
Ballroom (2nd Floor),
911 N. University Ave.,
Ann Arbor, MI 48109

Overview

With each passing day, AI technologies play a more and more prominent role in our lives. AI is transforming fields including healthcare, business, science, art, technology, transportation, and more. It is reaching into nearly every aspect of our broader society. Ensuring that we are developing, deploying, and evaluating AI applications ethically and responsibly is imperative — as is providing equal access to these tools.

Amid the current AI boom, researchers, industry leaders, and policymakers need to work together to identify the best practices for studying, developing, and regulating AI in a responsible and ethical manner. MIDAS is excited to be partnering with Rocket Companies to host experts from academia, the private sector, and government, as well as U-M researchers and Future Leaders Summit attendees to showcase their research, discuss important research opportunities and identify gaps, and foster collaborations that will help inform a more responsible, ethical, and accessible future for AI in our society.

Please note that event photography will be in use. Feel free to reach out to [email protected] with any questions or comments.

Schedule

8:45 AM – 9:00 AM

Opening Remarks

Dr. Jing Liu, Executive Director, Michigan Institute for Data and AI in Society, University of Michigan

9:00 AM – 10:30 AM

Keynote Speaker – Min Kyung Lee, Assistant Professor, School of Information, University of Texas, Austin

Opportunities and challenges in participatory AI

10:30 AM – 11:30 AM

Break

11:30 AM – 12:15 PM

A Conversation on AI Policy and Regulation

Bill de Blasio, Harry A. and Margaret D. Towsley Policymaker in Residence, Gerald R. Ford School of Public Policy, University of Michigan; former mayor, New York City

Moderator: Merve Hickok, Lecturer, School of Information, University of Michigan; Responsible Data and AI Advisor, MIDAS; Founder, AIEthicist.org; President and Research Director, Center for AI & Digital Policy.

12:15 PM – 2:00 PM

Lunch & Poster Session 1

📍Vandenberg Room, Michigan League

Explore research projects that showcase the vast, multidisciplinary applications for ethical AI and data science, from engineering to healthcare, social sciences, and beyond. Discover the cutting-edge research that is laying the groundwork for an ethical, AI-driven future and meet the scientists at the forefront.

This poster session will feature research from the 2024 Cohort from the Future Leaders Summit, comprised of ~40 outstanding data science and AI researchers, postdocs, and PhDs from over 30 different institutions across the U.S. and Canada.

2:00 PM – 3:00 PM

Keynote Speaker – Michael Tjalve, Chief AI Architect, Tech for Social Impact at Microsoft Philanthropies; Assistant Professor, Linguistics, University of Washington

A Practical Approach to Ethical AI

View Recording

3:00 PM – 3:30 PM

Eclipsing Unethical AI

Join us on Ingalls Mall (outside the League) to view the solar eclipse! We will be handing out special glasses for guests.

See also: “Totally Awesome: Your Guide to the Great American Solar Eclipse of 2024” – David Gerdes, Chair & Professor of Physics, University of Michigan

3:30 PM – 5:00 PM

Poster Session 2

📍Vandenberg Room, Michigan League

5:00 PM

Networking Reception

Light refreshments provided.

Speakers

Min Kyung Lee

Assistant Professor, School of Information, University of Texas, Austin

Bio

Min Kyung Lee is an assistant professor in the School of Information at the University of Texas at Austin. She has been a director of a Human-AI Interaction Lab since 2016. She is affiliated with UT Austin Machine Learning Lab—one of the first NSF funded national AI research institutes, Good Systems—a UT Austin 8-year Grand Challenge to design responsible AI technologies, and Texas Robotics. Previously, she was a research scientist in the Machine Learning Department at Carnegie Mellon University.

Dr. Lee has conducted some of the first studies that empirically examine the social implications of algorithms’ emerging roles in management and governance in society. She has extensive expertise in developing theories, methods and tools for human-centered AI and deploying them in practice through collaboration with real-world stakeholders and organizations. She developed a participatory framework that empowers community members to design matching algorithms that govern their own communities.

Her current research is inspired by and complements her previous work on social robots for long-term interaction, seamless human-robot handovers, and telepresence robots.

Opportunities and Challenges in Participatory AI

Abstract

As artificial intelligence (AI) continues to impact every aspect of our lives, it is crucial to ensure that its development aligns with the priorities and values of diverse communities and users. In this talk, I will present several case studies in which our research team explores different methods to i) understand the priorities and preferences of stakeholders such as gig workers, community members, and policymakers and ii) incorporate these insights into the design of AI systems, ranging from labor platforms to urban air mobility infrastructure. Drawing from these case studies, I will share reflections on the opportunities and challenges of participatory AI.

Bill de Blasio

Harry A. and Margaret D. Towsley Policymaker in Residence, Gerald R. Ford School of Public Policy, University of Michigan; former mayor, New York City

Bio

Bill de Blasio is an American political leader who served as the 109th mayor of New York City from 2014 to 2021. A member of the Democratic Party, he held the office of New York City Public Advocate from 2010 to 2013. De Blasio started his career as an elected official on the New York City Council, representing the 39th district in Brooklyn from 2002 to 2009.

As mayor, de Blasio led NYC through the Covid-19 pandemic, turning what was once a global epicenter into the safest city in the country.

In 2014, de Blasio created a groundbreaking initiative which ensured that early childhood education became a universal right in the five boroughs. The universal Pre-K and 3-K programs in NYC have become a national model.

During his tenure, NYC financed the preservation and construction of over 200,000 affordable homes, the most created by any administration in the City’s history. In 2019, de Blasio launched a first-in-the nation, 6-point action plan to end long-term homelessness. “The Journey Home” initiative was designed to increase access to housing and health care in combination with rapid-response outreach efforts for homeless individuals living in the streets.

In fulfilling his campaign promise to end a “tale of two cities,” de Blasio implemented policies which successfully reduced income inequality among New Yorkers and fought alongside them to secure a $15 minimum wage for all workers.

In response to the growing climate crisis, de Blasio and the NYC Council passed the Climate Mobilization Act (or the NYC Green New Deal) to make NYC net-carbon-neutral by 2050, as well as groundbreaking legislation to reduce building emissions and to end fossil fuel use in new buildings.

Prior to being an elected official, de Blasio served as the campaign manager for Hillary Rodham Clinton’s successful senatorial campaign of 2000 and got his start in NYC government working for Mayor David Dinkins.

De Blasio graduated from New York University with a B.A. in Metropolitan Studies and from Columbia University with an M.A. in International Affairs.

Michael Tjalve

Chief AI Architect, Tech for Social Impact at Microsoft Philanthropies; Assistant Professor, Linguistics, University of Washington

Bio

Michael Tjalve is Chief AI Architect on the Tech for Social Impact team in Microsoft Philanthropies where he works with nonprofits and humanitarian organizations around the world on building technology solutions that help them amplify their impact and address some of today’s biggest societal challenges. He’s Assistant Professor at University of Washington where he teaches AI in the humanitarian sector and ethical innovation and he serves as tech advisor to Spreeha Foundation and World Humanitarian Forum.

A Practical Approach to Ethical AI

Abstract

Modern AI capabilities are playing an increasingly important role across both our personal and our professional lives. As we collectively explore novel ways that AI can be used, it’s more important than ever to pause… and think about how we’re using these technologies.

Rooted in real-world examples from the humanitarian sector, we’ll take a practical approach to balancing the potential for positive impact with being clear-eyed about the risks by proactively implementing mitigation strategies and robust AI policies.

Posters

Session 1

Responsible AI Team, Rocket Mortgage

Individual fairness methods depart from the idea that similar observations should be treated similarly by a machine learning model, circumventing some of the shortcomings of group fairness tools. Nevertheless, many existing individual fairness approaches are either tailored to specific models and/or rely on a series of ad-hoc decisions to assess model bias. In this paper, we propose an individual fairness-inspired, inference-based bias detection pipeline. Our method is model-agnostic, suited for all data types, avoids commonly used ad-hoc thresholds and decisions in bias evaluation, and provides an intuitive scale to indicate how biased the assessed model is. We propose a model ensemble approach for our bias detection tool, consisting of: (i) building a proximity matrix with random forests based on features and output; (ii) inputting it into a Bayesian network method to cluster similar observations; (iii) performing within-cluster inference to test the hypothesis that the model is treating similar observations similarly; and (iv) aggregating the cluster tests up with multiple hypothesis test correction. In addition to providing a single statistical p-value for the null hypothesis that the model is unbiased based on individual fairness, we further create a scale that measures the amount of bias against minorities carried by the model of interest, making the overall p-value more interpretable to decision-makers.

Maryam Berijanian, Michigan State University

In the rapidly evolving field of digital pathology, the ethical implications of AI technologies are of high concern. This study introduces an innovative unsupervised many-to-many stain translation framework for histopathology images, leveraging an enhanced GAN model with an edge detector to preserve tissue structure while generating synthetic images. Our method addresses two critical ethical challenges in AI: privacy and the reliance on low-cost labor for image annotation. First, by utilizing artificially generated images, our approach circumvents the privacy issues inherent in using real patient data, thereby safeguarding individual confidentiality—an essential consideration in medical research. Second, the reliance on extensive annotated datasets for deep learning applications often implicates ethical concerns regarding the exploitation of low-cost labor for manual image annotation. Our framework mitigates this issue by generating high-quality, realistic synthetic images, reducing the dependency on manually annotated datasets. Empirical results underscore the effectiveness of our approach; incorporating generated images into the training datasets of breast cancer classifiers resulted in performance improvements, demonstrating the technical feasibility and ethical advantages of our method. This research not only contributes to the advancement of digital pathology through AI but also emphasizes the importance of ethical considerations in the development and application of AI technologies.

Isabela Bertolini Coelho, University of Maryland

Privacy is central to discussions surrounding data protection and ethical considerations in both survey methodology and AI. Understanding stakeholders’ attitudes, perceptions, and participation levels toward privacy is crucial to identifying the barriers to adopting formal privacy models in sample survey data, especially for official statistics. Large language models (LLMs) have emerged as powerful tools in various domains, including survey research. In this study, we present a comparative analysis between LLM-generated codifications and human-coded responses to open-ended questions regarding privacy. Based on results from a qualitative study conducted with experts on data privacy, our investigation delves into the similarities and disparities between codifications generated by LLMs and those crafted by human coders. Additionally, we examine the extent to which LLMs capture the contextual intricacies of privacy discussions, especially regarding the differentiation between what privacy means in the context of their work and as they experienced it in their personal lives. Furthermore, this study sheds light on the efficacy of LLMs in survey research, particularly in codifying complex concepts such as privacy. It contributes to ongoing discussions surrounding the role of AI in survey methodology.

Brooks Butler, Purdue University

The safe coordination of multi-agent systems presents a complex and dynamic research frontier, encompassing various objectives such as ensuring group coherence while navigating obstacles and avoiding collisions between agents. Expanding upon our prior work in distributed collaborative control for networked dynamic systems, we introduce an algorithm tailored for the formation control of multi-agent systems, considering individual agent dynamics, induced formation dynamics, and local neighborhood information within a predefined sensing radius for each agent. Our approach prioritizes individual agent safety through iterative communication rounds among neighbors, enforcing safety conditions derived from high-order control barrier functions (CBFs) to mitigate potentially hazardous control actions within the cooperative framework. Emphasizing explainable AI principles, our method provides transparent insights into decision-making processes via model-based methods and intentional design of individual agent safety constraints, enhancing the interpretability and trustworthiness of multi-agent system behavior.

Lucius Bynum, PhD Candidate, Data Science, New York University

Counterfactuals and counterfactual reasoning underpin numerous techniques for auditing and understanding artificial intelligence (AI) systems. The traditional paradigm for counterfactual reasoning in this literature is the interventional counterfactual, where hypothetical interventions are imagined and simulated. For this reason, the starting point for causal reasoning about legal protections and demographic data in AI is an imagined intervention on a legally-protected characteristic, such as ethnicity, race, gender, disability, age, etc. We ask, for example, what would have happened had your race been different? An inherent limitation of this paradigm is that some demographic interventions — like interventions on race — may not translate into the formalisms of interventional counterfactuals. In this work, we explore a new paradigm based instead on the backtracking counterfactual, where rather than imagine hypothetical interventions on legally-protected characteristics, we imagine alternate initial conditions while holding these characteristics fixed. We ask instead, what would explain a counterfactual outcome for you as you actually are or could be? This alternate framework allows us to address many of the same social concerns, but to do so while asking fundamentally different questions that do not rely on demographic interventions

César Claros, University of Delaware

This work presents a novel method for interpreting 3D convolutional neural networks (CNNs) that estimate clinically relevant attributes from 3D brain maps, aiming to address the challenge of interpretability in deep learning within healthcare. Unlike common image classification interpretability methods, such as GradCAM, which rely on per-instance explanations due to spatial variation, this approach leverages the consistent spatial registration of brain maps to compute dataset-level explanations. By organizing the network’s internal activations into a tensor and applying constrained tensor decomposition, the method identifies key spatial patterns and brain regions focused on during prediction. The technique uses reconstruction error to determine the tensor decomposition rank and employs linear models to link activation decompositions to target attributes. Applied to networks estimating chronological age from brain volume and stiffness maps obtained via MRI and T1-weighted MRE scans, the decomposition highlights brain areas known to change with age. This approach offers a means to interpret CNNs in brain mapping and insights into age-related brain structural changes, enhancing the understanding and trustworthiness of deep learning models in healthcare.

Anja Conev, Rice University

Peptide-HLA (pHLA) binding prediction is essential in screening peptide candidates for personalized peptide vaccines. Machine learning (ML) pHLA binding prediction tools are trained on vast amounts of data and are effective in screening peptide candidates. Most ML models report the ability to generalize to HLA alleles unseen during training (“pan-allele” models). However, the use of datasets with imbalanced allele content raises concerns about biased model performance. First, we examine the data bias of two ML-based pan-allele pHLA binding predictors. We find that the pHLA datasets overrepresent alleles from geographic populations of high-income countries. Second, we show that the identified data bias is perpetuated within ML models, leading to algorithmic bias and subpar performance for alleles expressed in low-income geographic populations. We draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term “pan-allele” to describe models trained with currently available public datasets.

Diamond Joelle Cunningham, MPH, Tulane University

Black/African American women disproportionately suffer from systemic lupus erythematosus (SLE), with higher prevalence, severity, and poorer outcomes compared to White counterparts. Appointment non-adherence contributes to racial disparities in health outcomes, with factors such as racial discrimination potentially leading to missed appointments among Black/African Americans. This study sought to examine whether racial discrimination in medical settings is Associated with missed appointments among Black/African American women living with SLE. Data from the BeWELL Study (2015-2017) involved 438 Black women diagnosed with SLE in Atlanta. Appointment adherence was gauged by asking about missed appointments with their lupus doctor. Participants reported experiences of racial discrimination in medical care, with multivariable logistic regression used to analyze missed appointments in relation to discrimination. Controlling for SLE duration, disease severity (organ damage and disease activity), and other demographic, socioeconomic, and health-related characteristics, racial discrimination was significantly associated with missed appointments (Odds Ratio: 1.33, 95% Confidence Interval: 1.03-1.73). Results from this study suggest that racial discrimination in medical care may result in missed medical appointments among Black women living with SLE. Antiracist interventions at multiple points of engagement within medical systems, from scheduling to the clinical encounter, may enhance appointment adherence among Black/African American women living with SLE.

Matthew R. DeVerna, PhD Candidate, Informatics, Indiana University, Bloomington

Fact checking can be an effective strategy against misinformation, but its implementation at scale is impeded by the overwhelming volume of information online. Recent artificial intelligence (AI) language models have shown impressive ability in fact-checking tasks, but how humans interact with fact-checking information provided by these models is unclear. Here, we investigate the impact of fact-checking information generated by a popular large language model (LLM) on belief in, and sharing intent of, political news in a preregistered randomized control experiment. Although the LLM performs reasonably well in debunking false headlines, we find that it does not significantly affect participants’ ability to discern headline accuracy or share accurate news. Subsequent analysis reveals that the AI fact-checker is harmful in specific cases: it decreases beliefs in true headlines that it mislabels as false and increases beliefs in false headlines that it is unsure about. On the positive side, the AI fact-checking information increases sharing intents for correctly labeled true headlines. When participants are given the option to view LLM fact checks and choose to do so, they are significantly more likely to share both true and false news but only more likely to believe false news. Our findings highlight an important source of potential harm stemming from AI applications and underscore the critical need for policies to prevent or mitigate such unintended consequences.

Majid Farhadloo, Department of Computer Science and Engineering, University of Minnesota, Twin Cities

High-risk applications of Geo-AI must show that their models are safe, transparent, and spatially lucid (i.e., explainable using spatial concepts) to end users. The goal of spatially lucid artificial intelligence (AI) classification approach is to build a classifier to distinguish two classes (e.g., responder, non-responder) based on the their spatial arrangements (e.g., spatial interactions between different point categories) given multi-category point data from two classes. This problem is societally important for many applications, such as generating clinical hypotheses for designing new immune therapies for cancer treatment. This problem is challenging due to an exponential number of category subsets which may vary in the strength of their spatial interactions. Most prior efforts on using human selected spatial association measures may not be sufficient for capturing the relevant spatial interactions (e.g., surrounded by) which may be of biological significance. In addition, the related deep neural networks are limited to category pairs and do not explore larger subsets of point categories. To overcome these limitations, we propose a Spatial-interaction Aware Multi-Category deep neural Network (SAMCNet) architecture and contribute novel local reference frame characterization and point pair prioritization layers for spatially explainable classification. Experimental results on multiple cancer datasets (e.g., MxIF) show that the proposed architecture provides higher prediction accuracy over baseline methods. A real-world case study demonstrates that the proposed work discovers patterns that are missed by the existing methods and has the potential to inspire new scientific discovery.

Emily Fletcher, PhD Candidate, Anthropology, Purdue University

Although archaeological field notebooks are created as a resource for future archaeologists to reference in their research, the labor required to digitize handwritten notes presents a barrier to their incorporation in state-of-the-art computational analyses. In this research, I explore if image preprocessing can improve the accuracy of text extracted from handwritten field notebooks by Handwritten Text Recognition. I apply image preprocessing to scans of handwritten field notebooks from the 1970s excavations of the Gulkana Site, a pre-contact Northern Dene site in Alaska’s Copper River Basin. These documents contain important data regarding native copper innovation that occurred at the Gulkana Site, but their current state has prevented analysis of that data.

Neil S. Gaikwad, Massachusetts Institute of Technology

The rise of AI has brought about a significant transformation in how algorithms engage with societal values, reshaping computational systems and human societies alike. However, despite its widespread adoption, AI innovation often overlooks individuals confronting poverty and heightened public health risks, especially in the face of climate change. To tackle these sustainability challenges, I introduce Public Interest Computing research, which centers on Responsible AI and Algorithmic Alignment, aiming to redefine Human-AI collaboration rooted in social norms. Illustrating through both theoretical grounding and practical examples, I present methods for integrating ethics and values into human-AI systems for societal decision-making. Firstly, by employing new social and democratic learning mechanisms to facilitate ethical decision-making, machine learning preferences gathered from 1.3 million individuals. Secondly, by developing value-sensitive design mechanisms that enhance the agency of historically marginalized communities in algorithmic decision-making for climate change adaptation policy, including addressing pressing issues like farmer suicides affecting 300,000 individuals. Thirdly, by redesigning socially and ethically responsible AI data market systems with incentive-compatible interactions to address equity concerns in data ecosystems. Public Interest Computing prioritizes ethics in human-AI collaboration from the inception rather than as an afterthought, offering a pathway to design technologies that are not only computationally efficient but also fair, value-sensitive, and accessible for everyone around the world.

Katherine R. Garcia, Rice University

The success of autonomous vehicles (AV) depends on artificial intelligence (AI). AI is responsible for sensing the driving environment, and planning, navigating, and executing a path for the vehicle. However, human involvement is crucial to ensure AV safety, especially when AI fails. This study used a think-aloud methodology to study how drivers perceive AI capabilities in AVs when identifying different road-sign images. Participants were tasked with rating how both themselves and AI classify six unique road-sign images with four manipulation types (original/no manipulation, projected gradient descent cyberattack, physical cyberattack, and scrambled manipulation). In order to understand their reasoning, half of the participants were prompted to speak their thoughts during the study, while the other half were not required. The results showed that participants accurately perceived the AI to correctly to classify the original images and not correctly classify the scrambled ones, as predicted. However, they overestimated the AI’s capabilities when handling cyberattacks, even when trying to discern the differences from the originals. Participants may perceive the AI to have similar capabilities to their own. These findings suggest that drivers may not appropriately trust or understand AI in completing critical tasks, displaying the need for more explainable AI in AVs.

Ryan Gifford, PhD Candidate, Integrated Systems Engineering, The Ohio State University

In this research we propose the CNN Tree algorithm, an intrinsically explainable model for time-series classification. The CNN Tree leverages the explainable structure of a Decision Tree and the power of Deep Learning to extract discriminative features from raw data. Recent techniques for explainable time-series classification rely on post-hoc explanations, which are not faithful to the true decision processes of the model they are trying to explain. As an alternative, the CNN Tree is explainable by design and shows hierarchical decision processes using both important time ranges and variables. We tested the CNN Tree with one private and nine open-source datasets; the CNN Tree has better or equivalent accuracy as state-of-the-art explainable AI models while providing faithful explanations.

Bhanu Teja Gullapalli, University of California, San Diego

Mobile sensor devices equipped to monitor electrophysiological signals provide information about various health metrics. However, there exists a significant gap in their applicability to substance use, despite well-documented medical research on the cycle of addiction and changes in mental and physical states. My research focused on building biomarkers to monitor addiction states in opioids and cocaine. Initially, I demonstrate that monitoring breathing and ECG signals of a cocaine-dependent person during a drug binge session provides information on states of drug craving and euphoria. Subsequently, I illustrate how the intrinsic relationship between these states can be leveraged by models to enhance predictions. Similarly, I utilize wearable signals from medical-grade devices to monitor opioid administration. I demonstrate that incorporating domain knowledge, particularly the pharmacokinetics of the drugs, into purely data-driven models can enhance the reliability of opioid monitoring. I observe that the performance of these models is highly dependent on the population group, based on their dependence and drug usage patterns. Consequently, I develop an opioid screening model to differentiate opioid misusers from prescription users using cognitive and psychophysiological data. The findings from my research represent an initial step towards building digital biomarkers for better understanding and treating substance use disorders.

Yifei Huang, PhD Candidate, Mathematics, University of Illinois, Chicago

In this project, we address the problem of designing experiments with discrete and continuous (mixed) factors under general parametric statistical models. We propose the ForLion algorithm to search for optimal designs under the D-criterion. Simulation results show that the ForLion algorithm will reduce the number of distinct experimental settings while keeping the highest possible efficiency.

Zach Jacokes, University of Virginia

Autism Spectrum Disorder (ASD) spans a wide array of phenotypic expressions that make it a difficult condition to study. Other factors complicating ASD research include a sex-wise diagnostic disparity (boys are almost four times more likely to receive an ASD diagnosis than girls), cultural biases around ASD traits, and dataset imbalances these issues can cause. This study examines the extent of the selection bias present in an in-progress ASD data collection effort and the issues with drawing generalizable conclusions from this dataset. In particular, this dataset is subject to collider bias, whereby the population of interest is artificially sampled in a way that can affect both the exposures (independent variables) and the outcomes (dependent variables). When the exposures include such variables as neuroanatomical feature size and neuronal interconnectivity between brain regions and outcomes include performance on behavioral surveys, there exists several key factors along the causal pathway between these that clearly impact their association. This study examines how artificially selecting autistic participants with low needs (measured by autism severity score) can act as a collider between exposures and outcomes.

Lavender Jiang, New York University

Although open data accelerates research, machine learning for healthcare has a limited open data due to concerns about patient privacy. Health Insurance Portability and Accountability Act of 1996 (HIPAA) was created to improve data portability and it allows disclosing “de-identified health information” via Safe Harbour, which requires removing 18 types of identifiers and ensuring the individuals cannot be re-identified. A conventional approach is to detect any tokens that is deemed to be relevant to HIPAA protected identifiers and remove or replace those tokens appropriately. Since it is time-consuming to do so manually, people often view the detection part as the problem of named entity recognition (NER) and remove the detected entities appropriately. However, annotators could miss implicit contextual identifiers, giving rise to the possibility that a de-identifier achieves perfect precision and recall, yet still produce re-identifiable notes. We formalize the de-identification problem using PGM, and show that it is impossible to achieve perfect de-identification without losing all utility. Empirically, we de-identified clinical notes using NER-based de-identifiers, and finetuned a public BERT model to predict annotated demographic attributes from the de-identified notes. We show that it can recover gender, borough, year, month, income and insurance with above random chance with as few as 1000 labelled examples. These predicted attributes can be further used for re-identifying patients. Using the fully finetuned predictions, the probability of being uniquely identified is around 3 in a thousand. Using the 1000-example-finetuned predictions, the probability of being uniquely identified is around 380 in a million.

Wenxin Jiang, Purdue University

Deep neural networks are being adopted as components in software systems. Creating and specializing deep neural networks from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, deep learning engineers have begun to reuse pretrained models (PTMs) and fine-tune them for downstream tasks and environments. However, unlike in traditional software, where reuse practices and challenges have been extensively studied, the knowledge foundation for PTM ecosystems remains underdeveloped. My research addresses this gap through a series of defect studies, case studies, and interviews, aiming to unearth detailed insights into the challenges and practices in PTM ecosystems. Utilizing mining software repository techniques, I’ve extracted, analyzed, and interpreted the rich data within PTM packages. My work first adopts the methodologies from traditional software engineering to understand the challenges and practices of deep learning software. I have also published two open-source datasets of PTM packages, aiming to support further research on this problem domain. My work focuses on enhancing the trustworthiness and reusability of PTMs. This involves improving transparency through comprehensive metadata extraction, identifying potential defects within the ecosystem, and developing optimized model selection strategies to support reuse.

Đorđe Klisura, University of Texas, San Antonio

Relational databases are integral to modern information systems, serving as the foundation for storing, querying, and managing data efficiently and effectively. Advancements in large language modeling have led to the emergence of text-to-SQL technologies, significantly enhancing the querying and extracting of information from these databases, while also raising concerns about privacy and security. Our research explores extracting the database schema elements underlying a text-to-SQL model. It is noteworthy that knowledge of the schema can make attacks such as SQL injection easier. To this end, we have developed a novel zero-knowledge framework designed to probe various database schema elements without access to the schema itself. The text-to-SQL models process specially crafted questions to produce an output that we use to uncover the structure of the database schema. We apply it to specialized text-to-SQL models fine-tuned on text-SQL pairs and general-purpose language models (e.g., GPT3.5). Our current results show an average recall of 0.83 and a precision of 0.79 for fine-tuned models in uncovering the schema. This research embeds ethical and responsible AI use considerations, recognizing the importance of transparency in AI-driven systems. This work precedes future experiments, where we will explore regenerating training data used by fine-tuned text-to-SQL systems.

Eugene Kim, PhD Candidate, Computer Science and Engineering, University of Michigan (Presenting Author)
Ben Fish, Asst. Prof., Computer Science and Engineering, University of Michigan
Elizabeth Bondi-Kelly, Asst. Prof., Computer Science and Engineering, University of Michigan

Participatory methods for AI have enjoyed considerable recent interest as a technique to enable inclusive and democratic AI systems by incorporating participation by stakeholders and those without AI expertise. These methods draw from a variety of disciplines including participatory design, computational social choice, and deliberative democracy, and have considerable promise for ensuring that AI systems incorporate the concerns and needs of historically marginalized groups in AI. However, due to the heterogeneity of methods used for participatory AI, there are no standard ways to evaluate participatory mechanisms, and there is yet no consensus on best practices for designing participatory mechanisms for AI. In order to contribute to a better understanding of participatory AI methods, we analyze over thirty case studies of Participatory AI with the perspective of understanding who is actually enabled to participate in the design of AI systems. We identify common techniques participatory AI researchers and practitioners use to recruit non-researcher stakeholders, and discuss these techniques in the context of the different kinds of stakeholders and their relationships to the AI system, and the goals of including participation for the AI project. We also include discussion on how barriers to participation, including financial, expertise, and infrastructural barriers, can affect recruitment, and thus affect the people who are enabled to participate in the design of AI systems. Finally, we introduce a design for an open database for the research community to collaboratively document the techniques used for participatory AI research, which can be used by researchers to evaluate different participatory mechanisms, learn about what mechanisms have been successful in the field, and establish best practices.

Session 2

Olivia Krebs, Case Western Reserve University

Overall survival (OS) in glioblastoma (GB) patients has been observed to depend on patient sex and, in part, immunological differences between males and females. This study investigated the relationship between the tumor immune microenvironment and OS in GB. Sex-specific survival models were developed utilizing spatial organization features of inflammatory cells extracted from digitized images of hematoxylin and eosin-stained resected GB tumor tissue. The inflammatory cell-based measurements were used to construct three survival risk-stratification models for male, female, and combined (male + female) cohorts. Patient-specific risk scores derived from these survival models were assessed using Kaplan-Meier estimates. The risk groups stratified by the sex-specific survival models were analyzed for differential expression of relevant cancer biology and treatment response pathways. Our findings indicate organizational histological features of inflammatory cells when trained separately for male and female GB patients, may be independently prognostic of OS. These findings suggest the potential of sex-specific immune-based approaches for constructing more accurate, patient-centric risk-assessment models.

Jessica Leivesley, University of Toronto

Canada’s recreational fishery contributed $7.9 billion to the national economy in 2015, and in Ontario alone freshwater recreational and commercial fisheries represent a $2.2 billion industry. To maintain sustainable and resilient fisheries, managers must have accurate information on the current status of stock health, population-size, and fish communities for many water bodies at a given time. Generally, this information is gathered through resource-intensive and lethal sampling methods. Current hydroacoustic methods can assess individual fish sizes but species identities cannot be discerned. The recent development of wideband acoustic transducers which emit a wide range of frequencies in a single ping may allow more information on body form to be extracted and thus may aid in species identification. In this study, we created a labelled dataset of acoustic responses of two fish species by tethering individual fish under a transducer emitting 249 frequencies between 45kHz and 170kHz. We then applied three different bespoke machine learning algorithms (deep, recurrent, and residual neural networks) to acoustic backscatter measures at each frequency and tested their ability to correctly classify the two fish species. We found that on unseen data all three methods had over 85% balanced classification accuracy. Further, extracting SHAP values for the deep neural network showed that there is not a single range of frequencies that are important for distinguishing the species, but rather the most important frequencies are distributed across the range of frequencies used. Eventually, these algorithms can be integrated into current abundance or biomass models and allow users to propagate classification uncertainty into these models. Overall, the use of wideband acoustics in conjunction with machine learning techniques offers the potential to drastically reduce the resources needed and costs associated with monitoring fish stocks.

Yaqi Li, University of Oklahoma

For almost all scientific research, the accuracy and reliability of findings, as well as the performance of predictive models depend upon the quality of the data used. As highlighted by Arias et al. (2020) in their study, even minor errors within research datasets can significantly impact the overall accuracy of results. However, within child welfare area, the issue of data quality has been relatively overlooked. The quality of child welfare outcomes is intricately linked to the quality of data, Nonetheless, the quality of child welfare outcomes is intricately tied to the quality of data, especially in the context of automated decision-making in service delivery. For instance, Predictive Risk Modeling (PRM), a predictive model utilized to automate decisions regarding child maltreatment, has faced criticism for generating biased decisions in practice, largely due to data fraud with errors. The forthcoming presentation will elucidate the results of data quality evaluations conducted on a nationwide child welfare database. Additionally, strategies to identify potential factors contributing to suboptimal data quality will be addressed. The presentation aims to address critical gaps in understanding and addressing data quality issues within the child welfare area, ultimately aiming to improve the effectiveness and fairness of decision-making in this area.

Tony Liu, University of Pennsylvania

The gold standard for the identification of causal effects are randomized controlled trials (RCT), but RCTs may not always be feasible to conduct. When treatments depend on a threshold however, such as the blood sugar threshold for diabetes diagnosis, we can still sometimes estimate causal effects with regression discontinuities (RDs). In practice however, implementing RD studies can be difficult as identifying treatment thresholds require considerable domain expertise — furthermore, the thresholds may differ across subgroups (e.g., the blood sugar threshold for diabetes may differ across demographics), and ignoring these differences can lower statistical power. Finding the thresholds and to whom they apply is an important problem currently solved manually by domain experts, and data-driven approaches are needed when domain expertise is not sufficient. Here, we introduce Regression Discontinuity SubGroup Discovery (RDSGD), a machine-learning method that identifies statistically powerful and interpretable subgroups for RD thresholds. Using a medical claims dataset with over 60 million patients, we apply RDSGD to multiple clinical contexts and identify subgroups with increased compliance to treatment assignment thresholds. As treatment thresholds matter for many diseases and policy decisions, RDSGD can be a powerful tool for discovering new avenues for causal estimation.

Stephanie Milani, Carnegie Mellon University

Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. However, existing work on interpretable reinforcement learning (RL) has shown promise in extracting more interpretable decision tree-based policies from neural networks, but only in the single-agent setting. To fill this gap, we propose the first set of algorithms that extract interpretable decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER learns high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.

Harsh Parikh, Johns Hopkins University

Randomized controlled trials (RCTs) serve as the cornerstone for understanding causal effects, yet extending inferences to target populations presents challenges due to effect heterogeneity and underrepresentation. Our work addresses the critical issue of identifying and characterizing underrepresented subgroups in RCTs, proposing a novel framework for refining target populations to improve generalizability. We introduce an optimization-based approach, Rashomon Set of Optimal Trees (ROOT), to characterize underrepresented groups. ROOT optimizes the target subpopulation distribution by minimizing the variance of the target average treatment effect estimate, ensuring more precise treatment effect estimations. Notably, ROOT generates interpretable characteristics of the underrepresented population, aiding researchers in effective communication. Our approach demonstrates improved precision and interpretability compared to alternatives, as illustrated with synthetic data experiments. We apply our methodology to extend inferences from the Starting Treatment with Agonist Replacement Therapies (START) trial — investigating the effectiveness of medication for opioid use disorder — to the real-world population represented by the Treatment Episode Dataset: Admissions (TEDS-A). By refining target populations using ROOT, our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations.

Rahul Ramesh, University of Pennsylvania

We develop information geometric techniques to understand the representations learned by deep networks when they are trained on different tasks using supervised, meta-, semi-supervised and contrastive learning. We shed light on the following phenomena that relate to the structure of the space of tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress even on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) episodic meta-learning algorithms and supervised learning traverse different trajectories during training but they fit similar models eventually; (5) contrastive and semi-supervised learning methods traverse trajectories similar to those of supervised learning. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.

Ransalu Senanayake, Arizona State University

The deployment of physical embodied AI systems, such as autonomous vehicles, is rapidly expanding. At the heart of these systems, there are numerous computer vision and large language modules that directly influence the downstream decision-making tasks by considering the presence of nearby humans, such as pedestrians. Despite the high accuracy of these models on held-out datasets, the potential presence of algorithmic bias is challenging to assess. We discuss our ongoing efforts at the Laboratory for Learning Evaluation of autoNomous Systems (LENS Lab) in analyzing disparate impacts for groups with different genders, skin tones, body sizes, professions, etc. in large-scale deep neural networks, especially under physical perturbations.

Subhasree Sengupta, Clemson University

As Artificial Intelligence (AI) becomes increasingly ingrained into society, ethical and regularity concerns become critical. Given the vast array of philosophical considerations of AI ethics, there is a pressing need to understand and balance public opinion and expectations of how AI ethics should be defined and implemented, such that it centers the voice of experts and non-experts alike. This investigation explores a subreddit r/AIethics through a multi-methodological, multi-level approach. The analysis yielded six conversational themes, sentiment trends, and emergent roles that elicit narratives associated with expanding implementation, policy, critical literacy, communal preparedness, and increased awareness towards combining technical and social aspects of AI ethics. Such insights can help to distill necessary considerations for the practice of AI ethics beyond scholarly traditions and how informal spaces (such as virtual channels) can and should act as avenues of learning, raising critical consciousness, bolstering connectivity, and enhancing narrative agency on AI ethics.

Nasim Sonboli, Brown University

The General Data Protection Regulations (GDPR) are designed to protect personal data from harm, with mandatory adherence within the European Union and varying levels of alignment elsewhere. Complying with GDPR is complex due to the potential contradictions within the regulations themselves. Additionally, operationalizing these regulations in machine learning systems adds additional complexity. Hence, it’s crucial to assess the feasibility of simultaneously achieving GDPR compliance in general and in machine learning systems, and to consider potential trade-offs if full alignment proves unattainable. In this research, we study the current research on data minimization in machine learning. We investigate the relationship between data minimization, fairness, and accuracy. Few works have investigated data minimization in machine learning and even fewer research on the conflict of data minimization with other GDPR principles. Our long-term goal is to provide guidelines how to operationalize data minimization in machine learning systems for the computer scientists, practitioners and researchers in academia and industry. We explore the existing tools to implement these regulations in machine learning systems and the advantages and disadvantages of these tools. Additionally, we investigate the potential tradeoffs among GDPR and we provide a roadmap how to navigate them. By exploring these critical aspects, we offer valuable insights for developing machine learning systems that comply with data protection regulations.

Tiffany Tang, University of Michigan

Machine learning algorithms often assume that training samples are independent. When data points are connected by a network, it creates dependency between samples, which is a challenge, reducing effective sample size, and an opportunity to improve prediction by leveraging information from network neighbors. Multiple prediction methods taking advantage of this opportunity are now available. Many methods including graph neural networks are not easily interpretable, limiting their usefulness in the biomedical and social sciences, where understanding how a model makes its predictions is often more important than the prediction itself. Some are interpretable, for example, network-assisted linear regression, but generally do not achieve similar prediction accuracies as more flexible models. We bridge this gap by proposing a family of flexible network-assisted models built upon a generalization of random forests (RF+), which both achieves highly-competitive prediction accuracy and can be interpreted through feature importance measures. In particular, we provide a suite of novel interpretation tools that enable practitioners to not only identify important features that drive model predictions, but also quantify the importance of the network contribution to prediction. This suite of general tools broadens the scope and applicability of network-assisted machine learning for high-impact problems where interpretability and transparency are essential.

Shantanu Vyas, Texas A&M University, College Station

In the initial stages of design, ambiguity and uncertainty present significant challenges to designers. Although generative AI shows promise in addressing these challenges, its premature application risks hindering creative exploration and inhibiting reflective thinking, both integral to the design process. Our work proposes strategies to responsibly integrate LLMs into the design process, by fostering reflective thinking over immediate solution generation. By reframing the role of LLMs to prompt contextual questioning and surface latent concepts in design problems, we aim to support designers in generating novel ideas while preserving their creative autonomy. We suggest techniques for incorporating explainability into generative design processes, utilizing multi-modal models trained on design language and 3D design concepts to provide explicit rationales for generated design solutions. Through these techniques, our objective is to instill trust in designers regarding solutions generated by AI models and, more importantly, to stimulate reflective thinking processes. Our work seeks to comprehend the responsible utilization of AI to nurture human creativity and critical thinking in the design process without replacing it.

Guanchu Wang, Rice University

My doctoral research centers on responsible AI, a critical area that demands the infusion of trust throughout the AI lifecycle. Within this overarching theme, my research delves into explainable AI, which specializes in developing algorithms to explain the behaviors of deep neural networks faithfully. The overarching goal of this thesis is to make the decision-making process within deep neural networks understandable to humans, thereby facilitating the safe deployment of machine learning to high-stake application scenarios. This abstract highlights two significant milestones from my research in explainable AI: 1) Developing Shaplay Value Explanation for DNNs: my seminal work SHEAR focuses on accurately estimating the Shapley value to explain the DNN decision, under a limited sampling budget. In our healthcare project, SHEAR is capable of precisely assessing the impact of gene-gene interaction on Alzheimer’s disease. 2) Explaining Large Language Models: we propose a generative explanation framework xLLM for explaining the outputs of large language models (LLMs). xLLM can faithfully explain most existing LLMs, such as the ChatGPT, LLAMA, and Claude, ensuring trustworthy decision-making in AI-driven healthcare.

Haoyu Wang, Purdue University

Haoyu Wang’s research addresses the critical challenge of democratizing AI, focusing on making AI more accessible through data and parameter efficiency, and ensuring trustworthiness by emphasizing fairness, robustness, and interpretability. His work introduces innovative model compression techniques that facilitate AI deployment on low-resource devices, enhancing global accessibility. Furthermore, his efforts in cross-lingual and multi-lingual understanding aim to overcome language barriers in AI use. By advocating for ethical AI, his research aligns technical advancements with societal needs, ensuring AI’s benefits are equitably distributed. This body of work represents a significant step towards accessible, trustworthy AI for all.

Galen Weld, University of Washington

Online communities are powerful tools to connect people and are used worldwide by billions of people. Nearly all online communities rely upon moderators or admins to govern the community in order to mitigate potential harms such as harassment, polarization, and deleterious effects on mental health. However, online communities are complex systems, and studying the impact of community governance empirically at scale is challenging because of the many aspects of community governance and outcomes that must be quantified. In this work, we develop methods to quantify the governance of online communities at web scale. We survey community members to build a comprehensive understanding of what it means to make communities ‘better,’ then assess existing governance practices and associate them with important outcomes to inform community moderators. We collaborate with communities to deploy our governance interventions to maximize the positive impact of our work, and, at every step of the way, we make our datasets and methods public to support further research on this important topic.

Siyu Wu, Pennsylvania State University

This research uniquely integrates ACT-R’s cognitive framework within LLMs to provide structure and clarity to their reasoning, enhancing decision-making transparency and explainability – a step not yet explored in current studies. We first highlight the disparity between Large Language Models (LLMs) and human decision-making, noting LLMs’ focus on rapid, intuitive processes and their limitations in complex reasoning and learning continuity. To address these shortcomings, we then propose integrating LLMs with the ACT-R cognitive architecture, a framework that models human cognitive processes. This integration aims to enhance LLMs with human-like decision-making and learning patterns by correlating ACT-R decision-making data with LLM embeddings. The architecture we propose has the potential to enable LLMs to make decisions and learn in ways that more closely mirror human cognition, addressing the critical challenge of aligning machine reasoning with human processes.

Yuchen Zeng, University of Wisconsin, Madison

Recently, there has been a significant increase in the development of large language models (LLMs), which are now extensively used in everyday life. However, the fairness and safety of these models have become significant concerns. Existing studies suggest that parameter-efficient fine-tuning (PEFT) can help alleviate the inherent biases present in LLMs. Our research aims to comprehensively understand PEFT’s capabilities through both experimental and theoretical lenses. We demonstrate that Low-Rank Adaptation (LoRA), a popular PEFT method, excels in adapting LLMs for non-language tasks, including processing tabular datasets, a crucial type for fair classification tasks, as evidenced by extensive experiments. Furthermore, we theoretically establish that LoRA can fine-tune a randomly initialized model into any smaller target model, showcasing the potential of PEFT. Through an in-depth exploration of PEFT’s practical applications and theoretical underpinnings, our work lay the foundation for future research aimed at enhancing the fairness and safety of LLMs via PEFT.

Yongmei Bai, PhD Candidate, Peking University Health Science Center; National Institute of Health Data Science, Peking University; Medical School, University of Michigan (Presenting Author)
Yongqun Oliver He, Assoc. Professor, Medical School, University of Michigan
Jian Du, Peking University Health Science Center; National Institute of Health Data Science, Peking University

Methods: Biomedical knowledge was automatically extracted from scientific literature into Subject-Predicate-Object (SPO) triples. The concepts and relationships of the SPO triples were standardized by the Unified Medical Language System (UMLS) encoding terminologies. A knowledge graph was generated in Neo4j through Python, which was further used to query for scientific questions of interest based on graph algorithms. Medical hypotheses were also formulated using the knowledge graph, aiding in advanced research design and interpretation of observational results. A list of medical hypotheses was generated, analyzed, and selectively validated by Mendelian randomization (MR). R was used in programming.

Results: A PubMed-derived knowledge graph was constructed to include 28,183 concepts and 251,628 relationships. The directional edges between two nodes represent the relationships between concepts. Relationship attributes include the PMIDs and quantity of each source study, as well as information on the sentences and quantity from the sources. Through graph-based queries, we proposed that in the context of lymphoma patients undergoing Car-T therapy, an adverse event, Cytokine Release Syndrome (CRS) might affect another, Blood Coagulation Disorders. Due to the lack of evidence from existing studies, we validated our hypothesis using Mendelian randomization (MR) methods.

Impact: Our method of automatically developing the knowledge graph is reusable and applicable for other domains. Due to the knowledge graphs’ advantages of structured, visualized, and easily for querying, they can help biomedical researchers propose reasonable scientific hypotheses. Our approach can effectively reuse existing medical research results and promote the understanding the causal relations between two variables.

Dataset: The knowledge graph we have built in Neo4j using Python is available at https://github.com/baiym13/knowledge-graph-construction-with-Neo4j-by-Python

Tiffany Parise, MA Student, Electrical and Computer Engineering, University of Michigan (Presenting Author)
Vinod Raman, PhD Candidate, Statistics, University of Michigan
Sindhu Kutty, Lecturer, Electrical Engineering & Computer Science, University of Michigan

Machine learning models are increasingly deployed to aid decisions with significant societal impact. Defining and assessing the degree of fairness of these models, therefore, is both important and urgent. One thread of research in Machine Learning (ML) aims to quantify the fairness of ML models using probabilistic metrics. To ascertain the fairness of a given model, many popular fairness metrics measure the difference in predictive power of that model across different subgroups of a population – typically, where one subgroup has historically been marginalized.

A separate thread of research aims to construct robust ML models. Intuitively, robustness may be understood as the ability of a model to perform well even in the presence of noisy data. Typically, robust models are trained by intentionally introducing perturbations in the data.

Our work aims to connect these two threads of research. We hypothesize that models trained to be robust are naturally more fair than those trained using standard empirical risk minimization. To what extent are fairness and robustness related? Do some notions of fairness and robustness have a stronger correlation than others? We investigate these questions empirically by setting up experiments to measure the relationship between these concepts.

To study trade-offs between robustness, fairness, and nominal accuracy, we use a probabilistically robust learning framework (Robey et. al., 2022) to train classifiers with varying levels of robustness on real-world datasets. We then use widely-used statistical metrics (Barocas et. al., 2019) to evaluate the fairness of these models. Preliminary results indicate that probabilistically robust learning reduces nominal accuracy but increases fairness with respect to the evaluated metrics. The significance of such a trade-off would be the conceptualization of fairness in terms of robustness and the ability to increase model fairness without explicitly optimizing for fairness.

Annual Ethical AI Symposium

Overview

Schedule

Opportunities and challenges in participatory AI

A Practical Approach to Ethical AI

Speakers

Min Kyung Lee

Opportunities and Challenges in Participatory AI

Bill de Blasio

Michael Tjalve

A Practical Approach to Ethical AI

Posters

Session 1

Session 2

Symposium Organizers