This collection includes datasets that MIDAS manages for the campus, and other U-M and external datasets that are of interest to the MIDAS research community. If you have questions, or if you have datasets that you would like to share with the data science community, please email midas-research@umich.edu
Academic Data Science Alliance
COVID-19
The Academic Data Science Alliance is working with partners to pull together data and data science resources related to the COVID-19 pandemic. This is a living list of resources and we welcome additions, suggestions, and collaborations. Please send additions, corrections, comments, and suggestions to us using this feedback form.
CoreLogic
Social Science
CoreLogic aggregates data from individual, parcel-level real estate transactions and financial records We have licensed access to Tax, Deed, and Foreclosure data at the parcel level for every county in the United States.
The dataset consists of multiple pipe-delimited text files organized into Tax, Deed and Foreclosure. Each file covers the whole US.
If you have any questions about the datasets, please contact a librarian.
CrowdTangle
Social Science
CrowdTangle is a public insights tool from Facebook that makes it easy to follow, analyze, and report on what’s happening across social media. CrowdTangle started a pilot program in 2019 to partner with researchers and academics and help them study critical topics such as racial justice, misinformation, and elections. In addition to launching an online application, we’ve built a new hub with information about all Facebook data sets that are available for independent research.
Facebook COVID-19 Symptom Surveys
COVID-19
Offering the symptom survey datasets to academic and nonprofit researchers with a privacy-minded approach enables experts to generate more impactful insights to aid public health responses. Facebook and partner universities created a centralized webpage for researchers with more information about the symptom surveys and how they can use the data for their research.
Facebook Data for Good
COVID-19
Facebook Data for Good has a number of tools and initiatives that can help organizations respond to the COVID-19 pandemic.
GAIA Dataset
Transportation
Didi Chuxing provides some of their anonymized data with the academic community. The open datasets include trajectory data, large-scale driving video data, and traffic travel index data.
Health Insurance Dataset
Healthcare
The Institute for Healthcare Policy and Innovation (IHPI) has more than 20 terabytes of data, from more than 113 million Americans, for researchers to study how healthcare works and how to make it better. IHPI’s data is provided primarily by large insurance companies in the form of administrative claims. These are proprietary datasets that cover both the commercial and private payer insurance sectors, and also give researchers a longitudinal accounting of millions of US patient’s healthcare utilization patterns.
For questions, please email ihpi-data@umich.edu.
ICPSR COVID-19 Data Repository
COVID-19
ICPSR has created a new archive for data examining the social, behavioral, public health, and economic impact of the novel coronavirus global pandemic. The COVID-19 Data Repository is a free, self-publishing option for any researcher or journalist who wants to share data related to COVID-19. The data will be available to any interested user for secondary analysis.
Lyft Dataset
Transportation
The open datasets include: 1) The logs of movement of traffic agents—cars, cyclists, and pedestrians—that their autonomous fleet encountered on Palo Alto routes. 2) Raw sensor camera and LiDAR inputs as perceived by autonomous vehicles in a bounded geographic area.
MCity Data Garage
Transportation
Data Garage is an Mcity maintained dataset catalog. The data is primarily vehicle-level sensor (LIDAR and camera) data that includes multiple geographical areas, high-volume road user intersections, and multiple weather conditions. U-M credentials are required to access the datasets.
Precision Health Analytics Platform
Healthcare
A collaborative research effort among physicians and researchers at the University of Michigan with the goal of harmonizing patient electronic medical records with genetic data to gain novel biomedical insights.
U-M COVID-CORE Twitter Dataset
COVID-19
The COVID-CORE dataset, one of the COVID-19 Social Media datasets created by the U-M School of Information (UMSI) and MIDAS, contains a sequential sample of Tweets that have explicitly mentioned various synonyms, aliases, or hashtags of the COVID-19 disease, the SARS-CoV-2 virus, or the pandemic. The team curated a list of keywords to generate filtering queries. By applying these queries to the Decahose stream (~10% sequential sample) of Tweets, we are able to retrieve millions of Tweets per month. The extracted Tweets start January 1, 2020. COVID-19 datasets filtered for medical/health and social/economic impact will be available soon. Please contact the creators, Dr. Xuan Lu (luxuan@umich.edu) and Dr. Qiaozhu Mei (qmei@umich.edu) for technical questions or for special needs of extraction. If you currently have access to the Twitter decahose, contact Kristin Burgard (burgardk@umich.edu) to access the COVID-CORE. If you do not already have access to the Twitter decahose, you will need to first request access.
UNIZIN Data Platform Dataset
Education
The UDP dataset includes most of the Canvas data found in the Unizin Data Warehouse but is paired with extensive student demographic data from the UM Student Information System. The UDP is now available in production however additional Canvas and demographics data are still being added to the schema.
UNIZIN Data Warehouse Dataset
Education
The UDW dataset is comprised of teaching and learning data created through the use of the Canvas LMS by U-M faculty, staff, and students from 2014 until the present. Researchers can use this data to help answer questions on student learning and learning outcomes. Administrators can use the data to track program outcomes. For teaching faculty, this data is useful in providing insights into teaching methodologies and instructional resource utilization.
Waymo Open Dataset
Transportation
The Waymo Open Dataset is comprised of high resolution sensor (LIDAR and camera) data collected by Waymo self-driving cars in a wide variety of conditions. The company is releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology.
Waze For Cities Dataset
Transportation
Waze for Cities Data includes access to their anonymized user data and traffic data.