Research Spotlight: Empowering new research with social media data

Social media is used by millions of people every single day. Vast amounts of data are being constantly generated and now, thanks to data science and high performance computing, researchers are able to parse and leverage those data enabling new research.

MIDAS has made a concerted effort to support social media research in a variety of ways from funding innovative projects through our pilot funding program to managing and allowing access to various large scale Twitter datasets.

The pilot projects that we funded use social media data in many innovative ways to address significant societal challenges, from identifying communities vulnerable to climate change, to understanding how students interact in and outside of classrooms, to tracking how misinformation starts and spreads.

Some pilot projects have since become major research initiatives. Dr. Libby Hemphill, Associate Professor of Information, was awarded a 2021 PODS Grant for her project Ensuring FAIRness in Social Media Archives. This work set out to develop new standards in the ethical use of social media data as well as creating new infrastructure in the form of an archive specifically built for the challenges of social media data. Hemphill directly credits the PODS program with enabling her to obtain funding and partnership from Meta to create SOMAR (Social Media Archive) at ICPSR, which will help researchers around the world leverage social media data.

Beyond pilot funding, MIDAS also enables and manages access to certain datasets in partnership with Twitter. Hosted on the U-M Turbo Drive these various Twitter datasets are open to all U-M affiliates (with some restrictions):

  • Decahose
    • Over 80 terabytes of data from a 10% random sample of tweets dating back to 2009
    • Available to U-M principal investigators and lab members
  • COVID-core Decahose subset
    • Created by the U-M School of Information (UMSI) and MIDAS, contains a sequential sample of Tweets that have explicitly mentioned various synonyms, aliases, or hashtags of the COVID-19 disease, the SARS-CoV-2 virus, or the pandemic. The team curated a list of keywords, to generate filtering queries.
  • Enahose
    • 1% random sample of tweets dating back to August 2021 using the Twitter API
    • Available to faculty, staff, and students for research and course projects
  • US-India Politicians dataset
    • Developed by Libby Hemphill, this dataset features social media data including over 8000 elected officials and candidates for public office from the U.S. and over 30,000 politicians and celebrities who talk about politics in India.

Thanks to the Twitter dataset access, over 70 projects have been enabled at U-M from 37 different PIs across 12 schools/colleges. The Decahose dataset has also already enabled more than $3.2million in external funding.

For more information about MIDAS datasets and to request access please visit the MIDAS website.