The Twitter Decahose is a compilation of a 10% sample of tweets. This dataset can be requested by eligible UM Principal Investigators and accessed by UM affiliates or sponsored affiliates. It is stored on the locker drive and can be accessed using the Great Lakes High-Performance Computing environment at UM. Currently, accessing the data outside of UM resources is not possible, and moving it offsite is strictly forbidden. MIDAS, CSCAR, and ARC together manage and support the use of this data repository, including the historical archive of Decahose tweets and ongoing collection from the Decahose.
Access to the decahose includes a COVID-CORE Twitter Dataset. One of the COVID-19 Social Media datasets created by the U-M School of Information (UMSI) and MIDAS, contains a sequential sample of Tweets that have explicitly mentioned various synonyms, aliases, or hashtags of the COVID-19 disease, the SARS-CoV-2 virus, or the pandemic. The team curated a list of keywords, to generate filtering queries. By applying these queries to the Decahose stream, we are able to retrieve millions of Tweets per month. The extracted Tweets start January 1, 2020. COVID-19 datasets filtered for medical/health and social/economic impact will be available soon. Please contact the creators, Dr. Xuan Lu (firstname.lastname@example.org) and Dr. Qiaozhu Mei (email@example.com) for technical questions or for special needs of extraction. If you do not already have access to the Twitter decahose, you will need to first request access.
Coderspaces Office Hours – Free analytical consulting
HPC Training Videos – Training videos on how to use Great Lakes Platform as well as other resources
Decahose with Great Lakes (Github) – Tutorial for using Twitter Decahose data with PySpark on Great Lakes
Decahose Filter (Github) – Tutorial for using command line interface with batch jobs