Explore ARCExplore ARC

MIDAS Data Science for Music Challenge Initiative announces funded projects

By | Data, General Interest, Happenings, News, Research

From digital analysis of Bach sonatas to mining data from crowdsourced compositions, researchers at the University of Michigan are using modern big data techniques to transform how we understand, create and interact with music.

Four U-M research teams will receive support for projects that apply data science tools like machine learning and data mining to the study of music theory, performance, social media-based music making, and the connection between words and music. The funding is provided under the Data Science for Music Challenge Initiative through the Michigan Institute for Data Science (MIDAS).

“MIDAS is excited to catalyze innovative, interdisciplinary research at the intersection of data science and music,” said Alfred Hero, co-director of MIDAS and the John H. Holland Distinguished University Professor of Electrical Engineering and Computer Science. “The four proposals selected will apply and demonstrate some of the most powerful state-of-the-art machine learning and data mining methods to empirical music theory, automated musical accompaniment of text and data-driven analysis of music performance.”

Jason Corey, associate dean for graduate studies and research at the School of Music, Theatre & Dance, added: “These new collaborations between our music faculty and engineers, mathematicians and computer scientists will help broaden and deepen our understanding of the complexities of music composition and performance.”

The four projects represent the beginning of MIDAS’ support for the emerging Data Science for Music research. The long-term goal is to build a critical mass of interdisciplinary researchers for sustained development of this research area, which demonstrates the power of data science to transform traditional research disciplines.

Each project will receive $75,000 over a year. The projects are:

Understanding and Mining Patterns of Audience Engagement and Creative Collaboration in Large-Scale Crowdsourced Music Performances

Investigators: Danai Koutra and Walter Lasecki, both assistant professors of computer science and engineering

Summary: The project will develop a platform for crowdsourced music making and performance, and use data mining techniques to discover patterns in audience engagement and participation. The results can be applied to other interactive settings as well, including developing new educational tools.

Understanding How the Brain Processes Music Through the Bach Trio Sonatas
Investigators: Daniel Forger, professor of mathematics and computational medicine and bioinformatics; James Kibbie, professor and chair of organ and university organist

Summary: The project will develop and analyze a library of digitized performances of Bach’s Trio Sonatas, applying novel algorithms to study the music structure from a data science perspective. The team’s analysis will compare different performances to determine features that make performances artistic, as well as the common mistakes performers make. Findings will be integrated into courses both on organ performance and on data science.

The Sound of Text
Investigators: Rada Mihalcea, professor of electrical engineering and computer science; Anıl Çamcı, assistant professor of performing arts technology

Summary: The project will develop a data science framework that will connect language and music, developing tools that can produce musical interpretations of texts based on content and emotion. The resulting tool will be able to translate any text—poetry, prose, or even research papers—into music.

A Computational Study of Patterned Melodic Structures Across Musical Cultures
Investigators: Somangshu Mukherji, assistant professor of music theory; Xuanlong Nguyen, associate professor of statistics

Summary: This project will combine music theory and computational analysis to compare the melodies of music across six cultures—including Indian and Irish songs, as well as Bach and Mozart—to identify commonalities in how music is structured cross-culturally.

The Data Science for Music program is the fifth challenge initiative funded by MIDAS to promote innovation in data science and cross-disciplinary collaboration, while building on existing expertise of U-M researchers. The other four are focused on transportation, health sciences, social sciences and learning analytics.

Hero said the confluence of music and data science was a natural extension.

“The University of Michigan’s combined strengths in data science methodology and music makes us an ideal crucible for discovery and innovation at this intersection,” he said.

Contact: Dan Meisler, Communications Manager, Advanced Research Computing
734-764-7414, dmeisler@umich.edu

Interdisciplinary Committee on Organizational Studies (ICOS) Big Data Summer Camp, May 14-18

By | Data, Educational, General Interest, Happenings, News
Social and organizational life are increasingly conducted online through electronic media, from emails to Twitter feed to dating sites to GPS phone tracking. The traces these activities leave behind have acquired the (misleading) title of “big data.” Within a few years, a standard part of graduate training in the social sciences will include a hefty dose of “using of big data,” and we will all be utilizing terms like API and Python.
This year ICOS, MIDAS, and ARC are again offering a one-week “big data summer camp” for doctoral students interested in organizational research, with a combination of detailed examples from researchers; hands-on instruction in Python, SQL, and APIs; and group work to apply these ideas to organizational questions.  Enrollment is free, but students must commit to attending all day for each day of camp, and be willing to work in interdisciplinary groups.

The dates of the camp are all day May 14th-18th.

https://ttc.iss.lsa.umich.edu/ttc/sessions/interdisciplinary-committee-on-organizational-studies-icos-big-data-summer-camp-3/ 

U-M will hold “hackathon” for health communication, with help from Sanjay Gupta and family

By | Educational, General Interest, Happenings, News

Disease outbreaks. Medical discoveries. Natural disasters. The hope — and hype — that can come with new treatment options.

Sanjay Gupta, M.D. has covered them all in his years as medical correspondent for CNN. He’s seen over and over the crucial role of communication in responding to the health effects of every kind of crisis. He’s also seen the delays, missed opportunities and even tragedy that can come from poor communication of health information.

That’s why he and his wife Rebecca have teamed up with his alma mater, the University of Michigan, to support an effort to bring new ideas and tools to health communication.

Application is now open for participation in marathon event March 23-25, focused on innovation for sharing information in times of crisis & beyond.

Read more….

U-M, Army research on robot/human interaction published in ECN Magazine

By | General Interest, News

Research being jointly conducted by the Army Research Laboratory and the University of Michigan focused on improving communications between humans and robots was recently published in ECN Magazine.

The research team developed a series of yes or no questions, borrowing from the game 20 Questions, which may lead to new techniques for machine-machine and machine-human interactions.

The U-M research team consists of Hye Won Chung, Lizhong Zheng and MIDAS co-director Alfred Hero.

U-M launches Data Science Master’s Program

By | Educational, General Interest, Happenings, News

The University of Michigan’s new, interdisciplinary Data Science Master’s Program is taking applications for its first group of students. The program is aimed at teaching participants how to extract useful knowledge from massive datasets using computational and statistical techniques.

The program is a collaboration between the College of Engineering (EECS), the College of Literature Science and the Arts (Statistics), the School of Public Health (Biostatistics), the School of Information, and the Michigan Institute for Data Science.

“We are very excited to be offering this unique collaborative program, which brings together expertise from four key disciplines at the University in a curriculum that is at the forefront of data science,” said HV Jagadish, Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science, who chairs the program committee for the program.

“MIDAS was a catalyst in bringing  faculty from multiple disciplines together to work towards the development of this new degree program,”  he added.

MIDAS will provide students in this program with interdisciplinary collaborations, intellectual stimulation, exposure to a broad range of practice, networking opportunities, and space on Central Campus to meet for formal and informal gatherings.

For more information, see the program website at https://lsa.umich.edu/stats/masters_students/mastersprograms/data-science-masters-program.html, and the program guide (PDF) at https://lsa.umich.edu/content/dam/stats-assets/StatsPDF/MSDS-Program-Guide.pdf.

Applications are due March 15.

MDST Places in the Parkinson’s Biomarker Challenge

By | MDSTPosts, MDSTProjects

Participants: Junhao Wang, Arya Farahi and Xinlin Song (Challenge 1); Yi-Lun Wu, Chun-Yu Hsiung and Xinlin Song (Challenge 2)

Parkinson’s disease (PD) is a degenerative disorder of central nervous system that mainly affects the motor system. Currently, there is no objective test to diagnose PD and the bedside examination by a neurologist remains the most important diagnostic tool. The examination is performed using the assessment of motor symptoms such as shaking, rigidity, slowness of movement and postural instability. However, these motor symptoms begin to occur at a very late stage. Smartphones and smart watches have sensitive sensors (accelerometer, gyroscope, and pedometer) that can track the user’s motion more frequently than clinical examinations at much lower cost. While the movement information is recorded by the sensors, the rough sensor data is hard to interpret and give limited help to PD diagnosis.

In the Parkinson’s Biomarker Challenge, we are tasked to extract useful features from time series accelerometer and gyroscope data. The data of Challenge 1 consist of ~ 35000 records collected from ~ 3000 participants with phone APP in their daily life. The final goal is to predict whether a participant has Parkinson’s disease or not. The data of Challenge 2 consist of records from ~ 20 patients doing different tasks (such as drinking water, folding towels, assembling nuts and bolts etc.). And the goal is to predict how severe is the limb action tremor.

The general method we used in both two challenges is generating multiple features from the time series sensor data and performing feature selection to get the top features. Finally, a machine learning model is built based on the top features. The details of the methods we use can be found here:

Challenge 1: https://www.synapse.org/#!Synapse:syn10894377/wiki/470036

Challenge 2: https://www.synapse.org/#!Synapse:syn11317207/wiki/486357

The highest ranking the team received was 4th place in Challenge 2.

Student data science competition winners visit Quicken Loans headquarters in Detroit

By | Educational, General Interest, MDSTPosts, News

Earlier this year, three Data Science Team (MDST) members — winners of the Quicken Loan (QL) Lending Strategies Prediction Challenge — traveled to Detroit to visit QL headquarters, accept their prizes, and present their findings to the company’s Data Science team.

Back row left to right: Reddy Rachamallu, Alexandr, Alex, Mark Nuppnau, Brian Ball
Front row left to right: Jingshu Chen, Patrick, Alex’s wife Kenzie, Yvette Tian, Mike Tan, and Catherine Tu.

 

Alexander Zaitzeff, a graduate student in the Applied and Interdisciplinary Mathematics program won first place; Alexandr Kalinin, a Bioinformatics graduate student earned second; and Patrick Belancourt, a graduate student in Climate and Space Sciences and Engineering took third.

The goal of the competition was to create a model that would predict whether potential clients would end up getting a mortgage based on the loan product originally offered to them. In order to create this model, each participant was given access to proprietary de-identified financial data from recent QL clients. The accuracy of their models was then evaluated on one month of client data.

Alexander Zaitzeff

“Every time I participate in a competition I try out a new technique,” Zaitzeff said. “MDST puts me in competitions with other U-M students who I can team up with and learn from.”

“This was a very valuable competition because it gives people experience working with real datasets, on actual problems that companies work on day to day,” said Jonathan Stroud, organizational chair of MDST.

Brian Ball, a data scientist at QL and U-M alum, said the input from MDST students gained through the competition helped confirm the company’s hope that “our system is predictable from a mathematical standpoint.”

“In that regard, we can use the results produced and the methods used to drive good decisions to most benefit our clients,” he added. “We view this as a total success as it was our hypothesis — and underlying hope — from the beginning.”

About 20 people from QL’s Data Science team gathered to hear how the MDST winners developed their models, as well as vice presidents of the Business Intelligence unit.

The winning entry was an “ensemble model,” in which several models are synthesized into one predictive framework.

Finding that so many different kinds of models performed similarly was a confirmation that “the data tells the story,” Ball said.

“Allowing for each technique to contribute more strongly to the final score in areas where the model type performs well (referred to as “blending” or “stacking”) is an especially strong method and one we should consider moving forward,” he said.

The competition began in September and ran until the end of the Fall semester. Over 70 students competed in this challenge, including both graduates and undergraduates from several schools and departments across the University.

MDST typically runs two or three competitions each year — the current competition involves predicting the value of NFL free agents, and is being conducted in partnership with the Baltimore Ravens. For more information, please visit MDST’s webpage: midas.umich.edu/mdst

MDST Competes in the Midwest Undergraduate Data Analytics Competition

By | MDSTPosts, MDSTProjects

Authors: Weifeng Hu and Divyansh Saini

The Competition

We represented the Michigan Data Science Team (MDST) in the Midwest Undergraduate Data Analytics Competition hosted by MinneAnalytics in Minnesota.

We were provided insurance claims data for people diagnosed with Type-II diabetes. At the novice level, the goal was to use this data to find meaningful patterns in age group, gender and geolocation for patients with Type II Diabetes. This information would then be further used to generate cost-saving mechanisms.

 

Experience

The first step to work on this challenge was to understand the dataset. We had medical claims for patients who have Type II Diabetes from health care facilities, pharmacies and laboratories. The files in the dataset are described below following:

medical_training.csv: claims when patients visited a hospital and related data

confinement_training.csv: claims for when patients were confined to the hospital

rx_training.csv: pharmacies claims of drugs prescribed to the patients

labresult_training.csv: claims for laboratory experiments

member_info.csv: file contains information about gender, age, location of patients

Through our meetings with MDST members, we realized that most teams would be approaching this problem with the purpose of predicting the diagnosis or finding patterns in the comorbidity.

To stand out, we decided to focus on finding the sub-populations who have the highest cost after day 0 being of diagnosed with type-II diabetes.

 

Methodology

 We performed K-means clustering on claims in medical_training.csv to find generalized patterns among different groups.

– Data Preprocessing:

The entries in medical_training.csv contain medical claims for each hospital visit for a patient. A patient will have multiple entries if he/she visited the hospital more than once. In each entry, there were at most five ICDM-9* diagnosis codes for each visit. It contains five characters that are mostly numbers, e.g. “12345”. This code can be grouped into 19 categories of diseases and each category contains a range of codes.

The first preprocess step we performed was to “shrink” the number of rows. We used patients ID to group each diagnosis to each patient. After that we constructed a 0/1 indicator feature vectors of length 19. Each column represented a type of disease and the value on that column would 1 if the patient was diagnosed with that disease. Each row in the feature vector contains the information of a particular patient’s diagnosis.

– Hyperparameter Tuning

After making the data into feature vectors, we can know perform k-means clustering. . k-means clustering is partitioning observations into a finite number(k) of clusters or groups in which each observation belongs to a cluster with the nearest mean. However, we still need to decide the number of clusters k. We want to find the number of clusters that result in small intracluster distance but do not overfit the data. To do this we used the “elbow heuristic”, which states that if on plotting the cost of k-means(the sum of the intracluster distance) with respect to k, we should choose a k value that has a significant drop before the that point and no significant drop after that point. As in the graph above, where the x-axis is the value of k and the y-axis is the cost of [1]k-means clustering, we can see that k=19 is a good choice.

[1]

 

 

Result

From the data analysis we performed, we realized that the average cost of each visit for a patient remained the same regardless of when the hospital visit occurred, but the amounts of visits increased significantly after the day that they were diagnosed. This resulted in a significant increase in the total cost, as can be seen in the graph below.

 

Another interesting finding was that although the number of patients diagnosed with Diabetes-II increased with age, the average cost was highest for ages 20-25 and 45-50, as shown in the graph below. This trend was common among different clusters. From this data, we were able to conclude that the people who were diagnosed in those two age groups had similar diseases. This suggests that we should focus on the age group who are diagnosed with diabetes at this younger ages will have a significantly higher cumulative costs as they live their life.

 

What we learned

From this competition, we learned that it was important to style our slides for the presentation. The judges commented us as “one of the most technical solid presentation”. However, we made some mistake on the slides and some axes on the graph are not clearly labeled. As a result, we did not make to the finalis. Nevertheless, it was really encouraging to know that the judges were impressed by our analytical skills.

But beyond that, at the competition itself we were impressed by the various interesting ways that the upper level teams used to predict the highest at-risk patients. One of them used models similar to what credit-card companies use to predict credit ratings for its clients. This was definitely out of the box and I was surprised to see that it actually worked.

And finally, we realized that it is important to persist. We encountered a significant delay before we received the real data(in fact, we did not have it until 5 days before the competition). It was challenging to try to come up with good analysis in that short period of time. However, with the help of our faculty mentor Sean, we were able to find meaningful patterns in the dataset. We wish we can have more time so that we can explore trends in other dataset.