Explore ARCExplore ARC

MDST Places in the Parkinson’s Biomarker Challenge

By | MDSTPosts, MDSTProjects

Participants: Junhao Wang, Arya Farahi and Xinlin Song (Challenge 1); Yi-Lun Wu, Chun-Yu Hsiung and Xinlin Song (Challenge 2)

Parkinson’s disease (PD) is a degenerative disorder of central nervous system that mainly affects the motor system. Currently, there is no objective test to diagnose PD and the bedside examination by a neurologist remains the most important diagnostic tool. The examination is performed using the assessment of motor symptoms such as shaking, rigidity, slowness of movement and postural instability. However, these motor symptoms begin to occur at a very late stage. Smartphones and smart watches have sensitive sensors (accelerometer, gyroscope, and pedometer) that can track the user’s motion more frequently than clinical examinations at much lower cost. While the movement information is recorded by the sensors, the rough sensor data is hard to interpret and give limited help to PD diagnosis.

In the Parkinson’s Biomarker Challenge, we are tasked to extract useful features from time series accelerometer and gyroscope data. The data of Challenge 1 consist of ~ 35000 records collected from ~ 3000 participants with phone APP in their daily life. The final goal is to predict whether a participant has Parkinson’s disease or not. The data of Challenge 2 consist of records from ~ 20 patients doing different tasks (such as drinking water, folding towels, assembling nuts and bolts etc.). And the goal is to predict how severe is the limb action tremor.

The general method we used in both two challenges is generating multiple features from the time series sensor data and performing feature selection to get the top features. Finally, a machine learning model is built based on the top features. The details of the methods we use can be found here:

Challenge 1: https://www.synapse.org/#!Synapse:syn10894377/wiki/470036

Challenge 2: https://www.synapse.org/#!Synapse:syn11317207/wiki/486357

The highest ranking the team received was 4th place in Challenge 2.

MDST Competes in the Midwest Undergraduate Data Analytics Competition

By | MDSTPosts, MDSTProjects

Authors: Weifeng Hu and Divyansh Saini

The Competition

We represented the Michigan Data Science Team (MDST) in the Midwest Undergraduate Data Analytics Competition hosted by MinneAnalytics in Minnesota.

We were provided insurance claims data for people diagnosed with Type-II diabetes. At the novice level, the goal was to use this data to find meaningful patterns in age group, gender and geolocation for patients with Type II Diabetes. This information would then be further used to generate cost-saving mechanisms.



The first step to work on this challenge was to understand the dataset. We had medical claims for patients who have Type II Diabetes from health care facilities, pharmacies and laboratories. The files in the dataset are described below following:

medical_training.csv: claims when patients visited a hospital and related data

confinement_training.csv: claims for when patients were confined to the hospital

rx_training.csv: pharmacies claims of drugs prescribed to the patients

labresult_training.csv: claims for laboratory experiments

member_info.csv: file contains information about gender, age, location of patients

Through our meetings with MDST members, we realized that most teams would be approaching this problem with the purpose of predicting the diagnosis or finding patterns in the comorbidity.

To stand out, we decided to focus on finding the sub-populations who have the highest cost after day 0 being of diagnosed with type-II diabetes.



 We performed K-means clustering on claims in medical_training.csv to find generalized patterns among different groups.

– Data Preprocessing:

The entries in medical_training.csv contain medical claims for each hospital visit for a patient. A patient will have multiple entries if he/she visited the hospital more than once. In each entry, there were at most five ICDM-9* diagnosis codes for each visit. It contains five characters that are mostly numbers, e.g. “12345”. This code can be grouped into 19 categories of diseases and each category contains a range of codes.

The first preprocess step we performed was to “shrink” the number of rows. We used patients ID to group each diagnosis to each patient. After that we constructed a 0/1 indicator feature vectors of length 19. Each column represented a type of disease and the value on that column would 1 if the patient was diagnosed with that disease. Each row in the feature vector contains the information of a particular patient’s diagnosis.

– Hyperparameter Tuning

After making the data into feature vectors, we can know perform k-means clustering. . k-means clustering is partitioning observations into a finite number(k) of clusters or groups in which each observation belongs to a cluster with the nearest mean. However, we still need to decide the number of clusters k. We want to find the number of clusters that result in small intracluster distance but do not overfit the data. To do this we used the “elbow heuristic”, which states that if on plotting the cost of k-means(the sum of the intracluster distance) with respect to k, we should choose a k value that has a significant drop before the that point and no significant drop after that point. As in the graph above, where the x-axis is the value of k and the y-axis is the cost of [1]k-means clustering, we can see that k=19 is a good choice.





From the data analysis we performed, we realized that the average cost of each visit for a patient remained the same regardless of when the hospital visit occurred, but the amounts of visits increased significantly after the day that they were diagnosed. This resulted in a significant increase in the total cost, as can be seen in the graph below.


Another interesting finding was that although the number of patients diagnosed with Diabetes-II increased with age, the average cost was highest for ages 20-25 and 45-50, as shown in the graph below. This trend was common among different clusters. From this data, we were able to conclude that the people who were diagnosed in those two age groups had similar diseases. This suggests that we should focus on the age group who are diagnosed with diabetes at this younger ages will have a significantly higher cumulative costs as they live their life.


What we learned

From this competition, we learned that it was important to style our slides for the presentation. The judges commented us as “one of the most technical solid presentation”. However, we made some mistake on the slides and some axes on the graph are not clearly labeled. As a result, we did not make to the finalis. Nevertheless, it was really encouraging to know that the judges were impressed by our analytical skills.

But beyond that, at the competition itself we were impressed by the various interesting ways that the upper level teams used to predict the highest at-risk patients. One of them used models similar to what credit-card companies use to predict credit ratings for its clients. This was definitely out of the box and I was surprised to see that it actually worked.

And finally, we realized that it is important to persist. We encountered a significant delay before we received the real data(in fact, we did not have it until 5 days before the competition). It was challenging to try to come up with good analysis in that short period of time. However, with the help of our faculty mentor Sean, we were able to find meaningful patterns in the dataset. We wish we can have more time so that we can explore trends in other dataset.


MDST – NFL Free Agency Value Prediction Competition Kick-Off – Nov. 9, 6pm

By | Data, Data sets, Educational, Events, Happenings, MDSTPosts, MDSTProjects, News

In this competition, student teams at the University of Michigan will use historical free agent data to predict the value of new contracts signed in the 2018 free agency period. These predictions will be evaluated against the actual contracts as they are signed. This competition is organized by the Michigan Data Science Team (MDST), in collaboration with the Baltimore Ravens and the Michigan Sports Analytics Society (MSAS).  Food will be provided. This is an initial kick-off meeting of the competition.


Date, Time

Thursday, November 9 at 6:00 PM EST to Thursday, November 9 at 7:00 PM EST
Add To Google Calendar | iCal/Outlook


Weiser Hall 10th Floor Auditorium
500 Church St, 48104, MI


Michigan Data Science Team



MDST announces Detroit blight data challenge; organizational meeting Feb. 16

By | Educational, General Interest, MDSTPosts, MDSTProjects, News

The Michigan Data Science Team and the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) have partnered with the City of Detroit on a data challenge that seeks to answer the question: How can blight ticket compliance be increased?

An organizational meeting is scheduled for Thursday, Feb. 16 at 5:30 p.m. in EECS 1200.

The city is making datasets available containing building permits, trades permits, citizens complaints, and more.

The competition runs through March 15. For more information, see the competition website.

Google, U-M to build digital tools for Flint water crisis

By | MDSTProjects

FLINT—A partnership between Google and the University of Michigan’s Flint and Ann Arbor campuses aims to provide a smartphone app and other digital tools to Flint residents and officials to help them manage the ongoing water crisis.

The app and other tools will help predict where lead levels will be highest in the city’s water, and they’ll pull together information and resources to make the crisis easier to navigate for those affected. The project is made possible by a $150,000 grant from Google.

“This investment by Google is an outstanding commitment to our community. It creates an ideal combination of an industry powerhouse with faculty expertise. It will create new opportunities for students and continue building community partnerships—all so that we can provide quick and critically important information and analysis for our community as we move forward,” said Chancellor Susan E. Borrego of the University of Michigan-Flint.

The Android app is slated for roll-out this summer. It could help residents determine whether their homes are at high risk of having lead-contaminated water. It could also help them locate day-to-day resources for lead testing, water distribution, water bottle recycling, water filters, and volunteer opportunities. A website will offer similar resources and will be accessible on any computer, including those in public libraries.

Additional web-based tools for researchers and government officials could provide detailed insight on how to deploy repairs and resources. For example, they could help identify and prioritize the water service line replacements.

A student team at UM-Flint has already developed a prototype smartphone app for Flint residents. Google and U-M Ann Arbor will work with them through the spring and summer to add mapping features that use predictive analytics from U-M Ann Arbor’s Michigan Data Science Team. The team will also develop an improved user interface with assistance from Google.

Google has pledged a variety of resources to the project including a grant and remote and on-site assistance from its user experience and app development team. The company will also donate data resources to the Michigan Data Science Team including mapping, satellite imagery, and geo-location data.

Initial work by the data science team has already shown some success at predicting which homes and neighborhoods have a high risk of lead contamination. In the coming months, they’ll continue to apply predictive algorithms and machine learning techniques to data from a wide variety of sources including Google, the State of Michigan and the City of Flint. The data includes existing lead testing data; detailed information on the type and location of water infrastructure; and information on the size, age, type, and condition of every parcel of property in the city.

“There’s a lot of data on the water crisis, but it’s scattered over many different agencies and places,” said Jacob Abernethy, an assistant professor of computer science and engineering at U-M Ann Arbor and faculty advisor to the Michigan Data Science Team. “By organizing it in one place and analyzing it, we can predict which areas are likely to be at risk. We can help planners determine which infrastructure repairs will benefit the most residents, and how to allocate resources like bottled water most efficiently.”

Google and U-M also plan to create a separate set of web tools for city planners and other officials. They will include extensive mapping and predictive analytics, with details on waterline type and location and other infrastructure data.

Mark Allison is an assistant professor of computer science at UM-Flint and the faculty leader of the Flint student team. He says the project will be an opportunity for students to make a difference in the water crisis and pick up valuable real-world development experience along the way.

“Finding the best way to put resources close to where high lead levels are is a big part of managing this crisis, and it’s the kind of problem that analytics can solve. We also want to give residents more transparency by making it easier for anyone to get access to the most up-to-date information,” Allison said. “I think this project will be transformative. And for all of us here in Flint, it’s about much more than grades.”

Allison said the team is working to keep the tools they develop flexible, enabling them to be used by other cities that face similar crises. His team is developing the tools as part of UM-Flint Computer Science’s community-based learning program, which puts students to work on real-world challenges in and around Flint.

The Michigan Data Science Team is a competitive extra-curricular team at U-M Ann Arbor. Founded by Abernethy, the team builds and applies advanced computer algorithms that can analyze and “learn” from large sets of data. By finding connections and patterns within that data, they can make predictions about future events. The techniques are already widely used in areas like online retailing and advertising.

“Access to clean drinking water is a concern all over the world, but in the United States it’s often a foregone conclusion. That is not the case recently for the residents of Flint, Michigan,” said Mike Miller, head of Google Michigan. “I am proud that we can contribute to help with the recovery of and we hope we can help to support a resolution to this crisis and get the residents of Flint the resources and respect they so rightly deserve.”

The Flint Water crisis began after April of 2014, when the city’s drinking water source was changed from Lake Huron via Detroit’s water system to the Flint River. The water supply was not properly monitored for corrosion control and it caused lead to leach from service lines into the city’s drinking water. While the city has since switched its water supply back to the Detroit system, residents are still being advised not to drink unfiltered tap water.

Written by Gabe Cherry