Category

MDSTPosts

MDST group wins KDD best paper award

By | General Interest, Happenings, MDSTPosts, Research

A paper by members and faculty leaders of the Michigan Data Science Team (co-authors: Jacob Abernethy, Alex Chojnacki, Arya Farahi, Eric Schwartz, and Jared Webb) won the Best Student Paper award in the Applied Data Science track at the KDD 2018 conference in August in London.

The paper, ActiveRemediation: The Search for Lead Pipes in Flint, Michigan, details the group’s ongoing work in Flint to detect pipes made of lead and other hazardous material.

For more on the team’s work, see this recent U-M press release.

MDST Places in the Parkinson’s Biomarker Challenge

By | MDSTPosts, MDSTProjects

Participants: Junhao Wang, Arya Farahi and Xinlin Song (Challenge 1); Yi-Lun Wu, Chun-Yu Hsiung and Xinlin Song (Challenge 2)

Parkinson’s disease (PD) is a degenerative disorder of central nervous system that mainly affects the motor system. Currently, there is no objective test to diagnose PD and the bedside examination by a neurologist remains the most important diagnostic tool. The examination is performed using the assessment of motor symptoms such as shaking, rigidity, slowness of movement and postural instability. However, these motor symptoms begin to occur at a very late stage. Smartphones and smart watches have sensitive sensors (accelerometer, gyroscope, and pedometer) that can track the user’s motion more frequently than clinical examinations at much lower cost. While the movement information is recorded by the sensors, the rough sensor data is hard to interpret and give limited help to PD diagnosis.

In the Parkinson’s Biomarker Challenge, we are tasked to extract useful features from time series accelerometer and gyroscope data. The data of Challenge 1 consist of ~ 35000 records collected from ~ 3000 participants with phone APP in their daily life. The final goal is to predict whether a participant has Parkinson’s disease or not. The data of Challenge 2 consist of records from ~ 20 patients doing different tasks (such as drinking water, folding towels, assembling nuts and bolts etc.). And the goal is to predict how severe is the limb action tremor.

The general method we used in both two challenges is generating multiple features from the time series sensor data and performing feature selection to get the top features. Finally, a machine learning model is built based on the top features. The details of the methods we use can be found here:

Challenge 1: https://www.synapse.org/#!Synapse:syn10894377/wiki/470036

Challenge 2: https://www.synapse.org/#!Synapse:syn11317207/wiki/486357

The highest ranking the team received was 4th place in Challenge 2.

Student data science competition winners visit Quicken Loans headquarters in Detroit

By | Educational, General Interest, MDSTPosts, News

Earlier this year, three Data Science Team (MDST) members — winners of the Quicken Loan (QL) Lending Strategies Prediction Challenge — traveled to Detroit to visit QL headquarters, accept their prizes, and present their findings to the company’s Data Science team.

Back row left to right: Reddy Rachamallu, Alexandr, Alex, Mark Nuppnau, Brian Ball
Front row left to right: Jingshu Chen, Patrick, Alex’s wife Kenzie, Yvette Tian, Mike Tan, and Catherine Tu.

 

Alexander Zaitzeff, a graduate student in the Applied and Interdisciplinary Mathematics program won first place; Alexandr Kalinin, a Bioinformatics graduate student earned second; and Patrick Belancourt, a graduate student in Climate and Space Sciences and Engineering took third.

The goal of the competition was to create a model that would predict whether potential clients would end up getting a mortgage based on the loan product originally offered to them. In order to create this model, each participant was given access to proprietary de-identified financial data from recent QL clients. The accuracy of their models was then evaluated on one month of client data.

Alexander Zaitzeff

“Every time I participate in a competition I try out a new technique,” Zaitzeff said. “MDST puts me in competitions with other U-M students who I can team up with and learn from.”

“This was a very valuable competition because it gives people experience working with real datasets, on actual problems that companies work on day to day,” said Jonathan Stroud, organizational chair of MDST.

Brian Ball, a data scientist at QL and U-M alum, said the input from MDST students gained through the competition helped confirm the company’s hope that “our system is predictable from a mathematical standpoint.”

“In that regard, we can use the results produced and the methods used to drive good decisions to most benefit our clients,” he added. “We view this as a total success as it was our hypothesis — and underlying hope — from the beginning.”

About 20 people from QL’s Data Science team gathered to hear how the MDST winners developed their models, as well as vice presidents of the Business Intelligence unit.

The winning entry was an “ensemble model,” in which several models are synthesized into one predictive framework.

Finding that so many different kinds of models performed similarly was a confirmation that “the data tells the story,” Ball said.

“Allowing for each technique to contribute more strongly to the final score in areas where the model type performs well (referred to as “blending” or “stacking”) is an especially strong method and one we should consider moving forward,” he said.

The competition began in September and ran until the end of the Fall semester. Over 70 students competed in this challenge, including both graduates and undergraduates from several schools and departments across the University.

MDST typically runs two or three competitions each year — the current competition involves predicting the value of NFL free agents, and is being conducted in partnership with the Baltimore Ravens. For more information, please visit MDST’s webpage: midas.umich.edu/mdst

MDST Competes in the Midwest Undergraduate Data Analytics Competition

By | MDSTPosts, MDSTProjects

Authors: Weifeng Hu and Divyansh Saini

The Competition

We represented the Michigan Data Science Team (MDST) in the Midwest Undergraduate Data Analytics Competition hosted by MinneAnalytics in Minnesota.

We were provided insurance claims data for people diagnosed with Type-II diabetes. At the novice level, the goal was to use this data to find meaningful patterns in age group, gender and geolocation for patients with Type II Diabetes. This information would then be further used to generate cost-saving mechanisms.

 

Experience

The first step to work on this challenge was to understand the dataset. We had medical claims for patients who have Type II Diabetes from health care facilities, pharmacies and laboratories. The files in the dataset are described below following:

medical_training.csv: claims when patients visited a hospital and related data

confinement_training.csv: claims for when patients were confined to the hospital

rx_training.csv: pharmacies claims of drugs prescribed to the patients

labresult_training.csv: claims for laboratory experiments

member_info.csv: file contains information about gender, age, location of patients

Through our meetings with MDST members, we realized that most teams would be approaching this problem with the purpose of predicting the diagnosis or finding patterns in the comorbidity.

To stand out, we decided to focus on finding the sub-populations who have the highest cost after day 0 being of diagnosed with type-II diabetes.

 

Methodology

 We performed K-means clustering on claims in medical_training.csv to find generalized patterns among different groups.

– Data Preprocessing:

The entries in medical_training.csv contain medical claims for each hospital visit for a patient. A patient will have multiple entries if he/she visited the hospital more than once. In each entry, there were at most five ICDM-9* diagnosis codes for each visit. It contains five characters that are mostly numbers, e.g. “12345”. This code can be grouped into 19 categories of diseases and each category contains a range of codes.

The first preprocess step we performed was to “shrink” the number of rows. We used patients ID to group each diagnosis to each patient. After that we constructed a 0/1 indicator feature vectors of length 19. Each column represented a type of disease and the value on that column would 1 if the patient was diagnosed with that disease. Each row in the feature vector contains the information of a particular patient’s diagnosis.

– Hyperparameter Tuning

After making the data into feature vectors, we can know perform k-means clustering. . k-means clustering is partitioning observations into a finite number(k) of clusters or groups in which each observation belongs to a cluster with the nearest mean. However, we still need to decide the number of clusters k. We want to find the number of clusters that result in small intracluster distance but do not overfit the data. To do this we used the “elbow heuristic”, which states that if on plotting the cost of k-means(the sum of the intracluster distance) with respect to k, we should choose a k value that has a significant drop before the that point and no significant drop after that point. As in the graph above, where the x-axis is the value of k and the y-axis is the cost of [1]k-means clustering, we can see that k=19 is a good choice.

[1]

 

 

Result

From the data analysis we performed, we realized that the average cost of each visit for a patient remained the same regardless of when the hospital visit occurred, but the amounts of visits increased significantly after the day that they were diagnosed. This resulted in a significant increase in the total cost, as can be seen in the graph below.

 

Another interesting finding was that although the number of patients diagnosed with Diabetes-II increased with age, the average cost was highest for ages 20-25 and 45-50, as shown in the graph below. This trend was common among different clusters. From this data, we were able to conclude that the people who were diagnosed in those two age groups had similar diseases. This suggests that we should focus on the age group who are diagnosed with diabetes at this younger ages will have a significantly higher cumulative costs as they live their life.

 

What we learned

From this competition, we learned that it was important to style our slides for the presentation. The judges commented us as “one of the most technical solid presentation”. However, we made some mistake on the slides and some axes on the graph are not clearly labeled. As a result, we did not make to the finalis. Nevertheless, it was really encouraging to know that the judges were impressed by our analytical skills.

But beyond that, at the competition itself we were impressed by the various interesting ways that the upper level teams used to predict the highest at-risk patients. One of them used models similar to what credit-card companies use to predict credit ratings for its clients. This was definitely out of the box and I was surprised to see that it actually worked.

And finally, we realized that it is important to persist. We encountered a significant delay before we received the real data(in fact, we did not have it until 5 days before the competition). It was challenging to try to come up with good analysis in that short period of time. However, with the help of our faculty mentor Sean, we were able to find meaningful patterns in the dataset. We wish we can have more time so that we can explore trends in other dataset.

 

MDST – NFL Free Agency Value Prediction Competition Kick-Off – Nov. 9, 6pm

By | Data, Data sets, Educational, Events, Happenings, MDSTPosts, MDSTProjects, News

In this competition, student teams at the University of Michigan will use historical free agent data to predict the value of new contracts signed in the 2018 free agency period. These predictions will be evaluated against the actual contracts as they are signed. This competition is organized by the Michigan Data Science Team (MDST), in collaboration with the Baltimore Ravens and the Michigan Sports Analytics Society (MSAS).  Food will be provided. This is an initial kick-off meeting of the competition.

RSVP

Date, Time

Thursday, November 9 at 6:00 PM EST to Thursday, November 9 at 7:00 PM EST
Add To Google Calendar | iCal/Outlook

Location

Weiser Hall 10th Floor Auditorium
500 Church St, 48104, MI

Host

Michigan Data Science Team

 

 

MDST Partners with City of Detroit to Improve Blight Enforcement

By | MDSTPosts, News

Author: Allie Cell, College of Engineering

blighted_propPROBLEM OVERVIEW

Property blight, which refers to lots and structures not being properly maintained (as pictured above), is a major problem in Detroit: over 20% of lots in the city are blighted. As a measure to help curb this behavior, in 2005 the City of Detroit began issuing blight tickets to the owners of afflicted properties, which include fees ranging from $20, for offenses including leaving out a trash bin too far before its collection date, to $10,000, for offenses including dumping more than 5 cubic feet of waste. Unfortunately, however, only 7% of people issued tickets and found guilty actually pay the ticket, leaving a balance of some $73 million in unpaid blight tickets.

Officials from the City of Detroit’s Department of Administrative Hearings and Department of Innovation and Technology who work with blight tickets and the Michigan Data Science Team initially came together this past February to sponsor a competition to forecast blight compliance. To provide a more actionable analysis for Detroit policymakers, we aimed to understand what sorts of people receive blight tickets, how we can use this knowledge to better grasp why blight tickets have not been effective, and to provide insights for policy makers accordingly.

THE DATA

In order to get an accurate picture of blight compliance in Detroit, we aggregated information from multiple datasets. The Detroit Open Data Portal is an online center for public data featuring datasets including those related to public health, education, transportation; that is where we obtained most of our datasets from. Some of the most valuable datasets we used included:

Blight Ticket Data, records of each blight ticket

Parcel Data, records of all properties in Detroit

Crime Data, records of all crimes in Detroit from 2009 through 2016

Demolition Data, records of each completed and scheduled demolition in Detroit

Improve Detroit, records of all issues submitted through an app whose goal is improving the city

PREDICTING BLIGHT TICKET COMPLIANCE

We built a model to predict whether a property owner would pay their blight ticket. Each record in the dataset had one-hot encoded data from the sources listed above. It has been observed that tree methods are easily interpretable and perform well for mixed data, so we considered scikit-learn Random Forests and the xgboost Gradient Boosted Trees (XGBoostClassifier). To choose the best model, we generated learning curves with 5-fold cross-validation for each classifier; xgboost performed well with a cross-validation score of over .9, so we selected this model.

Screen Shot 2017-08-22 at 5.27.14 PM20_80

ANALYSIS OF TICKETED PROPERTY OWNERS

To gain a more holistic understanding of the relationships between ticketed property owners and their properties, we analyzed three categories of property owners:

Top Offenders, the small portion of offenders who own many blighted properties and account for the majority of tickets–as shown above, 20% of violators own over 70% of unpaid blight fines

Live-In Owners, offenders who were determined to actually live in their blighted property, indicative of a stronger relationship between the owner and the house

Residential Rental Property Owners, offenders who own residential properties but were determined to not live in them, indicative of a more income-driven relationship between the owner and the property

After deciding to focus on comparing these three groups, we found some notable differences between each owner category:

Repeat Offenses: only 11% of live-in owners were issued more than two blight tickets (71% of live-in owners received only one blight ticket); the multiple offense rate jumps to 20% when looking at residential rental property owners (59% of residential rental property owners received only one blight ticket).

Property Conditions: only 4.8% of the ticketed properties owned by live-in owners were in poor condition, compared to 7.1% for those owned by residential rental property owners and 8.8% for those owned by top offenders

Compliance: only 6.5% of tickets issued to top offenders were paid (either on time or by less than 1 month late), which is significantly less than the 10% and 11% rates for residential rental property owners and live-in owners respectively

Occupancy Rates: live-in owners saw the highest occupancy rate on properties that were issued blight tickets, 69%, followed by 57% for residential rental property owners and 47% for top offenders. While this trend makes sense, we would really expect a 100% occupancy rate to focus on properties that owners are actively living in, and thus the distance from 100% is one testament to a problem our team faced in data quality–both from inconsistency in records and from real estate turnover in Detroit.

Feel free to check out the whole paper ~ here ~

Driving with Data: MDST Partners with City of Detroit for Vehicle Maintenance Analysis Project

By | MDSTPosts

Author: Josh Gardner, School of Information

For this project, MDST partnered with the City of Detroit’s Operations and Infrastructure Group. The Operations and Infrastructure Group manages the City of Detroit’s vehicle fleet, which includes vehicles of every type: police cars, ambulances, fire trucks, motorcycles, boats, semis … all of the unique vehicles that the City of Detroit uses to manage the many tasks that keep this city of over 700,000 people functioning.

smeal_sst_pumperThe Operations and Infrastructure Group was interested in exploring several aspects of its vehicle fleet, especially related to vehicle cost and reliability and their relationship to how their fleet of over 2,700 vehicles were maintained. This project kicked off at an especially unique time for the City of Detroit, as the city filed for bankruptcy in 2013, emerging in 2014 — cost-effectiveness with its vehicle fleet, one of the city’s most expensive and critical assets, was a key lever for operational effectiveness in the city.

The city provided two data sources: a vehicles table, with one record for each vehicle purchased by the city — nearly 6700 vehicles, purchased as early as 1944 — and a maintenance table, with one record for each maintenance job performed on a vehicle owned by the city — over 200,000 records of jobs such as preventive maintenance, tire changes, body work, and windshield replacement.

Our team faced three common challenges in this project, which are common to many data science tasks. First, as a team of student data scientists who were not experts in city government, we needed to work with our partners to determine which questions our analysis should address. The possibilities for analysis were nearly endless — we needed to ensure that the ones we chose would assist the city in understanding and acting on their data. Second, this data was incomplete and potentially unreliable. This involved paper records being converted to digital records (like the vehicles from 1944), and data which reflected human coding decisions which may be inaccurate, arbitrary, or overlapping. Third, our team needed to recover complex structure from tabular data. Our data was in the format of two simple tables, but they contained complex patterns across time, vehicles, and locations. We needed techniques which could help us discover and explore these patterns. This is a common challenge with data science in the “real world”, where complex data is stored in tables.

Data Analysis and Modeling

Tensor Decomposition

Tensor decomposition describes techniques used to decompose tensors, or multidimensional data arrays. We applied the PARallel FACtors, or PARAFAC, decomposition. We refer readers to our paper or to (Kolda and Bader 2009) for details on this technique. Also check out (Bader et al. 2008) for an interesting application of this technique to tracing patterns in the Enron email dataset that we enjoyed reading.

Applying the PARAFAC allowed us to automatically extract and visualize multidimensional patterns in the vehicle maintenance data. First, we created data tensors as shown below. Second, we applied the PARAFAC decomposition to these tensors, which produced a series of factor matrices which best reconstructed the original data tensor.

tensor_fig_v2PARAFAC

 

 


Left: Depiction of (a) 3-mode data tensor; (b) the same tensor as a stacked series of frontal slices, or arrays; (c) an example single frontal slice of a vehicle data tensor used in this analysis (each entry corresponds to the count of a specific job type for a vehicle at a fixed time). Right: Visual depiction of the PARAFAC decomposition.3_way_plot_factor_pubversion_17

We generated and explored “three-way plots” of these factors using two different types of time axis: absolute time, which allowed us to model things like seasonality over time; and vehicle lifetime, which allowed us to model patterns in how maintenance occurred over a vehicle’s lifetime (relative to when it was purchased). We show one such plot below, which demonstrates unique maintenance patterns for the 2015 Smeal SST Pumper fire truck (shown in the photo in the introduction above); for a detailed analysis of this and other three-way plots from the Detroit vehicles data, check out the paper. Here, we will mention that PARAFAC helped us discover unique patterns in maintenance, including a set of vehicles which were maintained at different specialty technical centers by Detroit, without us specifically looking for them or even knowing these patterns existed.

Differential Sequence Mining

Our tensor decomposition analysis demonstrated several interesting patterns, both in absolute time and in vehicle lifetime. As a second step of data exploration, we decided to investigate whether there were statistically unique sequences of vehicle maintenance. To do this, we employed a differential sequence mining algorithm to compare common sequences between different groups, and identify whether the differences in their prevalence between groups is statistically significant. We compared each vehicle make/model to the rest of the fleet for this analysis.

For nearly every vehicle, we found that the most common sequences were significantly different from the rest of the fleet. This indicated that there were strong, unique patterns in the sequences of these make/models relative to the rest of the fleet. This insight confirmed our finding from the tensor decomposition, demonstrating that there were useful patterns in maintenance by make/model that a sequence-based predictive modeling approach could utilize.

Predictive LSTM Model

Having demonstrated strong patterns in the way different vehicle make/models were maintained over time, we then explored predictive modeling techniques which could capture these patterns in a vehicles’ maintenance history, and use them to predict its next maintenance job.

We used a common sequential modeling algorithm, the Long Short-Term Memory network. This is a more sophisticated version of a normal neural network which, instead of evaluating just a single observation (i.e., a single job or vehicle), captures information about the observations it has seen before (i.e., the vehicles’ entire previous history). This allows the model to “remember” vehicles’ past repairs, and use all of the available historical information to predict the next job.

The specific model we used was a simple neural network from a prior work on predicting words in sentences because of its ability to model complex sequences while avoiding over-fitting (Zaremba et al. 2014). In our paper, we demonstrate that this model is able to predict sequences effectively, even with a very small training set of only 164 vehicles and predicting on unseen test data of the same make/model (we used Dodge Chargers for this prediction because these were the largest set of available data, as the most common vehicle in the data set we evaluated). While we don’t have any direct comparison for the model performance, we demonstrated that it achieves a perplexity score of 15.4, far outperforming a random baseline using the same distribution as the training sequences (perplexity of 260), and below state-of-the-art language models on the Penn Treebank (23.7 for the original model ours is based on) or the Google Billion Words Corpus (around 25) (Kuchaiev and Ginsburg 2017).

Conclusion

Working with real-world data that contains complex, multidimensional patterns can be challenging. In this post, we demonstrated our approach to analyzing the Detroit vehicles dataset, applying tensor decomposition and differential sequence mining techniques to discover time- and vehicle-dependent patterns in the data, and then building an LSTM predictive model on this sequence data to demonstrate the ability to accurately predict a vehicle’s next job, given its past maintenance history.

We’re looking forward to continuing to explore the Detroit vehicles data, and have future projects in the works. We’d like to thank the General Services Department and the Operations and Infrastructure Group of the City of Detroit for bringing this project to our attention and making the data available for use.

For more information on the project, check out the full paper here.

MDST partners with Adamy Valuation for market analysis

By | Educational, General Interest, MDSTPosts, News

Authors: Michael Kovalcik, College of Engineering; Xinyu Tan, College of Engineering; Derek Chen, Ross School of Business.

Problem Overview

adamy-full-logo-rgb copyThe Michigan Data Science Team partnered with Adamy Valuation, a Grand Rapids-based valuation firm, to bring data-driven insights to business equity valuation.  Business valuation firms determine the market value of business interests in support of a variety of different types of transactions typically involving ownership interests in private businesses. Valuation firms, such as Adamy Valuation, deliver this assessment, which includes a detailed report explaining the reasons why they believe it to be fair.

Valuations are performed by expert financial analysts, who use their knowledge about the factors that influence value to manually assess the value of the equity. Shannon Pratt’s Valuing a Business suggests that there are two key factors in particular that influence value: risk and size. Risk is a measure of uncertainty relating to the company’s future  and can be assessed by looking at total debt and cash flows. Size refers to a company’s economic power. Larger companies will spend and make more than smaller ones. While these factors are quite informative, the degree to which they influence value varies a lot from industry to industry and even from company to company. Therefore, a valuation firm will often adjust their models manually to account for additional features, using knowledge gained from years of experience and industry expertise.

Our goals were to conduct a data-driven analysis of the valuation process and to build a predictive model that could learn to make value adjustments from historical data. A critical requirement of our approach was that the resulting model must be interpretable. An algorithm that is extremely accurate but offers no insight into how the prediction was made or what features it was based off of is of no use to Adamy Valuation because, at the end of the day, they must be able to validate the reasoning behind their assessment.

model_overview copyThe Data Pipeline

While our goal is to value private companies, data related to these companies is difficult to come by.  Business valuation analysts address this issue by using market data from public companies as guideline data points to inform the valuation of a private subject company.  To this end, we acquired a dataset of 400 publicly-traded companies along with 20 financial metrics that are commonly used during valuation. We cleaned this dataset to only contain features that are relevant to private companies so that the model learned on public companies could later be applied to value private companies.

We separate financial metrics into four categories: Size, Profitability, Growth, and Risk, as indicated by the colors in Fig. 1. Our goal was to determine which of the four categories, or more specifically, which features in these categories, contribute the most to:

tev-ebitdawhere TEV represents the Total Enterprise Value a measure of a company’s market value, adjusting for things like debt and cash on hand, and EBITDA stands for earnings before interest, tax, depreciation, and amortization. EBITDA allows analysts to focus on operating performance by minimizing the impact of non-operating decisions such as which tax rates they must pay and the degree to which their goods depreciate. In other words EBITDA gives a clearer value for head to head comparisons of company performance. Valuation firms typically examine the ratio of TEV and EBITDA instead of examining TEV or EBITDA directly, because the ratio standardizes for the size of the company, making it easier to make apples to apples comparisons with companies that may be much larger or smaller, but are otherwise similar.

To study how feature importance varied across industries, we categorized each public company into one of three separate sectors:

  • Consumer Discretionary refers to companies that provide goods and services that are considered nonessential to the consumer. For example, Bed Bath and Beyond, Ford Motor Company, and Panera Bread are all part of this category.
  • Consumer Staples provide essential products such as food, beverages, and household items. Companies like Campbell’s Soup, Coca Cola, and Kellogg are considered Consumer Staples.
  • Industrial Spending sector is a diverse category, which contains companies related to the manufacture and distribution of goods for industrial customers. In this dataset we see companies like Delta Airlines, Fedex, and Lockheed Martin.

Modeling

Our goal is not just to accurately estimate value, but also to identify key relationships between a company’s observable metrics and its ratio of TEV to EBITDA.We study 17 financial metrics, many of which have complex relationships with the ratio of TEV and EBITDA. To identify these relationships, we model the problem as a regression task. We use two simple but widely-used frameworks: linear models and tree-based models because both methods offer insight into how the predictions are actually made.

After fitting our models to the data, we identified the most predictive features of company value across industries, and compared this to profit margin and size, the metrics most commonly used in Valuing a Business. For our linear models we used the coefficients in our regression equation to determine which features were most important. For our random forest model we used the feature importance metric which ranks features according to the information gained during the fitting process.

Comparison of MethodsResults

The figure to the right depicts the accuracy our models versus the market approach (also known as comparable approach), the method used by valuation firms. With the size of the dataset and the specificity of the market approach we are not surprised that it outperforms our models. Rather we are showing here that our models have a reasonable enough degree of accuracy to trust the interpretation of the features.

Import features across different sectorsAlso on the right we show the top 3 features, according to information gain, per industry as learned by our random forest model. The larger the bar the more insightful that variable was for predictions.The features we see turning up in our model are indicators of profitability and size which agree with the existing knowledge in the literature. It is interesting to note that return on assets shows up in each sector which intuitively means the market values those companies that get high returns regardless of the sector.

Explanation of Key Predictors

Remember our goal was to predict TEV/EBITDA, which is a measure of company’s total value after standardizing for things such as size, tax structure, and number of other factors. There were 5 distinct predictors that really stood out in our analysis.

Return on Assets is a measure of a company’s efficiency in generating profit.

Total Revenue is also known as total sales and is a measurement of how much a company receives from the sale of goods and services.

EBITDA 1 year growth: EBITDA is a measure of profitability and growing EBITDA means growing profit and increasing value of a company.

A Capital Expenditure(Capex) is the amount of money that a company invested in property and equipment. Capex is often linked to the expansion or contraction of a business and is therefore a measure of growth. Looking at Capex as percentage of revenue provides a normalized measurement for comparison.

EBITDA Margin serves as an indicator of a company’s operating profitability. Higher EBITDA margin means the company is getting more EBITDA for every dollar of revenue.

MSSISS

MSSISS or the Michigan Student Symposium for Interdisciplinary Statistical Sciences is an annual conference hosted by the University of Michigan. MSSISS brings together statistics works from a number of different fields including computer science, electrical engineering, statistics, biostatistics, and industrial operations. Our poster was particularly interesting as it was the only one with a financial application. The novelty of our project drew in a number of viewers and impressed the judges. A major component of our poster score was determined by our ability to communicate our results to people outside the field. We received a certificate of merit for our work and ability to communicate it to the other attendees at the conference.

adamy_mssiss (2) copy

MDST announces Detroit blight data challenge; organizational meeting Feb. 16

By | Educational, General Interest, MDSTPosts, MDSTProjects, News

The Michigan Data Science Team and the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) have partnered with the City of Detroit on a data challenge that seeks to answer the question: How can blight ticket compliance be increased?

An organizational meeting is scheduled for Thursday, Feb. 16 at 5:30 p.m. in EECS 1200.

The city is making datasets available containing building permits, trades permits, citizens complaints, and more.

The competition runs through March 15. For more information, see the competition website.

MDST Poster Wins Symposium Competition

By | MDSTPosts

Today, MDST participated in the student poster competition at the “Meeting the Challenges of Safe Transportation in an Aging Society Symposium”. The poster highlights the key findings from the Fatal Accident Reporting System (FARS) competition we held earlier this year. The Michigan Institute for Data Science (MIDAS) provided MDST members access to a dataset of fatal crashes in the US, with a labeled variable indicating whether alcohol was involved in the incident, and models were judged based on how well they could predict the value of this true/false variable.

The poster describes the winning model for the competition, an ensemble of a neural network and boosted decision tree, and identifies crash time, location, and the number of passengers involved, as the most predictive variables.

We want to thank MIDAS for funding the competition, Chengyu Dai and Guangsha Shi for representing MDST at the ATLAS Symposium, and the many members of MDST who participated in the FARS Challenge.

You can download the poster from the link below.

  • “Inferring Alcohol Involvement in Fatal Car Accidents with Ensembled Classifiers”,Guangsha Shi, Arya Farahi, Chengyu Dai, Cyrus Anderson, Jiachen Huang, Wenbo Shen, Kristjan Greenwald, and Johnathan Stroud.