MDST Partners with City of Detroit to Improve Blight Enforcement

By | MDSTPosts, News | No Comments

Author: Allie Cell, College of Engineering

blighted_propPROBLEM OVERVIEW

Property blight, which refers to lots and structures not being properly maintained (as pictured above), is a major problem in Detroit: over 20% of lots in the city are blighted. As a measure to help curb this behavior, in 2005 the City of Detroit began issuing blight tickets to the owners of afflicted properties, which include fees ranging from $20, for offenses including leaving out a trash bin too far before its collection date, to $10,000, for offenses including dumping more than 5 cubic feet of waste. Unfortunately, however, only 7% of people issued tickets and found guilty actually pay the ticket, leaving a balance of some $73 million in unpaid blight tickets.

Officials from the City of Detroit’s Department of Administrative Hearings and Department of Innovation and Technology who work with blight tickets and the Michigan Data Science Team initially came together this past February to sponsor a competition to forecast blight compliance. To provide a more actionable analysis for Detroit policymakers, we aimed to understand what sorts of people receive blight tickets, how we can use this knowledge to better grasp why blight tickets have not been effective, and to provide insights for policy makers accordingly.

THE DATA

In order to get an accurate picture of blight compliance in Detroit, we aggregated information from multiple datasets. The Detroit Open Data Portal is an online center for public data featuring datasets including those related to public health, education, transportation; that is where we obtained most of our datasets from. Some of the most valuable datasets we used included:

Blight Ticket Data, records of each blight ticket

Parcel Data, records of all properties in Detroit

Crime Data, records of all crimes in Detroit from 2009 through 2016

Demolition Data, records of each completed and scheduled demolition in Detroit

Improve Detroit, records of all issues submitted through an app whose goal is improving the city

PREDICTING BLIGHT TICKET COMPLIANCE

We built a model to predict whether a property owner would pay their blight ticket. Each record in the dataset had one-hot encoded data from the sources listed above. It has been observed that tree methods are easily interpretable and perform well for mixed data, so we considered scikit-learn Random Forests and the xgboost Gradient Boosted Trees (XGBoostClassifier). To choose the best model, we generated learning curves with 5-fold cross-validation for each classifier; xgboost performed well with a cross-validation score of over .9, so we selected this model.

Screen Shot 2017-08-22 at 5.27.14 PM20_80

ANALYSIS OF TICKETED PROPERTY OWNERS

To gain a more holistic understanding of the relationships between ticketed property owners and their properties, we analyzed three categories of property owners:

Top Offenders, the small portion of offenders who own many blighted properties and account for the majority of tickets–as shown above, 20% of violators own over 70% of unpaid blight fines

Live-In Owners, offenders who were determined to actually live in their blighted property, indicative of a stronger relationship between the owner and the house

Residential Rental Property Owners, offenders who own residential properties but were determined to not live in them, indicative of a more income-driven relationship between the owner and the property

After deciding to focus on comparing these three groups, we found some notable differences between each owner category:

Repeat Offenses: only 11% of live-in owners were issued more than two blight tickets (71% of live-in owners received only one blight ticket); the multiple offense rate jumps to 20% when looking at residential rental property owners (59% of residential rental property owners received only one blight ticket).

Property Conditions: only 4.8% of the ticketed properties owned by live-in owners were in poor condition, compared to 7.1% for those owned by residential rental property owners and 8.8% for those owned by top offenders

Compliance: only 6.5% of tickets issued to top offenders were paid (either on time or by less than 1 month late), which is significantly less than the 10% and 11% rates for residential rental property owners and live-in owners respectively

Occupancy Rates: live-in owners saw the highest occupancy rate on properties that were issued blight tickets, 69%, followed by 57% for residential rental property owners and 47% for top offenders. While this trend makes sense, we would really expect a 100% occupancy rate to focus on properties that owners are actively living in, and thus the distance from 100% is one testament to a problem our team faced in data quality–both from inconsistency in records and from real estate turnover in Detroit.

Feel free to check out the whole paper ~ here ~

Driving with Data: MDST Partners with City of Detroit for Vehicle Maintenance Analysis Project

By | MDSTPosts | No Comments

Author: Josh Gardner, School of Information

For this project, MDST partnered with the City of Detroit’s Operations and Infrastructure Group. The Operations and Infrastructure Group manages the City of Detroit’s vehicle fleet, which includes vehicles of every type: police cars, ambulances, fire trucks, motorcycles, boats, semis … all of the unique vehicles that the City of Detroit uses to manage the many tasks that keep this city of over 700,000 people functioning.

smeal_sst_pumperThe Operations and Infrastructure Group was interested in exploring several aspects of its vehicle fleet, especially related to vehicle cost and reliability and their relationship to how their fleet of over 2,700 vehicles were maintained. This project kicked off at an especially unique time for the City of Detroit, as the city filed for bankruptcy in 2013, emerging in 2014 — cost-effectiveness with its vehicle fleet, one of the city’s most expensive and critical assets, was a key lever for operational effectiveness in the city.

The city provided two data sources: a vehicles table, with one record for each vehicle purchased by the city — nearly 6700 vehicles, purchased as early as 1944 — and a maintenance table, with one record for each maintenance job performed on a vehicle owned by the city — over 200,000 records of jobs such as preventive maintenance, tire changes, body work, and windshield replacement.

Our team faced three common challenges in this project, which are common to many data science tasks. First, as a team of student data scientists who were not experts in city government, we needed to work with our partners to determine which questions our analysis should address. The possibilities for analysis were nearly endless — we needed to ensure that the ones we chose would assist the city in understanding and acting on their data. Second, this data was incomplete and potentially unreliable. This involved paper records being converted to digital records (like the vehicles from 1944), and data which reflected human coding decisions which may be inaccurate, arbitrary, or overlapping. Third, our team needed to recover complex structure from tabular data. Our data was in the format of two simple tables, but they contained complex patterns across time, vehicles, and locations. We needed techniques which could help us discover and explore these patterns. This is a common challenge with data science in the “real world”, where complex data is stored in tables.

Data Analysis and Modeling

Tensor Decomposition

Tensor decomposition describes techniques used to decompose tensors, or multidimensional data arrays. We applied the PARallel FACtors, or PARAFAC, decomposition. We refer readers to our paper or to (Kolda and Bader 2009) for details on this technique. Also check out (Bader et al. 2008) for an interesting application of this technique to tracing patterns in the Enron email dataset that we enjoyed reading.

Applying the PARAFAC allowed us to automatically extract and visualize multidimensional patterns in the vehicle maintenance data. First, we created data tensors as shown below. Second, we applied the PARAFAC decomposition to these tensors, which produced a series of factor matrices which best reconstructed the original data tensor.

tensor_fig_v2PARAFAC

 

 


Left: Depiction of (a) 3-mode data tensor; (b) the same tensor as a stacked series of frontal slices, or arrays; (c) an example single frontal slice of a vehicle data tensor used in this analysis (each entry corresponds to the count of a specific job type for a vehicle at a fixed time). Right: Visual depiction of the PARAFAC decomposition.3_way_plot_factor_pubversion_17

We generated and explored “three-way plots” of these factors using two different types of time axis: absolute time, which allowed us to model things like seasonality over time; and vehicle lifetime, which allowed us to model patterns in how maintenance occurred over a vehicle’s lifetime (relative to when it was purchased). We show one such plot below, which demonstrates unique maintenance patterns for the 2015 Smeal SST Pumper fire truck (shown in the photo in the introduction above); for a detailed analysis of this and other three-way plots from the Detroit vehicles data, check out the paper. Here, we will mention that PARAFAC helped us discover unique patterns in maintenance, including a set of vehicles which were maintained at different specialty technical centers by Detroit, without us specifically looking for them or even knowing these patterns existed.

Differential Sequence Mining

Our tensor decomposition analysis demonstrated several interesting patterns, both in absolute time and in vehicle lifetime. As a second step of data exploration, we decided to investigate whether there were statistically unique sequences of vehicle maintenance. To do this, we employed a differential sequence mining algorithm to compare common sequences between different groups, and identify whether the differences in their prevalence between groups is statistically significant. We compared each vehicle make/model to the rest of the fleet for this analysis.

For nearly every vehicle, we found that the most common sequences were significantly different from the rest of the fleet. This indicated that there were strong, unique patterns in the sequences of these make/models relative to the rest of the fleet. This insight confirmed our finding from the tensor decomposition, demonstrating that there were useful patterns in maintenance by make/model that a sequence-based predictive modeling approach could utilize.

Predictive LSTM Model

Having demonstrated strong patterns in the way different vehicle make/models were maintained over time, we then explored predictive modeling techniques which could capture these patterns in a vehicles’ maintenance history, and use them to predict its next maintenance job.

We used a common sequential modeling algorithm, the Long Short-Term Memory network. This is a more sophisticated version of a normal neural network which, instead of evaluating just a single observation (i.e., a single job or vehicle), captures information about the observations it has seen before (i.e., the vehicles’ entire previous history). This allows the model to “remember” vehicles’ past repairs, and use all of the available historical information to predict the next job.

The specific model we used was a simple neural network from a prior work on predicting words in sentences because of its ability to model complex sequences while avoiding over-fitting (Zaremba et al. 2014). In our paper, we demonstrate that this model is able to predict sequences effectively, even with a very small training set of only 164 vehicles and predicting on unseen test data of the same make/model (we used Dodge Chargers for this prediction because these were the largest set of available data, as the most common vehicle in the data set we evaluated). While we don’t have any direct comparison for the model performance, we demonstrated that it achieves a perplexity score of 15.4, far outperforming a random baseline using the same distribution as the training sequences (perplexity of 260), and below state-of-the-art language models on the Penn Treebank (23.7 for the original model ours is based on) or the Google Billion Words Corpus (around 25) (Kuchaiev and Ginsburg 2017).

Conclusion

Working with real-world data that contains complex, multidimensional patterns can be challenging. In this post, we demonstrated our approach to analyzing the Detroit vehicles dataset, applying tensor decomposition and differential sequence mining techniques to discover time- and vehicle-dependent patterns in the data, and then building an LSTM predictive model on this sequence data to demonstrate the ability to accurately predict a vehicle’s next job, given its past maintenance history.

We’re looking forward to continuing to explore the Detroit vehicles data, and have future projects in the works. We’d like to thank the General Services Department and the Operations and Infrastructure Group of the City of Detroit for bringing this project to our attention and making the data available for use.

For more information on the project, check out the full paper here.

MDST partners with Adamy Valuation for market analysis

By | Educational, General Interest, MDSTPosts, News | No Comments

Authors: Michael Kovalcik, College of Engineering; Xinyu Tan, College of Engineering; Derek Chen, Ross School of Business.

Problem Overview

adamy-full-logo-rgb copyThe Michigan Data Science Team partnered with Adamy Valuation, a Grand Rapids-based valuation firm, to bring data-driven insights to business equity valuation.  Business valuation firms determine the market value of business interests in support of a variety of different types of transactions typically involving ownership interests in private businesses. Valuation firms, such as Adamy Valuation, deliver this assessment, which includes a detailed report explaining the reasons why they believe it to be fair.

Valuations are performed by expert financial analysts, who use their knowledge about the factors that influence value to manually assess the value of the equity. Shannon Pratt’s Valuing a Business suggests that there are two key factors in particular that influence value: risk and size. Risk is a measure of uncertainty relating to the company’s future  and can be assessed by looking at total debt and cash flows. Size refers to a company’s economic power. Larger companies will spend and make more than smaller ones. While these factors are quite informative, the degree to which they influence value varies a lot from industry to industry and even from company to company. Therefore, a valuation firm will often adjust their models manually to account for additional features, using knowledge gained from years of experience and industry expertise.

Our goals were to conduct a data-driven analysis of the valuation process and to build a predictive model that could learn to make value adjustments from historical data. A critical requirement of our approach was that the resulting model must be interpretable. An algorithm that is extremely accurate but offers no insight into how the prediction was made or what features it was based off of is of no use to Adamy Valuation because, at the end of the day, they must be able to validate the reasoning behind their assessment.

model_overview copyThe Data Pipeline

While our goal is to value private companies, data related to these companies is difficult to come by.  Business valuation analysts address this issue by using market data from public companies as guideline data points to inform the valuation of a private subject company.  To this end, we acquired a dataset of 400 publicly-traded companies along with 20 financial metrics that are commonly used during valuation. We cleaned this dataset to only contain features that are relevant to private companies so that the model learned on public companies could later be applied to value private companies.

We separate financial metrics into four categories: Size, Profitability, Growth, and Risk, as indicated by the colors in Fig. 1. Our goal was to determine which of the four categories, or more specifically, which features in these categories, contribute the most to:

tev-ebitdawhere TEV represents the Total Enterprise Value a measure of a company’s market value, adjusting for things like debt and cash on hand, and EBITDA stands for earnings before interest, tax, depreciation, and amortization. EBITDA allows analysts to focus on operating performance by minimizing the impact of non-operating decisions such as which tax rates they must pay and the degree to which their goods depreciate. In other words EBITDA gives a clearer value for head to head comparisons of company performance. Valuation firms typically examine the ratio of TEV and EBITDA instead of examining TEV or EBITDA directly, because the ratio standardizes for the size of the company, making it easier to make apples to apples comparisons with companies that may be much larger or smaller, but are otherwise similar.

To study how feature importance varied across industries, we categorized each public company into one of three separate sectors:

  • Consumer Discretionary refers to companies that provide goods and services that are considered nonessential to the consumer. For example, Bed Bath and Beyond, Ford Motor Company, and Panera Bread are all part of this category.
  • Consumer Staples provide essential products such as food, beverages, and household items. Companies like Campbell’s Soup, Coca Cola, and Kellogg are considered Consumer Staples.
  • Industrial Spending sector is a diverse category, which contains companies related to the manufacture and distribution of goods for industrial customers. In this dataset we see companies like Delta Airlines, Fedex, and Lockheed Martin.

Modeling

Our goal is not just to accurately estimate value, but also to identify key relationships between a company’s observable metrics and its ratio of TEV to EBITDA.We study 17 financial metrics, many of which have complex relationships with the ratio of TEV and EBITDA. To identify these relationships, we model the problem as a regression task. We use two simple but widely-used frameworks: linear models and tree-based models because both methods offer insight into how the predictions are actually made.

After fitting our models to the data, we identified the most predictive features of company value across industries, and compared this to profit margin and size, the metrics most commonly used in Valuing a Business. For our linear models we used the coefficients in our regression equation to determine which features were most important. For our random forest model we used the feature importance metric which ranks features according to the information gained during the fitting process.

Comparison of MethodsResults

The figure to the right depicts the accuracy our models versus the market approach (also known as comparable approach), the method used by valuation firms. With the size of the dataset and the specificity of the market approach we are not surprised that it outperforms our models. Rather we are showing here that our models have a reasonable enough degree of accuracy to trust the interpretation of the features.

Import features across different sectorsAlso on the right we show the top 3 features, according to information gain, per industry as learned by our random forest model. The larger the bar the more insightful that variable was for predictions.The features we see turning up in our model are indicators of profitability and size which agree with the existing knowledge in the literature. It is interesting to note that return on assets shows up in each sector which intuitively means the market values those companies that get high returns regardless of the sector.

Explanation of Key Predictors

Remember our goal was to predict TEV/EBITDA, which is a measure of company’s total value after standardizing for things such as size, tax structure, and number of other factors. There were 5 distinct predictors that really stood out in our analysis.

Return on Assets is a measure of a company’s efficiency in generating profit.

Total Revenue is also known as total sales and is a measurement of how much a company receives from the sale of goods and services.

EBITDA 1 year growth: EBITDA is a measure of profitability and growing EBITDA means growing profit and increasing value of a company.

A Capital Expenditure(Capex) is the amount of money that a company invested in property and equipment. Capex is often linked to the expansion or contraction of a business and is therefore a measure of growth. Looking at Capex as percentage of revenue provides a normalized measurement for comparison.

EBITDA Margin serves as an indicator of a company’s operating profitability. Higher EBITDA margin means the company is getting more EBITDA for every dollar of revenue.

MSSISS

MSSISS or the Michigan Student Symposium for Interdisciplinary Statistical Sciences is an annual conference hosted by the University of Michigan. MSSISS brings together statistics works from a number of different fields including computer science, electrical engineering, statistics, biostatistics, and industrial operations. Our poster was particularly interesting as it was the only one with a financial application. The novelty of our project drew in a number of viewers and impressed the judges. A major component of our poster score was determined by our ability to communicate our results to people outside the field. We received a certificate of merit for our work and ability to communicate it to the other attendees at the conference.

adamy_mssiss (2) copy

MDST announces Detroit blight data challenge; organizational meeting Feb. 16

By | Educational, General Interest, MDSTPosts, MDSTProjects, News | No Comments

The Michigan Data Science Team and the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) have partnered with the City of Detroit on a data challenge that seeks to answer the question: How can blight ticket compliance be increased?

An organizational meeting is scheduled for Thursday, Feb. 16 at 5:30 p.m. in EECS 1200.

The city is making datasets available containing building permits, trades permits, citizens complaints, and more.

The competition runs through March 15. For more information, see the competition website.

MDST Poster Wins Symposium Competition

By | MDSTPosts | No Comments

Today, MDST participated in the student poster competition at the “Meeting the Challenges of Safe Transportation in an Aging Society Symposium”. The poster highlights the key findings from the Fatal Accident Reporting System (FARS) competition we held earlier this year. The Michigan Institute for Data Science (MIDAS) provided MDST members access to a dataset of fatal crashes in the US, with a labeled variable indicating whether alcohol was involved in the incident, and models were judged based on how well they could predict the value of this true/false variable.

The poster describes the winning model for the competition, an ensemble of a neural network and boosted decision tree, and identifies crash time, location, and the number of passengers involved, as the most predictive variables.

We want to thank MIDAS for funding the competition, Chengyu Dai and Guangsha Shi for representing MDST at the ATLAS Symposium, and the many members of MDST who participated in the FARS Challenge.

You can download the poster from the link below.

Bloomberg Conference Accepts Both MDST Papers!

By | MDSTPosts | No Comments

Earlier this summer, MDST submitted two papers to the Bloomberg Data For Good Exchange conference regarding our work on the Flint Water Crisis and with the University Musical Society respectively. It is my great pleasure to announce that the conference has elected both of our papers for presentation at the conference in New York on September 25th!

Needless to say, we’re all very excited. 🎉

MDST Faculty Advisor Jacob Abernethy Interviewed for Machine Learning Podcast!

By | MDSTPosts | No Comments

Our very own Jacob Abernethy was recently interviewed on the popular machine learning podcast, Talking Machines. Among other things, Jake was asked about his experiences working with the trove of municipal data available in Flint, his path to research at the University of Michigan, and our work with Google and UM-Flint.

You can find a link to the interview here. Fun Fact: Talking Machines is produced by Kathrine Goreman, a UM alumna!

MDST Submits Two Papers to Bloomberg Conference

By | MDSTPosts | No Comments

While we are known for our participation in structured prediction challenges, MDST has picked up at least two community projects in the last year. MDST members of all experience levels got to participate in both our efforts in Flint and our work with UMS’s ticket purchase data. Around the time that we hit milestones in both projects, news of the Bloomberg Data 4 Good Exchange call for papers reached some members of MDST and we decided to take a shot.

The results of our foray into volunteer, remote, academic paper collaboration can be found below in the form of two successfully written MDST papers! We’re incredibly proud of the results and even prouder of our membership, who worked so hard to produce such quality work.