Monthly Archives

September 2017

Real estate dataset available to researchers

By | Data, Data sets, Educational, General Interest, Happenings, News

The University of Michigan Library system and the Data Acquisition for Data Sciences program (DADS) of the U-M Data Science Initiative (DSI) have recently joined forces to license a major data resource capturing parcel-level information about the property market in the United States.  

The data were licensed from the Corelogic corporation, who have assimilated deed, tax and foreclosure information on nearly all properties in the entire US. Coverage dates vary by county, some county records go back fifty years. Coverage is more comprehensive from the 1990s to the present.

These data will support a variety of research efforts into regional economies, economic disparities, trends in land-use, housing market dynamics, and urban ecology, among many other areas.

The data are available on the Turbo Research Storage system for users of the U-M High Performance Computing infrastructure, and via the University of Michigan Library.

To access the data, researchers must first sign a MOU; contact Senior Associate Librarian Catherine Morse cmorse@umich.edu for more information, or visit https://www.lib.umich.edu/database/corelogic-parcel-level-real-estate-data.

MDST Partners with City of Detroit to Improve Blight Enforcement

By | MDSTPosts, News

Author: Allie Cell, College of Engineering

blighted_propPROBLEM OVERVIEW

Property blight, which refers to lots and structures not being properly maintained (as pictured above), is a major problem in Detroit: over 20% of lots in the city are blighted. As a measure to help curb this behavior, in 2005 the City of Detroit began issuing blight tickets to the owners of afflicted properties, which include fees ranging from $20, for offenses including leaving out a trash bin too far before its collection date, to $10,000, for offenses including dumping more than 5 cubic feet of waste. Unfortunately, however, only 7% of people issued tickets and found guilty actually pay the ticket, leaving a balance of some $73 million in unpaid blight tickets.

Officials from the City of Detroit’s Department of Administrative Hearings and Department of Innovation and Technology who work with blight tickets and the Michigan Data Science Team initially came together this past February to sponsor a competition to forecast blight compliance. To provide a more actionable analysis for Detroit policymakers, we aimed to understand what sorts of people receive blight tickets, how we can use this knowledge to better grasp why blight tickets have not been effective, and to provide insights for policy makers accordingly.

THE DATA

In order to get an accurate picture of blight compliance in Detroit, we aggregated information from multiple datasets. The Detroit Open Data Portal is an online center for public data featuring datasets including those related to public health, education, transportation; that is where we obtained most of our datasets from. Some of the most valuable datasets we used included:

Blight Ticket Data, records of each blight ticket

Parcel Data, records of all properties in Detroit

Crime Data, records of all crimes in Detroit from 2009 through 2016

Demolition Data, records of each completed and scheduled demolition in Detroit

Improve Detroit, records of all issues submitted through an app whose goal is improving the city

PREDICTING BLIGHT TICKET COMPLIANCE

We built a model to predict whether a property owner would pay their blight ticket. Each record in the dataset had one-hot encoded data from the sources listed above. It has been observed that tree methods are easily interpretable and perform well for mixed data, so we considered scikit-learn Random Forests and the xgboost Gradient Boosted Trees (XGBoostClassifier). To choose the best model, we generated learning curves with 5-fold cross-validation for each classifier; xgboost performed well with a cross-validation score of over .9, so we selected this model.

Screen Shot 2017-08-22 at 5.27.14 PM20_80

ANALYSIS OF TICKETED PROPERTY OWNERS

To gain a more holistic understanding of the relationships between ticketed property owners and their properties, we analyzed three categories of property owners:

Top Offenders, the small portion of offenders who own many blighted properties and account for the majority of tickets–as shown above, 20% of violators own over 70% of unpaid blight fines

Live-In Owners, offenders who were determined to actually live in their blighted property, indicative of a stronger relationship between the owner and the house

Residential Rental Property Owners, offenders who own residential properties but were determined to not live in them, indicative of a more income-driven relationship between the owner and the property

After deciding to focus on comparing these three groups, we found some notable differences between each owner category:

Repeat Offenses: only 11% of live-in owners were issued more than two blight tickets (71% of live-in owners received only one blight ticket); the multiple offense rate jumps to 20% when looking at residential rental property owners (59% of residential rental property owners received only one blight ticket).

Property Conditions: only 4.8% of the ticketed properties owned by live-in owners were in poor condition, compared to 7.1% for those owned by residential rental property owners and 8.8% for those owned by top offenders

Compliance: only 6.5% of tickets issued to top offenders were paid (either on time or by less than 1 month late), which is significantly less than the 10% and 11% rates for residential rental property owners and live-in owners respectively

Occupancy Rates: live-in owners saw the highest occupancy rate on properties that were issued blight tickets, 69%, followed by 57% for residential rental property owners and 47% for top offenders. While this trend makes sense, we would really expect a 100% occupancy rate to focus on properties that owners are actively living in, and thus the distance from 100% is one testament to a problem our team faced in data quality–both from inconsistency in records and from real estate turnover in Detroit.

Feel free to check out the whole paper ~ here ~

Driving with Data: MDST Partners with City of Detroit for Vehicle Maintenance Analysis Project

By | MDSTPosts

Author: Josh Gardner, School of Information

For this project, MDST partnered with the City of Detroit’s Operations and Infrastructure Group. The Operations and Infrastructure Group manages the City of Detroit’s vehicle fleet, which includes vehicles of every type: police cars, ambulances, fire trucks, motorcycles, boats, semis … all of the unique vehicles that the City of Detroit uses to manage the many tasks that keep this city of over 700,000 people functioning.

smeal_sst_pumperThe Operations and Infrastructure Group was interested in exploring several aspects of its vehicle fleet, especially related to vehicle cost and reliability and their relationship to how their fleet of over 2,700 vehicles were maintained. This project kicked off at an especially unique time for the City of Detroit, as the city filed for bankruptcy in 2013, emerging in 2014 — cost-effectiveness with its vehicle fleet, one of the city’s most expensive and critical assets, was a key lever for operational effectiveness in the city.

The city provided two data sources: a vehicles table, with one record for each vehicle purchased by the city — nearly 6700 vehicles, purchased as early as 1944 — and a maintenance table, with one record for each maintenance job performed on a vehicle owned by the city — over 200,000 records of jobs such as preventive maintenance, tire changes, body work, and windshield replacement.

Our team faced three common challenges in this project, which are common to many data science tasks. First, as a team of student data scientists who were not experts in city government, we needed to work with our partners to determine which questions our analysis should address. The possibilities for analysis were nearly endless — we needed to ensure that the ones we chose would assist the city in understanding and acting on their data. Second, this data was incomplete and potentially unreliable. This involved paper records being converted to digital records (like the vehicles from 1944), and data which reflected human coding decisions which may be inaccurate, arbitrary, or overlapping. Third, our team needed to recover complex structure from tabular data. Our data was in the format of two simple tables, but they contained complex patterns across time, vehicles, and locations. We needed techniques which could help us discover and explore these patterns. This is a common challenge with data science in the “real world”, where complex data is stored in tables.

Data Analysis and Modeling

Tensor Decomposition

Tensor decomposition describes techniques used to decompose tensors, or multidimensional data arrays. We applied the PARallel FACtors, or PARAFAC, decomposition. We refer readers to our paper or to (Kolda and Bader 2009) for details on this technique. Also check out (Bader et al. 2008) for an interesting application of this technique to tracing patterns in the Enron email dataset that we enjoyed reading.

Applying the PARAFAC allowed us to automatically extract and visualize multidimensional patterns in the vehicle maintenance data. First, we created data tensors as shown below. Second, we applied the PARAFAC decomposition to these tensors, which produced a series of factor matrices which best reconstructed the original data tensor.

tensor_fig_v2PARAFAC

 

 


Left: Depiction of (a) 3-mode data tensor; (b) the same tensor as a stacked series of frontal slices, or arrays; (c) an example single frontal slice of a vehicle data tensor used in this analysis (each entry corresponds to the count of a specific job type for a vehicle at a fixed time). Right: Visual depiction of the PARAFAC decomposition.3_way_plot_factor_pubversion_17

We generated and explored “three-way plots” of these factors using two different types of time axis: absolute time, which allowed us to model things like seasonality over time; and vehicle lifetime, which allowed us to model patterns in how maintenance occurred over a vehicle’s lifetime (relative to when it was purchased). We show one such plot below, which demonstrates unique maintenance patterns for the 2015 Smeal SST Pumper fire truck (shown in the photo in the introduction above); for a detailed analysis of this and other three-way plots from the Detroit vehicles data, check out the paper. Here, we will mention that PARAFAC helped us discover unique patterns in maintenance, including a set of vehicles which were maintained at different specialty technical centers by Detroit, without us specifically looking for them or even knowing these patterns existed.

Differential Sequence Mining

Our tensor decomposition analysis demonstrated several interesting patterns, both in absolute time and in vehicle lifetime. As a second step of data exploration, we decided to investigate whether there were statistically unique sequences of vehicle maintenance. To do this, we employed a differential sequence mining algorithm to compare common sequences between different groups, and identify whether the differences in their prevalence between groups is statistically significant. We compared each vehicle make/model to the rest of the fleet for this analysis.

For nearly every vehicle, we found that the most common sequences were significantly different from the rest of the fleet. This indicated that there were strong, unique patterns in the sequences of these make/models relative to the rest of the fleet. This insight confirmed our finding from the tensor decomposition, demonstrating that there were useful patterns in maintenance by make/model that a sequence-based predictive modeling approach could utilize.

Predictive LSTM Model

Having demonstrated strong patterns in the way different vehicle make/models were maintained over time, we then explored predictive modeling techniques which could capture these patterns in a vehicles’ maintenance history, and use them to predict its next maintenance job.

We used a common sequential modeling algorithm, the Long Short-Term Memory network. This is a more sophisticated version of a normal neural network which, instead of evaluating just a single observation (i.e., a single job or vehicle), captures information about the observations it has seen before (i.e., the vehicles’ entire previous history). This allows the model to “remember” vehicles’ past repairs, and use all of the available historical information to predict the next job.

The specific model we used was a simple neural network from a prior work on predicting words in sentences because of its ability to model complex sequences while avoiding over-fitting (Zaremba et al. 2014). In our paper, we demonstrate that this model is able to predict sequences effectively, even with a very small training set of only 164 vehicles and predicting on unseen test data of the same make/model (we used Dodge Chargers for this prediction because these were the largest set of available data, as the most common vehicle in the data set we evaluated). While we don’t have any direct comparison for the model performance, we demonstrated that it achieves a perplexity score of 15.4, far outperforming a random baseline using the same distribution as the training sequences (perplexity of 260), and below state-of-the-art language models on the Penn Treebank (23.7 for the original model ours is based on) or the Google Billion Words Corpus (around 25) (Kuchaiev and Ginsburg 2017).

Conclusion

Working with real-world data that contains complex, multidimensional patterns can be challenging. In this post, we demonstrated our approach to analyzing the Detroit vehicles dataset, applying tensor decomposition and differential sequence mining techniques to discover time- and vehicle-dependent patterns in the data, and then building an LSTM predictive model on this sequence data to demonstrate the ability to accurately predict a vehicle’s next job, given its past maintenance history.

We’re looking forward to continuing to explore the Detroit vehicles data, and have future projects in the works. We’d like to thank the General Services Department and the Operations and Infrastructure Group of the City of Detroit for bringing this project to our attention and making the data available for use.

For more information on the project, check out the full paper here.

Mini-course: Introduction to Python — Sept. 11-14

By | Data, Educational, Events, General Interest, News

Asst. Prof. Emanuel Gull, Physics, is offering a mini-course introducing the Python programming language in a four-lecture series. Beginners without any programming experience as well as programmers who usually use other languages (C, C++, Fortran, Java, …) are encouraged to come; no prior knowledge of programming languages is required!

For the first two lectures we will mostly follow the book Learning Python. This book is available at our library. An earlier edition (with small differences, equivalent for all practical purposes) is available as an e-book. The second week will introduce some useful python libraries: numpyscipymatplotlib.

At the end of the first two weeks you will know enough about Python to use it for your grad class homework and your research.

Special meeting place: we will meet in 340 West Hall on Monday September 11 at 5 PM.

Please bring a laptop computer along to follow the exercises!

Syllabus (Dates & Location for Fall 2017)

  1. Monday September 11 5:00 – 6:30 PM: Welcome & Getting Started (hello.py). Location: 340 West Hall
  2. Tuesday September 12 5:00 – 6:30 PM: Numbers, Strings, Lists, Dictionaries, Tuples, Functions, Modules, Control flow. Location: 335 West Hall
  3. Wednesday September 13 5:00 – 6:30 PM: Useful Python libraries (part I): numpy, scipy, matplotlib. Location: 335 West Hall
  4. Thursday September 14 5:00 – 6:30 PM: Useful Python libraries (part 2): 3d plotting in matplotlib and exercises. Location: 335 West Hall

For more information: https://sites.lsa.umich.edu/gull-lab/teaching/physics-514-fall-2017/introduction-to-python/

 

Info sessions on graduate studies in computational and data sciences — Sept. 21 and 25

By | Educational, Events, General Interest, News, Research

Learn about graduate programs that will prepare you for success in computationally intensive fields — pizza and pop provided

  • The Ph.D. in Scientific Computing is open to all Ph.D. students who will make extensive use of large-scale computation, computational methods, or algorithms for advanced computer architectures in their studies. It is a joint degree program, with students earning a Ph.D. from their current departments, “… and Scientific Computing” — for example, “Ph.D. in Aerospace Engineering and Scientific Computing.”
  • The Graduate Certificate in Computational Discovery and Engineering trains graduate students in computationally intensive research so they can excel in interdisciplinary HPC-focused research and product development environments. The certificate is open to all students currently pursuing Master’s or Ph.D. degrees at the University of Michigan.
  • The Graduate Certificate in Data Science is focused on developing core proficiencies in data analytics:
    1) Modeling — Understanding of core data science principles, assumptions and applications;
    2) Technology — Knowledge of basic protocols for data management, processing, computation, information extraction, and visualization;
    3) Practice — Hands-on experience with real data, modeling tools, and technology resources.

Times / Locations: