MDST Partners with City of Detroit to Improve Blight Enforcement

By | MDSTPosts, News | No Comments

Author: Allie Cell, College of Engineering

blighted_propPROBLEM OVERVIEW

Property blight, which refers to lots and structures not being properly maintained (as pictured above), is a major problem in Detroit: over 20% of lots in the city are blighted. As a measure to help curb this behavior, in 2005 the City of Detroit began issuing blight tickets to the owners of afflicted properties, which include fees ranging from $20, for offenses including leaving out a trash bin too far before its collection date, to $10,000, for offenses including dumping more than 5 cubic feet of waste. Unfortunately, however, only 7% of people issued tickets and found guilty actually pay the ticket, leaving a balance of some $73 million in unpaid blight tickets.

Officials from the City of Detroit’s Department of Administrative Hearings and Department of Innovation and Technology who work with blight tickets and the Michigan Data Science Team initially came together this past February to sponsor a competition to forecast blight compliance. To provide a more actionable analysis for Detroit policymakers, we aimed to understand what sorts of people receive blight tickets, how we can use this knowledge to better grasp why blight tickets have not been effective, and to provide insights for policy makers accordingly.

THE DATA

In order to get an accurate picture of blight compliance in Detroit, we aggregated information from multiple datasets. The Detroit Open Data Portal is an online center for public data featuring datasets including those related to public health, education, transportation; that is where we obtained most of our datasets from. Some of the most valuable datasets we used included:

Blight Ticket Data, records of each blight ticket

Parcel Data, records of all properties in Detroit

Crime Data, records of all crimes in Detroit from 2009 through 2016

Demolition Data, records of each completed and scheduled demolition in Detroit

Improve Detroit, records of all issues submitted through an app whose goal is improving the city

PREDICTING BLIGHT TICKET COMPLIANCE

We built a model to predict whether a property owner would pay their blight ticket. Each record in the dataset had one-hot encoded data from the sources listed above. It has been observed that tree methods are easily interpretable and perform well for mixed data, so we considered scikit-learn Random Forests and the xgboost Gradient Boosted Trees (XGBoostClassifier). To choose the best model, we generated learning curves with 5-fold cross-validation for each classifier; xgboost performed well with a cross-validation score of over .9, so we selected this model.

Screen Shot 2017-08-22 at 5.27.14 PM20_80

ANALYSIS OF TICKETED PROPERTY OWNERS

To gain a more holistic understanding of the relationships between ticketed property owners and their properties, we analyzed three categories of property owners:

Top Offenders, the small portion of offenders who own many blighted properties and account for the majority of tickets–as shown above, 20% of violators own over 70% of unpaid blight fines

Live-In Owners, offenders who were determined to actually live in their blighted property, indicative of a stronger relationship between the owner and the house

Residential Rental Property Owners, offenders who own residential properties but were determined to not live in them, indicative of a more income-driven relationship between the owner and the property

After deciding to focus on comparing these three groups, we found some notable differences between each owner category:

Repeat Offenses: only 11% of live-in owners were issued more than two blight tickets (71% of live-in owners received only one blight ticket); the multiple offense rate jumps to 20% when looking at residential rental property owners (59% of residential rental property owners received only one blight ticket).

Property Conditions: only 4.8% of the ticketed properties owned by live-in owners were in poor condition, compared to 7.1% for those owned by residential rental property owners and 8.8% for those owned by top offenders

Compliance: only 6.5% of tickets issued to top offenders were paid (either on time or by less than 1 month late), which is significantly less than the 10% and 11% rates for residential rental property owners and live-in owners respectively

Occupancy Rates: live-in owners saw the highest occupancy rate on properties that were issued blight tickets, 69%, followed by 57% for residential rental property owners and 47% for top offenders. While this trend makes sense, we would really expect a 100% occupancy rate to focus on properties that owners are actively living in, and thus the distance from 100% is one testament to a problem our team faced in data quality–both from inconsistency in records and from real estate turnover in Detroit.

Feel free to check out the whole paper ~ here ~

Mini-course: Introduction to Python — Sept. 11-14

By | Data, Educational, Events, General Interest, News | No Comments

Asst. Prof. Emanuel Gull, Physics, is offering a mini-course introducing the Python programming language in a four-lecture series. Beginners without any programming experience as well as programmers who usually use other languages (C, C++, Fortran, Java, …) are encouraged to come; no prior knowledge of programming languages is required!

For the first two lectures we will mostly follow the book Learning Python. This book is available at our library. An earlier edition (with small differences, equivalent for all practical purposes) is available as an e-book. The second week will introduce some useful python libraries: numpyscipymatplotlib.

At the end of the first two weeks you will know enough about Python to use it for your grad class homework and your research.

Special meeting place: we will meet in 340 West Hall on Monday September 11 at 5 PM.

Please bring a laptop computer along to follow the exercises!

Syllabus (Dates & Location for Fall 2017)

  1. Monday September 11 5:00 – 6:30 PM: Welcome & Getting Started (hello.py). Location: 340 West Hall
  2. Tuesday September 12 5:00 – 6:30 PM: Numbers, Strings, Lists, Dictionaries, Tuples, Functions, Modules, Control flow. Location: 335 West Hall
  3. Wednesday September 13 5:00 – 6:30 PM: Useful Python libraries (part I): numpy, scipy, matplotlib. Location: 335 West Hall
  4. Thursday September 14 5:00 – 6:30 PM: Useful Python libraries (part 2): 3d plotting in matplotlib and exercises. Location: 335 West Hall

For more information: https://sites.lsa.umich.edu/gull-lab/teaching/physics-514-fall-2017/introduction-to-python/

 

Info sessions on graduate studies in computational and data sciences — Sept. 21 and 25

By | Educational, Events, General Interest, News, Research | No Comments

Learn about graduate programs that will prepare you for success in computationally intensive fields — pizza and pop provided

  • The Ph.D. in Scientific Computing is open to all Ph.D. students who will make extensive use of large-scale computation, computational methods, or algorithms for advanced computer architectures in their studies. It is a joint degree program, with students earning a Ph.D. from their current departments, “… and Scientific Computing” — for example, “Ph.D. in Aerospace Engineering and Scientific Computing.”
  • The Graduate Certificate in Computational Discovery and Engineering trains graduate students in computationally intensive research so they can excel in interdisciplinary HPC-focused research and product development environments. The certificate is open to all students currently pursuing Master’s or Ph.D. degrees at the University of Michigan.
  • The Graduate Certificate in Data Science is focused on developing core proficiencies in data analytics:
    1) Modeling — Understanding of core data science principles, assumptions and applications;
    2) Technology — Knowledge of basic protocols for data management, processing, computation, information extraction, and visualization;
    3) Practice — Hands-on experience with real data, modeling tools, and technology resources.

Times / Locations:

U-M, SJTU research teams share $1 million for data science projects

By | Data, General Interest, Happenings, News, Research | No Comments

Five research teams from the University of Michigan and Shanghai Jiao Tong University in China are sharing $1 million to study data science and its impact on air quality, galaxy clusters, lightweight metals, financial trading and renewable energy.

Since 2009, the two universities have collaborated on a number of research projects that address challenges and opportunities in energy, biomedicine, nanotechnology and data science.

In the latest round of annual grants, the winning projects focus on data science and how it can be applied to chemistry and physics of the universe, as well as finance and economics.

For more, read the University Record article.

For descriptions of the research projects, see the MIDAS/SJTU partnership page.

Call for Proposals: Amazon Research Awards, deadline 9/15/17

By | Data, Educational, Funding Opportunities, News, Research | No Comments

The Amazon Research Awards (ARA) program offers awards of up to $80,000 in cash and $20,000 in AWS promotional credits to faculty members at academic institutions in North America and Europe for research in these areas:

  • Computer vision
  • General AI
  • Knowledge management and data quality
  • Machine learning
  • Machine translation
  • Natural language understanding
  • Personalization
  • Robotics
  • Search and information retrieval
  • Security, privacy and abuse prevention
  • Speech

The ARA program funds projects conducted primarily by PhD students or post docs, under the supervision of the faculty member awarded the funds. To encourage collaboration and the sharing of insights, each funded proposal team is assigned an appropriate Amazon research contact. Amazon invites ARA recipients to speak at Amazon offices worldwide about their work, meet with Amazon research groups face-to-face, and encourages ARA recipients to publish their research outcome and commit related code to open-source code repositories.

Submissions are to be made online and details including rules and who may apply are located here.

Liza Levina, PhD, Chosen IMS Medallion Lecturer in 2019

By | Events, Feature, General Interest, Happenings, News, Research | No Comments

Professor Liza Levina has been selected to present an Institute of Mathematical Statistics (IMS) Medallion Lecture at the 2019 Joint Statistical Meeting (JSM).

Each year eight Medallion Lecturers are chosen from across all areas of statistics and probability by the IMS Committee on Special Lectures. The Medallion nomination is an honor and an acknowledgment of a significant research contribution to one or more areas of research. Each Medallion Lecturer will receive a Medallion in a brief ceremony preceding the lecture.

SAVE THE DATE: MIDAS Annual Symposium, Oct. 11

By | Events, General Interest, News | No Comments

Please join us for the 2017 Michigan Institute for Data Science Symposium.

The keynote speaker will be Cathy O’Neil, mathematician and best-selling author of “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.”

Other speakers include:

  • Nadya Bliss, Director of the Global Security Initiative, Arizona State University
  • Francesca Dominici, Co-Director of the Data Science Initiative and Professor of Biostatistics, Harvard T.H. Chan School of Public Health
  • Daniela Witten, Associate Professor of Statistics and Biostatistics, University of Washington
  • James Pennebaker, Professor of Psychology, University of Texas

More details, including how to register, will be available soon.

New Data Science Computing Platform Available to U-M Researchers

By | General Interest, Happenings, HPC, News | No Comments

Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce an expanded data science computing platform, giving all U-M researchers new capabilities to host structured and unstructured databases, and to ingest, store, query and analyze large datasets.

The new platform features a flexible, robust and scalable database environment, and a set of data pipeline tools that can ingest and process large amounts of data from sensors, mobile devices and wearables, and other sources of streaming data. The platform leverages the advanced virtualization capabilities of ARC-TS’s Yottabyte Research Cloud (YBRC) infrastructure, and is supported by U-M’s Data Science Initiative launched in 2015. YBRC was created through a partnership between Yottabyte and ARC-TS announced last fall.

The following functionalities are immediately available:

  • Structured databases:  MySQL/MariaDB, and PostgreSQL.
  • Unstructured databases: Cassandra, MongoDB, InfluxDB, Grafana, and ElasticSearch.
  • Data ingestion: Redis, Kafka, RabbitMQ.
  • Data processing: Apache Flink, Apache Storm, Node.js and Apache NiFi.

Other types of databases can be created upon request.

These tools are offered to all researchers at the University of Michigan free of charge, provided that certain usage restrictions are not exceeded. Large-scale users who outgrow the no-cost allotment may purchase additional YBRC resources. All interested parties should contact hpc-support@umich.edu.

At this time, the YBRC platform only accepts unrestricted data. The platform is expected to accommodate restricted data within the next few months.

ARC-TS also operates a separate data science computing cluster available for researchers using the latest Hadoop components. This cluster also will be expanded in the near future.

MDST partners with Adamy Valuation for market analysis

By | Educational, General Interest, MDSTPosts, News | No Comments

Authors: Michael Kovalcik, College of Engineering; Xinyu Tan, College of Engineering; Derek Chen, Ross School of Business.

Problem Overview

adamy-full-logo-rgb copyThe Michigan Data Science Team partnered with Adamy Valuation, a Grand Rapids-based valuation firm, to bring data-driven insights to business equity valuation.  Business valuation firms determine the market value of business interests in support of a variety of different types of transactions typically involving ownership interests in private businesses. Valuation firms, such as Adamy Valuation, deliver this assessment, which includes a detailed report explaining the reasons why they believe it to be fair.

Valuations are performed by expert financial analysts, who use their knowledge about the factors that influence value to manually assess the value of the equity. Shannon Pratt’s Valuing a Business suggests that there are two key factors in particular that influence value: risk and size. Risk is a measure of uncertainty relating to the company’s future  and can be assessed by looking at total debt and cash flows. Size refers to a company’s economic power. Larger companies will spend and make more than smaller ones. While these factors are quite informative, the degree to which they influence value varies a lot from industry to industry and even from company to company. Therefore, a valuation firm will often adjust their models manually to account for additional features, using knowledge gained from years of experience and industry expertise.

Our goals were to conduct a data-driven analysis of the valuation process and to build a predictive model that could learn to make value adjustments from historical data. A critical requirement of our approach was that the resulting model must be interpretable. An algorithm that is extremely accurate but offers no insight into how the prediction was made or what features it was based off of is of no use to Adamy Valuation because, at the end of the day, they must be able to validate the reasoning behind their assessment.

model_overview copyThe Data Pipeline

While our goal is to value private companies, data related to these companies is difficult to come by.  Business valuation analysts address this issue by using market data from public companies as guideline data points to inform the valuation of a private subject company.  To this end, we acquired a dataset of 400 publicly-traded companies along with 20 financial metrics that are commonly used during valuation. We cleaned this dataset to only contain features that are relevant to private companies so that the model learned on public companies could later be applied to value private companies.

We separate financial metrics into four categories: Size, Profitability, Growth, and Risk, as indicated by the colors in Fig. 1. Our goal was to determine which of the four categories, or more specifically, which features in these categories, contribute the most to:

tev-ebitdawhere TEV represents the Total Enterprise Value a measure of a company’s market value, adjusting for things like debt and cash on hand, and EBITDA stands for earnings before interest, tax, depreciation, and amortization. EBITDA allows analysts to focus on operating performance by minimizing the impact of non-operating decisions such as which tax rates they must pay and the degree to which their goods depreciate. In other words EBITDA gives a clearer value for head to head comparisons of company performance. Valuation firms typically examine the ratio of TEV and EBITDA instead of examining TEV or EBITDA directly, because the ratio standardizes for the size of the company, making it easier to make apples to apples comparisons with companies that may be much larger or smaller, but are otherwise similar.

To study how feature importance varied across industries, we categorized each public company into one of three separate sectors:

  • Consumer Discretionary refers to companies that provide goods and services that are considered nonessential to the consumer. For example, Bed Bath and Beyond, Ford Motor Company, and Panera Bread are all part of this category.
  • Consumer Staples provide essential products such as food, beverages, and household items. Companies like Campbell’s Soup, Coca Cola, and Kellogg are considered Consumer Staples.
  • Industrial Spending sector is a diverse category, which contains companies related to the manufacture and distribution of goods for industrial customers. In this dataset we see companies like Delta Airlines, Fedex, and Lockheed Martin.

Modeling

Our goal is not just to accurately estimate value, but also to identify key relationships between a company’s observable metrics and its ratio of TEV to EBITDA.We study 17 financial metrics, many of which have complex relationships with the ratio of TEV and EBITDA. To identify these relationships, we model the problem as a regression task. We use two simple but widely-used frameworks: linear models and tree-based models because both methods offer insight into how the predictions are actually made.

After fitting our models to the data, we identified the most predictive features of company value across industries, and compared this to profit margin and size, the metrics most commonly used in Valuing a Business. For our linear models we used the coefficients in our regression equation to determine which features were most important. For our random forest model we used the feature importance metric which ranks features according to the information gained during the fitting process.

Comparison of MethodsResults

The figure to the right depicts the accuracy our models versus the market approach (also known as comparable approach), the method used by valuation firms. With the size of the dataset and the specificity of the market approach we are not surprised that it outperforms our models. Rather we are showing here that our models have a reasonable enough degree of accuracy to trust the interpretation of the features.

Import features across different sectorsAlso on the right we show the top 3 features, according to information gain, per industry as learned by our random forest model. The larger the bar the more insightful that variable was for predictions.The features we see turning up in our model are indicators of profitability and size which agree with the existing knowledge in the literature. It is interesting to note that return on assets shows up in each sector which intuitively means the market values those companies that get high returns regardless of the sector.

Explanation of Key Predictors

Remember our goal was to predict TEV/EBITDA, which is a measure of company’s total value after standardizing for things such as size, tax structure, and number of other factors. There were 5 distinct predictors that really stood out in our analysis.

Return on Assets is a measure of a company’s efficiency in generating profit.

Total Revenue is also known as total sales and is a measurement of how much a company receives from the sale of goods and services.

EBITDA 1 year growth: EBITDA is a measure of profitability and growing EBITDA means growing profit and increasing value of a company.

A Capital Expenditure(Capex) is the amount of money that a company invested in property and equipment. Capex is often linked to the expansion or contraction of a business and is therefore a measure of growth. Looking at Capex as percentage of revenue provides a normalized measurement for comparison.

EBITDA Margin serves as an indicator of a company’s operating profitability. Higher EBITDA margin means the company is getting more EBITDA for every dollar of revenue.

MSSISS

MSSISS or the Michigan Student Symposium for Interdisciplinary Statistical Sciences is an annual conference hosted by the University of Michigan. MSSISS brings together statistics works from a number of different fields including computer science, electrical engineering, statistics, biostatistics, and industrial operations. Our poster was particularly interesting as it was the only one with a financial application. The novelty of our project drew in a number of viewers and impressed the judges. A major component of our poster score was determined by our ability to communicate our results to people outside the field. We received a certificate of merit for our work and ability to communicate it to the other attendees at the conference.

adamy_mssiss (2) copy

HV Jagadish contributes to Big Data magazine article on diversity

By | General Interest, News | No Comments

HV Jagadish, a core MIDAS faculty member and Professor of Electrical Engineering and Computer Science, contributed as a co-author on an article on diversity in big data that appears in to a special edition of Big Data magazine. Big Data is published by phys.org.

Jagadish co-authored the piece, titled “Diversity in Big Data, a Review,” with researchers from the University of Ioannia in Greece, and Drexel University. The article emphasizes the risks big data may pose to society and individuals if it fails to account for diversity and potential discrimination, and discusses connections between diversity and fairness in big data systems research.