Each year eight Medallion Lecturers are chosen from across all areas of statistics and probability by the IMS Committee on Special Lectures. The Medallion nomination is an honor and an acknowledgment of a significant research contribution to one or more areas of research. Each Medallion Lecturer will receive a Medallion in a brief ceremony preceding the lecture.
Please join us for the 2017 Michigan Institute for Data Science Symposium.
The keynote speaker will be Cathy O’Neil, mathematician and best-selling author of “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.”
Other speakers include:
- Nadya Bliss, Director of the Global Security Initiative, Arizona State University
- Francesca Dominici, Co-Director of the Data Science Initiative and Professor of Biostatistics, Harvard T.H. Chan School of Public Health
- Daniela Whitten, Associate Professor of Statistics and Biostatistics, University of Washington
- James Pennebaker, Professor of Psychology, University of Texas
More details, including how to register, will be available soon.
Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce an expanded data science computing platform, giving all U-M researchers new capabilities to host structured and unstructured databases, and to ingest, store, query and analyze large datasets.
The new platform features a flexible, robust and scalable database environment, and a set of data pipeline tools that can ingest and process large amounts of data from sensors, mobile devices and wearables, and other sources of streaming data. The platform leverages the advanced virtualization capabilities of ARC-TS’s Yottabyte Research Cloud (YBRC) infrastructure, and is supported by U-M’s Data Science Initiative launched in 2015. YBRC was created through a partnership between Yottabyte and ARC-TS announced last fall.
The following functionalities are immediately available:
- Structured databases: MySQL/MariaDB, and PostgreSQL.
- Unstructured databases: Cassandra, MongoDB, InfluxDB, Grafana, and ElasticSearch.
- Data ingestion: Redis, Kafka, RabbitMQ.
- Data processing: Apache Flink, Apache Storm, Node.js and Apache NiFi.
Other types of databases can be created upon request.
These tools are offered to all researchers at the University of Michigan free of charge, provided that certain usage restrictions are not exceeded. Large-scale users who outgrow the no-cost allotment may purchase additional YBRC resources. All interested parties should contact email@example.com.
At this time, the YBRC platform only accepts unrestricted data. The platform is expected to accommodate restricted data within the next few months.
ARC-TS also operates a separate data science computing cluster available for researchers using the latest Hadoop components. This cluster also will be expanded in the near future.
Authors: Michael Kovalcik, College of Engineering; Xinyu Tan, College of Engineering; Derek Chen, Ross School of Business.
The Michigan Data Science Team partnered with Adamy Valuation, a Grand Rapids-based valuation firm, to bring data-driven insights to business equity valuation. Business valuation firms determine the market value of business interests in support of a variety of different types of transactions typically involving ownership interests in private businesses. Valuation firms, such as Adamy Valuation, deliver this assessment, which includes a detailed report explaining the reasons why they believe it to be fair.
Valuations are performed by expert financial analysts, who use their knowledge about the factors that influence value to manually assess the value of the equity. Shannon Pratt’s Valuing a Business suggests that there are two key factors in particular that influence value: risk and size. Risk is a measure of uncertainty relating to the company’s future and can be assessed by looking at total debt and cash flows. Size refers to a company’s economic power. Larger companies will spend and make more than smaller ones. While these factors are quite informative, the degree to which they influence value varies a lot from industry to industry and even from company to company. Therefore, a valuation firm will often adjust their models manually to account for additional features, using knowledge gained from years of experience and industry expertise.
Our goals were to conduct a data-driven analysis of the valuation process and to build a predictive model that could learn to make value adjustments from historical data. A critical requirement of our approach was that the resulting model must be interpretable. An algorithm that is extremely accurate but offers no insight into how the prediction was made or what features it was based off of is of no use to Adamy Valuation because, at the end of the day, they must be able to validate the reasoning behind their assessment.
While our goal is to value private companies, data related to these companies is difficult to come by. Business valuation analysts address this issue by using market data from public companies as guideline data points to inform the valuation of a private subject company. To this end, we acquired a dataset of 400 publicly-traded companies along with 20 financial metrics that are commonly used during valuation. We cleaned this dataset to only contain features that are relevant to private companies so that the model learned on public companies could later be applied to value private companies.
We separate financial metrics into four categories: Size, Profitability, Growth, and Risk, as indicated by the colors in Fig. 1. Our goal was to determine which of the four categories, or more specifically, which features in these categories, contribute the most to:
where TEV represents the Total Enterprise Value a measure of a company’s market value, adjusting for things like debt and cash on hand, and EBITDA stands for earnings before interest, tax, depreciation, and amortization. EBITDA allows analysts to focus on operating performance by minimizing the impact of non-operating decisions such as which tax rates they must pay and the degree to which their goods depreciate. In other words EBITDA gives a clearer value for head to head comparisons of company performance. Valuation firms typically examine the ratio of TEV and EBITDA instead of examining TEV or EBITDA directly, because the ratio standardizes for the size of the company, making it easier to make apples to apples comparisons with companies that may be much larger or smaller, but are otherwise similar.
To study how feature importance varied across industries, we categorized each public company into one of three separate sectors:
- Consumer Discretionary refers to companies that provide goods and services that are considered nonessential to the consumer. For example, Bed Bath and Beyond, Ford Motor Company, and Panera Bread are all part of this category.
- Consumer Staples provide essential products such as food, beverages, and household items. Companies like Campbell’s Soup, Coca Cola, and Kellogg are considered Consumer Staples.
- Industrial Spending sector is a diverse category, which contains companies related to the manufacture and distribution of goods for industrial customers. In this dataset we see companies like Delta Airlines, Fedex, and Lockheed Martin.
Our goal is not just to accurately estimate value, but also to identify key relationships between a company’s observable metrics and its ratio of TEV to EBITDA.We study 17 financial metrics, many of which have complex relationships with the ratio of TEV and EBITDA. To identify these relationships, we model the problem as a regression task. We use two simple but widely-used frameworks: linear models and tree-based models because both methods offer insight into how the predictions are actually made.
After fitting our models to the data, we identified the most predictive features of company value across industries, and compared this to profit margin and size, the metrics most commonly used in Valuing a Business. For our linear models we used the coefficients in our regression equation to determine which features were most important. For our random forest model we used the feature importance metric which ranks features according to the information gained during the fitting process.
The figure to the right depicts the accuracy our models versus the market approach (also known as comparable approach), the method used by valuation firms. With the size of the dataset and the specificity of the market approach we are not surprised that it outperforms our models. Rather we are showing here that our models have a reasonable enough degree of accuracy to trust the interpretation of the features.
Also on the right we show the top 3 features, according to information gain, per industry as learned by our random forest model. The larger the bar the more insightful that variable was for predictions.The features we see turning up in our model are indicators of profitability and size which agree with the existing knowledge in the literature. It is interesting to note that return on assets shows up in each sector which intuitively means the market values those companies that get high returns regardless of the sector.
Explanation of Key Predictors
Remember our goal was to predict TEV/EBITDA, which is a measure of company’s total value after standardizing for things such as size, tax structure, and number of other factors. There were 5 distinct predictors that really stood out in our analysis.
Return on Assets is a measure of a company’s efficiency in generating profit.
Total Revenue is also known as total sales and is a measurement of how much a company receives from the sale of goods and services.
EBITDA 1 year growth: EBITDA is a measure of profitability and growing EBITDA means growing profit and increasing value of a company.
A Capital Expenditure(Capex) is the amount of money that a company invested in property and equipment. Capex is often linked to the expansion or contraction of a business and is therefore a measure of growth. Looking at Capex as percentage of revenue provides a normalized measurement for comparison.
EBITDA Margin serves as an indicator of a company’s operating profitability. Higher EBITDA margin means the company is getting more EBITDA for every dollar of revenue.
MSSISS or the Michigan Student Symposium for Interdisciplinary Statistical Sciences is an annual conference hosted by the University of Michigan. MSSISS brings together statistics works from a number of different fields including computer science, electrical engineering, statistics, biostatistics, and industrial operations. Our poster was particularly interesting as it was the only one with a financial application. The novelty of our project drew in a number of viewers and impressed the judges. A major component of our poster score was determined by our ability to communicate our results to people outside the field. We received a certificate of merit for our work and ability to communicate it to the other attendees at the conference.
HV Jagadish, a core MIDAS faculty member and Professor of Electrical Engineering and Computer Science, contributed as a co-author on an article on diversity in big data that appears in to a special edition of Big Data magazine. Big Data is published by phys.org.
Jagadish co-authored the piece, titled “Diversity in Big Data, a Review,” with researchers from the University of Ioannia in Greece, and Drexel University. The article emphasizes the risks big data may pose to society and individuals if it fails to account for diversity and potential discrimination, and discusses connections between diversity and fairness in big data systems research.
The Big Data in Transportation and Mobility symposium held June 22-23, 2017, in Ann Arbor, MI brought together more than 150 data science practitioners from academia, industry and government to explore emerging issues in this expanding field.
Sponsored by the NSF-supported Midwest Big Data Hub (MBDH) and the Michigan Institute for Data Science (MIDAS), the symposium featured lightning talks from transportation research programs around the Midwest; tutorials and breakout sessions on specific issues and methods; a poster session; and a keynote address from two representatives of the Smart Columbus project: Chris Stewart, Ohio State University Associate Professor of Computer Science and Engineering, and Shoreh Elhami, GIS Manager for the city of Columbus.
Speakers and attendees came from a number of organizations from across the midwest including the University of Michigan, University of Illinois, University of Nebraska, University of North Dakota, North Dakota State University, Ohio State University, Purdue University, Denso International America, Fiat Chrysler, Ford Motor Company, General Motors, IAV Automotive Engineering and Yottabyte.
“This was an extremely valuable opportunity to share information and ideas,” said Carol Flannagan, one of the organizers of the symposium and a researcher at MIDAS and the U-M Transportation Research Institute. “Cross-discipline and cross-institutional collaboration is crucial to the success of Big Data applications, and we took a significant step forward in that vein during this symposium.”
Topics addressed in talks, breakouts, and tutorials included:
- New Analytic Tools for Designing and Managing Transportation Systems
- New Mobility Options for Small and Mid-sized Cities in the Midwest
- Automated and Connected Vehicles
- Transforming Transportation Operations using High Performance Computing
- On-Demand Transit
- Using Big Data for Monitoring Bridges
At the closing session, participants outlined some areas that could be fruitful to focus on going forward, including increasing data-science literacy in the general public; diversity and workforce development in data science; public data-sharing platforms and partners; and privacy issues.
The University of Michigan is beginning the process of building our next generation HPC platform, “Big House.” Flux, the shared HPC cluster, has reached the end of its useful life. Flux has served us well for more than five years, but as we move forward with replacement, we want to make sure we’re meeting the needs of the research community.
ARC-TS will be holding a series of town halls to take input from faculty and researchers on the next HPC platform to be built by the University. These town halls are open to anyone and will be held at:
College of Engineering, Johnson Room, Tuesday, June 20th, 9:00a – 10:00a
NCRC Bldg 300, Room 376, Wednesday, June 21st, 11:00a – 12:00p
LSA #2001, Tuesday, June 27th, 10:00a – 11:00a
3114 Med Sci I, Wednesday, June 28th, 2:00p – 3:00p
Your input will help to ensure that U-M is on course for providing HPC, so we hope you will make time to attend one of these sessions. If you cannot attend, please email firstname.lastname@example.org with any input you want to share.
Advanced Research Computing – Technology Services (ARC-TS) has an exciting opportunity for a Research Cloud Administrator.
This position will be part of a team working on a novel platform for research computing in the university for data science and high performance computing. The primary responsibilities for this position will be to develop and create a resource sharing environment to enable execution of Data Science and HPC workflows using containers for University of Michigan researchers.
For more details and to apply, visit: http://careers.umich.edu/job_detail/142372/research_cloud_administrator_intermediate
The Institute for Healthcare Policy and Innovation (IHPI) is partnering with Advanced Research Computing (ARC) to bring two commercial claims datasets to campus researchers.
The OptumInsight and Truven Marketscan datasets contain nearly complete insurance claims and other health data on tens of millions of people representing the US private insurance population. Within each dataset, records can be linked longitudinally for over 5 years.
To begin working with the data, researchers should submit a brief analysis plan for review by IHPI staff, who will create extracts or grant access to primary data as appropriate.
CSCAR consultants are available to provide guidance on computational and analytic methods for a variety of research aims, including use of Flux and other UM computing infrastructure for working with these large and complex repositories.
The data acquisition and availability was funded by IHPI and the U-M Data Science Initiative.
The Michigan Institute for Data Science (MIDAS) is convening a research working group on mobile sensor analytics. Mobile sensors are taking on an increasing presence in our lives. Wearable devices allow for physiological and cognitive monitoring, and behavior modeling for health maintenance, exercise, sports, and entertainment. Sensors in vehicles measure vehicle kinematics, record driver behavior, and increase perimeter awareness. Mobile sensors are becoming essential in areas such as environmental monitoring and epidemiological tracking.
There are significant data science opportunities for theory and application in mobile sensor analytics, including real-time data collection, streaming data analysis, active on-line learning, mobile sensor networks, and energy efficient mobile computing.
Our working group welcomes researchers with interest in mobile sensor analytics in any scientific domain, including but not limited to health, transportation, smart cities, ecology and the environment.
Where and When:
Noon to 2 pm, April 13, 2017
School of Public Health I, Room 7625
Brief presentations about challenges and opportunities in mobile sensor analytics (theory and application);
A brief presentation of a list of funding opportunities;
Discussion of research ideas and collaboration in the context of grant application and industry partnership.
Future Plans: Based on the interest of participants, MIDAS will alert researchers to relevant funding opportunities, hold follow-up meetings for continued discussion and team formation as ideas crystalize for grant applications, and work with the UM Business Engagement Center to bring in industry partnership.