ASA Conference: Women in Statistics and Data Science, La Jolla, California

By | | No Comments

The American Statistical Association invites you to join us at the 2017 Women in Statistics and Data Science Conference in La Jolla, California—the only conference for the field tailored specifically for women!

Join us to “share WISDOM (Women in Statistics, Data science, and -OMics).”

WSDS will gather professionals and students from academia, industry, and the government working in statistics and data science. Find unique opportunities to grow your influence, your community, and your knowledge.

Whether you are a student, early-career professional, or an experienced statistician or data scientist, this conference will deliver new knowledge and connections in an intimate and comfortable setting.

Learn More!

2017 MIDAS Symposium

By | | No Comments

Please join us for the 2017 Michigan Institute for Data Science Symposium.

The keynote speaker will be Cathy O’Neil, mathematician and best-selling author of “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.”

Other speakers include:

  • Nadya Bliss, Director of the Global Security Initiative, Arizona State University
  • Francesca Dominici, Co-Director of the Data Science Initiative and Professor of Biostatistics, Harvard T.H. Chan School of Public Health
  • Daniela Whitten, Associate Professor of Statistics and Biostatistics, University of Washington
  • James Pennebaker, Professor of Psychology, University of Texas

More details are available at: http://midas.umich.edu/2017-symposium/

U-M, SJTU research teams share $1 million for data science projects

By | Data, General Interest, Happenings, News, Research | No Comments

Five research teams from the University of Michigan and Shanghai Jiao Tong University in China are sharing $1 million to study data science and its impact on air quality, galaxy clusters, lightweight metals, financial trading and renewable energy.

Since 2009, the two universities have collaborated on a number of research projects that address challenges and opportunities in energy, biomedicine, nanotechnology and data science.

In the latest round of annual grants, the winning projects focus on data science and how it can be applied to chemistry and physics of the universe, as well as finance and economics.

For more, read the University Record article.

For descriptions of the research projects, see the MIDAS/SJTU partnership page.

New Data Science Computing Platform Available to U-M Researchers

By | General Interest, Happenings, HPC, News | No Comments

Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce an expanded data science computing platform, giving all U-M researchers new capabilities to host structured and unstructured databases, and to ingest, store, query and analyze large datasets.

The new platform features a flexible, robust and scalable database environment, and a set of data pipeline tools that can ingest and process large amounts of data from sensors, mobile devices and wearables, and other sources of streaming data. The platform leverages the advanced virtualization capabilities of ARC-TS’s Yottabyte Research Cloud (YBRC) infrastructure, and is supported by U-M’s Data Science Initiative launched in 2015. YBRC was created through a partnership between Yottabyte and ARC-TS announced last fall.

The following functionalities are immediately available:

  • Structured databases:  MySQL/MariaDB, and PostgreSQL.
  • Unstructured databases: Cassandra, MongoDB, InfluxDB, Grafana, and ElasticSearch.
  • Data ingestion: Redis, Kafka, RabbitMQ.
  • Data processing: Apache Flink, Apache Storm, Node.js and Apache NiFi.

Other types of databases can be created upon request.

These tools are offered to all researchers at the University of Michigan free of charge, provided that certain usage restrictions are not exceeded. Large-scale users who outgrow the no-cost allotment may purchase additional YBRC resources. All interested parties should contact hpc-support@umich.edu.

At this time, the YBRC platform only accepts unrestricted data. The platform is expected to accommodate restricted data within the next few months.

ARC-TS also operates a separate data science computing cluster available for researchers using the latest Hadoop components. This cluster also will be expanded in the near future.

MDST partners with Adamy Valuation for market analysis

By | Educational, General Interest, MDSTPosts, News | No Comments

Authors: Michael Kovalcik, College of Engineering; Xinyu Tan, College of Engineering; Derek Chen, Ross School of Business.

Problem Overview

adamy-full-logo-rgb copyThe Michigan Data Science Team partnered with Adamy Valuation, a Grand Rapids-based valuation firm, to bring data-driven insights to business equity valuation.  Business valuation firms determine the market value of business interests in support of a variety of different types of transactions typically involving ownership interests in private businesses. Valuation firms, such as Adamy Valuation, deliver this assessment, which includes a detailed report explaining the reasons why they believe it to be fair.

Valuations are performed by expert financial analysts, who use their knowledge about the factors that influence value to manually assess the value of the equity. Shannon Pratt’s Valuing a Business suggests that there are two key factors in particular that influence value: risk and size. Risk is a measure of uncertainty relating to the company’s future  and can be assessed by looking at total debt and cash flows. Size refers to a company’s economic power. Larger companies will spend and make more than smaller ones. While these factors are quite informative, the degree to which they influence value varies a lot from industry to industry and even from company to company. Therefore, a valuation firm will often adjust their models manually to account for additional features, using knowledge gained from years of experience and industry expertise.

Our goals were to conduct a data-driven analysis of the valuation process and to build a predictive model that could learn to make value adjustments from historical data. A critical requirement of our approach was that the resulting model must be interpretable. An algorithm that is extremely accurate but offers no insight into how the prediction was made or what features it was based off of is of no use to Adamy Valuation because, at the end of the day, they must be able to validate the reasoning behind their assessment.

model_overview copyThe Data Pipeline

While our goal is to value private companies, data related to these companies is difficult to come by.  Business valuation analysts address this issue by using market data from public companies as guideline data points to inform the valuation of a private subject company.  To this end, we acquired a dataset of 400 publicly-traded companies along with 20 financial metrics that are commonly used during valuation. We cleaned this dataset to only contain features that are relevant to private companies so that the model learned on public companies could later be applied to value private companies.

We separate financial metrics into four categories: Size, Profitability, Growth, and Risk, as indicated by the colors in Fig. 1. Our goal was to determine which of the four categories, or more specifically, which features in these categories, contribute the most to:

tev-ebitdawhere TEV represents the Total Enterprise Value a measure of a company’s market value, adjusting for things like debt and cash on hand, and EBITDA stands for earnings before interest, tax, depreciation, and amortization. EBITDA allows analysts to focus on operating performance by minimizing the impact of non-operating decisions such as which tax rates they must pay and the degree to which their goods depreciate. In other words EBITDA gives a clearer value for head to head comparisons of company performance. Valuation firms typically examine the ratio of TEV and EBITDA instead of examining TEV or EBITDA directly, because the ratio standardizes for the size of the company, making it easier to make apples to apples comparisons with companies that may be much larger or smaller, but are otherwise similar.

To study how feature importance varied across industries, we categorized each public company into one of three separate sectors:

  • Consumer Discretionary refers to companies that provide goods and services that are considered nonessential to the consumer. For example, Bed Bath and Beyond, Ford Motor Company, and Panera Bread are all part of this category.
  • Consumer Staples provide essential products such as food, beverages, and household items. Companies like Campbell’s Soup, Coca Cola, and Kellogg are considered Consumer Staples.
  • Industrial Spending sector is a diverse category, which contains companies related to the manufacture and distribution of goods for industrial customers. In this dataset we see companies like Delta Airlines, Fedex, and Lockheed Martin.

Modeling

Our goal is not just to accurately estimate value, but also to identify key relationships between a company’s observable metrics and its ratio of TEV to EBITDA.We study 17 financial metrics, many of which have complex relationships with the ratio of TEV and EBITDA. To identify these relationships, we model the problem as a regression task. We use two simple but widely-used frameworks: linear models and tree-based models because both methods offer insight into how the predictions are actually made.

After fitting our models to the data, we identified the most predictive features of company value across industries, and compared this to profit margin and size, the metrics most commonly used in Valuing a Business. For our linear models we used the coefficients in our regression equation to determine which features were most important. For our random forest model we used the feature importance metric which ranks features according to the information gained during the fitting process.

Comparison of MethodsResults

The figure to the right depicts the accuracy our models versus the market approach (also known as comparable approach), the method used by valuation firms. With the size of the dataset and the specificity of the market approach we are not surprised that it outperforms our models. Rather we are showing here that our models have a reasonable enough degree of accuracy to trust the interpretation of the features.

Import features across different sectorsAlso on the right we show the top 3 features, according to information gain, per industry as learned by our random forest model. The larger the bar the more insightful that variable was for predictions.The features we see turning up in our model are indicators of profitability and size which agree with the existing knowledge in the literature. It is interesting to note that return on assets shows up in each sector which intuitively means the market values those companies that get high returns regardless of the sector.

Explanation of Key Predictors

Remember our goal was to predict TEV/EBITDA, which is a measure of company’s total value after standardizing for things such as size, tax structure, and number of other factors. There were 5 distinct predictors that really stood out in our analysis.

Return on Assets is a measure of a company’s efficiency in generating profit.

Total Revenue is also known as total sales and is a measurement of how much a company receives from the sale of goods and services.

EBITDA 1 year growth: EBITDA is a measure of profitability and growing EBITDA means growing profit and increasing value of a company.

A Capital Expenditure(Capex) is the amount of money that a company invested in property and equipment. Capex is often linked to the expansion or contraction of a business and is therefore a measure of growth. Looking at Capex as percentage of revenue provides a normalized measurement for comparison.

EBITDA Margin serves as an indicator of a company’s operating profitability. Higher EBITDA margin means the company is getting more EBITDA for every dollar of revenue.

MSSISS

MSSISS or the Michigan Student Symposium for Interdisciplinary Statistical Sciences is an annual conference hosted by the University of Michigan. MSSISS brings together statistics works from a number of different fields including computer science, electrical engineering, statistics, biostatistics, and industrial operations. Our poster was particularly interesting as it was the only one with a financial application. The novelty of our project drew in a number of viewers and impressed the judges. A major component of our poster score was determined by our ability to communicate our results to people outside the field. We received a certificate of merit for our work and ability to communicate it to the other attendees at the conference.

adamy_mssiss (2) copy

pydata

PyData June Meetup: Intro to Azure Machine Learning: Predict Who Survives the Titanic

By | | No Comments

Join us for a PyData Ann Arbor Meetup on Thursday, June 8th, at 6 PM, hosted by TD Ameritrade and MIDAS.

Interested in doing machine learning in the cloud? In this demo-heavy talk, Jennifer Marsman will set the stage with some information on the different types of machine learning (clustering, classification, regression, and anomaly detection) supported by Azure Machine Learning and when to use each. Then, for the majority of the session, she’ll demonstrate using Azure Machine Learning to build a model which predicts survival of individuals on the Titanic (one of the challenges on the Kaggle website). She’ll talk through how she analyzes the given data and why she chooses to drop or modify certain data, so you will see the entire process from data import to data cleaning to building, training, testing, and deploying a model. You’ll leave with practical knowledge on how to get started and build your own predictive models using Azure Machine Learning.

Jennifer Marsman is a Principal Software Development Engineer in Microsoft’s Developer Experience group, where she educates developers on Microsoft’s new technologies with a focus on data science, machine learning, and artificial intelligence. Jennifer blogs at http://blogs.msdn.microsoft.com/jennifer and tweets at http://twitter.com/jennifermarsman.

PyData Ann Arbor is a group for amateurs, academics, and professionals currently exploring various data ecosystems. Specifically, we seek to engage with others around analysis, visualization, and management. We are primarily focused on how Python data tools can be used in innovative ways but also maintain a healthy interest in leveraging tools based in other languages such as R, Java/Scala, Rust, and Julia.

PyData Ann Arbor strives to be a welcoming and fully inclusive group and we observe the PyData Code of Conduct. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.

“use what you have learned to make something better and share with others”

pydata

PyData May Meetup: Scalable, Distributed, and Reproducible Machine Learning

By | | No Comments

Join us for a PyData Ann Arbor Meetup on Thursday, May 25th at 6 PM, hosted by TD Ameritrade and MIDAS.

The recent advances in machine learning and artificial intelligence are amazing!  Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure.  Those models must also scale to production size data. In this talk, we will implement a model locally in Python. We will then take that model and deploy both it’s training and inference in a scalable manner to a production cluster with Pachyderm, an open source framework for distributed pipelining and data versioning. We will also learn how to update the production model online, track changes in our model and data, and explore our results.

Daniel Whitenack (@dwhitena) is a Ph.D. trained data scientist working with Pachyderm (@pachydermIO). Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world (ODSC, Spark Summit, Datapalooza, DevFest Siberia, GopherCon, and more), teaches data science/engineering with Ardan Labs (@ardanlabs), maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.

PyData Ann Arbor is a group for amateurs, academics, and professionals currently exploring various data ecosystems. Specifically, we seek to engage with others around analysis, visualization, and management. We are primarily focused on how Python data tools can be used in innovative ways but also maintain a healthy interest in leveraging tools based in other languages such as R, Java/Scala, Rust, and Julia.

PyData Ann Arbor strives to be a welcoming and fully inclusive group and we observe the PyData Code of Conduct. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.

“use what you have learned to make something better and share with others”

2017 MICDE Annual Symposium

By | | No Comments

Please join us for the Michigan Institute for Computational Discovery and Engineering 2017 Symposium. The event features eminent scientists from around the world and the U-M campus. The symposium this year focuses on the “New Era of Data-Enabled Computational Science.”

Speakers:

  • Frederica Darema — Director, Air Force Office of Scientific Research
  • George Karniadakis —  Professor of Applied Mathematics, Brown University
  • Tinsley Oden Director of the Institute for Computational Engineering and Sciences, V.P. for Research, University of Texas at Austin
  • Karen Willcox — Professor of Aerospace and Aeronautics, Massachusetts Institute of Technology, co-Director of MIT Center for Computational Engineering
  • Jacqueline H. Chen — Distinguished Member of Technical Staff at the Combustion Research Facility, Sandia National Laboratories
  • Laura Balzano — Assistant Professor, Electrical Engineering and Computer Science, U-M
  • Emanuel Gull — Assistant Professor, Physics

The symposium features a poster competition and more. For more information and to register go to http://micde.umich.edu/symposium17/

Past Symposia

2016 MICDE Annual Symposium

Research Computing Symposium Fall 2014 

 

pydata

PyData April Meetup: Interactive Data Visualization in Jupyter Notebook Using bqplot

By | | No Comments

Join us for a PyData Ann Arbor Meetup on Thursday, April 13th at 6 PM, hosted by TD Ameritrade and MIDAS.

This month’s meetup will focus on bqplot which is a Python plotting library based on d3.js that offers its functionality directly in the Jupyter Notebook, including selections, interactions, and arbitrary css customization. In bqplot, every element of a chart is an interactive widget that can be bound to a python function, which serves as the callback when an interaction takes place. This allows the user to generate full fledged interactive applications directly in the Notebook with just a few lines of Python code. In the second part of the talk, drawing examples from fields like Data Science and Finance, we show examples of building interactive charts and dashboards using bqplot and the ipywidgets framework.

The talk will also cover bqplot’s interaction with the new JupyterLab IDE and what we plan for the future.

Presenter: Dhruv Madeka is a Quantitative Researcher at Bloomberg LP. His current research interests focus on Machine Learning, Quantitative Finance, Data Visualization and Applied Mathematics. Having graduated from the University of Michigan with a BS in Operations Research and from Boston University with an MS in Mathematical Finance, Dhruv is part of one of the leading research teams in Finance, developing models, software and tools for users to make their data analysis experience richer.

 

PyData Ann Arbor is a group for amateurs, academics, and professionals currently exploring various data ecosystems. Specifically, we seek to engage with others around analysis, visualization, and management. We are primarily focused on how Python data tools can be used in innovative ways but also maintain a healthy interest in leveraging tools based in other languages such as R, Java/Scala, Rust, and Julia.

PyData Ann Arbor strives to be a welcoming and fully inclusive group and we observe the PyData Code of Conduct. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.

“use what you have learned to make something better and share with others”

NSF Federal Datasets Faculty Working Group

By | | No Comments

In response to a recent NSF solicitation (Dear Colleague Letter: Request for Input on Federal Datasets with Potential to Advance Data Science), the Michigan Institute for Data Science (MIDAS) invites Faculty to join a faculty working group to collaborate on a joint submission (deadline: March 31).

The NSF DCL working group will identify federal government data that will enhance and support the growing data science research community. We are being asked what federal data is of value for data science and machine learning that will have significant impact on science, engineering, education, and society.

If you have experience or interest in using federal datasets for your research, and would like to help shape how federal datasets can be preserved and utilized, please join this working group. We will discuss strategies for responding to NSF and potential funding (both federal and local) to support this effort. Please attend in person if possible.

RSVP