The Michigan Institute for Data Science (MINDS) and the Michigan Institute for Computational Discovery and Engineering (MICDE) held a symposium in April 2014 featuring speakers at the forefront of data and computational science. The symposium also received support from Yahoo!
Introduction: H.V. Jagadish, Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science
Usama Fayyad, Chief Data Officer, Barclays Bank
The New CDO Challenge: Taming the Big Data Beast for Value and Insights
Usama Fayyad, Ph.D. is Chief Data Officer at Barclays. His responsibilities, globally across Group, include the governance, performance and management of our operational and analytical data systems, as well as delivering value by using data and analytics to create growth opportunities and cost savings for the business. He previously led OASIS-500, a tech startup investment fund, following his appointment as Executive Chairman in 2010 by King Abdullah II of Jordan. He was also Chairman, Co-Founder and Chief Technology Officer of ChoozOn Corporation/ Blue Kangaroo, a mobile search engine service for offers based in Silicon Valley.
In 2008, Usama founded Open Insights, a US-based data strategy, technology and consulting firm that helps enterprises deploy data-driven solutions that effectively and dramatically grow revenue and competitive advantage. Prior to this, he served as Yahoo!’s Chief Data Officer and Executive Vice President where he was responsible for Yahoo!’s global data strategy, architecting its data policies and systems, and managing its data analytics and data processing infrastructure. The data teams he built at Yahoo! collected, managed, and processed over 25 terabytes of data per day, and drove a major part of ad targeting revenue and data insights businesses globally. In 2003 Usama co-founded and led the DMX Group, a data mining and data strategy consulting and technology company specializing in Big Data Analytics for Fortune 500 clients. DMX Group was acquired by Yahoo! in 2004. Prior to 2003, he co-founded and served as Chief Executive Officer of Audience Science. He also has experience at Microsoft where led the data mining and exploration group at Microsoft Research and also headed the data mining products group for Microsoft’s server division.
From 1989 to 1996 Usama held a leadership role at NASA’s Jet Propulsion Laboratory where his work garnered him the Lew Allen Award for Excellence in Research from Caltech, as well as a US Government medal from NASA.
Randomized matrix algorithms and large-scale scientific data analysis
Michael W. Mahoney, ICSI and UC Berkeley
Matrix problems are ubiquitous in many large-scale scientific data analysis applications; and in recent years randomization has proved to be a valuable resource for the design of better algorithms for many of these problems. Depending on the situation, better might mean faster in worst-case theory, faster in high-quality numerical implementation, e.g., in RAM or in parallel and distributed environments, or more useful for downstream domain scientists. This talk will describe the theory underlying randomized algorithms for matrix problems such as least-squares regression and low-rank matrix approximation; and it will describe the use of these algorithms in large-scale scientific data analysis and numerical computing applications. Examples of the former include the use of interpretable CUR matrix decompositions to extract informative markers from DNA single nucleotide polymorphism data as well as informative wavelength regions in astronomical galaxy spectra data; and examples of the latter include a randomized algorithm that beats Lapack on dense overconstrained least-squares problems for data in RAM, and a randomized algorithm to solve the least absolute deviations problem on a terabyte of distributed data.
Michael Mahoney works on algorithmic and statistical aspects of modernlarge-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he spent fall 2013 at UC Berkeley co-organizing the Simons Foundation’s program on the Theoretical Foundations of Big Data Analysis.
The NEEShub: A Scalable and Reliable Cyberinfrastructure for the Earthquake Engineering Community
Thomas Hacker, Purdue University
Since 2009, NEEScomm at Purdue University has developed and operated a distributed cyberinfrastructure to meet the needs of the Civil Engineering research community as a part of the NSF George E. Brown Network for Earthquake Engineering Simulation (NEES) project. As of March 2014, over 104,000 users annually connect to the NEEShub and the NEES Project Warehouse to access over 1.78M project files uploaded by the earthquake engineering community. In this talk, Hacker will discuss the NEES cyberinfrastructure, and describe some of the experiences and lessons learned in developing and operating this cyberinfrastructure and data repository for a large distributed engineering community.
Thomas Hacker is an Associate Professor of Computer and Information Technology at Purdue University, and a Visiting Professor in the Department of Electrical Engineering and Computer Science at the University of Stavanger in Norway. Dr. Hacker’s research interests center around high performance computing and networking on the operating system and middleware layers. Recently his research has focused on cloud computing, cyberinfrastructure, the reliability of large-scale supercomputing systems, and data-oriented infrastructure. Dr. Hacker is the co-leader for Information Technology for the NSF Network for Earthquake Engineering Simulation (NEES), which brings together researchers from fourteen civil engineering laboratories across the country to share innovations in earthquake research and engineering.
The Web Changes Everything
Jaime Teevan, Microsoft Research
When you visit a colleague’s webpage, do the new articles she’s posted jump out at you? When you return to your favorite news website, is it easy to find the front page story you saw yesterday? The Web is a dynamic, ever-changing collection of information, and the changes can affect, drive, and interfere with people’s information seeking activities. With so much information online, it is now possible to capture and study content evolution and human interaction with evolving content on a scale previously unimaginable. This talk will use large-scale log analysis to explore how and why people revisit Web content that has changed, and illustrate how understanding the association between change and revisitation might improve browser, crawler, and search engine design.
Jaime Teevan is a Senior Researcher at Microsoft Research and an Affiliate Assistant Professor in the Information School at the University of Washington. Working at the intersection of human computer interaction, information retrieval, and social media, she studies and supports people’s information seeking activities. Jaime was named a Technology Review (TR35) 2009 Young Innovator for her research on personalized search. She is particularly interested in understanding social and temporal context, co-authoring the first book on collaborative Web search and chairing of the Web Search and Data Mining (WSDM) 2012 conference. Jaime also edited a book on Personal Information Management (PIM), edited a special issue of Communications of the ACM on the topic, and organized workshops on PIM and query log analysis. She has published numerous technical papers, including several best papers, and received a Ph.D. and S.M. from MIT and a B.S. in Computer Science from Yale University.
Research support and data management challenges: Is there such thing as too much data?
Sandra Cannon, Data Management Expert
The ever growing assortment and volume of data available pose interesting challenges for researchers and for those who manage data and provide support to them: what to do with all these data? The tsunami of available data presents a variety of “opportunities” and may require new ways of thinking about long standing activities in the data world. The changing data landscape poses interesting questions at each stage of the data lifecycle. What if data acquisition does not mean a carefully crafted data collection but the acquisition of petabytes of sensor data? Do data processing, editing, and management have different considerations when data are commercially sourced? Are “predictive analytics” a better approach to data analysis now and how do they fit with “traditional” research methods? How can we think about archiving data that may to be too big to handle effectively with our current frameworks but that we still want to store for posterity and make available for reuse? Given that “big data” are now big business, does the potential for licensing constraints or usage restrictions on top of the more traditional privacy concerns hamper the research process? This presentation will outline some of the topics to deliberate when thinking strategically about how to help advance research in the new data order.
Sandra “San” Cannon is a data management expert with more than 15 years of experience in research and analytical support. She has worked in all phases of the data life cycle from collection through management to dissemination. She is active in the international data community, especially the academic and central banking spheres, and works closely with international statistical offices and US statistical and regulatory agencies on data and metadata issues. She has presented and published on topics that include metadata standards, copyright and licensing issues, and data dissemination challenges. She holds too multiple degrees in Economics (B.S. from the University of California, Irvine, an M. Sc. from the London School of Economics, and a Ph. D. from the University of Wisconsin, Madison.) but has collaborated extensively with technology providers for to help design the best systems to further advance data-intensive research and analysis. She continues to work vigorously toward her vision of building a data community that provides good governance, professional management, and better understanding of data and statistics for the broadest user base.