Approved Courses

In lieu of EECS 409, which will no longer be offered, students should participate in at least 7-9 data-science specific seminars (1 semester) to enrich their formal didactic training. These seminar series could be from different schools, Institutes, Initiatives, Centers, etc. Seminar attendance should be recorded on this Google form.

Course Catalog

Core Courses for the Graduate Certificate Program

Legend: AA=Algorithms and Applications, DM=Data Management, AM=Analysis Methods.

The course covers some fundamental methods in biomedical data analysis. Topics include: Database management in biomedical applications. Transforms and feature extraction, Fourier transform, wavelet transform, fundamentals of information theory, and statistical methods used in signal processing. Image enhancement, image segmentation, and image feature extraction methods. Brief introduction to natural language processing. Introduction to fundamental techniques in clustering and classification. Applications in medicine and biology.

This course focuses on machine learning methods and their applications in biomedical sciences. Topics include: 1) data management solutions for Big Data applications. 2) feature extraction and reduction methods. 3) clustering and classification methods. 4) testing and validation of models. 5) applications in systems biology and clinical informatics.

We will introduce the foundational machine learning techniques used in computational biology and describe their applications to biological data. Computational biology is a rich and growing field featuring large, complex, and noisy datasets. This exciting area both draws upon the techniques of machine learning for scientific discovery and offers challenging problems that push the boundaries of machine learning. The course will emphasize theoretical foundations and practical implementation of machine learning techniques, in addition to the biological background needed for computational biology applications.

Fundamental statistical concepts related to the practice of public health: descriptive statistics; probability; sampling; statistical distributions; estimation; hypothesis testing; chi-square tests; simple and multiple linear regression; one-way ANOVA. . Taught at a more advanced mathematical level than Biostat 503. Use of the computer in statistical analysis.

A second course in applied biostatistical methods and data analysis. Concepts of data analysis and experimental design for health-related studies. Emphasis on categorical data analysis, multiple regression, analysis of variance and covariance.

Graphical methods, simple and multiple linear regression; simple, partial and multiple correlation; estimation; hypothesis testing, model building and diagnosis; introduction to nonparametric regression; introduction to smoothing methods (e.g., lowess) The course will include applications to real data.

Data analysis in experimental molecular biology, gene expression, genome sequence and epigenomics data

Masters/Ph.D. level course for students in Computer Science, Electrical Engineering, and Information School

Foundations of machine learning, mathematical derivation and implementation of the algorithms, and their applications

The course aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary.

Modern analytical methods for advanced healthcare research. Specific focus on innovative modeling, computational, analytic and visualization techniques to address concrete driving biomedical and healthcare applications. The course covers the 5 dimensions of Big-Data (volume, complexity, time/scale, source and management).

This is a basic course in stochastic processes with emphasis on model building and probabilistic reasoning. The approach will be non-measure theoretic but otherwise rigorous. Topics to be covered include a review of elementary probability theory with particular attention to conditional expectation; the Poisson process; renewal theory; Markov chains; and some continuous state models including Brownian motion. Applications will be considered in queueing, reliability, and inventory theory. 1. Review of probability theory, 2. Conditional probability, random variables, distributions, 3. MGF, SLLN, WLLN, and CLT, 4. Conditional expectation, 5. Poisson processes, 6. Renewal processes, 7. Stopping times, Wald’s equation and the key renewal Theorem, 8. Elementary renewal theorem, renewal reward theorem, 9. Discrete time Markov chains-Definition and examples, 10. Class properties-Transience and recurrence, 11. Long run behavior, 12. Continuous time Markov chains (CTMC): Transient analysis.

This is an advanced course on stochastic processes, which is a continuation of IOE 515. It covers measure-theoretic probability theory, discrete-time and continuous-time Markov chains, martingale, and renewal theory. Some useful applications including queueing, inventory, Markov decision processes, reinforcement learning will also be discussed.

This course provides the theory and application of time series methods for understanding, forecasting and control of systems. Topics include: 1. Introduction: Review of basis statistics, Concept of time series modeling and analysis, Regression analysis, 2. ARMA modeling: Model selection procedure and strategy, F-test, Final validation, 3. Model Analysis: Green Function, Autocovariance function, 4. Forecast: Conditional expectation. One-step ahead forecast, Multi-step ahead forecast, Optimal properties of multi-step ahead forecast, Exponential smoothing, 5. Nonstationary time series model: Trend and Seasonality, 6. ARCH/GARCH modeling and analysis.

Rigid vs. flexible models Decision trees K-nearest neighbors Perceptron Support vector machines Introduction to caret: training and predicting

Content covered in the course includes linear algebra, multilinear algebra, dynamical systems, and information theory. The course will start with a basic introduction to data representation as vectors, matrices, and tensors. Then the course will move on to geometric methods for dimension reduction, also known as manifold learning, and topological data reduction. Using an application-based approach, the course will cover spectral graph theory, addressing the combinatorial meaning of eigenvalues and eigenvectors of their associated graph matrices and extensions to hypergraphs via tensors. The course will provide an introduction to the application of dynamical systems theory to data including dynamic mode decomposition. Real data examples will be given where possible, along with assistance writing code to implement these algorithms to solve these problems. The methods discussed in this class are shown primarily for biological data, but are useful in handling data across many fields. The course will also feature several guest lecturers from the industry and government.

Numerical methods for solving linear algebra problems (linear systems and eigenvalue problems), matrix decompositions, and convex optimization.

This course will offer the background necessary to take advantage of the recent advances in computational materials science, data science applied to materials science problems, and experimental techniques that fills the gaps in the computation and data approaches. The specific topics include (but not limited to): Introduction to MGI with recent success examples, computational approaches in MSE including high throughput approaches, data science tools, including statistical methods and data visualization, material property databases and how to extract data from them, and machine learning techniques, including text mining.

This is the first course in a two-semester sequence on data analysis. This course presents the “general linear model” with particular emphasis on exploratory data analysis, contrast analysis, residual analysis, and Euclidean distance. The topics covered over the two semesters include analysis of variance, regression, categorical data analysis, principal components analysis, multidimensional scaling, cluster analysis, multivariate ANOVA, canonical correlation, and structural equations modeling

Topics covered in this course include multidimensional scaling, cluster analysis, principal components, factor analysis, multivariate analysis of variance and canonical correlation. A brief introduction to reliability theory, structural equations modeling and hierarchical linear modeling will also be provided

Advanced course for master students of information and health informatics

This course introduces students to a variety of NLP methods available for reasoning about text in computational systems. We will focus on major algorithms used in NLP for various applications (e.g., part-of-speech tagging, parsing, machine translation), on the linguistic phenomena those algorithms attempt to model, and on the people who interpret and utter the language. Students will implement a variety of algorithms for different linguistic aspects (e.g., syntax, semantics) and also understand the creation of ground truth data through linguistically annotating data on which those algorithms depend.

Advanced course for graduate students of information, health informatics, and CS

This course covers the principles of data mining, exploratory analysis and visualization of complex data sets, and predictive modeling. The presentation balances statistical concepts (such as over-fitting data, and interpreting results) and computational issues. Students are exposed to algorithms, computations, and hands-on data analysis in the weekly discussion sessions.

The course covers methods for modern multivariate data analysis and statistical learning, including theoretical foundations and practical applications. Topics include principal component analysis and other dimension reduction techniques, classification (discriminant analysis, decision trees, nearest neighbor classifiers, logistic regression, support vector machines, ensemble methods), clustering (agglomerative and partitioning methods, model-based methods), and categorical data analysis. There is a significant data analysis component.

This course provides basic concepts and several modern techniques of Bayesian modeling and computation. Foundational topics include decision theoretic characterization of Bayesian inference and its relation to frequentist methods, de Finetti-type theorems and the existence of priors, conjugate priors and other notions of objective prior distributions, and Bayesian model selection. The course covers a number of advanced modeling techniques, both classical and modern, which belong to the class of hierarchical models, spatiotemporal models, dynamics models and Bayesian nonparametric models. A substantial part of the course is devoted to computational algorithms based on Markov Chain Monte Carlo sampling for complex models, sequential Monte Carlo methods, and deterministic methods such as variational approximation. A key component of the course would involve data analysis with Bayesian techniques.

Basic design principles, review of the analysis of variance, block designs, two-level and three-level factorial and fractional factorial experiments, data analysis techniques and case studies, basic response surface methodology, and introductory robust parameter designs.

This course teaches the basic tools in acquisition, management, and visualization of large data sets. Students will learn how to: store, manage, and query databases via SQL; quickly construct insightful visualizations of multi-attribute data using Tableau; use the Python programming language to manage data as well as connect to APIs to efficiently acquire public data. After taking this course, students will be able to efficiently construct large data sets that source underlying data from multiple sources, and form initial hypotheses based on visualization.

Elective Courses Approved for the Certificate Program

Legend: AA=Algorithms and Applications, DM=Data Management, AM=Analysis Methods.

Intensive MS/PhD course for Bioinformatics students on Big Data projects

Offering: Winter annually

BIOINF 597 covers both some of the conventional AI and some of the recently developed deep learning and generative AI methods, beyond those covered in BIOINF 580. We will discuss classic AI methods such as agent-based models and probabilistic methods (such as Hidden Markov Models and Bayesian network), interpretable methods (such as fuzzy models and fuzzy neural networks). We will also discuss more recently developed methods such reinforcement learning, active learning and representation learning. We will emphasize some emerging topics in deep learning such as attention mechanisms, transformers and generative models.

Offering: Fall alt. years

Practical understanding of computational aspects in implementing statistical methods.

Offering: Annually

Theory and methods of spatial and spatio-temporal statistics, modeling and inference on spatial processes within a geostatistical and a hierarchical Bayesian framework

Offering: Annually

The course will introduce students to biomedical applications of Machine Learning algorithms. It will lay the foundation for analysis of any big biomedical data set. This course will provide an overview of a wide range of AI and machine-learning tools, biomedical data sets (imaging, omics, health records) and diseases (cancer, cardiovascular-, infectious- and brain diseases).

Theory and application of matrix methods to signal processing, data analysis and machine learning. Theoretical topics include subspaces, eigenvalue and singular value decomposition, projection theorem, constrained, regularized and unconstrained least squares techniques and iterative algorithms. Applications such as image deblurring, ranking of webpages, image segmentation and compression, social networks, circuit analysis, recommender systems and handwritten digit recognition.

Note: EECS 453 and EECS 551 share same lectures but have different recitations.

Programming-focused introduction to Machine Learning

Offering: Annually

Undergraduate level course for Computer Science and Engineering students

Offering: Winter annually

Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification and object detection. Recent developments in neural network approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into details of neural-network based deep learning methods for computer vision. During this course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. We will cover learning algorithms, neural network architectures, and practical engineering tricks for training and fine-tuning networks for visual recognition tasks.

The Computational Data Science course offers an in-depth introduction to computational methods in data science for identifying, fitting, extracting and making sense of patterns in large data sets. More information is available.

Offering: Biannually

Understanding the mathematical underpinnings of machine learning.

Offering: Fall

Understanding the mathematical underpinnings of machine learning.

Offering: Fall

Understanding the mathematical underpinnings of machine learning.

Offering: Fall

Principles and progress in unsupervised feature learning algorithms for machine learning applications. Topics include clustering, sparse coding, autoencoders, restricted Boltzmann machines, and deep belief networks

Offering: Biannually

Introduction To Mathematical Modeling In Epidemiology And Public Health

Offering: Fall

Applied inference methods in studies involving multiple variables. Specific methods that will be discussed include linear regression, analysis of variance, and different regression models. This course emphasizes the scientific formulation, analytical modeling, computational tools and applied statistical inference in diverse health-sciences problems. Hands-on data interrogation, modeling approaches, rigorous interpretation and inference.

Offering: Fall annually

Introduction to concepts and methods of constrained and unconstrained continuous nonlinear optimization. The course revolves around three issues in optimization: building optimization models of problems, characterization of their solutions, and algorithms for finding these solutions. As the semester progresses, I will compile a list of topics of each lecture on the web site. The outline of the topics cover: Introduction to optimization, optimality conditions for unconstrained problems, algorithms for unconstrained problems (steepest descent, Newton’s, etc.) and analysis of their convergence, optimality conditions and constraint qualifications for constrained problems, convexity and its role in optimization, algorithms for constrained problems (SQP, barrier and penalty methods, etc.), conic optimization problems, their applications, and methods for their solution.

Offering: Winter

The course is about theories and applications of dynamic programming (DP), and how to build recursive dynamic equations and their use in sequential decision making. The goal is to set a foundation for future research in dynamic programming and related fields. We also introduce topics of stochastic dynamic programming, Markov Decision Processes, Partially Observable Markov Decision Processes, Approximate Dynamic Programming, and reinforcement learning. In the class, we discuss (i) deterministic DP problems such as knapsack problems, traveling salesman problems, shortest path problems and (ii) stochastic DP problems such as stochastic shortest path problems, stopping time problems, machine maintenance, medical decision making, and others. The course requires basic knowledge in probability, statistics, and computer coding. The students need to implement different DP-based algorithms to solve a variety of deterministic or stochastic sequential decision making problems.

Scheduling is an activity involving sequencing and timing of events to achieve a goal in a data-driven context. Important applications of scheduling arise in most areas of science and engineering including biology, computer/data science, energy, manufacturing, medicine, and transportation systems. This class will cover fundamental topics that form the foundation of scheduling theory and practice, in both data-rich (offline) and data-limited (online) environments. There will be an emphasis on classifying problems on the basis of their data availability, theoretical properties of scheduling models, and computational methods for solving models that arise in many contexts.

Students in this course will learn advanced techniques to parse and collate information from text-rich health documents such as electronic health records, clinical notes, and peer-reviewed medical literature. In this elective, students will be able to delve deeper into challenges in recognizing medical entities in text documents, extracting clinical information, addressing ambiguity and polysemy, and building searchable interfaces to efficiently and effectively query and retrieve relevant patient data. Students will develop tools and techniques to analyze new genres of health information, and build resources to help in these tasks. Students will also participate in a semester-long project on addressing specific natural language processing challenges in real-life health data sets.

Offering: Winter

Numerical methods for solving practical scientific problems involving accuracy, stability, efficiency and convergence

Offering: Fall annually

Sparse analysis, compressive sensing and data modeling

Offering: Winter annually

Sparse analysis, compressive sensing and data modeling

Offering: Winter annually

Sparse analysis, compressive sensing and data modeling

Offering: Winter annually

Linear Regression Models: denition, fitting, Gauss-Markov theorem, inference, interpretation of results, meaning of regression coefficients, diagnostics, influential observations, multi-collinearity, lack of t, robust procedures, transformations, variable selection, ridge regression, principal components regression, ANOVA and analysis of covariance. Introduction to generalized linear models: general framework, binomial data, logistic regression, Poisson regression. The objective is to learn what methods are available and, more importantly, when they should be applied.

Offering: Winter

Topics include: (1) classification and machine learning, including support vector machines, recursive partitioning, and ensemble methods; (2) methods for analyzing sets of curves, surfaces and images, including functional data analysis, wavelets, independent component analysis, and random field models; (3) modern regression, including splines and generalized additive models, (4) methods for analyzing structured dependent data, including mixed effects models, hierarchical models, graphical models, and Bayesian networks; and (5) clustering, detection, and dimension reduction methods, including manifold learning, spectral clustering, and bump hunting

Offering: Annually

With the ongoing explosion in availability of large and complex business datasets (“Big Data”), Machine Learning (“ML”) algorithms are increasingly being used to automate the analytics process and better manage the volume, velocity and variety of Big Data. This course teaches how to apply the growing body of ML algorithms to various Big Data sources in a business context.

This graduate level prepares engineering students to use data science tools during their master’s and PhD thesis research as well as for post-graduation in industry, government, and academia. This course will familiarize students with the principles of modern data science techniques in the context of chemical engineering, materials science, and research. Central focus is on data science tools used in engineering and science applications such as, data curation, supervised and unsupervised machine learning, and data mining. Algorithms and frameworks covered include the perceptron, dimensionality reduction tools, kernel ridge regression, neural networks, subgroup discovery, compressed sensing, random forests, support vector machines, and causal inference, among others. Homework exercises include hands-on practice of using data science to solve science and engineering problems. Students will be responsible for a data science project on a topic of interest.

Offering: Winter

Social networks, the world wide web, information and biological networks; methods and computer algorithms for the analysis and interpretation of network data; graph theory; models of networks including random graphs and preferential attachment models; spectral methods and random matrix theory; maximum likelihood methods; percolation theory; network search

Offering: On Demand

Computational Physics graduate seminar, including an Introduction to Python mini-course at the start of the semester

Offering: Anually

Computational Physics graduate seminar, including an Introduction to Python mini-course at the start of the semester

Offering: Anually

This course aims to help students get started with their own data harvesting, processing, aggregation, and analysis. Data analysis is crucial to evaluating and designing solutions and applications, as well as understanding user’s information needs and use. In many cases the data we need to access is distributed online among many webpages, stored in a database, or available in a large text file. Often these data (e.g. web server logs) are too large to obtain and/or process manually. Instead, we need an automated way of gathering the data, parsing it, and summarizing it, before we can do more advanced analysis. Therefore, students will learn to use Python and its modules to accomplish these tasks in a ‘quick and easy’ yet useful and repeatable way. Next, students will learn techniques of exploratory data analysis, using scripting, text parsing, structured query language, regular expressions, graphing, and clustering methods to explore data. R modules will be used to accomplish these tasks. Students will be able to make sense of and see patterns in otherwise intractable quantities of data.

Offering: On Demand

Image models, multidimensional and multivariate data, design principles for visualization, hierarchical, network, textual and collaborative visualization, and visualization pipeline, data processing for visualization, visual representations, visualization system interaction design, and impact of perception

Offering: On Demand

Advanced masters level course /doctoral course for students in information sciences

Offering: Fall annually

Methods of Survey Sampling is a moderately advanced course in applied statistics, with an emphasis on the practical problems of sample design, which provides students with an understanding of principles and practice in skills required to select subjects and analyze sample data. Topics covered include stratified, clustered, systematic multi-stage sample designs; unequal probabilities and probabilities proportional to size, area, and telephone sampling; ratio means; sampling errors; frame problems; cost factors; and practical designs and procedures.

Offering: Winter annually

The first part of this course provides an introduction to web scraping and APIs for gathering data from the web and then discusses how to store and manage (big) data from diverse sources efficiently. The second part of the course demonstrates techniques for exploring and finding patterns in (non-standard) data, with a focus on data visualization. Tools for reproducible research will be introduced to facilitate transparent and collaborative programming. The course focuses on R as the primary computing environment, with excursus into SQL and Big Data processing tools.

Offering: Fall annually

This is the first in a two term sequence in applied statistical methods covering topics such as regression, analysis of variance, categorical data, and survival analysis.

Offering: Fall annually

This builds on the introduction to linear models and data analysis provided in Statistical Methods I. Topics include: Multivariate analysis techniques (Hotelling’s T-square, Principal Components, Factor Analysis, Profile Analysis, MANOVA); Categorical Data Analysis (contingency tables, measurement of association, log-linear models for counts, logistics and polytomous regression, GEE); and lifetime Data Analysis (Kaplan-Meier plots, logrank test, Cox regression).

Offering: Winter annually

Advanced Statistical Modeling, designed for students on both the social science and statistical tracks for the two programs in survey methodology, will provide students with exposure to applications of more advanced statistical modeling tools for both substantive and methodological investigations that are not fully covered in other MPSM or JPSM courses. Modeling techniques to be covered include multilevel modeling (with an application to methodological studies of interviewer effects), structural equation modeling (with an application of latent class models to methodological studies of measurement error), classification trees (with an application to prediction of response propensity), and alternative models for longitudinal data (with an application to panel survey data from the Health and Retirement Study). Discussions and examples of each modeling technique will be supplemented with methods for appropriately handling complex sample designs when fitting the models. The class will focus on essential concepts, practical applications, and software, rather than extensive theoretical discussions.

Offering: Fall annually

Applied Business Analytics and Decisions — Objective: Strategic and tactical decisions problems that firms face became too complex to solve by naive intuition and heuristics. Increasingly, making business decisions requires “intelligent” and “data oriented” decisions, aided by decision support tools and analytics. The ability to make such decisions and use available tools is critical for both managers and firms. In recent years, the toolbox of business analytics has grown. These tools provide the ability to make decisions supported by data and models. This course prepares students to model and manage business decisions with data analytics and decision models.

Other Data Science Courses (not approved for the Certificate Program)

Legend: AA=Algorithms and Applications, DM=Data Management, AM=Analysis Methods.

This course is an introduction to the process used in estimating demand for passenger travel across modes and regions. The goal is to provide you an overview of the different steps involved in travel demand forecasting and then focus on understanding user behavior, choices and preferences, as it relates to transportation mode and route choices, in greater details. We will study different behavioral theories defining consumer behavior and learn how to model some of them using discrete choice methods. We will explore different families of discrete choice models, their behavioral, statistical and econometric foundations and estimation methods. We will also briefly cover sampling techniques and survey design as part of data collection and estimation for discrete choice models. Last, but most importantly, as travel choices are closely related to land use planning and policy, we will spend a few lectures learning about scenario planning and accessibility from experts in that area.

This course will provide opportunities for students to apply advanced data science techniques to real-world problems. Students will work in a team throughout the semester on a project as part of a collaboration between the Michigan Institute for Data Science (MIDAS) and its Industry Partner Yazaki. Throughout the course, students will apply various supervised and unsupervised machine learning techniques, with the ultimate goal of optimizing the engineering specifications used to ensure the safety and reliability of electrical connections within the automotive industry.

This course will introduce 1) the concepts of multistage carcinogenesis and the analysis of cancer epidemiology using mathematical models of carcinogenesis; 2) the analysis of cancer prevention strategies using Markov cancer natural history models. Students will learn how to develop and fit multistage and cancer natural history models in R.

Learning analytics involves collecting, analyzing, and communicating with data about learners and learning environments. In this course, we will address efforts to use new and novel data sources as well as diverse analytical techniques to improve learning opportunities in K-16 and professional settings (e.g., healthcare). As a field, learning analytics draws on theories and methods from multiple traditions, and in this course, we will use learning, organizational, and socio-technical theories to examine the possibilities and pitfalls of analytics-based interventions in education. This class is intended for students who are broadly interested in learning, whether that learning takes place in the classroom, in the home, or on the shop floor.

One in five Americans uses a wearable. These wearables score sleep, predict alertness/stress/performance, and integrate into medical practice, for example, in COVID detection. This course will use sensor data measuring physiological signals to predict human performance and diagnose disease. We aim to teach students how to analyze real data from wearables or sensors, including data from students, athletes, travelers, or patents. The course will integrate into many ongoing University initiatives, for example from the Biosciences Initiative or Precision Health. Mathematical techniques introduced in this course include 1) Time Series Analysis, 2) Building Models with Differential Equations, 3) Parameter Estimation, 4) Uncertainty Analysis, 5) Techniques from the Theory of Dynamical Systems. We will study potential applications, including Exercise and Heart Rate, Sleep, Circadian Rhythms, Mood, Weight, Music Performance, Infectious Disease, Addiction, … Course meetings will consist of prerecorded lectures, interactive lectures, and remote computer labs. Emphasis will be placed on the analysis of raw data. Consideration will be given in the problem sets and course projects to interdisciplinary student backgrounds. Teamwork will be encouraged. This course is taught by Daniel Forger and takes place T-Th 10-11:30 remotely.

The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have a basic python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.

Offering: Online

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through a tutorial. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis.

Offering: Online

This course will introduce the learner to information visualization basics, with a focus on reporting and charting using the matplotlib library. The course will start with a design and information literacy perspective, touching on what makes a good and bad visualization, and what statistical measures translate into in terms of visualizations. The second week will focus on the technology used to make visualizations in python, matplotlib, and introduce users to best practices when creating basic charts and how to realize design decisions in the framework. The third week will be a tutorial of functionality available in matplotlib, and demonstrate a variety of basic statistical charts helping learners to identify when a particular method is good for a particular problem. The course will end with a discussion of other forms of structuring and visualizing data.

Offering: Online

This course will introduce the learner to network analysis through tutorials using the NetworkX library. The course begins with an understanding of what network analysis is and motivations for why we might model phenomena as networks. The second week introduces the concept of connectivity and network robustness. The third week will explore ways of measuring the importance or centrality of a node in a network. The final week will explore the evolution of networks over time and cover models of network generation and the link prediction problem.

Offering: Online

This course will introduce the learner to text mining and text manipulation basics. The course begins with an understanding of how text is handled by python, the structure of text both to the machine and to humans, and an overview of the nltk framework for manipulating text. The second week focuses on common manipulation needs, including regular expressions (searching for text), cleaning text, and preparing text for use by machine learning processes. The third week will apply basic natural language processing methods to text, and demonstrate how text classification is accomplished. The final week will explore more advanced methods for detecting the topics in documents and grouping them by similarity (topic modelling).

Offering: Online

This is the MOOC version of the regular residential course HS650 Data Science and Predictive Analytics. The MOOC is an independent-study, self-guided version of the regular residential HS650 class. The course aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary.

Offering: Online

As patients, we care about the privacy of our medical record; but as patients, we also wish to benefit from the analysis of data in medical records. As citizens, we want a fair trial before being punished for a crime; but as citizens, we want to stop terrorists before they attack us. As decision-makers, we value the advice we get from data-driven algorithms; but as decision-makers, we also worry about unintended bias. Many data scientists learn the tools of the trade and get down to work right away, without appreciating the possible consequences of their work. This course focused on ethics specifically related to data science will provide you with the framework to analyze these concerns. This framework is based on ethics, which are shared values that help differentiate right from wrong. Ethics are not law, but they are usually the basis for laws. Everyone, including data scientists, will benefit from this course. No previous knowledge is needed.

Offering: Online

This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambdas, reading and manipulating csv files, and the numpy library. The course will introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the Series and DataFrame as the central data structures for data analysis, along with tutorials on how to use functions such as groupby, merge, and pivot tables effectively. By the end of this course, students will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses.

Offering: Online

Practical Learning Analytics has a specific goal: to help us collectively ponder learning analytics in a concrete way. To keep it practical, we will focus on using traditional student record data, the kinds of data every campus already has. To make it interesting, we will address questions raised by an array of different stakeholders, including campus leaders, faculty, staff, and especially students. To provide analytic teeth, each analysis we discuss will be supported by both realistic data and sample code.

Offering: Online

This course aims to teach everyone the basics of programming computers using Python. We cover the basics of how one constructs a program from a series of simple instructions in Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course. This course will cover Chapters 1-5 of the textbook “Python for Everybody”. Once a student completes this course, they will be ready to take more advanced programming courses. This course covers Python 3.

Offering: Online

This specialization covers the fundamentals of surveys as used in market research, evaluation research, social science and political research, official government statistics, and many other topic domains. In six courses, you will learn the basics of questionnaire design, data collection methods, sampling design, dealing with missing values, making estimates, combining data from different sources, and the analysis of survey data. In the final Capstone Project, you’ll apply the skills learned throughout the specialization by analyzing and comparing multiple data sources.

Offering: Online