Areas |
Competency |
Expectation |
Notes |
Algorithms and Applications |
Tools |
Working knowledge of basic software tools (command-line, GUI based, or web-services) |
Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL |
Algorithms |
Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures |
Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching |
Application Domain |
Data analysis experience from at least one application area, either through coursework, internship, research project, etc. |
Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences |
Data Management |
Data validation & visualization |
Curation, Exploratory Data Analysis (EDA) and visualization |
Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (ggplot, Dashboard, D3.js) |
Data wrangling |
Skills for data normalization, data cleaning, data aggregation, and data harmonization/registration
|
Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Lynux time vs. timestamps, structured vs. unstructured data |
Data infrastructure |
Handling databases, web-services, Hadoop, multi-source data |
Data structures, SOAP protocols, ontologies, XML, JSON, streaming |
Analysis Methods |
Statistical inference |
Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling |
Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression |
Study design and diagnostics |
Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates |
Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction |
Machine Learning |
Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN |
Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning |