2015 03-28-eb-final

How to train the next generation for Big Data Projects: building a curriculum

Christopher G. Wilson, Ph.D. Associate Professor Physiology and Pediatrics Center for Perinatal Biology

Experimental Biology, Mar 28th, 2015

Outline

•  Assessing the need for a “Big Data” Analytics course •  Structure and grading of the course •  Overview of the curriculum •  Advantages to Python/IPython •  Examples/use cases •  Coalition institutions and participating faculty

Is a Big Data analytics course necessary?

•  “Back in the day, when *I* was a graduate student…” •  First year Physics lab as a training ground… •  Contemporary students live in a digital world… •  Office suites are NOT suited to large-scale data analytics!

Work-flow of “Big Data” analysis

Or…

•  Obtain data •  Scrub data •  Explore data •  Model the data •  Interpret the data •  Present the data

Why use Free/Open-Source Software?

•  In this era of shrinking science funding, free software makes more economic sense. •  Bugs/security issues are fixed FASTER than proprietary software. •  With access to the source code, we can customize the software to fit OUR needs. •  Reproducibility of analyses and algorithms is easier when all code is free, can be shared, and examined/dissected. •  Free/Open-source software tends to be more reliable and stable. •  See Eric Raymond’s The Cathedral and the Bazaar for a more comprehensive explanation.

Using a “flipped” classroom

•  On-line material or reading is provided to the student either before or during the class meeting time •  The instructor provides a short summary/overview lecture (~20 min) •  The remaining class time is spent working on the subject matter as individuals and groups—with the instructor and TA present •  More effective for learning “hands on” skills like programming, bioinformatics, web design, etc.

Why use a flipped classroom model instead of lecturing for 50

minutes and assigning homework?

The data analytics team •  Project manager—responsible for setting clear project objectives and deliverables.

The project manager should be someone with more experience in data analysis and a more comprehensive background than the other team members.

•  Statistician—should have a strong mathematics/statistics background and will be responsible for reporting and developing the statistics workflow for the project.

•  Visualization specialist—responsible for the design/development of data visualization (figures/animation) for the project.

•  Database specialist—develops ontology/meta-tags to represent the data and incorporate this information in the team's chosen database schema.

•  Content Expert—has the strongest background in the focus area of the project (Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and is responsible for providing background material relevant to the project's focus.

•  Web developer/integrator—responsible for web-content related to the project, including the final report formatting (for web/hardcopy display).

•  Data analyst—the most junior member of the team will take on general responsibilities to assist the other team members. This is a learning opportunity for a team member who is new to data analysis and needs time to develop the skills necessary to fully participate in the workflow.

Student self-assessment

Survey created using Google Forms

Student self-assessment

From Doing Data Science by Cathy O’Neil and Rachel Schutt

Grading

•  Pass/No Pass •  Weekly quizzes (concepts from short lectures, on-line resources, simple

code fragments/pseudo-code, etc.) •  Projects

•  One individual project (basics of using IPython, simple statistics computed via interaction with R—or using Pandas—and simple visualization of a dataset).

•  Two short projects (small group, designed to develop team-based distribution of workload, team roles assigned by instructor).

•  Larger scale project using a Big Data dataset (students will “self-organize” their team roles). This project is envisioned as the final exam for the class and each team will present their results and project summary to the class.

•  Final projects will be posted on the class website along with IPython notebooks and supporting materials used for the project.

Syllabus Overview (10 week course)

Foundations 1: Using text editors, using the IPython notebook for data exploration, using version control software (git), using the class wiki. Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas, data visualization in IPython. Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms, bars, etc.) dynamical systems analyses of data variability, information theory measures (entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum), wavelets. Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene-array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays. Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology for biomedical/patient data (XML), using secure databases (REDCap). Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA) and what it means for data management, de-identifying patient data (handling PHI), data security best practices, making data available to the public—implications for data transparency and large-scale data mining.

Why Python?

•  Python is an easy-to-learn, complete programming language that has rapidly become an important scientific programming and data analysis environment with usage across multiple disciplines.

•  Python was originally developed with a philosophy of “easy to read” code incorporating object-oriented, imperative, and functional programming styles.

•  Python allows the incorporation of specialized modules based upon low-level code (C/C++) so it can run very fast.

•  Python has modules developed specifically for scientific computing and signal processing (NumPy/SciPy).

•  Python has well-documented import/export hooks into databases (both SQL and NOSQL) that are key to working with Big Data.

Why IPython?

•  IPython is an interactive data exploration and visualization shell that supports the inclusion of code, inline text, mathematical expressions, 2D/3D plotting, multimedia, and dynamic widgets.

•  IPython is a suite of tools designed to cover scientific workflow from interactive data transformation and analysis to publication.

•  The IPython notebook uses a web browser as its display “front end” and provides a rich interactive environment similar that seen in Mathematica.

•  IPython notebooks makes it possible to save analysis procedures and output—providing reproducible, curatable data analysis, and an easy way to share algorithms/methods.

•  IPython supports parallel coding and distributed data analysis to take advantage of cloud/high-performance clusters.

Python as a data analytics environment

IPython interface

http://ipython.org

Line plots with error bars

import numpy as np import matplotlib.pyplot as plt # example data x = np.arange(0.1, 4, 0.5) y = np.exp(-x) plt.errorbar(x, y, xerr=0.2, yerr=0.4) plt.show()

Heatmaps

import numpy as np import numpy.random import matplotlib.pyplot as plt # Generate some test data x = np.random.randn(8873) y = np.random.randn(8873) heatmap, xedges, yedges = np.histogram2d(x, y, bins=50) extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]] plt.clf() plt.imshow(heatmap, extent=extent) plt.show()

Scatterplots

import numpy as np import matplotlib.pyplot as plt N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show()

3D contour map

from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt from matplotlib import cm fig = plt.figure() ax = fig.gca(projection='3d') X, Y, Z = axes3d.get_test_data(0.05) ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3) cset = ax.contour(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='x', offset=-40, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='y', offset=40, cmap=cm.coolwarm) ax.set_xlabel('X') ax.set_xlim(-40, 40) ax.set_ylabel('Y') ax.set_ylim(-40, 40) ax.set_zlabel('Z') ax.set_zlim(-100, 100) plt.show()

Example: Patient physiology waveforms + EMR

Example: Interrogating sequence data

Summary

•  Free/Libre Open-Source software provides a viable “tool stack” for Big Data analytics. •  Python provides a robust, easy-to-use foundation for data analytics. •  IPython provides an easy to use interactive front-end for data transformation, analysis, visualization, presentation, and distribution. •  Team-based science depends upon developing a wide range of data analytics skills. •  We have developed a coalition of institutions to serve students who wish to be become data scientists.

Coalition Institutions

The coding Queen and her Court…

Abby Dobyns

Princesses of Python

Rhaya Johnson Regie Felix and Adaeze Anyanwu

And a Princeling….

Jamie Tillett

Acknowledgements

Loma Linda

•  Traci Marin •  Charles Wang •  Wilson Aruni •  Valery Filippov UC Riverside

•  Thomas Girke (Bioinformatics)

My laboratory’s git repository:

La Sierra University •  Marvin Payne CSU San Bernardino •  Art Concepcion

(Bioinformatics) UC Irvine •  Alex Nicolau

(Comp Sci/Bioinf)

https://github.com/drcgw/bass

Further reading

•  Doing Data Science by Cathy O’Neil and Rachel Schutt •  Data Analysis with Open-Source Tools by Philipp Janert •  The Art of R Programming by Norman Matloff •  R for Everyone by Jared P. Lander •  Python for Data Analysis by Wes McKinney •  Think Python by Allen B. Downey •  Think Stats by Allen B. Downey •  Think Complexity by Allen B. Downey •  Every one of Edward Tufte’s books (The Visual Display of Quantitative Information, Visual Explanations, Envisioning Information, Beautiful Evidence)

Questions?!

2015 03-28-eb-final

Science

data scrub data

data model

data analystthe

big data analysis

big data projects

largescale data analytics

project physiologist

free software