additional praise for · 2017. 6. 21. · additional praise for strategies in biomedical data...
TRANSCRIPT
Additional Praise for Strategies in Biomedical Data Science: Driving Force for Innovation
“The allure of data analytics is in knowing what is currently unknowable by
identifying patterns in apparent chaos. If these insights could be applied in the
healthcare field to individualized patient care, the improvement in outcomes
could be profound indeed. This type of research and innovation right here in
Tempe (ASU) demonstrates why ASU is ranked as the number one most inno-
vative university in the nation.
“Industry analysts expect there will be three to four connected Internet of
Things (IoT) devices for every person on the planet by 2020. Healthcare can
and is leading the way in IoT adoption. To prepare for the coming deluge of IoT
data, healthcare IT organizations should be investing in data analytics capabil-
ity to convert that raw data flood into actionable information that delivers bet-
ter healthcare outcomes.”
—Steve Phillips
Senior Vice President and Chief Information Officer, Avnet, Inc.
Twitter: @Avnet
“I think it is really great that Jay Etchings is working on this; the dearth of infor-
mation for dealing with large, complex biomedical data sets makes building sys-
tems capable of supporting precision medicine very challenging. I would say that
we are not yet at the “blueprint” stage, but we certainly can use help in getting the
right people thinking about this, so we can build the recipes going forward. While
true clinical application at scale is still not here, we are rapidly approaching that
event horizon, and as we have learned in biomedical research, the infrastructure
challenges alone require careful planning and very deliberate applications of the
proper technologies to deal with the vast amount of data that is generated. The
algorithms to automate things such as true clinical decision guidance have yet to
be written, and although some approaches such as neuro-linguistic programming
or machine learning look promising, actually creating a “doc in a box” is probably
many years off. This does not mean we should not be striving to move forward as
rapidly as possible, because the impact that can be had on a patient’s life is truly
inspirational and that should always be remembered. This is not building systems
to showcase technology or how smart we are, it is to help propel a truly world
changing methodology of how medicine is practiced.”
—James Lowey, CIO
TGen, The Translational Genomics Research Institute
Twitter: @loweyj, @Tgen
“The journey to precision medicine will require the confluence and analysis of
enormous amounts of data from genomics, clinical and fundamental research,
clinical care, and environmental and lifestyle data, including connected health
data from the “Internet of Medical Things.” The entire healthcare ecosystem
needs to work together, along with the information and communications tech-
nology ecosystems, to collect, transport, analyze, and leverage the vast amount
of data that can be honed to develop insights and recommendations for preci-
sion medicine. The opportunity to improve healthcare is compelling, the data
is vast and will continue to grow, and we need to work together to realize
improved outcomes. We need to build the technology and process-enabled
capabilities to protect the data and the people. The need for increased TIPPSS—
trust, identity, privacy, protection, safety, and security—mechanisms is critical
to the success and safety in our ongoing healthcare journey.”
—Florence D. Hudson
Senior Vice President and Chief Innovation Officer, Internet2
Twitter: @FloInternet2
“In the last decade, the wave of data coming off modern sequencing instru-
ments is transforming bioscience into a digital science. Not only are the data
sets enormous, the need to work through them quickly to have a real-time
impact on therapy is crucial, requiring all of the elements of high-performance
computing: fast compute, storage and networking, sophisticated data manage-
ment, and highly parallel application codes.
The ability to quickly crunch massive amounts of disease and patient data
is at the heart of precision medicine. While much of the promise of precision
medicine is still on the horizon, advances have already led to life-saving treat-
ments for children and adults with lethal cancers and genetic diseases. At the
Center for Pediatric Genomic Medicine (CPGM) at Children’s Mercy Hospital in
Kansas City, MO, researchers used 25 hours of supercomputer time to decode
the genetic variants of an infant suffering from liver failure. Thanks to the fast
genomic diagnosis, doctors were able to proceed with the most effective treat-
ment and the baby is alive and well.”
—Tiffany Trader
Managing Editor, HPCwire
Strategies in Biomedical Data
Science
Wiley & SAS Business Series
The Wiley & SAS Business Series presents books that help senior-level manag-
ers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Analytics in a Big Data World: The Essential Guide to Data Science and Its Applica-
tions by Bart Baesens
Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through Business
Analytics by Evan Stubbs
Business Analytics for Customer Intelligence by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and Commu-
nications Technology Infrastructure by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide by Michael
S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational Insights by
Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next
Level with Social Media by Frank Leistner
Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by
Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition
by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a More Effi-
cient Supply Chain by Robert A. Davis
Developing Human Capital: Using Analytics to Plan and Optimize Your Learning
and Development Investments by Gene Pease, Barbara Beresford, and Lew
Walker
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results
by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fun-
damental Concepts and Practical Applications by Robert Rowan
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production
with Data Driven Models by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical
World by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization’s
Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creat-
ing a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp
Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown
Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve Business
Performance by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, Second Edition by Roger
W. Hoerl and Ronald D. Snee
Style and Statistics: The Art of Retail Analytics by Brittany Bullard
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams
with Advanced Analytics by Bill Franks
The Analytic Hospitality Executive: Implementing Data Analytics in Hotels and
Casinos by Kelly A. McGuire
The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks
Are Radically Transforming Your Business by David Thomas and Mike Barlow
The Value of Business Analytics: Identifying the Path to Profitability by Evan
Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for Better
Decisions by Phil Simon
Too Big to Ignore: The Business Case for Big Data by Phil Simon
Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean
Win with Advanced Business Analytics: Creating Business Value from Your Data by
Jean Paul Isson and Jesse Harriott
For more information on any of the above titles, please visit www.wiley.com.
Strategies in Biomedical Data
ScienceDriving Force for Innovation
Jay Etchings
Cover image: DNA strand © Don Bishop/Getty Images, Inc.Cover design: Wiley
Copyright © 2017 by SAS Institute, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley .com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Names: Etchings, Jay, 1966– author. | SAS Institute, issuing body.Title: Strategies in biomedical data science : driving force for innovation / Jay Etchings.Other titles: Wiley and SAS business series.Description: Hoboken, New Jersey : John Wiley & Sons, Inc., [2017] | Series: Wiley & SAS business series | Includes bibliographical references and index.Identifiers: LCCN 2016036794 (print) | LCCN 2016037346 (ebook) | ISBN 978-1-119-23219-3 (hardcover) | ISBN 978-1-119-25597-0 (ePub) | ISBN 978-1-119-25618-2 (ePDF) Subjects: | MESH: Medical Informatics | Computational Biology—methods | Cybernetics—methodsClassification: LCC R859.7.A78 (print) | LCC R859.7.A78 (ebook) | NLM W 26.5 | DDC 610.285—dc23LC record available at https://lccn.loc.gov/2016036794
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
vii
Contents
Foreword xi
Acknowledgments xv
Introduction 1Who Should Read This Book? 3What’s in This Book? 4How to Contact Us 6
Chapter 1 Healthcare, History, and Heartbreak 7Top Issues in Healthcare 9Data Management 16Biosimilars, Drug Pricing, and Pharmaceutical Compounding 18Promising Areas of Innovation 19Conclusion 25Notes 25
Chapter 2 Genome Sequencing: Know Thyself, One Base Pair
at a Time 27Content contributed by Sheetal Shetty and Jacob Brill
Challenges of Genomic Analysis 29The Language of Life 30A Brief History of DNA Sequencing 31DNA Sequencing and the Human Genome Project 35Select Tools for Genomic Analysis 38Conclusion 47Notes 48
Chapter 3 Data Management 53Content contributed by Joe Arnold
Bits about Data 54Data Types 56Data Security and Compliance 59Data Storage 66SwiftStack 70OpenStack Swift Architecture 78Conclusion 94Notes 94
viii ▸ C o n t e n t s
Chapter 4 Designing a Data-Ready Network Infrastructure 105Research Networks: A Primer 108ESnet at 30: Evolving toward Exascale and
Raising Expectations 109Internet2 Innovation Platform 111Advances in Networking 113InfiniBand and Microsecond Latency 114The Future of High-Performance Fabrics 117Network Function Virtualization 119Software-Defined Networking 121OpenDaylight 122Conclusion 157Notes 157
Chapter 5 Data-Intensive Compute Infrastructures 163Content contributed by Dijiang Huang, Yuli Deng, Jay Etchings, Zhiyuan
Ma, and Guangchun LuoBig Data Applications in Health Informatics 166Sources of Big Data in Health Informatics 168Infrastructure for Big Data Analytics 171Fundamental System Properties 186GPU-Accelerated Computing and Biomedical Informatics 187Conclusion 190Notes 191
Chapter 6 Cloud Computing and Emerging Architectures 211Cloud Basics 213Challenges Facing Cloud Computing Applications in Biomedicine 215Hybrid Campus Clouds 216Research as a Service 217Federated Access Web Portals 219Cluster Homogeneity 220Emerging Architectures (Zeta Architecture) 221Conclusion 229Notes 229
Chapter 7 Data Science 235NoSQL Approaches to Biomedical Data Science 237Using Splunk for Data Analytics 244Statistical Analysis of Genomic Data with Hadoop 250Extracting and Transforming Genomic Data 253Processing eQTL Data 256Generating Master SNP Files for Cases and Controls 259Generating Gene Expression Files for Cases and Controls 260Cleaning Raw Data Using MapReduce 261Transpose Data Using Python 263Statistical Analysis Using Spark 264Hive Tables with Partitions 268Conclusion 270Notes 270
C o n t e n t s ◂ ix
Appendix: A Brief Statistics Primer 290
Content Contributed by Daniel Penaherrera
Chapter 8 Next-Generation Cyberinfrastructures 307Next-Generation Cyber Capability 308NGCC Design and Infrastructure 310Conclusion 327Note 330
Conclusion 335
Appendix A The Research Data Management Survey: From Concepts to Practice 337
Brandon Mikkelsen and Jay Etchings
Appendix B Central IT and Research Support 353Gregory D. Palmer
Appendix C HPC Working Example: Using Parallelization Programs Such as GNU Parallel and OpenMP with Serial Tools 377
Appendix D HPC and Hadoop: Bridging HPC to Hadoop 385
Appendix E Bioinformatics + Docker: Simplifying Bioinformatics Tools Delivery with Docker Containers 391
Glossary 399
About the Author 419
About the Contributors 421
Index 427
xi
Foreword
The emergence of data science is radically transforming the biomedical knowl-
edge generation paradigm. While modern biomedicine has been a pioneer in
evidence-based science, its approach for decades has largely followed a well-
worn path of experimental design, data collection, analysis, and interpretation.
Data science introduces an alternative pathway—one that starts with the vast
collections of diverse digital data increasingly accessible to the community.
While the data science evidence generation concept has many birth par-
ents, Jim Gray of Microsoft best described the unique opportunity afforded by
this new paradigm. In a 2007 address to the U.S. National Research Council,
Gray argued: “With an exaflood of unexamined data and teraflops of cheap
computing power, we should be able to make many valuable discoveries
simply by searching all that information for unexpected patterns” [1]. Gray
coined the phrase “data-intensive scientific discovery.” Notably, he broke with
the high-performance computing “high priests” and advocated the adoption
of new models of computing. Following Gray’s untimely death shortly after
his address, his colleagues captured this concept in a collection of essays ulti-
mately published as The Fourth Paradigm: Data-Intensive Scientific Discovery [2].
It was within these essays that the term “big data” was introduced.
“Data science” and “big data” are now overburdened terms with many
meanings. The most useful definitions are operational in nature. One of the
most colorful comes from John Myles of Facebook, who indicates that big data
is any problem “so large that traditional approaches to data analysis are doomed
to failure” [3]. I find the definition of the chief architect of Data.gov, Philip
Ashlock, most elucidating: “Analysis that can help you find patterns, anoma-
lies, or new structures amidst otherwise chaotic or complex data points” [3].
Data science remains controversial in biomedicine. Jeff Drazen, the edi-
tor in chief of the New England Journal of Medicine, has described data science
practitioners as “research parasites” [4]. More subtly, Robert Weinberg openly
questions whether such approaches have any potential to generate real insight
in his article describing an emerging crisis in understanding cancer, “Coming
Full Circle—From Endless Complexity to Simplicity and Back Again” [5].
I have been an eyewitness and co-conspirator in the data science trans-
formation occurring in biomedicine. I grew up with the Human Genome
Project and the rapid accumulation of large volumes of big data it generates.
I have made contributions through the “Discovery Science” paradigm that the
Genome Project made acceptable in biomedicine. For example, with my col-
leagues at the Cooperative Human Linkage Center, we were early adopters of
xii ▸ F o r e w o r d
computational science and the Internet (then NSFnet) in our efforts to con-
struct the map of human inheritance [6]. For us at the time, big data topped
out at a gigabyte! While serving as the founding director of the National Insti-
tutes of Health’s National Cancer Institute’s Center for Biomedical Informatics
and Information Technology, I was tasked with helping bring data science to
the cancer community. The charge was broad—including basic science, clini-
cal research, and health encounter data. It was technologically challenging—
predating many technology paradigms now taken for granted as standard in
data science. Through these pioneering efforts, I experienced the aforemen-
tioned controversial nature of data science and the second of Arthur C. Clarke’s
laws: “The only way of discovering the limits of the possible is to venture a little
way past them into the impossible” [7].
Strategies in Biomedical Data Science is an ambitious attempt to look at “the
limits of the possible” for data science in biomedicine. Unique in its scope, it
takes a comprehensive look at all aspects of data science. Work in the sciences
is routinely compartmentalized and segregated among specialists. This segre-
gation is particularly true in biomedicine as it wrestles with the integration of
data science and its underpinning in information technology. While such spe-
cialization is essential for progress within disciplines, the failure to have cross-
cutting discussions results in lost opportunities. This book is significant in that
it purposely embraces the “transdisciplinary” nature of biomedical data science.
Transdisciplinary research (a foundational aspect of Arizona State University’s
“New American University”) brings together different disciplines to create inno-
vations that are beyond the capacity of any single specialty. Data science is defi-
nitionally transdisciplinary and somewhat ironically is discipline-agnostic.
Strategies in Biomedical Data Science unapologetically mixes biology, analyt-
ics, and information technology. Its transdisciplinary topics cover diverse data
types—genomic, clinical encounter, personal monitoring devices—and the
data science opportunities (and challenges) in each. Within each of these top-
ics, it provides insights into the software capabilities that are used to wrangle
Gray’s “exaflood” of data and to find his “unexpected patterns.” It provides
insightful discussions of the underpinning computational and network infra-
structure necessary to realize the potential of data science. More specifically,
it provides practical blueprints that translate Gray’s suggested alternative to
traditional high-performance computing paradigms into reality. Within each of
these, it provides case studies written by experts that transition the topics from
concept to real-world examples. Importantly, these case studies are provided
by both academics and industry sources, demonstrating the importance of both
to the biomedical data science progress as well as the need to blend these often-
adversarial communities.
I have had the opportunity to know the author, Jay Etchings, for over three
years. Jay is a true computational renaissance man, as reflected in the breadth
F o r e w o r d ◂ xiii
of topics facilely presented in Strategies in Biomedical Data Science. I was first intro-
duced to Jay when he was an architect for Dell. Jay translated ASU’s vision for
a first-generation, purpose-built data science research platform into the opera-
tional Next Generation Cyber Capability (NGCC) described in the book. The
NGCC is a physical instantiation of what Gray envisioned. Now at ASU as the
director of Research Computing Operations, Jay and his team deliver biomedi-
cal data science to a diverse collection of international scientists.
Jay brings a fresh perspective and a diverse pedigree of work experiences
to biomedical data science. He has been at the forefront of developing and
deploying big data capabilities throughout his career. For example, Jay was on
the leading edge in bringing big data infrastructure to the gaming industry—
a community that is always an early adopter of breakthrough technology. Jay
has hands-on experience in the complexities of biomedical data from his efforts
to provide support for the Centers for Medicare and Medicaid Services. Jay’s
commercial background brings with it a can-do approach to problems and a
low tolerance for the arcane consternation that often paralyzes academics.
This fresh perspective and his enthusiasm for biomedicine pervade his writing.
Strategies in Biomedical Data Science is a one-stop shop of data science essentials
and is likely to serve as the go-to resource for years to come.
Ken Buetow, Ph.D.,
Professor, Arizona State University
Director, Computational Science and Informatics Core Program
Director, Complex Adaptive Systems Initiative
Notes
1. David Snyder. 2016. “The Big Picture of Big Data—IEEE—The Institute.” http://theinstitute .ieee.org/ieee-roundup/members/achievements/the-big-picture-of-big-data.
2. Anthony J. G. Hey, ed. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research.
3. Jennifer Dutcher. 2014. “What Is Big Data?” September 3. https://datascience.berkeley.edu /what-is-big-data/.
4. Dan L. Longo and Jeffrey M. Drazen. 2016. “Data Sharing.” New England Journal of Medicine 374, no. 3: 276–277. doi: 10.1056/NEJMe1516564.
5. Robert A. Weinberg. 2014. “Coming Full Circle—From Endless Complexity to Simplicity and Back Again.” Cell 157 (1): 267–71. doi: 10.1016/j.cell.2014.03.004.
6. J. C. Murray, K. H. Buetow, J. L. Weber, S. Ludwigsen, T. Scherpbier-Heddema, F. Manion, et al. 1994. “A Comprehensive Human Linkage Map with Centimorgan Density. Cooperative Human Linkage Center (CHLC).” Science 265, no. 5181: 2049–2054.
7. Arthur C. Clarke. 1962. “Hazards of Prophecy: The Failure of Imagination” In Profiles of the Future: An Inquiry into the Limits of the Possible. New York: Harper & Row.
xv
Acknowledgments
Most broadly, this book has been inspired by the need for a collaborative and
multidisciplinary approach to solving the intricate puzzle that is cancer. Cancer
poses a complex adaptive challenge that reaches across all domains: medicine,
biology, technology, and the social sciences. Transdisciplinary collaboration is
the only true path to the future. Ubiquitous research computing in support of
“open science” and open big data has an essential role to play in this collabora-
tive process.
More specifically, this book is dedicated to Sue Stigler and the family she
leaves behind. Her three-and-a-half-year battle with cancer came to a close on
December 7, 2015. Sue’s kindness and devotion, and her endless support for
others even while ill, were remarkable; her selflessness will always be remem-
bered. If you would like to donate to the Stigler family college fund, please visit
their GoFundMe page, https://www.gofundme.com/bpebavas.
Author proceeds support childhood brain cancer research through an ASU
Foundation account supporting Dr. Joshua LaBaer’s work in the Biodesign
Institute. Dr LaBaer is conducting cutting-edge research on pediatric low-grade
astrocytomas (PLGAs), which are the most common cancers of the brain in
children.
In the research and discovery leading to this book, I have worked with
more amazing and committed individuals than I could have ever imagined.
My mentor and friend Ken Buetow is fond of saying, “If you’re the smartest
person in the room, you are in the wrong room.” Time and again I have been
in the right room. I am able to count some of the smartest people on the planet
as colleagues and friends. Publication of this book was made a reality by their
support and example.
A very special thanks to my good friend Phil Simon for convincing me to
put thoughts, concepts, and theory on paper and share it with the world.
At Arizona State University I would like to thank Gordon Wishon, Dr. Elizabeth
Cantwell, and Dr. Sethuraman Panchanathan (“Panch”) for giving me the opportu-
nity to drive innovation at the university.
I would also like to recognize the dedication of our Research Computing
team at Arizona State University for the continued commitment to our “com-
mander’s intent” and to Christopher Myhill for sharing the commander’s intent
with me while at Dell Enterprise.
Tremendous thanks to the teamwork of Jon McNally, Johnathan “Jr.” Lee,
Lee Reynolds, Ram Polur, Daniel Penaherrera, Sheetal Shetty, James Napier,
Tiffany Marin, Deborah Whitten, Curtis Thompson, Srinivasa Mathkur, Marisa
xvi ▸ A c k n o w l e d g m e n t s
Brazil, and of course Carol Schumacher, arguably the best administrative assis-
tant alive. Special thanks also to Wendy “DigDug” Cegielski for her editing
hours and continued motivation; next year you will be Dr. Wendy.
In no specific order I also would like to thank this list of super-smart
and generous folks as well as our many terrific and invaluable partners: Nim-
bleStorage, Brocade, Internet2, ESNET, Penguin Computing, TGEN, SwiftStack,
MarkLogic, the Open Daylight Foundation, the Linux Foundation, Open Net-
working Foundation, IT Partners, friends at University of Arizona, Northern
Arizona University, Dell Enterprise, University of Massachusetts-Lowell, Baylor
University, Washington State University, Georgia Tech, Broad Institute of MIT
and Harvard, University of Nevada Las Vegas, and the College of Southern Nevada
(formally CCSN), and thanks for the support and mentorship from domain pro-
fessionals both public and private like Mark “Pup” Roberts, Brandon Mikkelsen,
Sean Dudley, Joel Dudley, James Lowey, Todd Decker, Jeff Creighton, Jim Scott,
Gregory Palmer, Neela Jacques, Al Ritacco, and of course my engineer stepbrother
Pedro Victor Gomes.
Last but certainly not least, I would like to recognize my awesome team of
Jacob, Dixon, and Annika for their enduring patience throughout the never-
ending collecting of the data and experience that comprises this text.
Heather, though you have departed from my arms, there is always a place
for you in my heart.
1
Introduction
Never let the future disturb you.
You will meet it, if you have to, with the same weapons of reason which today arm you against the present.
—Marcus Aurelius
Some time ago, while I was engaged as a consultant, it became painfully
obvious that the approaches to healthcare data management and overall
infrastructure architecture were stuck in the Stone Age. While data and
information technology (IT) professionals sprinted to remain on the cutting
edge of top tech trends, much of the healthcare system remained a techni-
cal backwater. The many explanations for this include compliance controls,
challenges associated with the rapid proliferation of data, and reliance on old
systems with proprietary code where porting was more painful than the day-
to-day operations. This state of affairs has been frustrating for all involved. But
beyond the very real frustrations, there are far more important negative impacts.
Technical inefficiencies increase costs, lead to a loss of research productivity, and
hurt clinical outcomes. In other words, everyone suffers. When I talk to people
about data management and IT support within the healthcare field, a recurring
theme is that much is “lost in translation” between the various stakeholders:
IT professionals, researchers, doctors, clinicians, and administrators.
Over the past 20 years, much of my time has been spent in medical and
technical fields. I have held positions with two large insurance payer providers
and have worked with the Centers for Medicare & Medicaid Services (CMS)
as a recovery audit contractor. I have even worked clinically as an emergency
medical technician with a strong background in exercise physiology. Seeking
greater challenges led me to Las Vegas, Nevada, where I was fortunate to work
on the first cloud-enabled centrally deterministic (Class 2) gaming systems for
the state lottery. This was well before the term “cloud” had even arrived. At the
close of the project, I returned to the medical field, joining a Fortune 50 payer
provider ingesting targeted acquisitions.
My wide-ranging work experiences have showed me that medical and
research professionals are usually not technology experts, and most do not
2 ▸ I n t r o d u c t I o n
desire to be. At the same time, computer scientists and infrastructure experts
are not biologists, doctors, or researchers. This longtime disconnect paves the
way for high-paid consultants to act as intermediaries brought in to work
between IT and biomedical staff.
Not surprisingly, this does not work terribly well, neither does it best serve
the medical and research communities. Consultants typically demand high com-
pensation and often are not able to perform the sort of knowledge transfer nec-
essary to make a meaningful and sustainable impact. There are many different
permutations and possible explanations for this. But, in the end, I think it is at
heart a failure to adequately translate or bridge biomedicine and IT.
The primary motivation for this book is to begin to create a sustainable and
readily accessible bridge between IT and data technologists, on one hand, and the
community of clinicians, researchers, and academics who deliver and advance
healthcare, on the other hand. This book is thus a translational text that will hope-
fully work both ways. It can help IT staff learn more about clinical and research
needs within biomedicine. It also can help doctors and researchers learn more
about data and other technical tools that are potentially at their disposal.
My experience in healthcare has shown me that both IT professionals and
biologists tend to become isolated or siloed in their professional worlds. This
isolation hurts us all: IT staff, biologists, doctors, and patients alike. This is not
to suggest that IT staff and data managers should get master’s degrees in biol-
ogy or epidemiology. Rather, I am suggesting that as IT staff and data managers
learn more about the biomedical context of their work, they will be able to
work better and more efficiently. Furthermore, as biomedicine becomes ever
more dependent on computing and big data, there is more and more domain-
specific technical knowledge to assimilate.
As IT and biomedicine innovate with increasing rapidity, I predict that
we will see more and more hybrid job titles, such as health technologist and
bioinformatician. In order to stay current, both IT professionals and biomedi-
cal professionals will need to become less isolated. This book begins to bring
together these two fields that are so dependent on each other and have so
much to offer each other. It is my sincere hope that this work will narrow the
gap between those engaged in use-inspired research and those supporting that
research from an infrastructure delivery perspective.
In the interest of creating as accessible a bridge text as possible between
IT staff and biomedical personnel, this book is relatively nontechnical. For the
most part, the aim is to offer a conceptual introduction to key topics in data
management for the biomedical sciences. While a certain familiarity with IT,
networking, and applications is assumed, you will find very little in the way of
code examples. The goal is to equip you with some foundational concepts that
will leave you prepared to seek out whatever additional information you and
your institution might need.
I n t r o d u c t I o n ◂ 3
I have worked in IT for over 20 years, but I am most inspired by how com-
puting technologies can be used to solve human problems. I certainly appreci-
ate elegant code and innovative technical solutions. But at the end of the day,
it is the prospect of improving patient outcomes that keeps me engaged and
driven to learn and continually extend the boundaries of the possible. One
area of biomedical research that I find particularly inspiring is the potential to
use targeted therapies to more effectively treat pediatric low-grade astrocyto-
mas (PLGAs). PLGAs are by far the most common cancer of the brain among
children. They are often fatal, and current chemotherapies frequently have
lifelong side effects, including neurocognitive impairment. Dr. Joshua LaBaer,
interim director of the Biodesign Institute at Arizona State University, is work-
ing to develop effective targeted therapies that reduce harmful effects on nor-
mal cells. Proceeds from this book support the ASU Research Foundation and
the work of Dr. Joshua LaBaer, Director, The Biodesign Institute, Personalized
Diagnostics and Virginia G. Piper Chair in Personalized Medicine.
In reflecting on the important roles to be played by humans and by
computing, I am reminded of a frequently cited quote by Leo M. Cherne, an
American economist and public servant, that is often inaccurately attributed to
Albert Einstein: “The computer is incredibly fast, accurate, and stupid. Man is
unbelievably slow, inaccurate, and brilliant. The marriage of the two is a force
beyond calculation.” As our capabilities to gather, analyze, and archive data
dramatically improve, computing is likely to be increasingly valuable to bio-
medical research and clinical medicine. Yet let us always remember the need
for humans, slow and inaccurate as we usually are.
Who Should Read ThIS Book?
Strategies in Biomedical Data Science is designed to help anyone who works with
biomedical data. This certainly includes IT staff and systems administrators.
These readers will hopefully gain a deeper understanding of particular chal-
lenges and solutions for biomedical data management. The target audience also
includes bioscience researchers and clinical staff. While persons in these roles
are not typically directly responsible for data management, they are most cer-
tainly concerned with and affected by how data is created, used, and archived.
I hope these readers will gain a deeper understanding of how IT staff tend
to approach systems architecture and data management. Quite frequently we
focus on research academic and other public research institutions. Such institu-
tions are tremendously important for cutting-edge research and collaboration.
Most of the best practices and scenarios presented in the book are, however,
equally applicable to private-sector use cases.
All readers are welcome to work through this book in whatever order best
suits their particular interests and needs.
4 ▸ I n t r o d u c t I o n
WhaT’S In ThIS Book?
Strategies in Biomedical Data Science offers a relatively high-level introduction
to the cutting-edge and rapidly changing field of biomedical data. It provides
biomedical IT professionals with much-needed guidance toward managing the
increasing deluge of healthcare data. This book demonstrates ways in which
both technological development and more effective use of current resources
can better serve both patient and payer. The discussion explores the aggregation
of disparate data sources, current analytics and tool sets, the growing necessity
of smart bioinformatics, and more as data science and biomedical science grow
increasingly intertwined. Real-world use cases and clear examples are featured
throughout, and coverage of data sources, problems, and potential mitigation
provides necessary insight for forward-looking healthcare professionals.
The book begins with an overview of current technical challenges in health-
care and then moves into topics in biomedical data management, including
network infrastructure, compute infrastructure, cloud architecture, and finally
next-generation cyberinfrastructures.
Many of the chapters include use cases and/or case studies. Use cases exam-
ine a general use case and typically focus on one application or technology.
Case studies are more particular examinations of how a company or institution
has used an application or technology to meet an operational need. One of
our objectives is to shine a light into the black box that is the emerging realm
of precision medicine. Much of the case study data has been compiled over
the past few years and has been updated to include as much current data as
available. Please be aware that some case study materials have been anony-
mized at the request of the institution providing the information. Case studies
appear after chapters, while use cases are presented within the chapters.
Strategies in Biomedical Data Science has benefited tremendously from the
many wonderful experts who have generously contributed content. Contribu-
tors are acknowledged throughout the book, alongside their contributions, and
you can find their biographies in the “About the Contributors” section.
Chapter 1, “Healthcare, History, and Heartbreak,” examines some of the
current top issues in healthcare that pertain to data and IT. There are great
challenges but also tremendous opportunities for innovation in IT and data
science. Chapter 1 also presents some promising areas for innovation, includ-
ing the Internet of Things, cloud computing, and dramatic advances in data
storage. Chapter 2, “Genome Sequencing,” recaps the remarkable history of
how scientists deciphered the central dogma, the deceptively simple model
that explains the molecular basis of biological life. We then review the his-
tory of genomic sequencing from its origins to next-generation sequencing
(NGS) and recount its startling price drop. Perhaps most important, we sur-
vey some common genomics tools and resources for analyzing and working
I n t r o d u c t I o n ◂ 5
with genomics data in silico. Following this chapter you will find a case
study presenting a dramatic example of exome sequencing leading to clinical
diagnosis.
Chapter 3, “Data Management,” explores challenges and solutions for man-
aging large quantities of biomedical data. The chapter begins with an overview
of different types of data and moves on to issues of security and compliance
in biomedical research. We offer a general research data life cycle to help you
plan and anticipate potential problems. Particular storage technologies covered
include iRODS, OpenStack Swift, SwiftStack, and NimbleStorage, a perfor-
mance storage array. Following this chapter you will find three case studies.
The first considers the data demands of genetic sequencing. The second offers
specification for HudsonAlpha’s SwiftStack storage cluster. The third focuses
on the use of NimbleStorage’s predictive flash storage at ASU.
Chapter 4, “Designing a Data-Ready Network Infrastructure,” offers a brief
history of computer networking before examining research networks and some
advances in networking. We also share a model that can be used to deliver
secure and regulated data storage and services so that institutions can comply
with security standards. Networking advances discussed include InfiniBand,
a computer-networking communications standard used in high-performance
computing, which features very high throughput and very low latency; net-
work function virtualization (NFV); and software-defined networking (SDN).
The bulk of this chapter is a detailed guide to OpenDaylight, an open source
SDN platform.
Chapter 5, “Data-Intensive Compute Infrastructures,” is all about big data.
It starts with a brief survey of the current state of big data efforts in healthcare
and biomedicine. We consider big data applications as well as data sources.
From there we dive into infrastructure for big data analytics, first examining
service-oriented architecture and cloud computing. We then focus on hierar-
chical system structures and discuss the following layers: sensing, data storage
and management, data computing and application services, and application
services. We end by presenting graphics processing unit (GPU) accelerated
computing. Following the chapter you will find two case studies. The first
reports on how computational modeling and scientific computing can model
treatment options for vascular disease. The second presents how GPU was used
to model the molecular dynamics of antibiotic resistance.
Chapter 6, “Cloud Computing and Emerging Architectures,” begins with an
overview of cloud computing, including service and deployment models as well
as challenges. After this we examine Research as a Service (RaaS) and cluster
homogeneity, key components of some versions of cloud computing, and we
also consider federated access. The second half of the chapter dives into Zeta
Architecture, an emergent architecture that is used by Google and that offers
better hardware utilization, fewer moving parts, and greater responsiveness
6 ▸ I n t r o d u c t I o n
and flexibility. Zeta and other emerging architectures are catalyzed by limi-
tations on one-size-fits-all enterprise architectures. Following this chapter is
a case study on using on-demand computing for biomedical research on
ventricular tachycardia.
Chapter 7, “Data Science,” focuses on the tools and techniques demanded
by this exciting and rapidly growing field. First we examine some basic statisti-
cal concepts as these are the foundation of much data science. From there we
explore some NoSQL database offerings and Splunk, and offer a detailed exam-
ple of genomic analysis (eQTL), which entails Apache Spark and Hive tables.
Following this chapter you will find two case studies: one on UC Irvine Health’s
Hortonworks Data Platform and the second on subclonal variations and the
computing and data science strategies used to study these.
Chapter 8, “Next-Generation Cyberinfrastructures,” brings together many
of the central strands of this book. It reports on the Next-Generation Cyber
Capability (NGCC), which is Arizona State University’s approach to meeting
compute and data needs for its research community and key collaborators.
Following this chapter is a case study on one of the first NGCC projects, the
National Biomarker Development Alliance.
A brief conclusion reviews the book’s goals and invites feedback and
suggestions.
In addition to the case studies, Strategies in Biomedical Data Science contains
five appendixes and a glossary.
Appendix A reports on a survey about research management. Appendix
B reports on a survey about the current state and desired capabilities for IT
resources at research universities. Appendix C offers some high-performance
computing working examples. Appendix D details how to bridge high-
performance computing to Hadoop. Finally, Appendix E discusses using Docker
for bioinformatics.
Thanks for reading!
Should this book inspire the reader to dig deeper into research computing
or the research itself, we will consider it a win. If you find this book to be of
little value, please leave it on your next flight, bus ride, or at a homeless shelter
for some other reader to find and take to their next job interview.
hoW To ConTaCT uS
As you use this book and work with biomedical data, we welcome your com-
ments and feedback. In the hybrid and rapidly evolving field of biomedical
data, collaboration and exchange are truly essential. We hope there will be a
second edition of this book, and I would value comments and feedback to help
improve this material.
You can reach Jay at [email protected] or [email protected].
7
C h a P t e r 1Healthcare, History, and Heartbreak
8
Over the past decade, we have unlocked many of the mysteries about DNA and RNA. This knowledge isn’t just sitting in books on the shelf nor is it confined to the workbenches of laboratories. We have used these research findings to pinpoint the causes of many diseases. Moreover, scientists have translated this genetic knowledge into several treatments and therapies prompting a bridge between the laboratory bench and the patient’s bedside.
—Barack Obama on the Genomics and Personalized Medicine Act (S. 976), March 23, 2007
While we are surely poised to continue to make tremendous medical
advances—notably in personalized medicine, pharmacogenomics, and
precision medicine—we are also facing substantial challenges. The chal-
lenges facing healthcare today are many, and if we do not adequately address
them we risk missing opportunities, pushing the cost of care up, and slowing the
pace of biomedical innovation. In briefly surveying the state of healthcare, it is not
my intention to offer a political diagnosis or solution. Rather, it is my intention
to use our current technical knowledge to point the way to practical solutions.
For example, a long-theorized solution to health records management would be
a single cloud-based system where healthcare information sharing exists univer-
sally. But if I were to present this as the best technical solution, it would not be my
intention to also advocate for a shift to a single-payer healthcare system. As much
as possible this book and the discussions in this chapter aim to avoid politics.
After decades of technological lag, biomedicine has started to embrace new
technologies with increasing rapidity. Next-generation sequencing, mobile tech-
nologies, wearable sensors, three-dimensional medical imaging, and advances in
analytic software now make it possible to capture vast amounts of information.
Yet we still struggle with the collection, management, security, and thoughtful
interpretation of all this information. At the same time, healthcare is changing
quickly as the field grapples with new technologies and is transformed by merg-
ers and new partnerships. As a complex adaptive system, healthcare is more
than the sum of its parts, and it is always difficult to predict the future. But we
do know that as the post–Affordable Care Act healthcare landscape takes shape,
the industry is shifting toward digitally enabled, consumer-focused care models.
Given these trends, technology will be granted many opportunities to improve
patient care.
H e a l t H c a r e , H i s t o r y , a n d H e a r t b r e a k ◂ 9
At the outset of this book it is worth surveying some of the top issues in
healthcare. For many of you, these will be quite familiar. Whether you’re an
expert or not, you should feel free to skip ahead if you like. But it is my sin-
cere hope that the background material will be of real value in bridging the
gap between healthcare and biomedicine, on the one hand, and information
technology (IT) and data management, on the other. Just as doctors in an age
of increasing specialization can benefit from attending to the whole patient, it
is very valuable for IT staff to have a more holistic and systemic understanding
of healthcare.
Top Issues In HealTHcare
There are many, many sources that comment on the state of healthcare and
biomedicine more broadly. Although I worked as a contractor for two of
the country’s largest Medicare/Medicaid contract holders, I am not a policy
expert. But I have come to appreciate the importance of taking in the bigger
picture. My admittedly incomplete survey of top healthcare issues is drawn
from PwC’s Top Health Industry Issues of 2016 and PwC’s Top Health Industry
Issues of 2015 [1]. These two brief reports offer compelling syntheses and analy-
ses of current trends. In rereading these reports and reflecting on my own
experiences in the field, I was struck by the number of top issues that are
substantially or in part data or IT issues. Many of the top healthcare issues
are centrally concerned with the storage, security, sharing, and analysis of
data. In other words, IT and data management will be called on to make major
contributions to advancing the dynamic healthcare field. Next I explore nine
key issues impacting healthcare.
Mergers and partnerships
As the health sector continues to change in response to the Affordable Care Act
(2010), we are seeing many mergers and partnerships. “The ACA’s emphasis on
value and outcomes has sent ripples through the $3.2 trillion health sector, spread-
ing and shifting risk in its wake. At the same time, capital is inexpensive, thanks
to sustained low interest rates. Industry’s response? Go big” [2]. Mergers between
large insurance providers are consolidating the insurance market. In 2015, the
second largest U.S. insurer, Anthem, made a $48.4 billion offer for health and life
insurance provider Cigna. Mergers have also been common in the pharmaceutical
field, including Pfizer’s whopping $160 billion deal for specialty pharmaceutical
star Allergan. While these deals are still awaiting regulatory approval, 2016 and
2017 will likely see more mergers and acquisitions. Many new partnerships are
also being formed between pharmaceutical, life sciences, software, pharmacy,
healthcare providers, and engineering companies, among others.
10 ▸ s t r a t e g i e s i n b i o m e d i c a l d a t a s c i e n c e
Mergers, acquisitions, and partnerships are driven by a number of larger
market forces. Sometimes predicted lower IT or data costs drive consolida-
tion. More often it is simply that IT and data will need to be able to respond
nimbly to these changes. One of the largest challenges is postacquisition data
management.
Many providers in the healthcare space have grown through organic means
and have survived on shoestring budgets. When compliance moved to the fore-
front, many chief information officers were granted grace periods to meet com-
pliance and conducted internal audits, patching together existing components
to meet the objectives. This expenditure had the systemic impact of prevent-
ing the distribution of funds toward infrastructure improvements. The mainte-
nance of many legacy systems resulted, leaving organizations with out-of-date,
proprietary, inflexible systems that were simply not designed to interoperate
on the larger scale. Now when that smaller provider, which potentially main-
tains a large collection of Medicare/Medicaid accounts, is acquired by a larger
entity, the most significant challenge is the integration of those legacy systems
without impacting operational activities. The challenge of migrating years of
patient data records into a system from an out-of-date platform encumbered
by complex and tangled spaghetti code and created by a resource long since
departed is substantial. The need to do so while maintaining business conti-
nuity drives many a large entity to maintain the down-level system for years
following the acquisition.
cybersecurity and Data security
As more and more patient data is stored and shared, security is an increasing
concern. Patient data typically contains individualized information. If that data
is stolen, the risks of identity theft are substantial, and there exists a thriving
black market for stolen health records. Data security breaches are relatively
common. “During the summer of 2014, more than 5 million patients had their
personal data compromised” [1]. These breaches are often costly for compa-
nies. Medical devices themselves can also be hacked. For example, in 2015 the
government warned that “an infusion pump . . . could be modified to deliver a
fatal dose of medication” [2].
The needs for elastic scalability, rapid provisioning, resource orchestration,
high availability, and storage efficiency have contributed to the explosion in
cloud providers and niche service offerings. However, this explosion has also
opened holes in known security elements that were once sealed. Cloud security
challenges can range from the innocuous VM sprawl, where virtual machines
are orphaned in an on/off state and fall outside of the domain security policy
for things as basic as patching and maintenance [3]. On the other end of the