additional praise for · 2017. 6. 21. · additional praise for strategies in biomedical data...

Additional Praise for Strategies in Biomedical Data Science: Driving Force for Innovation

“The allure of data analytics is in knowing what is currently unknowable by

identifying patterns in apparent chaos. If these insights could be applied in the

healthcare field to individualized patient care, the improvement in outcomes

could be profound indeed. This type of research and innovation right here in

Tempe (ASU) demonstrates why ASU is ranked as the number one most inno-

vative university in the nation.

“Industry analysts expect there will be three to four connected Internet of

Things (IoT) devices for every person on the planet by 2020. Healthcare can

and is leading the way in IoT adoption. To prepare for the coming deluge of IoT

data, healthcare IT organizations should be investing in data analytics capabil-

ity to convert that raw data flood into actionable information that delivers bet-

ter healthcare outcomes.”

—Steve Phillips

Senior Vice President and Chief Information Officer, Avnet, Inc.

Twitter: @Avnet

“I think it is really great that Jay Etchings is working on this; the dearth of infor-

mation for dealing with large, complex biomedical data sets makes building sys-

tems capable of supporting precision medicine very challenging. I would say that

we are not yet at the “blueprint” stage, but we certainly can use help in getting the

right people thinking about this, so we can build the recipes going forward. While

true clinical application at scale is still not here, we are rapidly approaching that

event horizon, and as we have learned in biomedical research, the infrastructure

challenges alone require careful planning and very deliberate applications of the

proper technologies to deal with the vast amount of data that is generated. The

algorithms to automate things such as true clinical decision guidance have yet to

be written, and although some approaches such as neuro-linguistic programming

or machine learning look promising, actually creating a “doc in a box” is probably

many years off. This does not mean we should not be striving to move forward as

rapidly as possible, because the impact that can be had on a patient’s life is truly

inspirational and that should always be remembered. This is not building systems

to showcase technology or how smart we are, it is to help propel a truly world

changing methodology of how medicine is practiced.”

—James Lowey, CIO

TGen, The Translational Genomics Research Institute

Twitter: @loweyj, @Tgen

“The journey to precision medicine will require the confluence and analysis of

enormous amounts of data from genomics, clinical and fundamental research,

clinical care, and environmental and lifestyle data, including connected health

data from the “Internet of Medical Things.” The entire healthcare ecosystem

needs to work together, along with the information and communications tech-

nology ecosystems, to collect, transport, analyze, and leverage the vast amount

of data that can be honed to develop insights and recommendations for preci-

sion medicine. The opportunity to improve healthcare is compelling, the data

is vast and will continue to grow, and we need to work together to realize

improved outcomes. We need to build the technology and process-enabled

capabilities to protect the data and the people. The need for increased TIPPSS—

trust, identity, privacy, protection, safety, and security—mechanisms is critical

to the success and safety in our ongoing healthcare journey.”

—Florence D. Hudson

Senior Vice President and Chief Innovation Officer, Internet2

Twitter: @FloInternet2

“In the last decade, the wave of data coming off modern sequencing instru-

ments is transforming bioscience into a digital science. Not only are the data

sets enormous, the need to work through them quickly to have a real-time

impact on therapy is crucial, requiring all of the elements of high-performance

computing: fast compute, storage and networking, sophisticated data manage-

ment, and highly parallel application codes.

The ability to quickly crunch massive amounts of disease and patient data

is at the heart of precision medicine. While much of the promise of precision

medicine is still on the horizon, advances have already led to life-saving treat-

ments for children and adults with lethal cancers and genetic diseases. At the

Center for Pediatric Genomic Medicine (CPGM) at Children’s Mercy Hospital in

Kansas City, MO, researchers used 25 hours of supercomputer time to decode

the genetic variants of an infant suffering from liver failure. Thanks to the fast

genomic diagnosis, doctors were able to proceed with the most effective treat-

ment and the baby is alive and well.”

—Tiffany Trader

Managing Editor, HPCwire

Strategies in Biomedical Data

Science

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level manag-

ers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

Analytics in a Big Data World: The Essential Guide to Data Science and Its Applica-

tions by Bart Baesens

Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian

Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst

Big Data, Big Innovation: Enabling Competitive Differentiation through Business

Analytics by Evan Stubbs

Business Analytics for Customer Intelligence by Gert Laursen

Business Intelligence Applied: Implementing an Effective Information and Commu-

nications Technology Infrastructure by Michael Gendron

Business Intelligence and the Cloud: Strategic Implementation Guide by Michael

S. Gendron

Business Transformation: A Roadmap for Maximizing Organizational Insights by

Aiman Zeid

Connecting Organizational Silos: Taking Knowledge Flow Management to the Next

Level with Social Media by Frank Leistner

Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by

Laura Madsen

Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs

Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition

by Charles Chase

Demand-Driven Inventory Optimization and Replenishment: Creating a More Effi-

cient Supply Chain by Robert A. Davis

Developing Human Capital: Using Analytics to Plan and Optimize Your Learning

and Development Investments by Gene Pease, Barbara Beresford, and Lew

Walker

Economic and Business Forecasting: Analyzing and Interpreting Econometric Results

by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard

Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fun-

damental Concepts and Practical Applications by Robert Rowan

Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production

with Data Driven Models by Keith Holdaway

Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke

Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical

World by Carlos Andre Reis Pinheiro and Fiona McNeill

Human Capital Analytics: How to Harness the Potential of Your Organization’s

Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz

Implement, Improve and Expand Your Statewide Longitudinal Data System: Creat-

ing a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp

Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown

Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II

Predictive Business Analytics: Forward-Looking Capabilities to Improve Business

Performance by Lawrence Maisel and Gary Cokins

Retail Analytics: The Secret Weapon by Emmett Cox

Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro

Statistical Thinking: Improving Business Performance, Second Edition by Roger

W. Hoerl and Ronald D. Snee

Style and Statistics: The Art of Retail Analytics by Brittany Bullard

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams

with Advanced Analytics by Bill Franks

The Analytic Hospitality Executive: Implementing Data Analytics in Hotels and

Casinos by Kelly A. McGuire

The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks

Are Radically Transforming Your Business by David Thomas and Mike Barlow

The Value of Business Analytics: Identifying the Path to Profitability by Evan

Stubbs

The Visual Organization: Data Visualization, Big Data, and the Quest for Better

Decisions by Phil Simon

Too Big to Ignore: The Business Case for Big Data by Phil Simon

Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean

Win with Advanced Business Analytics: Creating Business Value from Your Data by

Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit www.wiley.com.

http://www.wiley.com

Strategies in Biomedical Data

ScienceDriving Force for Innovation

Jay Etchings

Cover image: DNA strand © Don Bishop/Getty Images, Inc.Cover design: Wiley

Copyright © 2017 by SAS Institute, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley .com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Names: Etchings, Jay, 1966– author. | SAS Institute, issuing body.Title: Strategies in biomedical data science : driving force for innovation / Jay Etchings.Other titles: Wiley and SAS business series.Description: Hoboken, New Jersey : John Wiley & Sons, Inc., [2017] | Series: Wiley & SAS business series | Includes bibliographical references and index.Identifiers: LCCN 2016036794 (print) | LCCN 2016037346 (ebook) | ISBN 978-1-119-23219-3 (hardcover) | ISBN 978-1-119-25597-0 (ePub) | ISBN 978-1-119-25618-2 (ePDF) Subjects: | MESH: Medical Informatics | Computational Biology—methods | Cybernetics—methodsClassification: LCC R859.7.A78 (print) | LCC R859.7.A78 (ebook) | NLM W 26.5 | DDC 610.285—dc23LC record available at https://lccn.loc.gov/2016036794

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

http://www.copyright.com

http://www.wiley.com/go/permissions

http://booksupport.wiley.com

http://www.wiley.com

https://lccn.loc.gov/2016036794

http://booksupport.wiley.com

vii

Contents

Foreword xi

Acknowledgments xv

Introduction 1Who Should Read This Book? 3What’s in This Book? 4How to Contact Us 6

Chapter 1 Healthcare, History, and Heartbreak 7Top Issues in Healthcare 9Data Management 16Biosimilars, Drug Pricing, and Pharmaceutical Compounding 18Promising Areas of Innovation 19Conclusion 25Notes 25

Chapter 2 Genome Sequencing: Know Thyself, One Base Pair

at a Time 27Content contributed by Sheetal Shetty and Jacob Brill

Challenges of Genomic Analysis 29The Language of Life 30A Brief History of DNA Sequencing 31DNA Sequencing and the Human Genome Project 35Select Tools for Genomic Analysis 38Conclusion 47Notes 48

Chapter 3 Data Management 53Content contributed by Joe Arnold

Bits about Data 54Data Types 56Data Security and Compliance 59Data Storage 66SwiftStack 70OpenStack Swift Architecture 78Conclusion 94Notes 94

viii ▸ C o n t e n t s

Chapter 4 Designing a Data-Ready Network Infrastructure 105Research Networks: A Primer 108ESnet at 30: Evolving toward Exascale and

Raising Expectations 109Internet2 Innovation Platform 111Advances in Networking 113InfiniBand and Microsecond Latency 114The Future of High-Performance Fabrics 117Network Function Virtualization 119Software-Defined Networking 121OpenDaylight 122Conclusion 157Notes 157

Chapter 5 Data-Intensive Compute Infrastructures 163Content contributed by Dijiang Huang, Yuli Deng, Jay Etchings, Zhiyuan

Ma, and Guangchun LuoBig Data Applications in Health Informatics 166Sources of Big Data in Health Informatics 168Infrastructure for Big Data Analytics 171Fundamental System Properties 186GPU-Accelerated Computing and Biomedical Informatics 187Conclusion 190Notes 191

Chapter 6 Cloud Computing and Emerging Architectures 211Cloud Basics 213Challenges Facing Cloud Computing Applications in Biomedicine 215Hybrid Campus Clouds 216Research as a Service 217Federated Access Web Portals 219Cluster Homogeneity 220Emerging Architectures (Zeta Architecture) 221Conclusion 229Notes 229

Chapter 7 Data Science 235NoSQL Approaches to Biomedical Data Science 237Using Splunk for Data Analytics 244Statistical Analysis of Genomic Data with Hadoop 250Extracting and Transforming Genomic Data 253Processing eQTL Data 256Generating Master SNP Files for Cases and Controls 259Generating Gene Expression Files for Cases and Controls 260Cleaning Raw Data Using MapReduce 261Transpose Data Using Python 263Statistical Analysis Using Spark 264Hive Tables with Partitions 268Conclusion 270Notes 270

C o n t e n t s ◂ ix

Appendix: A Brief Statistics Primer 290

Content Contributed by Daniel Penaherrera

Chapter 8 Next-Generation Cyberinfrastructures 307Next-Generation Cyber Capability 308NGCC Design and Infrastructure 310Conclusion 327Note 330

Conclusion 335

Appendix A The Research Data Management Survey: From Concepts to Practice 337

Brandon Mikkelsen and Jay Etchings

Appendix B Central IT and Research Support 353Gregory D. Palmer

Appendix C HPC Working Example: Using Parallelization Programs Such as GNU Parallel and OpenMP with Serial Tools 377

Appendix D HPC and Hadoop: Bridging HPC to Hadoop 385

Appendix E Bioinformatics + Docker: Simplifying Bioinformatics Tools Delivery with Docker Containers 391

Glossary 399

About the Author 419

About the Contributors 421

Index 427

xi

Foreword

The emergence of data science is radically transforming the biomedical knowl-

edge generation paradigm. While modern biomedicine has been a pioneer in

evidence-based science, its approach for decades has largely followed a well-

worn path of experimental design, data collection, analysis, and interpretation.

Data science introduces an alternative pathway—one that starts with the vast

collections of diverse digital data increasingly accessible to the community.

While the data science evidence generation concept has many birth par-

ents, Jim Gray of Microsoft best described the unique opportunity afforded by

this new paradigm. In a 2007 address to the U.S. National Research Council,

Gray argued: “With an exaflood of unexamined data and teraflops of cheap

computing power, we should be able to make many valuable discoveries

simply by searching all that information for unexpected patterns” [1]. Gray

coined the phrase “data-intensive scientific discovery.” Notably, he broke with

the high-performance computing “high priests” and advocated the adoption

of new models of computing. Following Gray’s untimely death shortly after

his address, his colleagues captured this concept in a collection of essays ulti-

mately published as The Fourth Paradigm: Data-Intensive Scientific Discovery [2].

It was within these essays that the term “big data” was introduced.

“Data science” and “big data” are now overburdened terms with many

meanings. The most useful definitions are operational in nature. One of the

most colorful comes from John Myles of Facebook, who indicates that big data

is any problem “so large that traditional approaches to data analysis are doomed

to failure” [3]. I find the definition of the chief architect of Data.gov, Philip

Ashlock, most elucidating: “Analysis that can help you find patterns, anoma-

lies, or new structures amidst otherwise chaotic or complex data points” [3].

Data science remains controversial in biomedicine. Jeff Drazen, the edi-

tor in chief of the New England Journal of Medicine, has described data science

practitioners as “research parasites” [4]. More subtly, Robert Weinberg openly

questions whether such approaches have any potential to generate real insight

in his article describing an emerging crisis in understanding cancer, “Coming

Full Circle—From Endless Complexity to Simplicity and Back Again” [5].

I have been an eyewitness and co-conspirator in the data science trans-

formation occurring in biomedicine. I grew up with the Human Genome

Project and the rapid accumulation of large volumes of big data it generates.

I have made contributions through the “Discovery Science” paradigm that the

Genome Project made acceptable in biomedicine. For example, with my col-

leagues at the Cooperative Human Linkage Center, we were early adopters of

xii ▸ F o r e w o r d

computational science and the Internet (then NSFnet) in our efforts to con-

struct the map of human inheritance [6]. For us at the time, big data topped

out at a gigabyte! While serving as the founding director of the National Insti-

tutes of Health’s National Cancer Institute’s Center for Biomedical Informatics

and Information Technology, I was tasked with helping bring data science to

the cancer community. The charge was broad—including basic science, clini-

cal research, and health encounter data. It was technologically challenging—

predating many technology paradigms now taken for granted as standard in

data science. Through these pioneering efforts, I experienced the aforemen-

tioned controversial nature of data science and the second of Arthur C. Clarke’s

laws: “The only way of discovering the limits of the possible is to venture a little

way past them into the impossible” [7].

Strategies in Biomedical Data Science is an ambitious attempt to look at “the

limits of the possible” for data science in biomedicine. Unique in its scope, it

takes a comprehensive look at all aspects of data science. Work in the sciences

is routinely compartmentalized and segregated among specialists. This segre-

gation is particularly true in biomedicine as it wrestles with the integration of

data science and its underpinning in information technology. While such spe-

cialization is essential for progress within disciplines, the failure to have cross-

cutting discussions results in lost opportunities. This book is significant in that

it purposely embraces the “transdisciplinary” nature of biomedical data science.

Transdisciplinary research (a foundational aspect of Arizona State University’s

“New American University”) brings together different disciplines to create inno-

vations that are beyond the capacity of any single specialty. Data science is defi-

nitionally transdisciplinary and somewhat ironically is discipline-agnostic.

Strategies in Biomedical Data Science unapologetically mixes biology, analyt-

ics, and information technology. Its transdisciplinary topics cover diverse data

types—genomic, clinical encounter, personal monitoring devices—and the

data science opportunities (and challenges) in each. Within each of these top-

ics, it provides insights into the software capabilities that are used to wrangle

Gray’s “exaflood” of data and to find his “unexpected patterns.” It provides

insightful discussions of the underpinning computational and network infra-

structure necessary to realize the potential of data science. More specifically,

it provides practical blueprints that translate Gray’s suggested alternative to

traditional high-performance computing paradigms into reality. Within each of

these, it provides case studies written by experts that transition the topics from

concept to real-world examples. Importantly, these case studies are provided

by both academics and industry sources, demonstrating the importance of both

to the biomedical data science progress as well as the need to blend these often-

adversarial communities.

I have had the opportunity to know the author, Jay Etchings, for over three

years. Jay is a true computational renaissance man, as reflected in the breadth

F o r e w o r d ◂ xiii

of topics facilely presented in Strategies in Biomedical Data Science. I was first intro-

duced to Jay when he was an architect for Dell. Jay translated ASU’s vision for

a first-generation, purpose-built data science research platform into the opera-

tional Next Generation Cyber Capability (NGCC) described in the book. The

NGCC is a physical instantiation of what Gray envisioned. Now at ASU as the

director of Research Computing Operations, Jay and his team deliver biomedi-

cal data science to a diverse collection of international scientists.

Jay brings a fresh perspective and a diverse pedigree of work experiences

to biomedical data science. He has been at the forefront of developing and

deploying big data capabilities throughout his career. For example, Jay was on

the leading edge in bringing big data infrastructure to the gaming industry—

a community that is always an early adopter of breakthrough technology. Jay

has hands-on experience in the complexities of biomedical data from his efforts

to provide support for the Centers for Medicare and Medicaid Services. Jay’s

commercial background brings with it a can-do approach to problems and a

low tolerance for the arcane consternation that often paralyzes academics.

This fresh perspective and his enthusiasm for biomedicine pervade his writing.

Strategies in Biomedical Data Science is a one-stop shop of data science essentials

and is likely to serve as the go-to resource for years to come.

Ken Buetow, Ph.D.,

Professor, Arizona State University

Director, Computational Science and Informatics Core Program

Director, Complex Adaptive Systems Initiative

Notes

1. David Snyder. 2016. “The Big Picture of Big Data—IEEE—The Institute.” http://theinstitute .ieee.org/ieee-roundup/members/achievements/the-big-picture-of-big-data.

2. Anthony J. G. Hey, ed. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research.

3. Jennifer Dutcher. 2014. “What Is Big Data?” September 3. https://datascience.berkeley.edu /what-is-big-data/.

4. Dan L. Longo and Jeffrey M. Drazen. 2016. “Data Sharing.” New England Journal of Medicine 374, no. 3: 276–277. doi: 10.1056/NEJMe1516564.

5. Robert A. Weinberg. 2014. “Coming Full Circle—From Endless Complexity to Simplicity and Back Again.” Cell 157 (1): 267–71. doi: 10.1016/j.cell.2014.03.004.

6. J. C. Murray, K. H. Buetow, J. L. Weber, S. Ludwigsen, T. Scherpbier-Heddema, F. Manion, et al. 1994. “A Comprehensive Human Linkage Map with Centimorgan Density. Cooperative Human Linkage Center (CHLC).” Science 265, no. 5181: 2049–2054.

7. Arthur C. Clarke. 1962. “Hazards of Prophecy: The Failure of Imagination” In Profiles of the Future: An Inquiry into the Limits of the Possible. New York: Harper & Row.

http://theinstitute.ieee.org/ieee-roundup/members/achievements/the-big-picture-of-big-data

https://datascience.berkeley.edu/what-is-big-data/

xv

Acknowledgments

Most broadly, this book has been inspired by the need for a collaborative and

multidisciplinary approach to solving the intricate puzzle that is cancer. Cancer

poses a complex adaptive challenge that reaches across all domains: medicine,

biology, technology, and the social sciences. Transdisciplinary collaboration is

the only true path to the future. Ubiquitous research computing in support of

“open science” and open big data has an essential role to play in this collabora-

tive process.

More specifically, this book is dedicated to Sue Stigler and the family she

leaves behind. Her three-and-a-half-year battle with cancer came to a close on

December 7, 2015. Sue’s kindness and devotion, and her endless support for

others even while ill, were remarkable; her selflessness will always be remem-

bered. If you would like to donate to the Stigler family college fund, please visit

their GoFundMe page, https://www.gofundme.com/bpebavas.

Author proceeds support childhood brain cancer research through an ASU

Foundation account supporting Dr. Joshua LaBaer’s work in the Biodesign

Institute. Dr LaBaer is conducting cutting-edge research on pediatric low-grade

astrocytomas (PLGAs), which are the most common cancers of the brain in

children.

In the research and discovery leading to this book, I have worked with

more amazing and committed individuals than I could have ever imagined.

My mentor and friend Ken Buetow is fond of saying, “If you’re the smartest

person in the room, you are in the wrong room.” Time and again I have been

in the right room. I am able to count some of the smartest people on the planet

as colleagues and friends. Publication of this book was made a reality by their

support and example.

A very special thanks to my good friend Phil Simon for convincing me to

put thoughts, concepts, and theory on paper and share it with the world.

At Arizona State University I would like to thank Gordon Wishon, Dr. Elizabeth

Cantwell, and Dr. Sethuraman Panchanathan (“Panch”) for giving me the opportu-

nity to drive innovation at the university.

I would also like to recognize the dedication of our Research Computing

team at Arizona State University for the continued commitment to our “com-

mander’s intent” and to Christopher Myhill for sharing the commander’s intent

with me while at Dell Enterprise.

Tremendous thanks to the teamwork of Jon McNally, Johnathan “Jr.” Lee,

Lee Reynolds, Ram Polur, Daniel Penaherrera, Sheetal Shetty, James Napier,

Tiffany Marin, Deborah Whitten, Curtis Thompson, Srinivasa Mathkur, Marisa

https://www.gofundme.com/bpebavas

xvi ▸ A c k n o w l e d g m e n t s

Brazil, and of course Carol Schumacher, arguably the best administrative assis-

tant alive. Special thanks also to Wendy “DigDug” Cegielski for her editing

hours and continued motivation; next year you will be Dr. Wendy.

In no specific order I also would like to thank this list of super-smart

and generous folks as well as our many terrific and invaluable partners: Nim-

bleStorage, Brocade, Internet2, ESNET, Penguin Computing, TGEN, SwiftStack,

MarkLogic, the Open Daylight Foundation, the Linux Foundation, Open Net-

working Foundation, IT Partners, friends at University of Arizona, Northern

Arizona University, Dell Enterprise, University of Massachusetts-Lowell, Baylor

University, Washington State University, Georgia Tech, Broad Institute of MIT

and Harvard, University of Nevada Las Vegas, and the College of Southern Nevada

(formally CCSN), and thanks for the support and mentorship from domain pro-

fessionals both public and private like Mark “Pup” Roberts, Brandon Mikkelsen,

Sean Dudley, Joel Dudley, James Lowey, Todd Decker, Jeff Creighton, Jim Scott,

Gregory Palmer, Neela Jacques, Al Ritacco, and of course my engineer stepbrother

Pedro Victor Gomes.

Last but certainly not least, I would like to recognize my awesome team of

Jacob, Dixon, and Annika for their enduring patience throughout the never-

ending collecting of the data and experience that comprises this text.

Heather, though you have departed from my arms, there is always a place

for you in my heart.

1

Introduction

Never let the future disturb you.

You will meet it, if you have to, with the same weapons of reason which today arm you against the present.

—Marcus Aurelius

Some time ago, while I was engaged as a consultant, it became painfully

obvious that the approaches to healthcare data management and overall

infrastructure architecture were stuck in the Stone Age. While data and

information technology (IT) professionals sprinted to remain on the cutting

edge of top tech trends, much of the healthcare system remained a techni-

cal backwater. The many explanations for this include compliance controls,

challenges associated with the rapid proliferation of data, and reliance on old

systems with proprietary code where porting was more painful than the day-

to-day operations. This state of affairs has been frustrating for all involved. But

beyond the very real frustrations, there are far more important negative impacts.

Technical inefficiencies increase costs, lead to a loss of research productivity, and

hurt clinical outcomes. In other words, everyone suffers. When I talk to people

about data management and IT support within the healthcare field, a recurring

theme is that much is “lost in translation” between the various stakeholders:

IT professionals, researchers, doctors, clinicians, and administrators.

Over the past 20 years, much of my time has been spent in medical and

technical fields. I have held positions with two large insurance payer providers

and have worked with the Centers for Medicare & Medicaid Services (CMS)

as a recovery audit contractor. I have even worked clinically as an emergency

medical technician with a strong background in exercise physiology. Seeking

greater challenges led me to Las Vegas, Nevada, where I was fortunate to work

on the first cloud-enabled centrally deterministic (Class 2) gaming systems for

the state lottery. This was well before the term “cloud” had even arrived. At the

close of the project, I returned to the medical field, joining a Fortune 50 payer

provider ingesting targeted acquisitions.

My wide-ranging work experiences have showed me that medical and

research professionals are usually not technology experts, and most do not

2 ▸ I n t r o d u c t I o n

desire to be. At the same time, computer scientists and infrastructure experts

are not biologists, doctors, or researchers. This longtime disconnect paves the

way for high-paid consultants to act as intermediaries brought in to work

between IT and biomedical staff.

Not surprisingly, this does not work terribly well, neither does it best serve

the medical and research communities. Consultants typically demand high com-

pensation and often are not able to perform the sort of knowledge transfer nec-

essary to make a meaningful and sustainable impact. There are many different

permutations and possible explanations for this. But, in the end, I think it is at

heart a failure to adequately translate or bridge biomedicine and IT.

The primary motivation for this book is to begin to create a sustainable and

readily accessible bridge between IT and data technologists, on one hand, and the

community of clinicians, researchers, and academics who deliver and advance

healthcare, on the other hand. This book is thus a translational text that will hope-

fully work both ways. It can help IT staff learn more about clinical and research

needs within biomedicine. It also can help doctors and researchers learn more

about data and other technical tools that are potentially at their disposal.

My experience in healthcare has shown me that both IT professionals and

biologists tend to become isolated or siloed in their professional worlds. This

isolation hurts us all: IT staff, biologists, doctors, and patients alike. This is not

to suggest that IT staff and data managers should get master’s degrees in biol-

ogy or epidemiology. Rather, I am suggesting that as IT staff and data managers

learn more about the biomedical context of their work, they will be able to

work better and more efficiently. Furthermore, as biomedicine becomes ever

more dependent on computing and big data, there is more and more domain-

specific technical knowledge to assimilate.

As IT and biomedicine innovate with increasing rapidity, I predict that

we will see more and more hybrid job titles, such as health technologist and

bioinformatician. In order to stay current, both IT professionals and biomedi-

cal professionals will need to become less isolated. This book begins to bring

together these two fields that are so dependent on each other and have so

much to offer each other. It is my sincere hope that this work will narrow the

gap between those engaged in use-inspired research and those supporting that

research from an infrastructure delivery perspective.

In the interest of creating as accessible a bridge text as possible between

IT staff and biomedical personnel, this book is relatively nontechnical. For the

most part, the aim is to offer a conceptual introduction to key topics in data

management for the biomedical sciences. While a certain familiarity with IT,

networking, and applications is assumed, you will find very little in the way of

code examples. The goal is to equip you with some foundational concepts that

will leave you prepared to seek out whatever additional information you and

your institution might need.

I n t r o d u c t I o n ◂ 3

I have worked in IT for over 20 years, but I am most inspired by how com-

puting technologies can be used to solve human problems. I certainly appreci-

ate elegant code and innovative technical solutions. But at the end of the day,

it is the prospect of improving patient outcomes that keeps me engaged and

driven to learn and continually extend the boundaries of the possible. One

area of biomedical research that I find particularly inspiring is the potential to

use targeted therapies to more effectively treat pediatric low-grade astrocyto-

mas (PLGAs). PLGAs are by far the most common cancer of the brain among

children. They are often fatal, and current chemotherapies frequently have

lifelong side effects, including neurocognitive impairment. Dr. Joshua LaBaer,

interim director of the Biodesign Institute at Arizona State University, is work-

ing to develop effective targeted therapies that reduce harmful effects on nor-

mal cells. Proceeds from this book support the ASU Research Foundation and

the work of Dr. Joshua LaBaer, Director, The Biodesign Institute, Personalized

Diagnostics and Virginia G. Piper Chair in Personalized Medicine.

In reflecting on the important roles to be played by humans and by

computing, I am reminded of a frequently cited quote by Leo M. Cherne, an

American economist and public servant, that is often inaccurately attributed to

Albert Einstein: “The computer is incredibly fast, accurate, and stupid. Man is

unbelievably slow, inaccurate, and brilliant. The marriage of the two is a force

beyond calculation.” As our capabilities to gather, analyze, and archive data

dramatically improve, computing is likely to be increasingly valuable to bio-

medical research and clinical medicine. Yet let us always remember the need

for humans, slow and inaccurate as we usually are.

Who Should Read ThIS Book?

Strategies in Biomedical Data Science is designed to help anyone who works with

biomedical data. This certainly includes IT staff and systems administrators.

These readers will hopefully gain a deeper understanding of particular chal-

lenges and solutions for biomedical data management. The target audience also

includes bioscience researchers and clinical staff. While persons in these roles

are not typically directly responsible for data management, they are most cer-

tainly concerned with and affected by how data is created, used, and archived.

I hope these readers will gain a deeper understanding of how IT staff tend

to approach systems architecture and data management. Quite frequently we

focus on research academic and other public research institutions. Such institu-

tions are tremendously important for cutting-edge research and collaboration.

Most of the best practices and scenarios presented in the book are, however,

equally applicable to private-sector use cases.

All readers are welcome to work through this book in whatever order best

suits their particular interests and needs.


WhaT’S In ThIS Book?

Strategies in Biomedical Data Science offers a relatively high-level introduction

to the cutting-edge and rapidly changing field of biomedical data. It provides

biomedical IT professionals with much-needed guidance toward managing the

increasing deluge of healthcare data. This book demonstrates ways in which

both technological development and more effective use of current resources

can better serve both patient and payer. The discussion explores the aggregation

of disparate data sources, current analytics and tool sets, the growing necessity

of smart bioinformatics, and more as data science and biomedical science grow

increasingly intertwined. Real-world use cases and clear examples are featured

throughout, and coverage of data sources, problems, and potential mitigation

provides necessary insight for forward-looking healthcare professionals.

The book begins with an overview of current technical challenges in health-

care and then moves into topics in biomedical data management, including

network infrastructure, compute infrastructure, cloud architecture, and finally

next-generation cyberinfrastructures.

Many of the chapters include use cases and/or case studies. Use cases exam-

ine a general use case and typically focus on one application or technology.

Case studies are more particular examinations of how a company or institution

has used an application or technology to meet an operational need. One of

our objectives is to shine a light into the black box that is the emerging realm

of precision medicine. Much of the case study data has been compiled over

the past few years and has been updated to include as much current data as

available. Please be aware that some case study materials have been anony-

mized at the request of the institution providing the information. Case studies

appear after chapters, while use cases are presented within the chapters.

Strategies in Biomedical Data Science has benefited tremendously from the

many wonderful experts who have generously contributed content. Contribu-

tors are acknowledged throughout the book, alongside their contributions, and

you can find their biographies in the “About the Contributors” section.

Chapter 1, “Healthcare, History, and Heartbreak,” examines some of the

current top issues in healthcare that pertain to data and IT. There are great

challenges but also tremendous opportunities for innovation in IT and data

science. Chapter 1 also presents some promising areas for innovation, includ-

ing the Internet of Things, cloud computing, and dramatic advances in data

storage. Chapter 2, “Genome Sequencing,” recaps the remarkable history of

how scientists deciphered the central dogma, the deceptively simple model

that explains the molecular basis of biological life. We then review the his-

tory of genomic sequencing from its origins to next-generation sequencing

(NGS) and recount its startling price drop. Perhaps most important, we sur-

vey some common genomics tools and resources for analyzing and working

I n t r o d u c t I o n ◂ 5

with genomics data in silico. Following this chapter you will find a case

study presenting a dramatic example of exome sequencing leading to clinical

diagnosis.

Chapter 3, “Data Management,” explores challenges and solutions for man-

aging large quantities of biomedical data. The chapter begins with an overview

of different types of data and moves on to issues of security and compliance

in biomedical research. We offer a general research data life cycle to help you

plan and anticipate potential problems. Particular storage technologies covered

include iRODS, OpenStack Swift, SwiftStack, and NimbleStorage, a perfor-

mance storage array. Following this chapter you will find three case studies.

The first considers the data demands of genetic sequencing. The second offers

specification for HudsonAlpha’s SwiftStack storage cluster. The third focuses

on the use of NimbleStorage’s predictive flash storage at ASU.

Chapter 4, “Designing a Data-Ready Network Infrastructure,” offers a brief

history of computer networking before examining research networks and some

advances in networking. We also share a model that can be used to deliver

secure and regulated data storage and services so that institutions can comply

with security standards. Networking advances discussed include InfiniBand,

a computer-networking communications standard used in high-performance

computing, which features very high throughput and very low latency; net-

work function virtualization (NFV); and software-defined networking (SDN).

The bulk of this chapter is a detailed guide to OpenDaylight, an open source

SDN platform.

Chapter 5, “Data-Intensive Compute Infrastructures,” is all about big data.

It starts with a brief survey of the current state of big data efforts in healthcare

and biomedicine. We consider big data applications as well as data sources.

From there we dive into infrastructure for big data analytics, first examining

service-oriented architecture and cloud computing. We then focus on hierar-

chical system structures and discuss the following layers: sensing, data storage

and management, data computing and application services, and application

services. We end by presenting graphics processing unit (GPU) accelerated

computing. Following the chapter you will find two case studies. The first

reports on how computational modeling and scientific computing can model

treatment options for vascular disease. The second presents how GPU was used

to model the molecular dynamics of antibiotic resistance.

Chapter 6, “Cloud Computing and Emerging Architectures,” begins with an

overview of cloud computing, including service and deployment models as well

as challenges. After this we examine Research as a Service (RaaS) and cluster

homogeneity, key components of some versions of cloud computing, and we

also consider federated access. The second half of the chapter dives into Zeta

Architecture, an emergent architecture that is used by Google and that offers

better hardware utilization, fewer moving parts, and greater responsiveness


and flexibility. Zeta and other emerging architectures are catalyzed by limi-

tations on one-size-fits-all enterprise architectures. Following this chapter is

a case study on using on-demand computing for biomedical research on

ventricular tachycardia.

Chapter 7, “Data Science,” focuses on the tools and techniques demanded

by this exciting and rapidly growing field. First we examine some basic statisti-

cal concepts as these are the foundation of much data science. From there we

explore some NoSQL database offerings and Splunk, and offer a detailed exam-

ple of genomic analysis (eQTL), which entails Apache Spark and Hive tables.

Following this chapter you will find two case studies: one on UC Irvine Health’s

Hortonworks Data Platform and the second on subclonal variations and the

computing and data science strategies used to study these.

Chapter 8, “Next-Generation Cyberinfrastructures,” brings together many

of the central strands of this book. It reports on the Next-Generation Cyber

Capability (NGCC), which is Arizona State University’s approach to meeting

compute and data needs for its research community and key collaborators.

Following this chapter is a case study on one of the first NGCC projects, the

National Biomarker Development Alliance.

A brief conclusion reviews the book’s goals and invites feedback and

suggestions.

In addition to the case studies, Strategies in Biomedical Data Science contains

five appendixes and a glossary.

Appendix A reports on a survey about research management. Appendix

B reports on a survey about the current state and desired capabilities for IT

resources at research universities. Appendix C offers some high-performance

computing working examples. Appendix D details how to bridge high-

performance computing to Hadoop. Finally, Appendix E discusses using Docker

for bioinformatics.

Thanks for reading!

Should this book inspire the reader to dig deeper into research computing

or the research itself, we will consider it a win. If you find this book to be of

little value, please leave it on your next flight, bus ride, or at a homeless shelter

for some other reader to find and take to their next job interview.

hoW To ConTaCT uS

As you use this book and work with biomedical data, we welcome your com-

ments and feedback. In the hybrid and rapidly evolving field of biomedical

data, collaboration and exchange are truly essential. We hope there will be a

second edition of this book, and I would value comments and feedback to help

improve this material.

You can reach Jay at [email protected] or [email protected].

mailto:[email protected]

mailto:[email protected]

7

C h a P t e r 1Healthcare, History, and Heartbreak

8

Over the past decade, we have unlocked many of the mysteries about DNA and RNA. This knowledge isn’t just sitting in books on the shelf nor is it confined to the workbenches of laboratories. We have used these research findings to pinpoint the causes of many diseases. Moreover, scientists have translated this genetic knowledge into several treatments and therapies prompting a bridge between the laboratory bench and the patient’s bedside.

—Barack Obama on the Genomics and Personalized Medicine Act (S. 976), March 23, 2007

While we are surely poised to continue to make tremendous medical

advances—notably in personalized medicine, pharmacogenomics, and

precision medicine—we are also facing substantial challenges. The chal-

lenges facing healthcare today are many, and if we do not adequately address

them we risk missing opportunities, pushing the cost of care up, and slowing the

pace of biomedical innovation. In briefly surveying the state of healthcare, it is not

my intention to offer a political diagnosis or solution. Rather, it is my intention

to use our current technical knowledge to point the way to practical solutions.

For example, a long-theorized solution to health records management would be

a single cloud-based system where healthcare information sharing exists univer-

sally. But if I were to present this as the best technical solution, it would not be my

intention to also advocate for a shift to a single-payer healthcare system. As much

as possible this book and the discussions in this chapter aim to avoid politics.

After decades of technological lag, biomedicine has started to embrace new

technologies with increasing rapidity. Next-generation sequencing, mobile tech-

nologies, wearable sensors, three-dimensional medical imaging, and advances in

analytic software now make it possible to capture vast amounts of information.

Yet we still struggle with the collection, management, security, and thoughtful

interpretation of all this information. At the same time, healthcare is changing

quickly as the field grapples with new technologies and is transformed by merg-

ers and new partnerships. As a complex adaptive system, healthcare is more

than the sum of its parts, and it is always difficult to predict the future. But we

do know that as the post–Affordable Care Act healthcare landscape takes shape,

the industry is shifting toward digitally enabled, consumer-focused care models.

Given these trends, technology will be granted many opportunities to improve

patient care.

H e a l t H c a r e , H i s t o r y , a n d H e a r t b r e a k ◂ 9

At the outset of this book it is worth surveying some of the top issues in

healthcare. For many of you, these will be quite familiar. Whether you’re an

expert or not, you should feel free to skip ahead if you like. But it is my sin-

cere hope that the background material will be of real value in bridging the

gap between healthcare and biomedicine, on the one hand, and information

technology (IT) and data management, on the other. Just as doctors in an age

of increasing specialization can benefit from attending to the whole patient, it

is very valuable for IT staff to have a more holistic and systemic understanding

of healthcare.

Top Issues In HealTHcare

There are many, many sources that comment on the state of healthcare and

biomedicine more broadly. Although I worked as a contractor for two of

the country’s largest Medicare/Medicaid contract holders, I am not a policy

expert. But I have come to appreciate the importance of taking in the bigger

picture. My admittedly incomplete survey of top healthcare issues is drawn

from PwC’s Top Health Industry Issues of 2016 and PwC’s Top Health Industry

Issues of 2015 [1]. These two brief reports offer compelling syntheses and analy-

ses of current trends. In rereading these reports and reflecting on my own

experiences in the field, I was struck by the number of top issues that are

substantially or in part data or IT issues. Many of the top healthcare issues

are centrally concerned with the storage, security, sharing, and analysis of

data. In other words, IT and data management will be called on to make major

contributions to advancing the dynamic healthcare field. Next I explore nine

key issues impacting healthcare.

Mergers and partnerships

As the health sector continues to change in response to the Affordable Care Act

(2010), we are seeing many mergers and partnerships. “The ACA’s emphasis on

value and outcomes has sent ripples through the $3.2 trillion health sector, spread-

ing and shifting risk in its wake. At the same time, capital is inexpensive, thanks

to sustained low interest rates. Industry’s response? Go big” [2]. Mergers between

large insurance providers are consolidating the insurance market. In 2015, the

second largest U.S. insurer, Anthem, made a $48.4 billion offer for health and life

insurance provider Cigna. Mergers have also been common in the pharmaceutical

field, including Pfizer’s whopping $160 billion deal for specialty pharmaceutical

star Allergan. While these deals are still awaiting regulatory approval, 2016 and

2017 will likely see more mergers and acquisitions. Many new partnerships are

also being formed between pharmaceutical, life sciences, software, pharmacy,

healthcare providers, and engineering companies, among others.

10 ▸ s t r a t e g i e s i n b i o m e d i c a l d a t a s c i e n c e

Mergers, acquisitions, and partnerships are driven by a number of larger

market forces. Sometimes predicted lower IT or data costs drive consolida-

tion. More often it is simply that IT and data will need to be able to respond

nimbly to these changes. One of the largest challenges is postacquisition data

management.

Many providers in the healthcare space have grown through organic means

and have survived on shoestring budgets. When compliance moved to the fore-

front, many chief information officers were granted grace periods to meet com-

pliance and conducted internal audits, patching together existing components

to meet the objectives. This expenditure had the systemic impact of prevent-

ing the distribution of funds toward infrastructure improvements. The mainte-

nance of many legacy systems resulted, leaving organizations with out-of-date,

proprietary, inflexible systems that were simply not designed to interoperate

on the larger scale. Now when that smaller provider, which potentially main-

tains a large collection of Medicare/Medicaid accounts, is acquired by a larger

entity, the most significant challenge is the integration of those legacy systems

without impacting operational activities. The challenge of migrating years of

patient data records into a system from an out-of-date platform encumbered

by complex and tangled spaghetti code and created by a resource long since

departed is substantial. The need to do so while maintaining business conti-

nuity drives many a large entity to maintain the down-level system for years

following the acquisition.

cybersecurity and Data security

As more and more patient data is stored and shared, security is an increasing

concern. Patient data typically contains individualized information. If that data

is stolen, the risks of identity theft are substantial, and there exists a thriving

black market for stolen health records. Data security breaches are relatively

common. “During the summer of 2014, more than 5 million patients had their

personal data compromised” [1]. These breaches are often costly for compa-

nies. Medical devices themselves can also be hacked. For example, in 2015 the

government warned that “an infusion pump . . . could be modified to deliver a

fatal dose of medication” [2].

The needs for elastic scalability, rapid provisioning, resource orchestration,

high availability, and storage efficiency have contributed to the explosion in

cloud providers and niche service offerings. However, this explosion has also

opened holes in known security elements that were once sealed. Cloud security

challenges can range from the innocuous VM sprawl, where virtual machines

are orphaned in an on/off state and fall outside of the domain security policy

for things as basic as patching and maintenance [3]. On the other end of the

additional praise for · 2017. 6. 21. · additional praise for strategies in biomedical data...

Documents