contents | zoom in | zoom out for navigation instructions ...€¦ · visualization corner: joao...

A Cloud-Based Seizure

Alert System, p. 56

Visualizing High-

Dimensional Data, p. 98

Computers in Cars, p. 108

cise.aip.org

www.computer.org/cise/

Vol. 18, No. 5 | September/October 2016

Contents | Zoom in | Zoom out Search Issue | Next PageFor navigation instructions please click here


READ YOUR FAVORITE PUBLICATIONS YOUR WAY

SEARCH, ANNOTATE, UNDERLINE, VIEW VIDEOS, CHANGE TEXT SIZE, DEFINE

Now, your IEEE Computer Society technical publications aren’t just the most informative and state-of-the-art

most exciting, interactive, and customizable to your reading preferences.

The new myCS format for all IEEE Computer Society digital publications is:

• Mobile friendly.

• Customizable.

• Adaptive.

• Personal.

Just go to www.computer.org/mycs-info

Login to mycs.computer.org

qqM

Mq

qM

MqM

THE WORLD’S NEWSSTAND®

Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page

qqM

Mq

qM

MqM



http://www.computer.org/mycs-info

http://mycs.computer.org

http://www.computer.org/cise


http://www.qmags.com


A Cloud-Based Seizure

Alert System, p. 56

Visualizing High-

Dimensional Data, p. 98

Computers in Cars, p. 108

cise.aip.org

www.computer.org/cise/

Vol. 18, No. 5 | September/October 2016



http://www.ieee.org/index.html

http://cise.aip.org

http://www.computer.org/mycs-info

http://www.computer.org/cise/

Kirk BornePrincipal Data

Scientist,Booz Allen Hamilton

Satyam PriyadarshyChief Data Scientist,

Halliburton

Bill FranksChief Analytics

Experience the Newest and Most Advanced Thinking in Big Data Analytics

Rock Star SpeakersBig Data: Big Hype or Big Imperative?BOTH.

Business departments know the promise of

www.computer.org/bda

03 November 2016 | Austin, TX

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://www.computer.org/bda





EDITORIAL BOARD MEMBERSJoan Adler, Technion-IIT, [email protected]

Francis J. Alexander, Los Alamos National Laboratory, [email protected] Beichl, Nat’l Inst. of Standards and Technology, [email protected]

Bruce Boghosian, Tufts Univ., [email protected] Bungartz, Technical University of Munich, [email protected]

Norman Chonacky, Yale Univ. (EIC Emeritus), [email protected] DiPierro, DePaul Univ., [email protected]

Jack Dongarra, Univ. of Tennessee, [email protected] Eigenmann, Purdue Univ., [email protected]

William J. Feiereisen, Intel Corporation, [email protected] Fox, Indiana Univ., [email protected]

K. Scott Hemmert, Sandia National Laboratories, [email protected] P. Landau, Univ. of Georgia, [email protected]

Konstantin Läufer, Loyola Univ. Chicago, [email protected] Malakar, Argonne National Laboratory, [email protected]

James D. Myers, University of Michigan, [email protected] Parashar, Rutgers Univ., [email protected]

John Rundle, Univ. of California, Davis, [email protected] Selinger, Kent State Univ., [email protected]

Thomas L. Sterling, Indiana Univ., [email protected] West, University of Texas, Austin, [email protected]

DEPARTMENT EDITORSBooks: Stephen P. Weppner,

Eckerd College, [email protected] Prescriptions: Ernst Mucke, Identity Solutions, ernst.mucke@gmail.

com, and Francis Sullivan, [email protected], IDA/Center for Computing Sciences

Computer Simulations: Barry I. Schneider, NIST, [email protected], andGabriel A. Wainer, Carleton University, [email protected]

Education: Rubin H. Landau, Oregon State Univ., [email protected], and Scott Lathrop, University of Illinois, [email protected]

Leadership Computing: James J. Hack, ORNL, [email protected],and Michael E. Papka, ANL, [email protected]

Novel Architectures: Volodymyr Kindratenko, University of Illinois, [email protected],

and Pedro Trancoso, Univ. of Cyprus, [email protected] Programming: Konrad Hinsen, CNRS Orléans,

[email protected] Matthew Turk, NCSA, [email protected]

Software Engineering Track: Jeffrey Carver, University of Alabama, [email protected], and Damian Rouson, Sourcery Institute,

[email protected] Last Word: Charles Day, [email protected]

Visualization Corner: Joao Comba, UFRGS, [email protected], and Daniel Weiskopf, Univ. Stuttgart, [email protected]

Your Homework Assignment: Nargess Memarsadeghi, NASA Goddard SpaceFlight Center, [email protected]

STAFFEditorial Product Lead: Cathy Martin, [email protected] Management: Jennifer StoutOperations Manager: Monette VelascoSenior Advertising Coordinator: Marian AndersonDirector of Membership: Eric BerkowitzDirector, Products & Services: Evan ButterfieldSenior Manager, Editorial Services: Robin BaldwinManager, Editorial Services: Brian BrannonSenior Business Development Manager: Sandra Brown

AMERICAN INSTITUTE OF PHYSICS STAFFMarketing Director, Magazines: Jeff Bebee, [email protected] Liaison: Charles Day, [email protected]

IEEE Antennas & Propagation Society Liaison:Don Wilton, Univ. of Houston, [email protected]

IEEE Signal Processing Society Liaison:Mrityunjoy Chakraborty, Indian Institute of Technology, [email protected]

CS MAGAZINE OPERATIONS COMMITTEEForrest Shull (chair), Brian Blake, Maria Ebling, Lieven Eeckhout,

Miguel Encarnacao, Nathan Ensmenger, Sumi Helal, San Murugesan, Ahmad-Reza Sadeghi, Yong Rui, Diomidis Spinellis, George K. Thiruvathukal, Mazin Yousif, Daniel Zeng

CS PUBLICATIONS BOARDDavid S. Ebert (VP for Publications), Alfredo Benso, Irena Bojanova, Greg Byrd,

Min Chen, Robert Dupuis, Niklas Elmqvist, Davide Falessi, William Ribarsky, Forrest Shull, Melanie Tory

EDITORIAL OFFICEPublications Coordinator: [email protected]

COMPUTING IN SCIENCE & ENGINEERINGc/o IEEE Computer Society10662 Los Vaqueros Circle, Los Alamitos, CA 90720 USAPhone +1 714 821 8380; Fax +1 714 821 4010Websites: www.computer.org/cise or http://cise.aip.org/

EDITOR IN CHIEFGeorge K. Thiruvathukal, Loyola Univ. Chicago, [email protected]

ASSOCIATE EDITORS IN CHIEFJeffrey Carver, University of Alabama, [email protected] X. Chen, George Mason Univ., [email protected]

Judith Bayard Cushing, The Evergreen State College, [email protected] Gottlieb, Indiana Univ., [email protected]

Douglass E. Post, Carnegie Mellon Univ., [email protected] I. Schneider, NIST, [email protected]

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________

__________

___________

______________________

________

____________________

___________________________

__________________________

_________________________

_________________________________

____________________________

___________________

__________________________

___________________________________

____________

_______________

_______________________

________

____________________

_________________ __________

________________

______________________

__________________________

__________________

___________________________

__________________

__________

________

______________

_________

__

________

_______________

mailto:[email protected]















































http://cise.aip.org/














SCIENCE AS A SERVICE

8 Guest Editors’ Introduction

Ravi Madduri and Ian Foster

10 A Case for Data Commons: Toward Data Science as a Service

Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson,

and Walt Wells

Data commons collocate data, storage, and computing infrastructure with core services and commonly used tools and applications for managing, analyzing, and sharing data to create an interoperable resource for the research community. An architecture for data commons is described, as well as some lessons learned from operating several large-scale data commons.

21 MRICloud: Delivering High-Throughput MRI Neuroinformatics as Cloud-Based

Software as a Service

Susumu Mori, Dan Wu, Can Ceritoglu, Yue Li, Anthony Kolasny,

Marc A. Vaillant, Andreia V. Faria, Kenichi Oishi, and Michael I. Miller

MRICloud provides a high-throughput neuroinformatics platform for automated brain MRI segmentation and analytical tools for quantification via distributed client-server remote computation and Web-based user interfaces. This cloud-based service approach improves the efficiency of software implementation, upgrades, and maintenance. The client-server model is also ideal for high-performance computing, allowing distribution of computational servers and client interactions across the world.

36 WaveformECG: A Platform for Visualizing, Annotating, and Analyzing ECG Data

Raimond L. Winslow, Stephen Granite, Christian Jurado

The electrocardiogram (ECG) is the most commonly collected data in cardiovascular research because of the ease with which it can be measured and because changes in ECG waveforms reflect underlying aspects of heart disease. Accessed through a browser, WaveformECG is an open source platform supporting interactive analysis, visualization, and annotation of ECGs.

COMPUTATIONAL CHEMISTRY

48 Chemical Kinetics: A CS Perspective

Dinesh P. Mehta, Anthony M. Dean, and Tina M. Kouri

CLOUD COMPUTING

56 A Cloud-Based Seizure Alert System for Epileptic Patients That Uses

Higher-Order Statistics

Sanjay Sareen, Sandeep K. Sood, and Sunil Kumar Gupta

HYBRID SYSTEMS

68 The Feasibility of Amazon’s Cloud Computing Platform for Parallel,

GPU-Accelerated, Multiphase-Flow Simulations

Cole Freniere, Ashish Pathak, Mehdi Raessi, and Gaurav Khanna

STATEMENT OF PURPOSE

Computing in Science & Engineering(CiSE) aims to support and promote the

emerging discipline of computational science and engineering and to foster the use of computers and computational techniques

in scientific research and education. Every issue contains broad-interest theme articles,

departments, news reports, and editorial comment. Collateral materials such as

source code are made available electronically over the Internet. The intended audience comprises physical scientists, engineers, mathematicians, and others who would

benefit from computational methodologies. All articles and technical notes in CiSE

are peer-reviewed.

Cover illustration: Andrew Bakerwww.debutart.com/illustration/

andrew-baker

For more information on these and other computing topics, please visit the IEEE Computer Society Digital Library at www.computer.org/csdl.

September/October 2016, Vol. 18, No. 5

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________

______

http://www.debutart.com/illustration/andrew-baker

http://www.computer.org/csdl

http://www.sfiprogram.org





COLUMNS

4 From the Editors

Steven Gottlieb

The Future of NSF Advanced Computing Infrastructure Revisited

108 The Last Word

Charles Day

Computers in Cars

DEPARTMENTS

78 Computer Simulations

Christian D. Ott

Massive Computation for Understanding Core-Collapse Supernova Explosions

94 Leadership Computing

Laura Wolf

Multiyear Simulation Study Provides Breakthrough in Membrane Protein Research

Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in Computing in Science & Engineering does not necessarily constitute endorsement by IEEE, the IEEE Computer Society, or the AIP. All submissions are subject to editing for style, clarity, and length. IEEE prohibits discrimination, harassment, and bullying. For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html. Circulation: Computing in Science & Engineering (ISSN 1521-9615) is published bimonthly by the AIP and the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York, NY 10016-5997; IEEE Computer Society Publications Office, 10662 Los Vaqueros Cir., Los Alamitos, CA 90720, phone +1 714 821 8380; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, D.C., 20036; AIP Circulation and Fulfillment Department, 1NO1, 2 Huntington Quadrangle, Melville, NY, 11747-4502. Subscribe to Computing in Science & Engineering by visiting www.computer.org/cise. Reuse Rights and

Reprint Permissions: Educational or personal use of this material is permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the copy; and 3) does not imply IEEE endorsement of any third-party products or services. Authors and their companies are permitted to post the accepted version of IEEE-copyrighted material on their own web servers without permission, provided that the IEEE copyright notice and a full citation to the original work appear on the first screen of the posted copy. An accepted manuscript is a version that has been revised by the author to incorporate review suggestions, but not the published version with copy-editing, proofreading and formatting added by IEEE. For more information, please go to: http://www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising, or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or [email protected]. Copyright © 2016 IEEE. All rights reserved. Abstracting and Library

Use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Dr., Danvers, MA 01923. Postmaster: Send undelivered copies and address changes to Computing in Science & Engineering,445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage paid at New York, NY, and at additional mailing offices. Canadian GST #125634188. Canada Post Corporation (Canadian distribution) publications mail agreement number 40013885. Return undeliverable Canadian addresses to PO Box 122, Niagara Falls, ON L2E 6S8 Canada. Printed in the USA.

98 Visualization Corner

Renato R.O. da Silva, Paulo E. Rauber,

and Alexandru C. Telea

Beyond the Third Dimension: Visualizing High-Dimensional Data with Projections

RESOURCES

46 AIP Membership Information

47 IEEE Computer Society Information

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________________________

____________

_____

http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html


http://www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html







4 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

FROM THE EDITORS

Steven GottliebIndiana University

The Future of NSF Advanced Computing Infrastructure Revisited

Iam in Sunriver, Oregon, having just enjoyed three days at the annual Blue Waters Symposium for Petascale Science and Beyond. It was a perfect opportunity to catch up on all the wonderful science being done on Blue Waters, the National Science Foundation’s flagship supercomputer, located at the University of Illinois’s National

Center for Supercomputing Applications (NCSA). To be honest, you can’t really catch up on all the science: most of the presentations are in parallel sessions with four simul-taneous talks. There were also very interesting tutorials to help attendees make the best use of Blue Waters.

But what I’m most interested in discussing here isn’t the petascale science, but the “beyond” issue. CiSE readers might recall that in the March/April 2015 issue, I used this space for a column entitled “Whither the Future of NSF Advanced Computing Infrastructure?” (vol. 17, no. 2, 2015, pp. 4–6). One focus of that piece was the in-terim report of the Committee on Future Directions for NSF Advanced Computing Infrastructure to Support US Science in 2017–2020. This committee was appointed through the Computer Science and Telecommunications Board of the National Re-search Council (NRC) and was expected to issue a final report in mid-2015 (in fact, it was announced nearly a year later, in a 4 May 2016 NSF press release). I had a chance to sit down with Bill Gropp (University of Illinois Urbana-Champaign), who cochaired the committee with Robert Harrison (Stony Brook) and gave a very well-received after-dinner talk at the symposium about the report.

Over the years, there has been a growing gap between requests for computer time through NSF’s XSEDE (Extreme Science and Engineering Discovery Environment) program and the availability of such time. Making matters worse, Blue Waters is sched-uled to shut down in 2018. At the symposium, William Kramer announced that the NCSA had requested a zero-cost extension to continue operations of Blue Waters until sometime in 2019. Extension of Blue Waters operations would be a very positive devel-opment. Unfortunately, the NSF hasn’t announced a plan to replace Blue Waters with a more powerful computer, even in light of the NSF’s role in the National Strategic Computer Initiative announced by President Obama on 29 July 2015. There could be a very serious shortage of computer time in the next few years that would broadly impact science and engineering research in the US.

My previous article mentioned that the Division of Advanced Cyberinfrastructure (ACI) is now part of the NSF’s Directorate of Computer & Information Science & Engi-neering (CISE). Previously, the Office of Cyberinfrastructure reported directly to the NSF director. The NSF has asked for comments on the impact of this change, but the deadline is 30 June, well before you’ll see this column. The NSF’s request for comments was a major topic of conversation in an open meeting at the symposium held by NCSA Director Ed Seidel. I plan to let the NSF know that I think it’s essential to go back to the previous ar-rangement: scientific computing isn’t part of computer science, and it’s very important that the people at the NSF planning for supercomputing be at the same level as the science directorates in order to get direct input on each directorate’s computing needs.

The committee report I mentioned earlier has seven recommendations, most of which contain subpoints (see the “Committee Recommendations” sidebar for more information). The recommendations are organized into four main issues: maintaining US leadership in science and engineering, ensuring that resources meet community needs, helping compu-tational scientists deal with the rapid changes in high-end computers, and sustaining the

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM







September/October 2016 5

Committee Recommendations

The full report is at http://tinyurl.com/advcomp17-20; the text here is a verbatim, unedited excerpt, reprinted with permission from

“Future Directions for NSF Advanced Computing Infrastructure to Support US Science and Engineering in 2017-2020,” Nat’l

Academy of Sciences, 2015 (doi:10.17226/21886).

A: Position US for continued leadership in science and engineeringRecommendation 1. NSF should sustain and seek to grow its investments in advanced computing—to include hardware and

services, software and algorithms, and expertise—to ensure that the nation’s researchers can continue to work at frontiers of science

and engineering.

Recommendation 1.1. NSF should ensure that adequate advanced computing resources are focused on systems and services

that support scientific research. In the future, these requirements will be captured in its road maps.

Recommendation 1.2. Within today’s limited budget envelope, this will mean, first and foremost, ensuring that a predominant

share of advanced computing investments be focused on production capabilities and that this focus not be diluted by undertaking too

many experimental or research activities as part of NSF’s advanced computing program.

Recommendation 1.3. NSF should explore partnerships, both strategic and financial, with federal agencies that also provide

advanced computing capabilities as well as federal agencies that rely on NSF facilities to provide computing support for their

grantees.

Recommendation 2. As it supports the full range of science requirements for advanced computing in the 2017-2020 timeframe,

NSF should pay particular attention to providing support for the revolution in data driven science along with simulation. It should

ensure that it can provide unique capabilities to support large-scale simulations and/or data analytics that would otherwise be unavail-

able to researchers and continue to monitor the cost-effectiveness of commercial cloud services.

Recommendation 2.1. NSF should integrate support for the revolution in data-driven science into NSF’s strategy for advanced

computing by (a) requiring most future systems and services and all those that are intended to be general purpose to be more data-

capable in both hardware and software and (b) expanding the portfolio of facilities and services optimized for data-intensive as well as

numerically-intensive computing, and (c) carefully evaluating inclusion of facilities and services optimized for data-intensive comput-

ing in its portfolio of advanced computing services.

Recommendation 2.2. NSF should (a) provide one or more systems for applications that require a single, large, tightly coupled

parallel computer and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as

high-performance work flows to them.

Recommendation 2.3. NSF should (a) eliminate barriers to cost-effective academic use of the commercial cloud and (b) carefully

evaluate the full cost and other attributes (e.g., productivity and match to science work flows) of all services and infrastructure mod-

els to determine whether such services can supply resources that meet the science needs of segments of the community in the most

effective ways.

B. Ensure resources meet community needsRecommendation 3. To inform decisions about capabilities planned for 2020 and beyond, NSF should collect community re-

quirements and construct and publish roadmaps to allow NSF to set priorities better and make more strategic decisions about ad-

vanced computing.

Recommendation 3.1. NSF should inform its strategy and decisions about investment trade-offs using a requirements analysis

that draws on community input, information on requirements contained in research proposals, allocation requests, and foundation-

wide information gathering.

Recommendation 3.2. NSF should construct and periodically update roadmaps for advanced computing that reflect these re-

quirements and anticipated technology trends to help NSF set priorities and make more strategic decisions about science and engi-

neering and to enable the researchers that use advanced computing to make plans and set priorities.

Recommendation 3.3. NSF should document and publish on a regular basis the amount and types of advanced computing capa-

bilities that are needed to respond to science and engineering research opportunities.

Recommendation 3.4. NSF should employ this requirements analysis and resulting roadmaps to explore whether there are more

opportunities to use shared advanced computing facilities to support individual science programs such as Major Research Equipment

and Facilities Construction projects.

Recommendation 4. NSF should adopt approaches that allow investments in advanced computing hardware acquisition, comput-

ing services, data services, expertise, algorithms, and software to be considered in an integrated manner.

Recommendation 4.1. NSF should consider requiring that all proposals contain an estimate of the advanced computing

resources required to carry out the proposed work and creating a standardized template for collection of the information as one step

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://tinyurl.com/advcomp17-20





6 September/October 2016

FROM THE EDITORS

infrastructure for advanced computing. When I asked Gropp about the report’s main message, he told me that “the community needs to get involved for the NSF to imple-ment the recommendations.” That’s because we’ll need to do a better job of describing our needs and our scientific plans. Gropp emphasized that it’s important to distinguish between our wants and our needs. For example, Recommendation 3 calls on the NSF to collect information on the needs of the scientific community for advanced comput-ing—one possibility is that all grant applications will need to supply information about their computing needs in a standard form (see recommendation 4.1).

The report also emphasizes that data-driven science needs to be supported along with simulation. The latter has often driven machine design, but there are many inter-esting scientific problems for which access to large amounts of data is the bottleneck, and there are also now many simulations that produce large volumes of data that must be read, stored, and visualized. It will be best to purchase computers that can support both requirements well.

“For many years, we have been blessed with rapid growth in computing power,” Gropp stated, but in referring to stagnant clock speeds, he noted, “that period is over.” New supercomputers are going to employ new technologies that will require new pro-gramming techniques to deal with the massive parallelism and deep memory hierar-chies. Gropp quoted Ken Kennedy as saying that software transformations can take

of potentially many toward more efficient individual and collective use of these finite, expensive, shared resources. (This information

would also inform the requirements process.)

Recommendation 4.2. NSF should inform users and program managers of the cost of advanced computing allocation requests in

dollars to illuminate the total cost and value of proposed research activities.

C. Aid the scientific community in keeping up with the revolution in computingRecommendation 5. NSF should support the development and maintenance of expertise, scientific software, and software tools

that are needed to make efficient use of its advanced computing resources.

Recommendation 5.1. NSF should continue to develop, sustain, and leverage expertise in all programs that supply or use

advanced computing to help researchers use today’s advanced computing more effectively and prepare for future machine

architectures.

Recommendation 5.2. NSF should explore ways to provision expertise in more effective and scalable ways to enable

researchers to make their software more efficient; for instance, by making more pervasive the XSEDE (Extreme Science and

Engineering Discovery Environment) practice that permits researchers to request an allocation of staff time along with computer

time.

Recommendation 5.3. NSF should continue to invest in and support scientific software and update the software to support

new systems and incorporate new algorithms, recognizing that this work is not primarily a research activity but rather is support of

software infrastructure.

Recommendation 6. NSF should also invest modestly to explore next-generation hardware and software technologies to explore

new ideas for delivering capabilities that can be used effectively for scientific research, tested, and transitioned into production

where successful. Not all communities will be ready to adopt radically new technologies quickly, and NSF should provision advanced

computing resources accordingly.

D. Sustain the infrastructure for advanced computingRecommendation 7. NSF should manage advanced computing investments in a more predictable and sustainable way.

Recommendation 7.1. NSF should consider funding models for advanced computing facilities that emphasize continuity of

support.

Recommendation 7.2. NSF should explore and possibly pilot the use of a special account (such as that used for Major Research

Equipment and Facilities Construction) to support large-scale advanced computing facilities.

Recommendation 7.3. NSF should consider longer-term commitments to center-like entities that can provide advanced

computing resources and the expertise to use them effectively in the scientific community.

Recommendation 7.4. NSF should establish regular processes for rigorous review of these center-like entities and not just their

individual procurements.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM







September/October 2016 7

10 years to reach maturity. I note that my own community is eight years into GPU code development and three to four years into development for Intel Xeon Phi. Th e ef-fort is continuing in anticipation of the next generation of supercomputers. Th e report strongly emphasizes that the NSF must help users to adapt their codes (Recommenda-tion 5 and its subpoints).

B efore my conversation with Gropp ended, I asked him about the delay from the original mid-2015 target date for the report’s release. He mentioned the “grueling

review process” and the need to respond to every comment. However, he said there were many thoughtful, useful comments and that responding to them made the report much better. Finally, Gropp left me with the thought that “Writing the report is not the end, it is the beginning.” I certainly hope that my fellow CiSE readers will take that to heart and get involved with helping the NSF plan for our needs for advanced com-puting. You can fi nd the entire report at http://tinyurl.com/advcomp17-20.

Steven Gottlieb is a distinguished professor of physics at Indiana University, where he directs the PhD minor in scientifi c computing. He’s also an associate editor in chief of CiSE. Gottlieb’s research is in lattice quantum chromodynamics, and he has a PhD in physics from Princeton University. Contact him at [email protected].

Stay relevant with the IEEE Computer Society

More at www.computer.org/myCS

Publications your way, when you want them.

The future of publication delivery is now. Check out myCS today!

Mobile-friendly

or desktopCustomizable

myCSPersonal Archiveissues and search or retrieve them quickly on your personal myCS site.

Keeping YOU at the

Center of Technology

Publications your way, when you want them.

The future of publication deliveryis now. Check out myCS today!

Mobile-friendly

or desktopCustomizable

myCSPersonal Archiveissues and search or retrieve themquickly on your personal myCS site.

U at the

erology

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________

http://tinyurl.com/advcomp17-20


http://www.computer.org/myCS





GUEST EDITORS’ INTRODUCTION


Science as a Service

Ravi Madduri and Ian Foster | Argonne National Laboratory and the University of Chicago

Researchers are increasingly taking advantage of advances in cloud computing to make data analy-sis available as a service. As we see from the articles in this special issue, the science-as-a-service approach has many advantages: it accelerates the discovery process via a separation of concerns, with computational experts creating, managing, and improving services, and researchers using

them for scientific discovery. We also see that making scientific software available as a service can lower costs and pave the way for sustainable scientific software. In addition, science services let users share their analyses, discover what others have done, and provide infrastructure for reproducing results, reanalyzing data, backward tracking rare or interesting events, performing uncertainty analysis, and verifying and validating experiments. Generally speaking, this approach lowers barriers to entry to large-scale analysis for theorists, students, and nonexperts in high-performance computing. It permits rapid hypothesis test-ing and exploration as well as serving as a valuable tool for teaching.

Computation and automation are vital in many scientific domains. For example, the decreased se-quencing costs in biology have transformed the field from a data-limited to a computationally-limited dis-cipline. Increasingly, researchers must process hundreds of sequenced genomes to determine statistical significance of variants. When datasets were small, they could be analyzed on PCs in modest amounts of time: a few hours or perhaps overnight. However, this approach does not scale to large, next-generation sequencing datasets—instead, researchers require high-performance computers and parallel

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM







www.computer.org/cise 9

algorithms if they are to analyze their data in a timely manner. By leveraging services such as the cloud-based Globus Genomics, researchers can analyze hundreds of genomes in parallel using just a browser.

In this special issue, we present three great ex-amples of efforts in science as a service. In “A Case for Data Commons: Toward Data Science as a Service,” Robert L. Grossman and his colleagues present a flexible computational infrastructure that supports various activities in the data life cycle such as discovery, storage, analysis, and long-term archiving. The authors present a vision to create a data commons and discuss challenges that result from a lack of appropriate standards.

In “MRICloud: Delivering High-Throughput MRI Neuroinformatics as Cloud-Based Software as a Service,” Susumu Mori and colleagues pres-ent MRICloud, a science as a service for large-scale analysis of brain images. This article illustrates how researchers can make novel analysis capabili-ties available to the scientific community at large by outsourcing key capabilities such as high-perfor-mance computing.

Finally, in “WaveformECG: A Platform for Visualizing, Annotat-ing, and Analyzing ECG Data,” Rai-mond Winslow and colleagues present a service for analyzing electrocardio-gram data that lets researchers upload time-series ECG data and provides analysis capabilities to enable discov-ery of the underlying aspects of heart disease. WaveformECG is accessible through a browser and provides inter-active analysis, visualization, and an-notation of waveforms using standard medical terminology.

A s adoption of public cloud com-puting resources for science in-

creases, science as a service provides a great way to create sustainable, reliable services that accelerate the scientific discovery process and im-prove the adoption of various tools and thus increase software reuse.

Ravi Madduri is a project manager and a Senior Fellow at the Computation In-stitute at the Argonne National Labora-tory and the University of Chicago. His

research interests include high-performance comput-ing, workflow technologies, and distributed comput-ing. Madduri has an MS in computer science from the Illinois Institute of Technology. Contact him at [email protected].

Ian Foster is director of the Computation Institute, a joint institute of the University of Chicago and Argonne National Laboratory. He is also an Argonne Senior Sci-entist and Distinguished Fellow and the Arthur Holly Compton Distinguished Service Professor of Computer Science. His research deals with distributed, parallel, and data-intensive computing technologies, and in-novative applications of those technologies to scientific problems in such domains as climate change and bio-medicine. Foster received a PhD in computer science from Imperial College, United Kingdom.

Selected articles and columns from IEEE Computer Society publications are also available for free at

http://ComputingNow.computer.org.

DEADLINE FOR 2017 AWARD NOMINATIONS

DUE: 15 OCTOBER 2016

In 1982, on the occasion of its thirtieth anniversary, the IEEE Computer Society established the Computer Entrepreneur Award to recognize and honor the technical managers and entrepreneurial leaders who are responsible for the growth of some segment of the

years earlier, and the industry

openly visible.

All members of the profession are invited to nominate a colleague who they consider most eligible to be considered for this award. Awarded to individuals whose entrepreneurial leadership is responsible for the growth of some segment of the computer industry.

COMPUTER ENTREPRENEUR AWARD

AWARD SITE: https://www.computer.org/web/awards/entrepreneur

www.computer.org/awards

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___

__________________________________



http://ComputingNow.computer.org

https://www.computer.org/web/awards/entrepreneur

http://www.computer.org/awards







A Case for Data Commons:Toward Data Science as a Service

Robert L. Grossman, Allison Heath, Mark Murphy, and Maria Patterson | University of Chicago

Walt Wells | Center for Computational Science Research

Data commons collocate data, storage, and computing infrastructure with core services and com-monly used tools and applications for managing, analyzing, and sharing data to create an interoper-able resource for the research community. An architecture for data commons is described, as well as some lessons learned from operating several large-scale data commons.

With the amount of available scientific data being far larger than the ability of the research com-munity to analyze it, there’s a critical need for new algorithms, software applications, software services, and cyberinfrastructure to support data throughout its life cycle in data science. In this article, we make a case for the role of data commons in meeting this need. We describe the

design and architecture of several data commons that we’ve developed and operated for the research com-munity in conjunction with the Open Science Data Cloud (OSDC), a multipetabyte science cloud that the nonprofit Open Commons Consortium (OCC) has managed and operated since 2009.1 One of the distin-guishing characteristics of the OSDC is that it interoperates with a data commons containing over 1 Pbyte of public research data through a service-based architecture. This is an example of what is sometimes called “data as a service,” which plays an important role in some science-as-a-service frameworks.

There are at least two definitions for science as a service. The first is analogous to the software-as-a-service2

model, in which instead of managing data and software locally using your own storage and computing resourc-es, you use the storage, computing, and software services offered by a service provider, such as a cloud service provider (CSP). With this approach, instead of setting up his or her own storage and computing infrastructure and installing the required software, a scientist uploads data to a CSP and uses preinstalled software for data

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








analysis. Note that a trained scientist is still required to run the software and analyze the data. Science as a service can also refer more generally to a service mod-el that relaxes the requirement of needing a trained scientist to process and analyze data. With this service model, specific software and analysis tools are avail-able for specific types of scientific data, which is up-loaded to the science-as-a-service provider, processed using the appropriate pipelines, and then made avail-able to the researcher for further analysis if required. Obviously these two definitions are closely connected in that a scientist can set up the required science-as-a-service framework, as in the first definition, so that less-trained technicians can use the service to process their research data, as in the second definition. By and large, we focus on the first definition in this article.

There are various science-as-a-service frameworks, including variants of the types of clouds formalized by the US National Institute of Standards and Tech-nology (infrastructure as a service, platform as a ser-vice, and software as a service),2 as well as some more specialized services that are relevant for data science (data science support services and data commons):

■ data science infrastructure and platform services,in which virtual machines (VMs), containers, or platform environments containing com-monly used applications, tools, services, and datasets are made available to researchers (the OSDC is an example);

■ data science software as a service, in which data is uploaded and processed by one or more ap-plications or pipelines and results are stored in the cloud or downloaded (general-purpose platforms offering data science as a service include Agave,3 as well as more specialized ser-vices, such as those designed to process ge-nomics data);

■ data science support services, including data stor-age services, data-sharing services, data trans-fer services, and data collaboration services (one example is Globus4); and

■ data commons, in which data, data science com-puting infrastructure, data science support services, and data science applications are col-located and available to researchers.

Data Commons

When we write of a “data commons,” we mean cy-berinfrastructure that collocates data, storage, and computing infrastructure with commonly used tools for analyzing and sharing data to create an interoperable resource for the research community.

In the discussion below, we distinguish among several stakeholders involved in data commons: the data commons service provider (DCSP), which is the entity operating the data commons; the data contributor (DC), which is the organization or in-dividual providing the data to the DCSP; and the data user (DU), which is the organization or indi-vidual accessing the data. (Note that there’s often a fourth stakeholder: the DCSP associated with the researcher accessing the data.) In general, there will be an agreement, often called the data contribu-tors agreement (DCA), governing the terms by which the data is managed by the DCSP and the researchers accessing the data, as well as a second agreement, often called the data access agreement (DAA), governing the terms of any researcher who accesses the data.

As we describe in more detail later, we’ve built several data commons since 2009. Based on this ex-perience, we’ve identified six main requirements that, if followed, would enable data commons to interop-erate with each other, science clouds,1 and other cy-berinfrastructure supporting science as a service:

■ Requirement 1, permanent digital IDs. The data commons must have a digital ID service, and datasets in the data commons must have per-manent, persistent digital IDs. Associated with digital IDs are access controls specifying who can access the data and metadata specifying additional information about the data. Part of this requirement is that data can be accessed from the data commons through an API by specifying its digital ID.

■ Requirement 2, permanent metadata. There must be a metadata service that returns the as-sociated metadata for each digital ID. Because the metadata can be indexed, this provides a ba-sic mechanism for the data to be discoverable.

■ Requirement 3, API-based access. Data must be accessed by an API, not just by browsing through a portal. Part of this requirement is that a metadata service can be queried to return a list of digital IDs that can then be retrieved via the API. For those data commons that con-tain controlled access data, another component of the requirement is that there’s an authentica-tion and authorization service so that users can first be authenticated and the data commons can check whether they are authorized to have access to the data.

■ Requirement 4, data portability. The data must be portable in the sense that a dataset in a data

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










commons can be transported to another data commons and be hosted there. In general, if data access is through digital IDs (versus referencing the data’s physical location), then software that references data shouldn’t have to be changed when data is rehosted by a second data commons.

■ Requirement 5, data peering. By “data peer-ing,” we mean an agreement between two data commons service providers to transfer data at no cost so that a researcher at data commons 1 can access data commons 2. In other words, the two data commons agree to transport research data between them with no access charges, no egress charges, and no ingress charges.

■ Requirement 6, pay for compute. Because, in practice, researchers’ demand for computing resources is larger than available computing resources, computing resources must be ra-tioned, either through allocations or by charg-ing for their use. Notice the asymmetry in how a data commons treats storage and computing infrastructure. When data is accepted into a data commons, there’s a commitment to store and make it available for a certain period of time, often indefinitely. In contrast, computing over data in a data commons is rationed in an ongoing fashion, as is the working storage and the storage required for derived data products, either by providing computing and storage al-locations for this purpose or by charging for them. For simplicity, we refer to this require-ment as “pay for computing,” even though the model is more complicated than that.

Although very important for many applications, we view other services, such as those for providing data provenance,5 data replication,6 and data col-laboration,7 as optional and not core services.

OSDC and OCC Data Commons

The OSDC is a multipetabyte science cloud that serves the research community by collocating a mul-tidisciplinary data commons containing approxi-mately 1 Pbyte of scientific data with cloud-based

computing, high-performance data transport ser-vices, and VM images and shareable snapshots con-taining common data analysis pipelines and tools.

The OSDC is designed to provide a long-term persistent home for scientific data, as well as a plat-form for data-intensive science, allowing new types of data-intensive algorithms to be developed, tested, and used over large sets of heterogeneous scientific data. Recently, OSDC researchers have logged about two million core hours each month, which translates to more than US$800,000 worth of cloud computing services (if purchased through Amazon Web Services’ public cloud). This equates to more than 12,000 core hours per user, or a 16-core machine continuously used by each researcher on average.

OSDC researchers used a total of more than 18 million core hours in 2015. We currently target operating OSDC computing resources at approxi-mately 85 percent of capacity, and storage resources at 80 percent of capacity. Given these constraints, we can determine how many researchers to support and what size allocations to provide them. Because the OSDC specializes in supporting data-intensive research projects, we’ve chosen to target research-ers who need larger-scale resources (relative to our total capacity) for data-intensive science. In other words, rather than support more researchers with smaller allocations, we support fewer researchers with larger allocations. Table 1 shows the number of times researchers exceeded the indicated number of core hours in a single month during 2015.

The OSDC Community

The OSDC is developed and operated by the Open Commons Consortium, a nonprofit that supports the scientific community by operating data com-mons and cloud computing infrastructure to support scientific, environmental, medical, and healthcare-related research. OCC members and partners include universities (University of Chicago, Northwestern University, University of Michigan), companies (Yahoo, Cisco, Infoblox), US government agencies and national laboratories (NASA, NOAA), and international partners (Edinburgh University, Uni-versity of Amsterdam, Japan’s National Institute of Advanced Industrial Science and Technology). The OSDC is a joint project with the University of Chicago, which provides the OSDC’s datacenter. Much of the support for the OSDC came from the Moore Foundation and from corporate donations.

The OSDC has a wide-reaching, multicampus, multi-institutional, interdisciplinary user base and has supported more than 760 research projects since its

Table 1. Data-intensive users supported by the Open Science

Data Cloud.

No. core hours per month No. users

20,000 120

50,000 34

100,000 23

200,000 5

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








inception. In 2015, 470 research groups from 54 uni-versities in 14 countries received OSDC allocations. In a typical month (November 2015), 186 of these research groups were active. The most computation-al-intensive group projects in 2015 included projects around biological sciences and genomics research, analysis of Earth science satellite imagery data, anal-ysis of text data in historical and scientific literature, and a computationally intensive project in sociology.

OCC Data Commons

The OCC operates several data commons for the research community.

OSDC data commons. We introduced our first data commons in 2009. It currently holds approximately 800 Tbytes of public open access research data, in-cluding Earth science data, biological data, social science data, and digital humanities data.

Matsu data commons. The OCC has collaborated with NASA since 2009 on Project Matsu, a data commons that contains six years of Earth Observing-1 (EO-1) data, with new data added daily, as well as selected datasets from other NASA satellites, including NA-SA’s Moderate Resolution Imaging Spectrometer (MODIS) and the Landsat Global Land Surveys.

The OCC NOAA data commons. In April 2015, NOAA announced five data alliance partnerships (with Am-azon, Google, IBM, Microsoft, and the OCC) that would have broad access to its data and help make it more accessible to the public. Currently, only a small fraction of the more than 20 of data that NOAA has available in its archives is available to the public, but NOAA data alliance partners have broader access to it. The focus of the OCC data alliance is to work with the environmental research community to build an environmental data commons. Currently, the OCC NOAA data commons contains Nexrad data, with additional datasets expected in 2016.

National Cancer Institute’s (NCI’s) genomic data com-mons (GDC). Through a contract between the NCI and the University of Chicago and in collaboration with the OCC, we’ve developed a data commons for cancer data; the GDC contains genomic data and associated clinical data from NCI-funded projects. Currently, the GDC contains about 2 Pbytes of data, but this is ex-pected to grow rapidly over the next few years.

Bionimbus protected data cloud. We also operate two private cloud computing platforms that are designed

to hold human genomic and other sensitive biomed-ical data. These two clouds contain a variety of sensi-tive controlled-access biomedical data that we make available to the research community following the requirements of the relevant data access committees.

Common software stack. The core software stack for the various data commons and clouds described here is open source. Many of the components are developed by third parties, but some key services are developed and maintained by the OCC and other working groups. Although there are some differences between them, we try to minimize the differences between the software stacks used by the various data commons that we operate. In practice, as we develop new versions of the basic software stack, it usually takes a year or so until the changes can percolate throughout our entire infrastructure.

OSDC Design and Architecture

Figure 1 shows the OSDC’s architecture. We are currently transitioning from version 2 of the OSDC software stack1 to version 3. Both are based on OpenStack8 for infrastructure as a service. The primary change made between version 2 and ver-sion 3 is that version 2 uses GlusterFS9 for storage, while version 3 uses Ceph10 for object storage in addition to OpenStack’s ephemeral storage. This is a significant user-facing change that comes with some tradeoffs. Version 2 utilized a POSIX-com-pliant file system for user home directory (scratch and persistent) data storage, which provides com-mand-line utilities familiar for most OSDC users. Version 3’s object storage, however, provides the advantage of an increased level of interoperability, as Ceph’s object storage has an interface compat-ible with a large subset of Amazon’s S3 RESTful API in addition to OpenStack’s API.

In version 3, there’s thus a clearer distinction between the way users interface with scratch data and intermediate working results on ephemeral storage, which is simple to use and persists only un-til VMs are terminated. This results in longer-term data on object storage, which requires the small extra effort of curating through the API interface. Although there’s a learning curve required in adopt-ing object storage, we’ve noticed that it’s small and easily overcome with examples in documentation. It also tempers increased storage usage that could stem from unnecessary data that isn’t actively removed.

The OSDC has a portal called the Tukey por-tal, which provides a front-end Web portal inter-face for users to access, launch, and manage VMs

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










and storage. The Tukey portal interfaces with the Tukey middleware, which provides a secure au-thentication layer and interface between various software stacks. The OSDC uses federated login for authentication so that academic institutions with InCommon, CANARIE, or the UK Federa-tion can use those credentials. We’ve worked with 145 academic universities and research institutions to release the appropriate attributes for authentica-tion. We also support Gmail and Yahoo logins, but only for approved projects when other authentica-tion options aren’t available.

We instrument all the resources that we oper-ate so that we can meter and collect the data re-quired for accounting and billing each user. We use Salesforce.com, one of the components of the OSDC that isn’t open source, to send out invoic-es. Even when computing resources are allocated and no payment is required, we’ve found that re-ceipt of these invoices promotes responsible usage of OSDC community resources. We also operate

an interactive support ticketing system that tracks user support requests and system team responses for technical questions. Collecting this data lets us track usage statistics and build a comprehensive as-sessment of how researchers use our services.

While adding to our resources, we’ve devel-oped an infrastructure automation tool called Yates to simplify bringing up new computing, storage, and networking infrastructure. We also try to au-tomate as much of the security required to operate the OSDC as is practical.

The core OSDC software stack is open source, enabling interested parties to set up their own sci-ence cloud or data commons. The core software stack consists of third-party, open source software, such as OpenStack and Ceph, as well as open source software developed by the OSDC commu-nity. The latter is licensed under the open source Apache license. The OSDC does use some propri-etary software, such as Salesforce.com to do the ac-counting and billing, as mentioned earlier.

Figure 1. The Open Science Data Cloud (OSDC) architecture. The various data commons that we have developed and

operate share an architecture, consisting of object-based storage, virtual machines (VMs), and containers for on-demand

computing, and core services for digital IDs, metadata, data access, and access to computing resources, all of which are

available through RESTful APIs. The data access and data submission portals are applications built using these APIs.

Disk

Disk

Disk

Disk

Object storage withaccess control listsfor managed data

Relational and NoSQL databases

VMs and containersfor workflow servicesfor OSDC projects

Disk

Microservice-based middleware (digital IDs, metadata server, and so on)

Datasubmission

portal + APIs

Data accessportal + APIs

Infrastructure-as-a-service (IaaS)

portal + APIs

VM- and container-based IaaS

Data peeringAPIDisk

Disk

VM andcontainers for user

computation

Physical hardware

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








OCC Digital ID and Metadata Services

The digital ID (DID) service is accessible via an API that generates digital IDs, assigns key-value attributes to digital IDs, and returns key-value at-tributes associated with digital IDs. We also devel-oped a metadata service that’s accessible via an API and can assign and retrieve metadata associated with a digital ID. Users can also edit metadata as-sociated with digital IDs if they have write access to it. Due to different release schedules, there are some differences in the digital ID and metadata services between several of the data commons that we operate, but over time, we plan to converge these services.

Persistent Identifier Strategies

Although the necessity of assigning digital IDs to data is well recognized,11,12 there isn’t yet a widely accepted service for this purpose, especially for large datasets.13 This is in contrast to the generally accept-ed use of digital object identifiers (DOIs) or handles for referencing digital publications. An alternative to a DOI is an archival resource key (ARK), a Uniform Resource Locator (URL) that’s also a multipurpose identifier for information objects of any type.14,15 In practice, DOIs and ARKs are generally used to as-sign IDs to datasets, with individual communities sometimes developing their own IDs. DataCite is an international consortium that manages DOIs for datasets and supports services for finding, accessing, and reusing data.16 There are also services such as EZID that support both DOIs and ARKs.17

Given the challenges the community is fac-ing in coming to a consensus about which digital IDs to use, our approach has been to build an open source digital ID service that can support multiple digital IDs, support “suffix pass-through,”13 and that can scale to large datasets.

Digital IDs

From the researcher viewpoint, the need for digital IDs associated with datasets is well appreciated.18,19

Here, we discuss some of the reasons that digital IDs are important for a data commons from an op-erational viewpoint.

First, with digital IDs, data can be moved from one physical location or storage system within a data commons to another without the need to change any code that references the data. As the amount of data grows, moving data between zones within a data commons or between storage systems becomes more and more common, and digital IDs allow this to take place without impeding researchers.

Second, digital IDs are an important component of the data portability requirement. More specifically, datasets can be moved between data commons, and, again, researchers don’t need to change their code. In practice, datasets can be migrated over time, with the digital IDs’ references updated as the migration proceeds.

Signpost is the digital ID service for the OSDC. Instead of using a hard-coded URL, the primary way to access managed data via the OSDC is through a digital ID. Signpost is an implementa-tion of this concept via JavaScript Object Notation (JSON) documents.

The Signpost digital ID service integrates a mutable ID that’s assigned to the data with an immutable hash-based ID that’s computed from the data. Both IDs are accessible through a REST API interface. With this approach, data contributors can make updates to the data and retain the same ID, while the data commons service provider can use the hash-based ID to facilitate data management. To prevent unauthorized editing of digital IDs, an access control list (ACL) is kept by each digital ID specifying the read/write permissions for different users and groups.

User-defined identities are flexible, can be of any format (including ARKs and DOIs), and pro-vide a layer of human readability. They map to hashes of the identified data objects, with the bot-tom layer utilizing hash-based identifiers, which guarantee data immutability, allow for identifica-tion of duplicated data via hash collisions, and al-low for verification upon retrieval. These map to known locations of the identified data.

Metadata Service

The OSDC metadata service, Sightseer, lets us-ers create, modify, and access searchable JSON documents containing metadata about digital IDs. The primary data can be accessed using Sign-post and the digital ID. At its core, Sightseer pro-vides no restrictions on the JSON documents it can store. However, it has the ability to specify metadata types and associate them with JSON schemas. This helps prevent unexpected errors in metadata with defined schemas. Sightseer has similar abilities as Signpost to provide ACLs to specify users that have write/read access to the specific JSON document.

Case Studies

Two case studies illustrate some of the projects that can be supported with data commons.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










Matsu

Project Matsu is a collaboration between NASA and the OCC that’s hosted by the University of Chicago, processes the data produced each day by NASA’s EO-1 satellite, and makes a variety of data products available to the research communi-ty, including flood maps. The raw data, processed data, and data products are all available through the OSDC. Project Matsu uses a framework called the OSDC Wheel to ingest raw data, process and analyze it, and deliver reports with actionable in-formation to the community in near real time.20

Project Matsu uses the data commons architecture illustrated in Figure 1.

As part of Project Matsu, we host several fo-cused analytic products with value-added data. Figure 2 shows a screenshot from one of these focused analytic products, the Project Matsu Na-mibia Flood Dashboard,20 which was developed as a tool for aggregating and rapidly presenting data and sources of information about ground condi-tions, rainfall, and other hydrological information to citizens and decision makers in the flood-prone areas of water basins in Namibia and the surround-ing areas. The tool features a bulletin system that produces a short daily written report, a geospatial data visualization display using Google Maps/

Earth and OpenStreetMap, methods for retrieving NASA images for a region of interest, and analyt-ics for projecting flood potential using hydrologi-cal models. The Namibia Flood Dashboard is an important tool for developing better situational awareness and enabling fast decision making and is a model for the types of focused analytics products made possible by collocating related datasets with each other and with computational and analytic capabilities.

Bionimbus

The Bionimbus Protected Data Cloud21 is a pet-abyte-scale private cloud and data commons that has been operational since 13 March 2013. Since going online in 2013, it has supported more than 152 allocation recipients from over 35 different projects at 29 different institutions. Each month, Bionimbus provides more than 2.5 million core hours to researchers, which at standard Amazon AWS pricing would cost over $500,000. One of the largest users of Bionimbus is the Cancer Ge-nome Atlas (TCGA)/International Cancer Ge-nome Consortium (ICGC) PanCancer Analysis of Whole Genomes working group (PCAWG). PCAWG is currently undertaking a large-scale analysis of most of the world’s whole genome

Figure 2. A screenshot of part of the Namibia Flood Dashboard from 14 March 2014. This image shows water catchments (outlined

and colored regions) and a one-day flood potential forecast of the area from hydrological models using data from the Tropical Rainfall

Measuring Mission (TRMM), a joint space mission between NASA and the Japan Aerospace Exploration Agency.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








cancer data available to the cancer community through the TCGA and ICGA consortia using several clouds, including Bionimbus.

Bionimbus also uses the data commons archi-tecture illustrated in Figure 1. More specifically, the current architecture uses OpenStack to provide virtualized infrastructure, containers to provide a platform-as-a-service capability, and object-based storage with an AWS compatible interface. Bion-imbus is a National Institutes of Health (NIH) Trusted Partner22 that interoperates with both the NIH Electronic Research Administration Com-mons to authenticate researchers and with the NIH Database of Genotypes and Phenotypes system to authorize users access to specific controlled access datasets, such as the TCGA dataset.

Discussion

Three projects that are supporting infrastructures similar to the OCC data commons are described in the sidebar. With the appropriate services, data commons support three different but related func-tions. First, data commons can serve as a data

repository or digital library for data associated with published research. Second, data commons can store data along with computational environ-ments in VMs or containers so that computations supporting scientific discoveries can be reproduc-ible. Third, data commons can serve as a platform, enabling future discoveries as more data, algo-rithms, and software applications are added to the commons.

Data commons fit well with the science-as-a-service model: although data commons allow re-searchers to download data, host it themselves, and analyze it locally, they also allow current data to be reanalyzed with new methods, tools, and appli-cations using collocated computing infrastruc-ture. New data can be uploaded for an integrated analysis, and hosted data can be made available to other resources and applications using a data-as-a-service model, in which data in a data commons is accessed through an API. A data-as-a-service model is enhanced when multiple data commons and sci-ence clouds peer so that data can be moved between them at no cost.

Related Work

Several projects share many of the goals of data commons

in general, and the Open Commons Consortium (OCC)

data commons in particular. Here, we discuss three of the most

important: the National Institutes of Health (NIH) Big Data to

Knowledge (BD2K) program, the Research Data Alliance (RDA),

and the National Data Service (NDS).

The work described in the main text is most closely con-

nected with the vision for a commons outlined by the BD2K

program at the US National Institutes for Health.1 The commons

described in this article can be viewed partly as an implementa-

tion of a commons that supports the principles of findability,

accessibility, interoperability, and reusability2 which are key

requirements of the data-sharing component of the BD2K

program.

Of the three projects mentioned, the largest and most ma-

ture is the RDA,3 the goals of which are to create concrete pieces

of infrastructure that accelerate data sharing and exchange for a

specific but substantive target community; adopt the infrastruc-

ture within the target community; and use the infrastructure to

accelerate data-driven innovation.3

The goals of the NDS are to implement core services for dis-

covering data; storing persistent copies of curated data and as-

sociated metadata; accessing data; linking data with other data,

publications, and credit for reuse; and computing and analyzing

data (www.nationaldataservice.org). Broadly speaking, the goals

of the commons described here are similar to the NDS and, for

this reason, they can be considered as some of the possible ways

to implement the services proposed for the NDS.

The OCC and Open Science Data Cloud (OSDC) started in

2008, several years before BD2K, RDA, and NDS, and have been

developing cloud-based computing and data commons services

for scientific research projects ever since. Roughly speaking,

the goals of these projects are similar, but the OSDC is strictly

a science service provider and data commons provider, whereas

the RDA is a much more general initiative. The BD2K program

is focused on biomedical research, especially for NIH-funded

researchers, while the NDS is a newer effort that involves the Na-

tional Science Foundation supercomputing centers, their partners,

and their users.

References

1. V. Bonazzi, “NIH Commons Overview, Framework & Pilots,”

2015; https://datascience.nih.gov/commons.

2. M.D. Wilkinson et al., “The FAIR Guiding Principles for Sci-

entific Data Management and Stewardship,” Scientific Data,

vol. 3:160018, 2016.

3. F. Berman, R. Wilkinson, and J. Wood, “Building Global In-

frastructure for Data Sharing and Exchange through the Re-

search Data Alliance,” D-Lib Magazine, vol. 20, 2014; www.

dlib.org/dlib/january14/01guest_editorial.html.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________

___


http://www.nationaldataservice.org

https://datascience.nih.gov/commons

http://www.dlib.org/dlib/january14/01guest_editorial.html







Challenges

Perhaps the biggest challenge for data commons, especially large-scale data commons, is develop-ing long-term sustainability models that support operations year after year.

Over the past several years, funding agencies have required data management plans for the dis-semination and sharing of research results, but, by and large, they haven’t provided funding to sup-port this requirement. What this means is that a lot of data is searching for data commons and simi-lar infrastructure, but very little funding is avail-able to support this type of infrastructure.

Moreover, datacenters are sometimes divided into several “pods” to facilitate their management and build out—for lack of better name, we some-times use the term cyberpod to refer to the scale of a pod at a datacenter. Cyberinfrastructure at this scale is also sometimes called midscale com-puting,23 to distinguish it from the large-scale infrastructure available to Internet companies such as Google and Amazon and the HPC clus-ters generally available to campus research groups. A pod might contain 50 to several hundred racks of computing infrastructure. Large-scale Internet companies have developed specialized software for mid- to large-scale (datacenter-scale) computing,24

such as MapReduce (Google)25 and Dynamo (Am-azon),26 but this proprietary software isn’t avail-able to the research community. Although some software applications, such as Hadoop,23 are avail-able to the research community and scale across multiple racks, there isn’t a complete open source software stack containing all the services required to build a large-scale data commons, including the infrastructure automation and management servic-es, security services, and so on24 required to oper-ate a data commons at midscale.

We single out three research challenges related to building data commons at the scale of cyberpods:

■ Software stacks for midscale computing. The first research challenge is to develop a scal-able open source software stack that provides the infrastructure automation and monitor-ing, computing, storage, security, and related services required to operate at the scale of a cyberpod.

■ Datapods. The second research challenge is to develop data management services that scale out to cyberpods. We sometimes use the term datapods for data management infrastructure at this scale—that is, data management in-

frastructure that scales to midscale and larger computing infrastructure.

■ AnalyticOps. The third challenge is to develop an integrated development and operations methodology to support large-scale analysis and reanalysis of data. You might think of this as the analogy of DevOps for large-scale data analysis.

An additional category of challenges is the lack of consensus within the research community for a core set of standards that would support data com-mons. There aren’t yet widely accepted standards for indexing data, APIs for accessing data, and au-thentication and authorization protocols for access-ing controlled-access data.

Lessons Learned

Data reanalysis is an important capability. For many research projects, large datasets are periodically re-analyzed using new algorithms or software appli-cations, and data commons are a convenient and cost-effective way to provide this service, especially as the data grows in size and becomes more expensive to transfer.

In addition, important discoveries are made at all computing resource levels. As mentioned, computing resources are rationed in a data com-mons (either directly through allocations or in-directly through charge backs). Typically, there’s a range of requests for computing allocations in a data commons spanning six to seven or more orders of magnitude, ranging from hundreds of core hours to tens of millions of core hours. The challenge is that important discoveries are usu-ally made across the entire range of resource al-locations, from the smallest to the largest. This is because when large datasets, especially multiple large datasets, are collocated, it’s possible to make interesting discoveries even with relatively small amounts of compute.

The tragedy of the commons can be allevi-ated with smart defaults in implementation. In the early stages of the OSDC, the number of us-ers was smaller, and depletion of shared computa-tional resources wasn’t an urgent concern. As the popularity of the system grew and attracted more users, we noted some user issues (for example, increase in support tickets that noted that larger VM instances wouldn’t launch) as compute core utilization surpassed 85 percent. Accounting and invoicing promotes responsible usage of commu-nity resources. We also implemented a quarterly resource allocation system with a short survey to

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








users requiring opt-in for continued resource us-age extending into the next quarter. This provides a more formal reminder every three months to users who are finishing research projects to re-linquish their quotas and has been successful for tempering unnecessary core usage. Similarly, as we moved to object storage functionality, we noted more responsible usage of storage, as scratch space is in ephemeral storage and removed by default when the computing environment is terminated. The small extra effort in moving data via an API to the object storage requires more thoughtful cura-tion and usage of resources.

Over the past several years, much of the re-search focus has been on designing and operating data commons and science clouds that are scalable, contain interesting datasets, and offer computing infrastructure as a service. We expect that as these types of science-as-a-service offerings become more common, there will be a variety of more interest-ing higher-order services, including discovery, cor-relation, and other analysis services that are offered within a commons or cloud and across two or more commons and clouds that interoperate.

Today, Web mashups are quite common, but analysis mashups, in which data is left in place but continuously analyzed as a distributed service, are relatively rare. As data commons and science clouds become more common, these types of services can be more easily built.

Finally, hybrid clouds will become the norm. At the scale of a several dozen racks (a cyberpod), a highly utilized data commons in a well-run data-center is less expensive than using today’s public clouds.22 For this reason, hybrid clouds consisting of privately run cyberpods hosting data commons that interoperate with public clouds seem to have certain advantages.

P roperly designed data commons can serve sev-eral roles in science as a service: first, they can

serve as an active, accessible, citable repository for research data in general and research data associ-ated with published research papers in particular. Second, by collocating computing resources, they can serve as a platform for reproducing research re-sults. Third, they can support future discoveries as more data is added to the commons, as new algo-rithms are developed and implemented in the com-mons, and as new software applications and tools are integrated into the commons. Fourth, they can serve as a core component in an interoperable “web

of data” as the number of data commons begins to grow, as standards for data commons and their interoperability begin to mature, and as data com-mons begin to peer.

Acknowledgments

This material is based in part on work supported by the US National Science Foundation under grant numbers OISE 1129076, CISE 1127316, and CISE 1251201 and by Nation-al Institutes of Health/Leidos Biomedical Research through contracts 14X050 and 13XS021/HHSN261200800001E.

References

1. R.L. Grossman et al., “The Design of a Community Science Cloud: The Open Science Data Cloud Per-spective,” Proc. High Performance Computing, Net-working, Storage and Analysis, 2012, pp. 1051–1057.

2. P. Mell and T. Grance, The NIST Definition of Cloud Computing (Draft): Recommendations of the National Institute of Standards and Technology, Nat’l Inst. Standards and Tech., 2011.

3. R. Dooley et al., “Software-as-a-Service: The iPlant Foundation API,” Proc. 5th IEEE Workshop Many-Task Computing on Grids and Supercomputers, 2012; https://www.semanticscholar.org/paper/Software-as-a-service-the-Iplant-Foundation-Api-Dooley-Vaughn/ccde19b95773dbb55328f3269fa697a4a7d60e03/pdf.

4. I. Foster, “Globus Online: Accelerating and Democ-ratizing Science through Cloud-Based Services,” IEEE Internet Computing, vol. 3, 2011, pp. 70–73.

5. Y.L. Simmhan, B. Plale, and D. Gannon, “A Survey of Data Provenance in E-Science,” ACM Sigmod Re-cord, vol. 34, no. 3, 2005, pp. 31–36.

6. A. Chervenak et al., “Wide Area Data Replication for Scientific Collaborations,” Int’ l J. High Perfor-mance Computing and Networking, vol. 5, no. 3, 2008, pp. 124–134.

7. J. Alameda et al., “The Open Grid Computing Environ-ments Collaboration: Portlets and Services for Science Gateways,” Concurrency and Computation: Practice and Experience, vol. 19, no. 6, 2007, pp. 921–942.

8. K. Pepple, Deploying OpenStack, O’Reilly, 2011.9. A. Davies and A. Orsaria, “Scale out with Glus-

terFS,” Linux J., vol. 235, 2013, p. 1.10. S.A. Weil et al., “Ceph: A Scalable, High-Perfor-

mance Distributed File System,” Proc. 7th Symp. Operating Systems Design and Implementation, 2006, pp. 307–320.

11. M.S. Mayernik, “Data Citation Initiatives and Is-sues,” Bulletin Am. Soc. Information Science and Tech-nology, vol. 38, no. 5, 2012, pp. 23–28.

12. R.E. Duerr et al., “On the Utility of Identification Schemes for Digital Earth Science Data: An Assess-

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________________

______________________________

______________________________

____


https://www.semanticscholar.org/paper/Software-as-a-service-the-Iplant-Foundation-Api-Dooley-Vaughn/ccde19b95773dbb55328f3269fa697a4a7d60e03/pdf







ment and Recommendations,” Earth Science Infor-matics, vol. 4, no. 3, 2011, pp. 139–160.

13. C. Lagoze et al., “CED 2 AR: The Comprehensive Extensible Data Documentation and Access Reposi-tory,” Proc. IEEE/ACM Joint Conf. Digital Libraries,2014, pp. 267–276.

14. J. Kunze, “Towards Electronic Persistence Using ARK Identifiers,” Proc. 3rd ECDL Workshop Web Archives, 2003; https://wiki.umiacs.umd.edu/adapt/images/0/0a/Arkcdl.pdf.

15. J.R. Kunze, The ARK Identifier Scheme, US Nat’l Library Medicine, 2008.

16. T. Pollard and J. Wilkinson, “Making Datasets Vis-ible and Accessible: DataCite’s First Summer Meet-ing,” Ariadne, vol. 64, 2010; www.ariadne.ac.uk/issue64/datacite-2010-rpt.

17. J. Starr et al., “A Collaborative Framework for Data Management Services: The Experience of the Uni-versity of California,” J. eScience Librarianship, vol. 1, no. 2, 2012, p. 7.

18. A. Ball and M. Duke, “How to Cite Datasets and Link to Publications,” Digital Curation Centre, 2011.

19. T. Green, “We Need Publishing Standards for Da-tasets and Data Tables,” Learned Publishing, vol. 22, no. 4, 2009, pp. 325–327.

20. D. Mandl et al., “Use of the Earth Observing One (EO-1) Satellite for the Namibia SensorWeb Flood Early Warning Pilot,” IEEE J. Selected Topics in Ap-plied Earth Observations and Remote Sensing, vol. 6, no. 2, 2013, pp. 298–308.

21. A.P. Heath et al., “Bionimbus: A Cloud for Manag-ing, Analyzing and Sharing Large Genomics Datas-ets,” J. Am. Medical Informatics Assoc., vol. 21, no. 6, 2014, pp. 969–975.

22. D.N. Paltoo et al., “Data Use under the NIH GWAS Data Sharing Policy and Future Directions,” Nature Genetics, vol. 46, no. 9, 2014, p. 934.

23. Future Directions for NSF Advanced Computing Infrastructure to Support US Science and Engineer-ing in 2017–2020, Nat’l Academies Press, 2016.

24. L.A. Barroso, J. Clidaras, and U. Hölzle, “The Data-center as a Computer: An Introduction to the De-sign of Warehouse-Scale Machines,” Synthesis Lec-tures on Computer Architecture, vol. 8, no. 3, 2013, pp. 1–154.

25. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. ACM,vol. 51, no. 1, 2008, pp. 107–113.

26. G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” ACM SIGOPS Op-erating Systems Rev., vol. 41, no. 6, 2007, pp. 205–220.

Robert L. Grossman is director of the University of Chi-cago’s Center for Data Intensive Science, a professor in the Division of Biological Sciences at the University of Chicago, founder and chief data scientist of Open Data Group, and director of the nonprofit Open Commons Consortium. Grossman has a PhD from Princeton Uni-versity from the Program in Applied and Computational Mathematics. He’s a Core Faculty and Senior Fellow at the University of Chicago’s Computation Institute. Contact him at [email protected].

Allison Heath is director of research for the University of Chicago’s Center for Data Intensive Science. Her re-search interests include scalable systems and algorithms tailored for data-intensive science, specifically with ap-plications to genomics. Heath has a PhD in computer science from Rice University. Contact her at [email protected].

Mark Murphy is a software engineer at the University of Chicago’s Center for Data Intensive Science. His re-search interests include the development of software to support scientific pursuits. Murphy has a BS in com-puter science engineering and a BS in physics from the Ohio State University. Contact him at [email protected].

Maria Patterson is a research scientist at the University of Chicago’s Center for Data Intensive Science. She also serves as scientific lead for the Open Science Data Cloud and works with the Open Commons Consortium on its Earth science collaborations with NASA and NOAA. Her research interests include cross-disciplinary scien-tific data analysis and techniques and tools for ensuring research reproducibility. Patterson has a PhD in astron-omy from New Mexico State University. Contact her at [email protected].

Walt Wells is director of operations at the Open Com-mons Consortium. His professional interests include using open data and data commons ecosystems to ac-celerate the pace of innovation and discovery. Wells re-ceived a BA in ethnomusicology/folklore from Indiana University and is pursuing an MS in data science at CUNY. Contact him at [email protected].



qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___________

______________

__________

________

__________________

_______________

_____________________

______________

________

_____

https://wiki.umiacs.umd.edu/adapt/images/0/0a/Arkcdl.pdf

http://www.ariadne.ac.uk/issue64/datacite-2010-rpt











September/October 2016 Copublished by the IEEE CS and the AIP 1521-9615/16/$33.00 © 2016 IEEE Computing in Science & Engineering 21


MRICloud: Delivering High-Throughput MRI Neuroinformatics as Cloud-Based Software as a Service

Susumu Mori and Dan Wu | Johns Hopkins University School of Medicine

Can Ceritoglu | Johns Hopkins University, Whiting School of Engineering

Yue Li | AnatomyWorks

Anthony Kolasny | Johns Hopkins University, Whiting School of Engineering

Marc A. Vaillant | Animetrics

Andreia V. Faria and Kenichi Oishi | Johns Hopkins University School of Medicine

Michael I. Miller | Johns Hopkins University, Whiting School of Engineering

MRICloud provides a high-throughput neuroinformatics platform for automated brain MRI segmentation and analytical tools for quantification via distributed client-server remote computation and Web-based user interfaces. This cloud-based service approach improves the efficiency of software implementation, upgrades, and maintenance. The client-server model is also ideal for high-performance computing, allowing distribution of computational servers and client interactions across the world.

In our laboratories at Johns Hopkins University, we have more than 15 years of experience in developing im-age analysis tools for brain magnetic resonance imaging (MRI) and in sharing the tools with research com-munities. The effort started when we developed DtiStudio in 20001 as an executable program that could be downloaded from our website to perform tensor calculation of diffusion tensor imaging and 3D white

matter tract reconstruction. In 2006, two more programs (RoiEditor and DiffeoMap) joined the family that we collectively called MriStudio. These two programs were designed to perform ROI (region of interest)-based

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









image quantification for any type of brain MRI data. The ROI could be manually defined, but DiffeoMap introduced our first capability for automated brain segmentation. We based our work on a single-subject atlas with more than 100 defined brain regions that were automatically deformed to image data, and thus transferring the predefined ROIs in the atlas to achieve automated brain segmentation of the target.We call this image analysis pipeline high-throughput neuroinformatics,2 as it offers the user the opportu-nity to reduce MR imagery on the order of O(106 to 107) variables to O(1,000) dimensions associated with the neuro-ontology of atlas-defined structures. These 1,000 dimensions are searchable and can be used to support diagnostic workflows.

The core atlas-to-data image analysis is based on advanced diffeomorphic image registration algorithms for positioning information in hu-man anatomical coordinate systems.3 To posi-tion dense atlas-based image ontologies, we use image-based large deformation diffeomorphic metric mapping (LDDMM),4 which is most effi-ciently implemented using high-performance net-worked systems, especially for large-volume data such as high-resolution T1-weighted images. In 2006, the term cloud wasn’t yet widely used, but we employed a concept similar to that of cloud storage to solve this problem. Specifically, we used an IBM supercomputer at Johns Hopkins Univer-sity’s Institute for Computational Medicine to re-motely and transparently process user data. Since the introduction of DiffeoMap, approximately 50,000 whole-brain MRI data have been pro-cessed using this approach. The platform natural-ly evolved into MRICloud, which we introduced in December 2014 as a beta testing platform. This Web-based software follows a cloud-based soft-ware-as-a-service (SaaS) model.

After 15 years of software development, the number of MriStudio’s registered users now approaches 10,000, and in 2015, the number of processed data per month through the new cloud system reached a record of 3500 per month. One motivation to adopt a cloud system is to exploit publicly available supercomputing systems for CPU- and memory-intensive operations. For ex-ample, although each MR image is typically 10 to 20 Mbytes, our current image-segmentation algo-rithm with 16 reference atlases requires approxi-mately 5 Gbytes/data of memory. In the Extreme Science and Engineering Discovery Environment (XSEDE) extreme computing environment, the pipeline can parallelize the registration process of

the 16 atlases and complete the calculation in 15 to 20 minutes using 8 cores for each atlas registra-tion (a total of 128 cores), which is equivalent to consuming 32 to 43 CPU-hour service units (SUs). Through the cloud system, users can transparently access this type of high-performance computation facility and run large cohorts of data in a short amount of time, which is certainly an advantage. However, in the transition from a conventional executable-distribution model to a cloud platform, it has become apparent that computational power is only one advantage that a cloud system can of-fer in terms of science as a service. It also changes the efficiency for new tools development and dis-semination, enabling services that weren’t previ-ously possible. In this article, we share the expe-riences we have accumulated during our period of development.

Software as a Service

The core mapping service is a computationally demanding high-throughput image analysis al-gorithm that parcels brain MRIs into upward of 400 structures by positioning the labeled atlas ontologies into the coordinates of the brain tar-gets. The approach assumes that there exists a structure-preserving or correspondence, what we term a diffeomorphism, a one-to-one smooth map-ping between the target I(x), x X, I Iatlas.Here, I and Iatlas are the target and atlas images, respectively, x and X denote an image’s individual coordinates and spatial domain, and denotes the diffeomorphic transformation between the two images. The correspondence between the indi-vidual and the atlas is termed the DiffeoMap. We interpret the morphisms (x), x X as carrying the contrast MR imagery I(x), x X. The morphisms provide a GPS3 for both transferring the atlas’s on-tological semantic labeling and providing coordi-nates to statistically encode the anatomy.

Personalization of atlas coordinates to the target occurs via smooth transformation of the atlas, which minimizes the distance �( )−d I Iinf , atlas 1

1 between the individual’s representation I and the transfor-med atlas � −Iatlas 1

1,5 with the transformation solving the equation v ( )t t tφ φ= , t [0, 1] and minimizing the integrated cost

� ∫∈ = ⋅ =−v dtinf ,

v t v I I tV, [0,1]: ( ), 0

1

t atlas atlas01

where vt is the time-dependent velocity vector field of the flow of deformation, t is the diffeomorphism

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








at time t, φt denotes first-order differentiation of t,

and ∫ v dttV0

1

denotes the integration of the norm of vt over the entire velocity field, V the Hilbert space of smooth vector fields. Figure 1 shows ex-amples of our structure-function model, including T1- and T2-weighted structural contrast imagery, orientation vector imagery (such as diffusion tensor MRI), metabolism measured via magnetic resonance spectroscopy, and functional connectivity via rest-ing-state functional MRI (rs-fMRI).6 Each atlas carries with it the means and variances associated with each high-dimensional feature vector.

Evolution of Software Architecture

To highlight the software architecture’s evolution (see Figure 2), let’s first look at the functions of three key software programs. DtiStudio is software

developed on the Windows platform and written in C++ for core algorithms. It also contains com-ponents of MS-Visual C, MFC, and OpenGL. User data and the executable file are both located in users’ local computers (Figure 2a).1 The execut-able file needs to be downloaded from our website (www.mristudio.org), but all operations are per-formed within users’ local computers, including data I/O, calculations, and the visualization in-terface. The input data are raw diffusion-weighted images and associated parameters from MRI scan-ners, from which diffusion tensor matrices are cal-culated. The software also offers ROI drawing and tractography tools to define white matter tracts and perform quantifications.

DiffeoMap is an example of a model in which external computation power is incorporated based on a seamless communication scheme (Figure 2b).7

The software reads two images (a reference brain

Figure 1. The structural-functional model of atlas-based MRI informatics. MRI imaginary from different modalities, such as T1- and T2-weighted structural MRI, diffusion tensor MRI, functional and resting-state functional MRI, and MRI spectroscopy images can be parcellated to predefined structures based on the presegmented MRI atlases. This allows for extraction of multicontrast features, from hundreds of anatomical structures to millions of voxels, in a reduced dimension.

Structural MRI Functional MRI

T1-weightedMRI

T2-weightedMRI

Diffusiontensor MRI

MR spectroscopy

Metabolic MRI

Cho

NAA

Cr

Resting-state fMRIRegional correlationmaps

Time s

eries

Atlas-basedfeature analysis

VolT2FA

NAA

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________


http://www.mristudio.org







atlas and a user-provided image), and one image is transformed into the shape of the other, thereby anatomically registering voxel-coordinates of the two images. Basic image transformation (voxel resiz-ing, image cropping, and linear transformation) and associated functions (file format conversion, inten-sity matching) are performed locally. The data I/O and visualization interfaces also remain in the local Windows platform. However, diffeomorphic image transformation, which is too CPU-intensive for local PCs, is performed by a remote server. Communica-tion with the remote server is performed through HTTPS and FTP protocols and through notifica-tion to users via email. Once users’ data are automat-ically sent to the remote server, the server performs diffeomorphic transformation and the resultant transformation matrices, which are typically about 1 Gbyte, can be retrieved by DiffeoMap through 32-bit data identifiers provided in the email.

MRICloud is the latest evolution of our soft-ware, in which computationally intensive algo-rithms are migrated to a remote server (Figure 2c). The cloud computing model is an attractive client-server model that we adopted because of the ease of scalability, portability, accessibility, and main-tenance cost, providing a “virtual” hardware en-vironment that decouples the computer from the physical hardware. The computer is referred to as a virtual machine and behaves like a software pro-gram that can run on another computer. Abstract-ing the computer from hardware facilitates move-ment and scaling of virtual machines on the fly.

Cloud System Architecture

The main entry point to the server infrastructure is through either the MRICloud Web application or its accompanying RESTful Web API. Data payloads can be several hundreds of Mbytes, and a

Figure 2. Schematic diagrams of the architectures of (a) DtiStudio, (b) DiffeoMap, and (c) MRICloud. DtiStudio is an example of a

conventional distribution model, in which an executable file is downloaded to local computers and the entire process take place within

local computers. DiffeoMap has an internal architecture similar to that of DtiStudio, but CPU-demanding calculations associated to the

large deformation diffeomorphic mapping calculations of DiffeoMap occur on a remote Unix/Linux server. For the MRICloud system,

the entire calculation occurs in the remote server, and the communication with users relies on a Web interface. The system has flexible

scalability and contains a storage system for temporary storage of user data.

(a)

(b)

(c)

Local data

http

http

FTP

FTP

Local data

End users

End users

Email

Unix/Linuxservers

Unix/Linuxservers

Remote services

Nonlinear image trasformation

Web developing tools

PHP C REST MySQL HTML

https/FTP

User-friendly human-machine interface

Resizing Linear alignment Intensity match

...

...

...

Local data

End users

Windows PC

User-friendly human-machine interface

Algorithm and modules

Image viewer

Image data input/output

DTI mapping Fiber tracing ...

...DICOM Analyze Nifty RAW Mosaic

Developing tools

MS-Visual C/C++ MFC OpenGL

Components

Laye

rs

Image data input/output

Developing tools

MS-Visual C/C++WindowsPC (local) MFC OpenGL

DICOM Analyze Nifty RAW Mosaic

Custom support

Email listing

JHU IBM XSEDE (Gordon, Stampede)

Processing servers

Software as a service

Databases

Apache Web farm

PHP C REST Python HTML ...

Data storage Knowledge database (MongoDB)

DTI mapping Segmentation LDDMM

Atlas-based statistics fMRI

Forum FAQ ...

...

...

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








special jQuery interface is used to facilitate resum-able uploads because they aren’t directly supported by the HTTP protocol. A successful upload returns a job identifier that references the data and its pipe-line throughout the system. The job ID is used to check the status of the processing and to reference the resulting processed data to be downloaded.

Figure 3 outlines our back-end processing pipe-line, which is built from standard legacy protocols on a LAMP (Linux, Apache, MySQL, PHP) stack, also including FTP, SSH, SMTP, and high-level script-ing (BASH, PHP) that keeps the system lightweight, simple, robust, and easily maintainable. Once the data are uploaded, it’s repacked in a zip payload structure to move through the system, first moving into a queue via FTP. Once consumed by the moni-tor, the payload is validated for completeness. Then, an available computational resource (www.xsede.org or www.marcc.jhu.edu) is identified and the data are submitted to the cluster’s processing queue using an SSH signal. The cluster uses SMTP to signal that the job is submitted, and the monitor then polls for com-plete jobs. Upon job completion, the resulting data are moved to an FTP server, and the user is notified

by email with the URI to retrieve the data. Alterna-tively, the user can check on the status of the pro-cessing at any time via the MRICloud website and retrieve the data from there if they’re ready.

To facilitate a programmatic interface to the processing, the RESTful Web API provides a service that can ping the status of the processing, as well as another service for downloading the data. There-fore, a user can batch process and retrieve results, without a human in the loop, and is notified when the MRI images being processed are completed. An example protocol might be api_submit data; in a loop, api_job_status every 30 seconds until com-plete; and api_download result. As with any REST-ful Web API, this can be done programmatically in any language that supports the HTTP protocol.

To secure the processing pipeline, SSH is core to data transfer and signaling commands on remote systems. The validation and cluster allocation server uses public and private keys with authorized key re-strictions. The root of the server allows SSHFS to mount a restricted area of data space for processing storage. A user-level SSH public/private key with au-thorized_key restrictions is used to signal the cluster

Figure 3. Diagram of core MriStudio and MRICloud server components. MRICloud.org or MriStudio applications generate a zip from the users’ data, which are uploaded to an anonymous FTP server (ftp.mristudio.org). Another server (io19.cis.jhu.edu) monitors the incoming queue for new data. Upon arrival, this server validates the data, identifies a computation resource, and copies the data to one of the clusters (currently, http://icm.jhu.edu, www.xsede.org or www.marcc.jhu.edu). The data are then queued using an SSH signal. The validation and allocation server also monitors job completion and updates the job status at www.mricloud.org or sends an email to MriStudio users with a URL of the data location.

MARCC clusterhttps://www.marcc.jhu.eduIddmm processing cluster

XSEDEhttps://www.xsede.org/

Iddmm processing cluster

MriStudiouser

MriCloud.org

ftp.mristudio.org

Incoming/outgoingprocessing storage

JHUfirewall

Departmentalfirewall

io19.cis.jhu.eduvalidation and cluster allocation server

Internet

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___

_________

________________

__________

____________


http://www.xsede.org

http://www.marcc.jhu.edu

http://MRICloud.org

http://ftp.mristudio.org

http://io19.cis.jhu.edu

http://icm.jhu.edu


http://www.marcc.jhu.edu

http://www.mricloud.org








for job submission. The user-level account doesn’t have direct shell access to the cluster but moves data through the root-level SSHFS mount point.

Step-by-Step Procedure for the Cloud Service

To provide a clear illustration of how the cloud-based SaaS functions, Figure 4 shows the actual steps involved in the image analysis services. The first step is to create an account in the login window (Figure 4a). Once logged in, users have access to several SaaSs, including T1-based brain segmenta-tion, diffusion tensor imaging (DTI) data process-ing, resting-state fMRI analysis, and arterial spin labeling data processing.

If a T1-based brain segmentation SaaS is chosen, a data upload page appears (Figure 4b), in which users need to choose several options, including choice of processing servers, image ori-entations (sagittal or axial), and multiatlas libraries. Currently, the SaaS accepts a specific file format gen-erated by a small program that needs to be down-loaded from the MRICloud website. If users want to compare their data with the internal control data being logged within MRICloud, the demography information must also be provided. Users have a choice of two processing servers: the John Hop-kins University IBM Blade computer, supported by

the Institute for Computational Medicine (http://icm.jhu.edu), and the University of California, San Diego, Gordon computer from the US National Sci-ence Foundation for the Computational Anatomy Gateway via XSEDE (www.xsede.org). Thus far, the services at the Computational Anatomy Gateway have been supported by the XSEDE grant program, which allows us to provide the MRI SaaS to users free of charge. Our current effort is focused on utilizing pub-licly available computational resources to make them available for users. When using the occupied SUs and computing resources, we can compare the efficiency performance in terms of computing consumption. SUs can be defined as SU = (wall time/60) * (total CPU number). Given a T1-segmentation pipeline using 16 atlases, the wall time on XSEDE resources would be 32 minutes. The SU and runtime increase with the number of atlases, which also depend on the available number of CPUs, as illustrated in Figure 5. On the current MRICloud platform, 45 and 30 at-lases are in use for adult and pediatric target images, respectively. The results in Figure 5a highlight the im-portance of parallelization and enhanced efficiency by employing supercomputing resources. The runtime of the pipeline decreases drastically as more CPUs be-come available. Figures 5b and 5c demonstrate the pipeline’s scalability when a large number of cores/

Figure 4. The actual login window and one of the SaaS interfaces of MRICloud. (a) User registration and authentication are essential for cloud-based services. (b) Currently, six SaaSs have been tested, which include brain segmentation, diffusion tensor imaging (DTI) calculation, resting-state functional MRI (rs-fMRI), arterial spin labeling, angio scan parameter calculation, and surface mapping. The figure shows the interface for the brain segmentation based on T1-weighted images.

(a) (b)

Registration and login Software-as-a-service (SaaS) interface

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____

_______

_________

http://icm.jhu.edu







CPUs are available. As long as there are a sufficient number of CPUs available, the increased number of atlases does not lead to an increased runtime.

Service status can be monitored at the “my job status” section; once completed, the results can be downloaded or viewed from the same page (Figure 6a). The “view results” option opens a new webpage that allows visual inspection of the segmentation results, which are displayed in three orthogonal views and a 3D surface rendering (Figure 6b). Us-ers can examine the quality of the segmentation with these views and, in addition, if the age of the data is specified at the data submission, the vol-ume of each defined structure can be compared to age-matched controls based on z-scores or within an age-versus-volume plot. These control data are stored in MongoDB; results on the Web are up-dated in real time as the control database evolves.

The downloaded files contain information about volumes and intensity of segmented struc-tures. Currently, we offer atlas version 7a, which has 289 defined structures and a five-level ontolog-ical relationship for these structures, as described in our previously published paper.8 This service converts T1-weighted images with more than 1 million voxels to standardized and quantitative matrices with [volume, intensity] 289 structures.

This T1-based brain segmentation service is linked to other SaaSs provided by MRICloud. For example, rs-fMRI and arterial spin labeling (ASL) services incorporate the structural segmentation re-sults and perform structure-to-structure connectivity analysis (rs-fMRI) or structure-specific quantifica-tion of blood flow (ASL). In this way, the cloud sys-tem can be a platform to link multiple SaaSs.

Advantages and Limitations of the Cloud

Service

A cloud-based SaaS lowers the threshold for adop-tion by users as well as developers: there are numer-ous steps software developers take for granted that aren’t obvious to application scientists. Each step of software installation, source code compilation, up-grades, and management can be a major obstacle. Technologies that can eliminate or minimize these processes can be major factors for the tools to be adopted in research communities.

The cloud-based SaaS also drastically changes the efficiency of software development. After more than 15 years of software development, we learned that writing new software is not even half the story. Manufacturers constantly change their data format and image parameters. The versions of computer

operating systems undergo upgrades every few years, and there’s no guarantee that our software will run on every new system. Then, of course, new versions of our software need to be updated, dis-tributed, and adopted. Upgraded documents need

Figure 5. Computational performance of the T1 segmentation pipeline on XSEDE

Stampede and Gordon clusters. (a) The system units (CPU hours) used in each

cluster increase as the number of atlases and the number of CPUs increase. (b)

and (c) The pipeline is scalable with nearly constant runtime, if the available

number of CPUs for each atlas is also constant.

4,000

3,500

3,000

2,500

2,000

Run

tim

e (s

)R

unti

me

(s)

Sys

tem

uni

ts (

CP

U h

ours

)

Sys

tem

uni

ts (

CP

U h

ours

)

1,500

1,000

500

50

45

40

35

30

25

20

15

10

5

0

1,400

1,200

1,000

800

600

400

200

0

16 32 64CPU number

128 2560

80

70

60

50

40

30

20

10

0

4/32 8/64 12/96

Atlas number/CPU number

16/128

4/32 8/64 12/96

Atlas number/CPU number

(a)

(b)

(c)

16/128

Runtime

System units

Stampede

Gordon

Stampede

Gordon

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










to be distributed, and if a bug is found, we need to make sure all users receive a revised version. Failure to keep up with these efforts often leads to software becoming extinct. Laboratories that support image analysis software soon become swamped by soft-ware and user maintenance.

The cloud approach really shines in this respect. Only a Web browser on the user’s machine is re-quired, with very lightweight hardware specifications

because the heavy lifting is done at the server and communicated to the browser over the Internet. As such, the applications are available readily to any-one capable of running a browser, typically with no further component installation. At the same time, the vendor is relieved of maintaining compatibility across varying local hardware configurations and can make application upgrades effective immedi-ately, without cooperation from the user and with-out the need to maintain legacy versions. When a bug is found, all affected results can be traced, users notified, and results recalculated. Specifically, if a bug is found after a new version of software or at-las resources are deployed, we can trace all affected data based on the dates of submission, unique data identifiers, and user email addresses recorded in our log. We then send bug notices to the users with the lists of affected data and reprocess the data. This ap-proach offloads a substantial amount of the mainte-nance burden and provides for a much better user experience.

That said, file format and Health Insurance Portability and Accountability Act (HIPPA) issues need to be addressed: the cloud-based approach requires users’ data to be transferred outside the institution, which raises the HIPPA issue of protec-tion of personal identification information. This is-sue is also related to a more general question about file format and header information. Regardless of the image source, original files are in one of the vendor-specific Digital Imaging and Communica-tions in Medicine (DICOM) formats (the files ex-ported from the MRI scanners). Externally or in-ternally, at some point, we need a tool that can read DICOM files from all MRI vendors and all ver-sions of MR operating systems. Once read, the files are usually stored in a more standardized file for-mat, such as NIfTI or Analyze, although the stan-dards for these third-party file formats still contain significant variability, and consistency hasn’t been guaranteed. When the SaaS is provided, this poses a substantial challenge. One thing that’s clear is that once the raw MRI DICOM data go through a third-party program, including the PACS (Picture Archive and Communication System), the SaaS needs to support a large number of file formats and matrix definitions because the variability is mul-tiplied by the variability of the original DICOM formats and that of the third-party file definitions. The only practical solution is to restrict the data format to the original DICOM formats. However, these often include personal identifiers. For re-search, this could raise a question about whether

Figure 6. Different interfaces. (a) The status of the submitted data can be

monitored by the “my job status” window. (b) Once the job is completed,

the results can be downloaded or visualized. The color coding is based on

z-scores using age-matched internal control data; red is more than three

standard deviations larger and green is more than three standard deviations

smaller. The actual volume-age relationships of the internal control data and

the submitted data can be shown by right-clicking the structure of interest.

In the plot, green dots are from the internal data, which are connected to

the data stored in MongoDB and updated in real time.

(a)

(b)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








use of a cloud service should be included in each project’s Internal Review Board approval. For clini-cal purposes, it isn’t immediately clear whether hospitals would allow their data to migrate to an outside entity, even temporarily, without proper permission.

Our current approach is to develop a small ex-ecutable program that can read original DICOM from most vendors, which is then distributed to each local client and file conversions are executed from DICOM to two simple files, a raw image ma-trix and a header file that contains only the matrix dimension information. (De-identification and file standardization are performed on local computers before the data are uploaded to the cloud service.) This executable file needs to be constantly updated and distributed as vendors change their DICOM contents. In this sense, the cloud system requires a lightweight download of a pre-processing execut-able program, and isn’t completely free from dis-tribution burdens and local computations. This strategy also indicates that all de-identification is accomplished by users prior to submission of their data to the service and, therefore, the SaaS is free from HIPPA issues. However, questions remain about the unique signatures embedded in the im-age. We can assume that highly processed data, such as the volumes of 100 structures, are essen-tially anonymized data, but we can also argue that unique identifiers associated with imaging features are the purpose of the SaaS. Certainly, at some point we need to define a line where HIPPA is applicable or not, although such a boundary isn’t entirely clear. In addition, as science as a service, it would be more beneficial for users if the HIP-PA issue is handled on the server side. Another interesting strategy, which we’re testing for clini-cal applications, is to transplant the entire cloud service behind an institutional firewall. This hy-brid approach falls between a distribution and a cloud model, which is highly viable because the cloud architecture is portable and transplantation is relatively straightforward, but it loses several ad-vantages such as access to public supercomputing resources and multiplication of efforts to maintain the servers. These issues deserve more discussion for science as a service in the future.

The SaaS model is powerful when technologies are mature: deployment of a new SaaS is, in the-ory, straightforward if core programs are written without relying on platform-dependent libraries. If we already have local executable files to perform certain types of image analysis, they can be imple-

mented in a computation server and linked to a Web interface for users. However, this process does require communication between program develop-ers and cloud systems, who must agree on exactly what the inputs and outputs are. The Web interface then is designed to meet specific needs.

During this process, however, it is important to realize that two phases in science as a service profound-ly affect service design. In the first phase, users must have access to all the service software’s parameters to provide scientific freedom and to maximally ex-plore data contents, as well as to evaluate tool effica-cy. This process is highly interactive and thus requires extensive user-interaction interfaces that let users vi-sually inspect results and store intermediate files at each step. The cloud approach’s performance could degrade if each interaction requires a large amount of data transfer. The Web interface also mandates modern designs to efficiently perform the frequent interactions between local and remote computers, es-pecially for complex visualization and graphics inter-faces. In the second phase, the technology matures, tool efficacy is established, and the majority of users start to use the same parameter sets and protocols. The cloud system is more efficient, as the informa-tion transfer occurs only twice: data upload and re-sults download. One of the frequent questions we re-ceive is, “Can we modify the segmentation results?” Unfortunately, in our current setup, the segmenta-tion files must be downloaded to the local computer and modified by ROI-management software in the local PC. The cloud approach requires a balance be-tween server- and client-side uploads and downloads. We find that our downloadable executable programs such as MriStudio provide advantages in terms of physical public network separation among data stor-age, visualization, and memory-based computer en-gines, facilitating user feedback and stepwise quality control monitoring. However, the scaling arguments, software maintenance, and upgrade ease, as well as large-scale computation distribution through nation-al computing networks, gives the cloud solution its own distinct advantages.

In the first phase, it’s important to stress that the maturation processes takes place both through users’ experience in testing and parameter choices and through developers’ efforts to revise the soft-ware to accommodate user requests for better or newer functionalities. In this period of dynamic updates, high-level programming languages, such as Matlab and IDL, provide an ideal environment for efficient revisions. This could also facilitate open source strategies and user participation in

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










software development. As the software becomes mature and services are solidified, the final phase of the maturation process should be tested by pro-cessing a large amount of data. For example, a simple task, such as skull-stripping, has two modes of failure. In the early phase of tool development, the tool is improved by minimizing the leakage (or removal) of brain definitions (say, 5 percent leak-age of the voxels outside the brain), but in the latter phase, as tool performance improves, our interest shifts to occasional failures (5 percent of the popu-lation) that are encountered only in a large-scale analysis. At this point, the low computation effi-ciency of the high-level language starts to become a major obstacle due to low throughput: every time we make a minor modification, we need to wait a week to complete 100 test data. At some point, recoding to C++ is inevitable, which can improve computation time as much as 10,000 times from the original, depending on the algorithms. Think-ing about the nature of the cloud-based SaaS and its position in science as a service, it makes more sense to deploy software using the lower-level lan-guage. This is especially important when we utilize national resource computation facilities because we need to make every effort to maximize the resourc-es. One practical limitation, however, is that it isn’t always easy to secure human resources to support these types of efforts. As much as we need the ex-pertise and knowledge of trainees and faculties in academic institutes, some crucial efforts are needed to develop a sophisticated cloud, and SaaS isn’t a subject for academic publications.

Impact of SaaS on Data Sharing

In recent years, data sharing is becoming an im-portant National Institutes of Health policy, and there are many data available in the public domain including Alzheimer’s Disease Neuroimaging Ini-tiative (ADNI; www.adni-info.org) for Alzheimer’s disease, Pediatric Imaging, Neurocognition, and Genetics (PING; http://pingstudy.ucsd.edu) for nor-mal pediatric data, and National Database for Autism Research (NDAR; https://ndar.nih.gov) for autism research. What is common to these database is the availability of raw data with which research com-munities can apply their own tools to extract bio-logically or clinically important finding. In these types of public databases, proactive plans and co-ordinated efforts, as well as funding, are needed to acquire data in a uniform manner, establish a da-tabase structure, gather data, and maintain them. SaaS introduces a very different perspective to

database development and sharing because a large number of data would be submitted by users that conform to a relatively uniform image protocol to perform the automated image analysis. This data collection doesn’t require coordinated planning or specific funding; users have motivation to acquire images with specified types, submit their data, and have access to the automated segmentation servic-es. The cloud host then has opportunities to build two types of databases: users’ raw images and pro-cessed data (such as the standardized anatomical feature vectors shown in Figure 1).

This indeed could be a new approach to fa-cilitate efficient knowledge building and sharing. However, several hurdles should also be noted. First, our current SaaS has a rule to erase users’ data after 60 days of storage. To retain them, we would need not only a much larger storage space but also permission from users. Storage of anatomical fea-ture vectors, on the other hand, could be less of an issue as they’re much smaller and highly de-identi-fied. In either case, probably the largest limitation is the availability of the associated nonimage infor-mation. In the regular data submission, users sub-mit their images without demographic and clinical information. The resultant database then have only anatomical features, which wouldn’t be very useful for many application studies. It’s relatively straight-forward to build an interface to gather demograph-ic and clinical information as part of a SaaS (see Figure 4b), but the barrier would be the extra effort of users to compile and input them at the time of data submission. The incentive, therefore, would be building a useful database through SaaS for future data sharing. For example, if the service includes image interpretation (potential diagnoses and their likelihood) based on detailed clinical patient data, users might be willing to make the extra effort to submit additional information associated with the images. For the actual method to distribute data, we currently use the GitHub (https://github.com) repository, which has become a de facto site for data sharing. Our rich atlas resources are available through this channel.

What New Things Can We Do with the Cloud

Service?

In the previous sections, we discussed the advan-tages and limitations of the cloud-based SaaS, high-lighting differences from classical distribution mod-els. In this section, we focus on service concepts that are only possible in the cloud platform. The key concept is “knowledge-driven” analysis.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________

________________

____________

____________

http://www.adni-info.org

http://pingstudy.ucsd.edu

https://ndar.nih.gov

https://github.com






Data Interpretation

Usually, the roles of image analysis tools are complet-ed when the requested quantification is achieved. For example, in our MRICloud, T1-weighted ana-tomical images are converted to standardized 300-el-ement vectors (volumes of 300 defined structures; Figure 6). Users are supposed to have self-inclusive data that consist of patient groups and age-matched control groups. All these data go through the same high-throughput image analysis pipelines, the 300 volumes are defined, and the volume data are statistically compared between the groups. But what if the analyzed metadata stay in the server, just like travel industries keep their customers’ travel infor-mation for mass analysis? We call these 300-element feature vectors, “brainprints,” just like fingerprints that describe each individual’s uniqueness. If these brainprints have associated clinical and demographic information, we can provide interesting knowledge-based services. The plot in Figure 6b is an example of this idea, in which age-matched control data are provided from our internal database, as well as pub-licly available data, such as ADNI (http://adni.loni.usc.edu) and PING (http://pingstudy.ucsd.edu). The brainprints are merely strings of numbers.

However, if age-matched control data are available, each element of the brainprint can be compared to the normal values and, for example, converted to 300 z-score values. Namely, the avail-ability of age-matched normal data lets us interpret brainprints. As a cloud-based service, the normal database can be centrally managed, enriched, and utilized in real time. By extending this idea, it’s possible to perform pattern matching of the brain-prints to identify past cases with similar anatomical features and provide reports of population statis-tics about the diagnosis and prognosis of identi-fied cases.2,9,10 This, however, is possible only if the cloud database contains a vast amount of patient data and each brainprint is associated with rich clinical information. In the past, there were efforts to establish centralized image archives for various diseases. In fact, the clinical PACS in each hospital stores tens of thousands of images and a great deal of clinical information. The aforementioned data interpretation services weren’t available, not because of the lack of data but because the data couldn’t be effectively utilized. These data are fragmented and they remain “high-dimensional” raw image formats that aren’t suitable for downstream analysis, such as feature extraction and comparison. Conversion to low-dimensional representation of the images, such as brainprints, remains a bottleneck to increasing

the fluidity of data usage, but cloud-based opera-tions could be the key to opening this bottleneck.

The limitations of the data interpretation service should also be stressed. The normal or pathologi-cal data stored in the cloud database are external to user data, meaning they’re most likely acquired un-der different imaging protocols. It is, therefore, es-sential that image analysis tools are robust against a reasonable range of protocol variability that could be encountered in research and clinical data. For example, the ADNI database contains data with two magnetic fields (1.5 and 3.0 Tesla) and three manufacturers. If these data with six different proto-cols are processed together in our pipeline, the age-dependent anatomical changes have been shown to have a much larger effect size compared to protocol-associated bias.11 However, it’s reasonable to assume that if we’re interested in pathological effects that are much smaller than age effects, the users’ own con-trol data would be needed to minimize the protocol impacts and maximize the study’s sensitivity.

Multiatlas-Based Analysis

The multiatlas analysis is another example of the knowledge-based approach that benefits greatly from a cloud architecture. To highlight this point, Figure 7 shows the concept of atlas-based image segmentation. For a computer algorithm to define structures of interest, a teaching file is needed that defines the structure’s location, shape, and intensity features. This teaching file is called the. The sim-plest form is a single-subject atlas, in which various structures are defined based on one person’s anato-my. This atlas can be warped to individual patient images, and boundary definitions can be trans-ferred to images.12–15 This approach is, however, not accurate if atlas-to-subject image registration is not perfect. In particular, it is difficult to perform perfect image matching for brain regions with high cross-subject variability, such as cortical areas. In more advanced approaches, probabilistic atlases were created from population data in which each voxel contains the location and intensity probability of structural labels.16,17 For example, the location probability of a given voxel near the brain surface could be 33 percent white matter, 33 percent cor-tex, and 34 percent cerebrospinal fluid (CSF) based on the probabilistic atlas. However, the atlas also teaches average intensities in T1-weighted images (after intensity normalization) are, for example, white matter is 211 +/– 19, gray matter is 142 +/– 31, and CSF 81 +/– 11. If the voxel of the patient has the intensity of 154, it will probably be assigned

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________________


http://adni.loni.usc.edu

http://adni.loni.usc.edu

http://pingstudy.ucsd.edu







to the gray matter. In this way, the probabilistic atlas could teach an algorithm about the anatomi-cal signatures (locations and intensities) of each structure label such that the best labeling accuracy can be achieved. In the multiatlas framework, the

process by which a probabilistic map is created is omitted and multiple atlases are directly registered to the patient image, followed by an arbitration pro-cess.18–20 This process opens up many new possibili-ties for knowledge-based image analysis.

Figure 7. Evolution of atlas-based brain segmentation approaches. In a single-atlas-based approach, only one atlas

is warped to the target image and, at the same time, transfers its presegmented labels to the target image. In the

probabilistic atlas-based approach, multiple atlases are warped to the target image, and a probabilistic map is

generated by averaging the label definitions from all atlases; image intensity information can be incorporated to

determine the final segmentation. The multiatlas-based approach also warps multiple atlases to the target image, but

employs arbitration algorithms (typically, weighting and fusion) to combine the multiple atlas labels to generate the

final segmentation.

Single atlas

Probabilistic atlas

Multiatlas

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








For example, in the multiatlas framework, the atlas library can be enriched, altered, or revised easily without creating a population-averaged atlas. The appropriate atlases can be dynamically cho-sen from a library. The criteria for appropriateness could be nonimage attributes, such as age, gender, race, or diagnosis.21 Image-based attributes, such as intensity, shape, or the amount of atrophy, could be used to determine the contributions of the at-lases.12,22,23 By extending this notion, the selection of images from an atlas library can be evolved to context-based image retrieval (CBIR),24,25 and if the library is sufficiently large with various patho-logical cases with rich clinical information, statis-tics about the retrieved images, such as diagnosis information, could be generated.

While the multiatlas-based analysis provides interesting and new research frontiers, it also poses unique challenges. First, the algorithm is CPU-in-tensive. For segmentation and mapping based on a single atlas or a probabilistic atlas, image registra-tion is required only once. However, for 30 atlases, the registration has to be repeated 30 times, fol-lowed by another CPU-intensive arbitration pro-cess. If the user chooses to select a subset of “appro-priate” atlases from the 300-atlas library, further calculation would be needed. This implies the issue of content management for atlas libraries. Because the libraries of data sources are dynamically evolv-ing in quantity and quality with frequent updates, it isn’t realistic to distribute the entire library to every user and provide version management. The cloud-based approach provides a high-performance computation environment and centralized man-agement of the atlas libraries,26 therefore enabling advanced multiatlas technologies and applications.

Linkage of Services

The cloud-based SaaS provides unique platforms to link different types of service tools. Many research-ers in image analysis communities often make their own programs to analyze their data or assist in the interpretation of MR scans. Many of these tools are highly valuable for these communities, and developers are willing to share them. However, for their programs to be widely adopted, they need to develop user interfaces, distribution channels, and user management systems, such as registration and communications. Based on our experience in de-veloping the MriStudio software family, we know how time-consuming it is to develop new programs as stand-alone software. In the cloud platform, the addition of new SaaSs is straightforward as long

as the tools are mature enough to follow the API, which is nothing more than defining input and output parameters. Then, the developers can enjoy the existing infrastructures for super-computation resources, processing status management, data up-load/download functions, and user management, such as registration and notifications. This kind of expandability doesn’t have to rely on a specific cloud platform like MRICloud because other de-velopers can create their own cloud platforms and access our SaaS without going through our cloud interface. This has an important implication for future extensions of medical imaging informatics.

In the past, attempts have been made to inte-grate results from multiple contrasts, multiple im-aging modalities, and multiple medical records. These integrative analyses have been, however, ham-pered by the fact that they need to ensure that data from each modality have already been standard-ized, quantified, and dimension-reduced. If we use the analogy of building a house, a cloud platform such as MRICloud serves as one of the foundations to build vertical columns that correspond to each SaaS. If we come up with a new image analysis tool, it can be integrated into one of the cloud founda-tions as a new service column. In this context, the cloud platform’s role is to provide an environment to readily establish new columns. The real power of the cloud strategy is then materialized when a “horizontal service” (corresponding to the roof of the house in this analogy) emerges, which spans not only multiple service columns but also multiple cloud foundations.

In the field of medical records, there are high expectations for the integration of big data as-sociated to available medical records to create a knowledge database and providing personalized medicine through the comparison of the features of individual patients to the knowledge database. This is a typical example of the horizontal ser-vice, but if we open the electronic health records currently available in each hospital, we soon real-ize that the data aren’t standardized, structured, quantitative, consistent, or cohesive. The inte-grative analysis by a horizontal service would be-come prohibitively difficult if, for example, one aspect of the data were a raw MR image that didn’t specify where the brain is within the 8 million voxels (200 200 200 image dimen-sion). The success of the horizontal services, there-fore, hinges on the proliferation of high-quality vertical services. This is somewhat akin to integra-tive travel services, such as Orbitz, Expedia, and

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










Booking.com, which rely on reservation SaaSs for each hotel, airline, or rental car company. Vertical SaaSs, established in many different medical appli-cations, have the potential to be linked via third-party horizontal services to perform higher-order integrative analysis, which, in the future, could realize new medical informatics that we haven’t yet imagined.

The architecture of our cloud platform allows for powerful computational resources beyond

traditional software packages and also facilitates the future development of image analysis func-tions and the incorporation of new services. We are currently working on making availability through MRICloud new services associated with arterial spin labeling as well as functional MRI.

Acknowledgments

This publication was made possible by the following grants: P41EB015909 (MIM, MS), R01EB017638 (MIM), R01NS084957 (MS). Potential conflict of interest are that MS and MIM own “AnatomyWorks," with MS serving as its CEO. This arrangement is being managed by the Johns Hopkins University in accordance with its conflict of interest policies.

References

1. H.Y. Jiang et al., “DtiStudio: Resource Program for Diffusion Tensor Computation and Fiber Bundle Tracking,” Computer Methods and Programs in Bio-medicine, vol. 81, no. 2, 2006, pp. 106–116.

2. M.I. Miller et al., “High-Throughput Neuro-imaging Informatics,” Front Neuroinform, vol. 7, 2013, p. 31.

3. M.I. Miller, A. Trouve, and Y. Younes, “Diffeomor-phometry and Geodesic Positioning Systems for Human Anatomy,” Technology, vol. 2, 2013; http://dx.doi.org/10.1142/S2339547814500010.

4. M.I. Miller et al., “Increasing the Power of Func-tional Maps of the Medial Temporal Lobe by Using Large Deformation Diffeomorphic Metric Map-ping,” Proc. Nat’ l Academy of Sciences USA, vol. 102, no. 27, 2005, pp. 9685–9690.

5. M.F. Beg et al., “Computing Large Deformation Metric Mapping via Geodesic Flows of Diffeomor-phisms,” Int’ l J. Computer Vision, vol. 61, 2005, pp. 139–157.

6. A.V. Faria et al., “Atlas-Based Analysis of Resting-State Functional Connectivity: Evaluation for Repro-ducibility and Multi-modal Anatomy-Function Cor-relation Studies,” NeuroImage, vol. 61, no. 3, 2012, pp. 613–621.

7. K. Oishi et al., “Atlas-Based Whole Brain White Matter Analysis Using Large Deformation Dif-feomorphic Metric Mapping: Application to Nor-mal Elderly and Alzheimer’s Disease Participants,” NeuroImage, 19 Jan. 2009, pp. 486–499.

8. A. Djamanakova et al., “Tools for Multiple Granularity Analysis of Brain MRI Data for Individualized Image Analysis,” NeuroImage, vol. 101, 2014, pp. 168–176.

9. A.V. Faria et al., “Content-Based Image Retrieval for Brain MRI: An Image-Searching Engine and Population-Based Analysis to Utilize Past Clinical Data for Future Diagnosis,” NeuroImage: Clinical,vol. 7, 2015, pp. 367–76.

10. S. Mori et al., “Atlas-Based Neuroinformatics via MRI: Harnessing Information from Past Clinical Cases and Quantitative Image Analysis for Patient Care,” Ann. Rev. Biomedical Eng., vol. 15, 2013, pp. 71–92.

11. Z. Liang et al., “Evaluation of Cross-Protocol Stabil-ity of a Fully Automated Brain Multi-atlas Parcel-lation Tool,” PLoS One, vol. 10, no. 7, 2015, article no. e0133533.

12. T. Rohlfing et al., “Evaluation of Atlas Selection Strategies for Atlas-Based Image Segmentation with Application to Confocol Microscopy Images of Bee Brains,” NeuroImage, vol. 21, no. 4, 2004, pp. 1428–1442.

13. B. Fischl et al., “Whole Brain Segmentation: Automat-ed Labeling of Neuroanatomical Structures in the Hu-man Brain,” Neuron, vol. 33, no. 3, 2002, pp. 341–355.

14. D.L. Collins et al., “Automatic 3D Model-Based Neuroanatomical Segmentation,” Human Brain Mapping, vol. 3, no. 3, 1995, pp. 190–208.

15. M.I. Miller et al., “Mathematical Textbook of De-formable Neuroanatomies,” Proc. Nat’ l Academy of Sciences USA, vol. 90, no. 24, 1993, pp. 11944–11948.

16. B. Fischl et al., “Automatically Parcellating the Hu-man Cerebral Cortex,” Cerebral Cortex, vol. 14, no. 1, 2004, pp. 11–22.

17. D.W. Shattuck et al., “Construction of a 3D Proba-bilistic Atlas of Human Cortical Structures,” Neuro-Image, vol. 39, no. 3, 2008, pp. 1064–1080.

18. R.A. Heckemann et al., “Automatic Anatomical Brain MRI Segmentation Combining Label Propa-gation and Decision Fusion,” NeuroImage, vol. 33, no. 1, 2006, pp. 115–126.

19. X. Artaechevarria, A. Muñoz-Barrutia, and C. Ortiz-de-Solorzano, “Combination Strategies in Multi-atlas Image Segmentation: Application to Brain MR Data,” IEEE Trans. Medical Imaging, vol. 28, no. 8, 2009, pp. 1266–1277.

20. J.M.P. Lotjonen et al., “Fast and Robust Multi-atlas Segmentation of Brain Magnetic Resonance Imag-es,” NeuroImage, vol. 49, no. 3, 2010, pp. 2352–2365.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____

_________

http://www.Booking.com

http://dx.doi.org/10.1142/S2339547814500010






21. P. Aljabar et al., “Multi-atlas Based Segmentation of Brain Images: Atlas Selection and Its Effect on Accu-racy,” NeuroImage, vol. 46, no. 3, 2009, pp. 726–738.

22. F. Maes et al., “Multimodality Image Registration by Maximization of Mutual Information,” IEEE Trans. Medical Imaging, vol. 16, no. 2, 1997, pp. 187–198.

23. M. Wu et al., “Optimum Template Selection for Atlas-Based Segmentation,” NeuroImage, vol. 34, no. 4, 2007, pp. 1612–1618.

24. W. Hsu et al., “Context-Based Electronic Health Record: Toward Patient Specific Healthcare,” IEEE Trans. Information Technology in Biomedicine, vol. 16, no. 2, 2012, pp. 228–234.

25. H. Müller et al., “A Review of Content-Based Image Retrieval Systems in Medical Applications—Clini-cal Benefits and Future Directions,” Int’ l J. Medical Informatics, vol. 73, no. 1, 2004, pp. 1–23.

26. D. Wu et al., “Resource Atlases for Multi-Atlas Brain Segmentations with Multiple Ontology Levels Based on T1-Weighted MRI,” NeuroImage, vol. 125, no. 10, 2015, pp. 120–130.

Susumu Mori is a professor in the Department of Radi-ology at Johns Hopkins University School of Medicine. His research interest is to develop new technologies for brain MRI data acquisition and analyses. Mori has a PhD in biophysics from Johns Hopkins University School of Medicine. He’s a Fellow of the International Society of Magnetic Resonance in Medicine. Contact him at [email protected].

Dan Wu is a research associate in the Department of Radiology at Johns Hopkins University School of Medi-cine. Her research interests include advanced neuroim-aging and quantitative brain MRI analysis, especially atlas-based neuroinformatics for clinical data analysis. Wu has a PhD in biomedical engineering from Johns Hopkins University. She’s a Junior Fellow of the Inter-national Society of Magnetic Resonance in Medicine. Contact her at [email protected].

Can Ceritoglu is a research scientist and software engi-neer in the Center for Imaging Science at Johns Hop-kins University. His research interests includes medical image processing. Ceritoglu has a PhD in electrical and computer engineering from Johns Hopkins University. Contact him at [email protected].

Yue Li is an engineer at AnatomyWorks. His research interests include medical image analysis, visualization, high-performance computing, cloud-based medical im-age solutions, and magnetic resonance imaging. Li has a PhD in biomedical engineering from Johns Hopkins

University School of Medicine. Contact him at [email protected].

Anthony Kolasny is an IT architect at Johns Hopkins University’s Center for Imaging Science. His research interests include high-performance computing and is the JHU XSEDE Campus Champion. Kolasny has an MS in computer science from Johns Hopkins University. He’s a professional member of the Society for Neuroscience, Usenix, and ACM. Contact him at [email protected].

Marc A. Vaillant is president and CTO of Animetrics, a software company that provides facial recognition solutions to law enforcement, government, and commercial markets. His research interests include computational anatomy in the brain sciences and machine learning. Vaillant has a PhD in biomedical engineering from Johns Hopkins Uni-versity. Contact him at [email protected].

Andreia V. Faria is a radiologist and an assistant profes-sor in the Department of Radiology at Johns Hopkins University School of Medicine. Her interests include the development, improvement, and application of tech-niques to study normal brain development and aging, as well as pathological models. Faria has a PhD in neuro-sciences from the State University of Campinas. Contact her at [email protected].

Kenichi Oishi is an associate professor in the Depart-ment of Radiology at Johns Hopkins University School of Medicine. His research interests include multimodal brain atlases and applied atlas-based image recognition and feature extraction methods for various neurological diseases. Oishi has an MD in medicine and a PhD in neuroscience from Kobe University School of Medicine in Japan. Contact him at [email protected].

Michael I. Miller is the Herschel Seder Professor of Bio-medical Engineering and director of the Center for Im-aging Science at Johns Hopkins University. He has been influential in pioneering the field of computational anato-my, focused on the study of the shape, form, and connec-tivity of human anatomy at the morpheme scale. Miller has a PhD in biomedical engineering from Johns Hopkins University. Contact him at [email protected].



qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________

__________

__________

____________

_______

___

__________

______________

____________

__________

___


















WaveformECG: A Platform for Visualizing, Annotating, and Analyzing ECG Data

Raimond L. Winslow, Stephen Granite, and Christian Jurado | Johns Hopkins University

The electrocardiogram (ECG) is the most commonly collected data in cardiovascular research because of the ease with which it can be measured and because changes in ECG waveforms reflect underlying aspects of heart disease. Accessed through a browser, WaveformECG is an open source platform supporting interactive analysis, visualization, and annotation of ECGs.

The electrocardiogram (ECG) is a measurement of time-varying changes in body surface poten-tials produced by the heart’s underlying electrical activity. It’s the most commonly collected data in heart disease research. This is because changes in ECG waveforms reflect underlying aspects of heart disease such as intraventricular conduction, depolarization, and repolarization distur-

bances,1,2 coronary artery disease,3 and structural remodeling.4 Many studies have investigated the use of different ECG features to predict the risk of coronary events such as arrhythmia and sudden cardiac death, however, it remains an open challenge to identify markers that are both sensitive and specific.

Many different commercial vendors have developed information systems that accept, store, and ana-lyze ECGs acquired via local monitors. The challenge in applying these systems in clinical research is that they’re closed and don’t provide APIs by which other software systems can query and access their stored digital ECG waveforms for further analyses, nor the means for adding and testing novel data-processing algorithms. They’re designed for use in patient care, rather than for clinical research. Despite the ubiquity of ECGs in cardiac clinical research, there are no open, noncommercial platforms for interactive manage-ment, sharing, and analysis of these data. We developed WaveformECG to address this unmet need.

WaveformECG is a Web-based tool for managing and analyzing ECG data, developed as part of the CardioVascular Research Grid (CVRG) project funded by the US National Institutes of Health’s National

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








Heart, Lung, and Blood Institute.5 Users can browse their files and upload ECG data in a variety of ven-dor formats for storage. WaveformECG extracts and stores ECGs as a time series; once data are uploaded, a browser can select, view, and scroll through indi-vidual digital ECG lead signals. Points and time in-tervals in ECG waveforms can be annotated using ontology from the Bioportal ontology server oper-ated by the National Center for Biomedical Ontol-ogy (NCBO),6 and annotations are stored with the waveforms for later retrieval, enabling features of in-terest to be marked and saved for others. Users can select groups of ECGs for computational analysis via multiple algorithms, and analyses can be distributed across multiple CPUs to decrease processing time. WaveformECG has also been integrated with the Informatics for Integrating Biology and the Bed-side (I2B2) clinical data warehouse system.7 This bidirectional coupling lets users define study cohorts within I2B2, analyze ECGs within WaveformECG, and then store analysis results within I2B2.

WaveformECG has been used by hundreds of investigators in several large longitudinal studies of heart disease including the Multi-ethnic Study of Atherosclerosis (MESA),8 the Coronary Artery Disease Risk in Young Adults (CARDIA),9 and the Prospective Observational Study of Implantable Cardioverter Defibrillators (PROSE-ICD)10 studies. A public demo version is available for use through the CVRG Portal. All software developed in this effort is available on GitHub, under the CVRG project open source repository, with an Apache 2.0 license. Instructions for deployment of the Wavefor-mECG tool is available on the CVRG wiki.

WaveformECG System Architecture

WaveformECG is accessed via a portal developed using Liferay Portal Community Edition, an open source product that lets developers build portals as an assembly of cascading style sheets (CSS), webpag-es, and portlets (see Figure 1).11 Liferay was extended to use Globus Nexus, a federated identity provider that’s part of the Globus Online family of services,12

for authentication and authorization. Users login to WaveformECG with their Globus credentials or credentials from other identity providers (such as InCommon or Google) linked with their Globus identity. Following authentication, users can access four separate portlet interfaces: upload, visualize, analyze, and download. We developed several back-end libraries and tools supporting these interfaces that enable storage, retrieval, and analysis of ECG time-series data, metadata, and their annotations.

The upload and visualize portlets utilize an open source distributed storage system running on Apache Hadoop and Hbase known as the open time-series database (OpenTSDB),13 a database sys-tem optimized for storage of streaming time-series data. OpenTSDB sits on top of Apache HBase,14

an open source nonrelational distributed (noSQL) database developed as part of the Apache Soft-ware Foundation Hadoop project (https://hadoop.apache.org). Apache Zookeeper (https://zookeeper.apache.org) serves as the synchronization and nam-ing registry for deployment. This configuration allows HBase to be deployed across multiple serv-ers and to be scaled to accommodate high-speed, real-time read/write access of massive datasets—an important consideration because WaveformECG is being extended to accept real-time ECG data streams from patient monitors. Reported ingest rates for OpenTSDB in many different applica-tions range from 1 to 100 million points per sec-ond. OpenTSDB defines a time series as a set of time-value pairs and assigns a unique time-series identifier (TSUID) to each time series. OpenTSDB also supports execution of aggregation functions on query results (a query result is a read of time-series data). Examples of aggregators include calculations

Figure 1. Platform architecture. Users authenticate to WaveformECG via Globus Nexus to upload, visualize, analyze, and download ECG data, analysis results, and annotations. To accomplish this, WaveformECG makes use of Java libraries and Web services that provide access to data, metadata, analysis results, annotations, and data analysis algorithms.

End user

Single sign-on and authentication: Giobus Nexus

Uploadportlet

Java libraries fordata/metadata

access andmanipulation

Metadatastorage:

PostgreSQL

ECG time series and annotation storage:OpenTSDB

Hadoop Zookeeper HBase

Analysisalgorithms:

Apache axis2Web service(local andremote)

Visualizeportlet

Analyzeportlet

Downloadportlet

Life

ray

Bac

ked

proc

essi

ng

Alg

orit

hms

Dat

abas

e/fil

e sy

stem

QTscreeningalgorithm

QRSdetector

QRSscore

algorithm

Heartrate

variability

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_______

_______

__________

___________


https://hadoop.apache.org

https://zookeeper.apache.org

https://hadoop.apache.org

https://zookeeper.apache.org







of sums, averages, max-min values, statistics, and custom functions.

OpenTSDB provides access to its storage and retrieval mechanisms via RESTful APIs.15 With this capability, other software systems can query OpenTS-DB to retrieve ECG datasets. The open source rela-tional database system PostgreSQL16 maintains file-related information and other metadata. PostgreSQL is also used for portal content management (user identities, portal configuration, the Liferay document and media library [LDML], and so on), storage of all uploaded ECG data files in their native format, and other ECG metadata (sampling rate, length of data capture, subject identifier, and so on).

Data Upload and Management

WaveformECG can import ECG data in several different vendor formats, including Philips XML 1.03/1.04, HL7aECG, Schiller XML, GE Muse/Muse XML 7+, Norav Raw Data (RDT), and the WaveForm DataBase (WFDB) format used in the Physionet Project.17 In addition to the ECG time series, Philips, Schiller, and GE Muse XML files also contain results from execution of vendor-spe-

cific proprietary ECG analysis algorithms, metrics on signal quality, and other data. These data are also extracted and stored.

Within the upload interface (Figure 2a), users can browse their file system to locate folders contain-ing ECG data. Files are selected for upload by click-ing the “choose” button or by dragging and dropping files into the central display area (Figure 2a).

Clicking the “upload all” button initiates transfer of data from the user’s file system to WaveformECG. WaveformECG automatically determines each file’s format and follows a multistep procedure for storing and retrieving data (Figure 2b). Completion of these steps is used as an indicator of progress. Progress in-formation is displayed in the right-most portion of the upload interface, under the “background queue” tab. In the first step, the system checks to make sure that all required files have transferred from the local source to the host. While most formats only have one file per dataset, some formats split information across multiple files. Figure 2a shows this for s0027lre, a WFDB format ECG dataset. s0027lre’s data is pack-aged in three different files, with dataset metadata in the header (.hea) file, and time-series data in the other

Figure 2. Upload portal. (a) The listing on the left shows that the user has created the patient006 folder, into which data will be uploaded. Datasets under “my subjects” are owned by the user. Folders group datasets by subjects, and progress bars next to the file names in the center of the screen show progress on the upload of each file to the server. The background queue on the right provides users with a real-time update of progress on dataset processing. (b) The upload processing flow consists of five parts: server upload (“wait”); storage in LDML and parsing file data (“transferred”); transfer of time-series data and analysis results to OpenTSDB (“written”); transfer of metadata to PostgreSQL (“annotated”); and completion (“done”).

Start upload

(b)

(a)

Transferfile to server

Are allnecessarry

files transferred?

Yes

No

1 2 3 4 5

DoneStore

metadata inPostgreSQL

Write time-series/analysis resultsto OpenTSDB

Store the ECG Filesin LDML/extract

metadata & analysisresults

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








two (.dat and .xyz). In this example, WaveformECG has fully received the .hea file, but the .dat and .xyz file transfers are still in progress. The progress bar for dataset s0027lre is empty, and the phase column of the background queue displays “wait” because these data files are still being transferred. Once each ECG file transfer to the service is complete, they’re stored in their native format in the LDML. Files at this stage of the workflow have a progress bar at 40 percent com-pletion, with “transferred” displayed in the “back-ground queue” area. The folder structure within the LDML corresponds with that of the folders created by the user in the upload interface. WaveformECG displays this folder structure on all screens where the user interacts directly with their uploaded files.

Once transfer is complete, WaveformECG spawns a separate process to extract each ECG time series for storage in OpenTSDB. A single ECG file contains signals from multiple leads, and a time series for each lead signal is extracted and labeled with a unique TSUID. Once this step is complete for all leads of an ECG waveform, the progress bar moves to 60 percent completion, with “written” displayed in the “background queue” column.

Following completion of writing, another back-ground process is spawned to extract ECG waveform analysis results in the files. Each result is labeled with an

appropriate ontology term, selected from the NCBO Bioportal Electrocardiography Ontology (http://purl.bioontology.org/ontology/ECG) by storing the ontol-ogy ID along with the result. WaveformECG bundles this information, along with the subject identifier, the format of the uploaded ECG dataset, and the start time of the time series itself, and writes labeled analysis results into OpenTSDB. Once this is completed, the progress bar moves to 80 percent, with “annotated” displayed in the “background queue” phase column.

WaveformECG must be able to maintain a connection with the original uploaded ECG files, the stored time-series data, file metadata, analysis results, and manual annotations made to ECG waveforms. To do this, the OpenTSDB TSUID is stored in PostgreSQL. Once this is done, the prog-ress bar moves to 100 percent, and “done” is dis-played in the “background queue” phase column.

Data Analysis

ECG analysis algorithms are made available for use in WaveformECG as Web services. The analyze interface (Figure 3a) uses libraries for Web service implementations of ECG analysis algorithms ac-cessed through Apache Axis2.18

Analysis Web services are developed by using Axis2 for communicating with the compiled version

Figure 3. Analysis portal. (a) Three datasets with different formats were selected for processing by multiple algorithms. The background queue shows progress in data processing: two datasets have each been processed using eight algorithms, while the third has completed processing by seven. (b) In the analysis process, data and algorithm selection follow a step-by-step workflow.

Start analysis

(a)

(b)

Drag/dropall files for analysisto the center pane

Click checkboxes forany/all algorithmsto process with

Click startanalysis

Invokealgorithms onselected files/

update progress

Done

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM




http://purl.bioontology.org/ontology/ECG







of the analysis algorithm. Axis2 is an open source XML-based framework that provides APIs for gen-erating and deploying Web services. It runs on an Apache Tomcat server, rendering it operating-plat-form-independent. Algorithms developed using the interpreted language Matlab can be compiled using Matlab Compiler (www.mathworks.com/products/compiler/mcr/?requestedDomain=www.mathworks.com) and executed in Matlab Runtime, a stand-alone set of shared libraries that enables the execution of compiled Matlab applications or com-ponents on computers that don’t have it installed. An XML file is developed that defines the service, commands it will accept, and acceptable values to pass to it. In a separate administrative portion of the analyze interface, a tool allows administrators to easily add algorithms implemented as Web ser-vices to the system. Upon entry of proper algorithm details and parameter information, WaveformECG can invoke an algorithm that the administrator has deployed. This approach simplifies the process of adding new algorithms to WaveformECG.

Figure 3a shows the analyze interface and Fig-ure 3b shows the associated processing steps. Users select files or folders from the file chooser on the left; multiple files can be dragged and dropped into the central pane. Placing a file in that pane makes that file available for analysis by one or more algo-rithms, listed in the bottom center pane. Clicking the checkbox on an algorithm entry instructs the system to analyze the selected files with that algo-rithm. The checkbox at the top of the algorithm

list allows a user to toggle the selection of all the available algorithms. All available algorithms have default settings—some have parameters that can be set via the “options” button, but all parameters set for an algorithm will be applied to all files to be analyzed. Upon selection of files to be processed and the algorithms with which to process them, the user clicks the “start analysis” button, which creates a thread to handle the processing. The thread dis-patches a RESTful call to OpenTSDB to retrieve all the data requested. Depending on the algo-rithms chosen, the thread writes the data into the necessary formats required by the algorithms (for example, algorithms from the PhysioToolkit19 re-quire that ECG data be in the WFDB file format). The thread then invokes the requested algorithms on the requested data. As long as the analyze screen remains open, the background queue will be up-dated, incrementing the number of algorithms that have finished processing. Upon processing comple-tion of all selected algorithms for one file, the phase will update to “done” in the background queue.

Analyses of ECGs can provide information on the heart’s both normal and pathological func-tions. The lead V3 ECG waveform in Figure 4 shows body surface potential (uV; ordinate) as a function of time (Sec; abscissa) measured over a single heartbeat. The ECG P, Q, R, S, and T waves are labeled. The P wave reflects depolarization of the cardiac atria in response to electrical excitation produced in the pacemaker region of the heart, the sinoatrial node. Onset of the Q wave corresponds

Figure 4. Multilead visualization interface. In the multilead display for 4 of the 15 leads from a GE Marquette Universal System for Electrocardiography (MUSE) XML upload, the vertical bar in 3 of the graphs represents the cursor location in the first graph. The bars move with the cursor and change focus as the cursor changes graphs.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________________________ ____

__________

http://www.mathworks.com/products/compiler/mcr/?requestedDomain

http://www.mathworks.com

http://www.mathworks.com

http://www.mathworks.com/products/compiler/mcr/?requestedDomain






to onset of depolarization of the cardiac interven-tricular septum. The R and S waves correspond to depolarization of the remainder of the cardiac ven-tricles and Purkinje fibers, respectively. Ventricular activation time is defined as the time interval be-tween onset of the Q wave and the peak of the R wave. The T wave corresponds to repolarization of the ventricles to the resting state. The time interval between onset of the Q wave and completion of the T wave is known as the QT interval and represents the amount of time over which the heart is partial-ly or fully electrically excited over the cardiac cycle. The time interval between successive R peaks is the instantaneous heart rate. Abnormalities of the shape, amplitude, and other features of these waves and intervals can reflect underlying heart disease, and there has been considerable effort in develop-ing algorithms that can be used to automatically analyze ECGs to detect these peaks and intervals. Table 1 lists the algorithms available in the current release of WaveformECG.20,21

Data Visualization

The visualize interface lets users examine and in-teract with stored ECG data. This feature also pro-vides a mechanism for manually annotating wave-forms. When the user selects a file to view in the visualization screen, it initially displays the data as a series of graphs, one for each lead in the dataset (15 leads for the GE MUSE dataset shown in Figure 4).

A 1-mV amplitude with a 200-msec duration calibration pulse is displayed in the left-most panel.

Initially, four leads are displayed, but additional leads can be viewed by grabbing and dragging the window scroll bar located on the right side of the browser display. Whenever the cursor is positioned within a display window, its x-y location is marked by a filled red dot, and time-amplitude values at that location are displayed at the bottom of the panel. Cursor display in all graphs is synchronized so that as the user navigates through one graph, others update with it. Lead name and number of annotations for that lead signal are displayed in each graph. File metadata, including subject ID, lead count, sampling rate, and ECG duration, are displayed above the graphs. WaveformECG sup-ports scrolling through waveforms. Clicking on the “next” button at the top of the display steps through time in increments of 1.2 seconds. Users can jump to a particular time point by entering the time value into the panel labeled “jump to time (sec).”

By clicking on a lead graph, users can expand the view to see the data for that lead in detail, in-cluding any annotations that have been entered manually. A list of analysis results on the lead are displayed in a table at the left of the view, and the graph is displayed in the center right. In Figure 5a, WaveformECG displays part of the analysis results extracted from a Philips XML upload. While re-questing the analysis results displayed, the visualize interface also checks the data originally returned from OpenTSDB to see if any annotations exist for the time frame displayed and, if so, Dygraphs (http://dygraphs.com), an open source JavaScript

Table 1. Algorithm listing for the analyze interface.

Algorithm name Developer Purpose/dependencies

Rdsamp Physionet Converts the Waveform Database (WFDB) ECG format to human-readable format

Sigamp Physionet Measure signal amplitudes of a WFDB record

sqrs/sqrs2csv Physionet Detects onset and offset times of the QRS complex in single leads; second implementation produces output in CSV format

wqrs/wqrs2csv Physionet Detects onset and offset times of the QRS complex in single leads using the length transform; second implementation produces output in CSV format

ihr (sqrs & wqrs implementation)

Physionet Computes successive RR intervals (instantaneous heart rate); requires input from sqrs or wqrs

pNNx (sqrs & wqrs implementation)

Physionet Calculates time domain measures of heart rate variability; requires input from sqrs or wqrs

QT Screening Yuri Chesnokov and colleagues19

Detects successive QT intervals based upon high- or low-pass filtering of ECG waveforms; works with data in WFDB format

QRS-Score David Strauss and colleagues20

Produces the Strauss-Selvester QRS arrhythmia risk score based on certain criteria derived from GE MUSE analysis

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________


http://dygraphs.com







Figure 5. Visualization. (a) For lead II in a 12-lead ECG in Philips format, the table under “Analysis Results” displays the results of automated data processing by the Philips system used to collect this ECG. In the waveform graph, A denotes a QT interval annotation, with the yellow bar representing the interval itself. This annotation was made manually. The 1 denotes an R peak annotation, also made manually. All interval and point annotations are listed below the graph. (b) In the manual annotation interface, the R peak is highlighted and the information in the center shows the definition returned for that term selection. In addition, there are details about the ECG and the point at which the annotation was made. To create these displays, the visualize interface initiates a RESTful call to OpenTSDB to retrieve the first 2.5 seconds of time-series data associated with all the leads in the file. Dygraphs, an open source JavaScript charting library, generates each of the graphs displayed.

(a)

(b)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








charting library, renders the annotation display on the screen. Figure 5a shows examples of the two types of supported waveform annotations: point annotations are associated with a specific ECG time-value pair—in this case, the time-value pair corresponding to the peak of the R wave, labeled with the number 1—and interval annotations are associated with a particular time interval. The user can scroll through the individual lead data using the slider control at the bottom of the display or the navigation buttons. There’s also a feature to jump to a specific point in time. Zooming can be performed using the slider bar at the bottom of the screen. To restore the original view of the graph, the user can double-click on it. Manual annota-tions can be added by clicking in the graph screen.

Data Annotation

Users can manually annotate ECG waveforms to mark features of interest. These annotations are then stored and redisplayed along with the waveform on subsequent visualization. To create an annotation, the user selects a point on the graph or highlights an interval using the mouse. This activates a graph-ical interface where the user enters the specific details of the annotation (Figure 5b). The system relies on the Bioportal ontology server to select an-notation terms. Figure 5b shows the end result of the R peak point annotation listed in Figure 5a, labeled with a 1 on the waveform itself. On the up-per-right-hand side of the screen, users can see the time and amplitude values for the point they se-lected. If it’s incorrect, the user can click the “chg” button. The annotation interface will then provide the user with a zoomed-in portion of the graph and view where the current selection is. If they so choose, users can change the point to another one and click “save.” If not, they can click “cancel.” The annotation interface refreshes and the onset will be updated to the chosen value.

When WaveformECG renders the annotation interface shown in Figure 5b, it dispatches a re-quest to NCBO’s Bioportal to retrieve all the root-level terms from the ECG terms view of the ECG ontology in Bioportal. As the ontology itself con-forms to the standards of a basic formal ontology (BFO), the ECG terms view provides a less formal display of the terms within. A button at the end of the terms listing in the annotation screen lets users change the display to or from the BFO and terms view of the ECG ontology.

At the top of the term listing is a text box labeled “search for class,” which lets a user type a search term.

As typing commences, a JavaScript application devel-oped by the NCBO provides a list of terms in the tar-get ontology that match the typed text. The user can then select a term from that list. Upon selection of a term, the lower box in the right-center screen will up-date with the term and the term definition retrieved from Bioportal. The user can then enter a comment in the text field below that describes any additional information to be included with the annotation. Upon completion of term selection and comment entry, the user clicks the “save” button. In Figure 5b, this button is grayed out because this figure shows the result of clicking on an existing annotation. This lets users delve into the details of existing an-notations and see any comments entered previously in the comment box for annotation.

Integration with the I2B2 Clinical

Data Warehouse

WaveformECG has been integrated with the Eu-reka Clinical Analytics systems.22 Eureka provides Web interfaces and tools for simplifying the task of extracting, transferring, and loading clinical data into the I2B2 clinical data warehouse. The ad-vantage of this integration is that subject cohorts identified in I2B2 can be easily sent to Wavefor-mECG for further analysis using a newly devel-oped Send Patient Set plug-in that communicates with WaveformECG using JavaScript Object No-tation (JSON). In the example in Figure 6, we use I2B2 to query data from the MIMIC-II database,23

looking for all Asian subjects for whom there ex-ists a measurement of cardiac output. The plug-in extracts subject IDs satisfying this particular query from I2B2 and sends them to WaveformECG as a JSON message. WaveformECG receives and pro-cesses the JSON, creating a folder displayed in the upload interface that corresponds to the I2B2 co-hort name (Monitor-Asian@14:11:45[10-29-2015]). WaveformECG creates subfolders for each of the corresponding subjects, with their subject iden-tifiers as the folder names. The user can then up-load the waveform data for each of the subjects into their respective folders (left side of Figure 6). In the example, I2B2 returned three subjects who met this criteria, but only one of the subjects had corresponding waveform data. We then uploaded that subject’s data to WaveformECG and processed those waveforms with multiple analysis algorithms.

Eureka lets users define a data source and specify cohort definitions. In the case of Wavefor-mECG, we defined OpenTSDB as the data source. Through that definition, Eureka performed a

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










RESTful call to OpenTSDB, searching for analysis results linked with files in the Eureka folder. Once found, analysis results along with their ECG lntol-ogy IDs are transferred to Eureka, where they are reorganized into a format acceptable for automatic loading into I2B2. A subset of those results can be seen in Figure 6 under the “EKG annotations” folder in the I2B2 “navigate terms” window.

WaveformECG Case Study

Sudden cardiac death (SCD) accounts for 200,000 to 450,000 deaths in the US annually.24 Current screening strategies fail to detect roughly 80 per-cent of those who die suddenly. The ideal screening method for increased risk should be simple, inex-pensive, and reproducible in different settings so that it can be performed routinely in a physicians’ offices, yet be both sensitive and specific. A recent study has shown that features computed from the 12-lead ECG known as the QRS score and QRS-T angle can be used to identify patients with fibrotic scars (determined using late-gadolinium enhance-ment magnetic resonance imaging) with 98 percent sensitivity and 51 percent specificity.25 Motivated by these findings, we assisted in a large-scale screening of all ECGs obtained over a six-month period at two large hospital systems. The challenges faced in this

study were the large number of subjects and ECGs (~35,000) to be managed and analyzed, the use of different ECG instrumentation and thus different data formats at the two sites, and the fact that instru-ment vendors don’t make either of the algorithms to be tested available in their systems. WaveformECG proved to be a powerful platform for supporting this study. The QRS score and QRS-T angle algorithms were implemented and deployed, making it possible for the research team to quickly select and analyze ECGs from different sites. The two ECG-based fea-tures were shown to be a useful initial method (a sensitivity of 70 percent and a specificity of 55 per-cent) for identifying those at risk of SCD in the pop-ulation of patients having preserved left ventricular ejection fraction (LVEF > 35 percent).

O ther physiological time-series data arise in many other healthcare applications. Blood

pressure waveforms, peripheral capillary oxygen saturation, respiratory rate, and other physiologi-cal signals are measured from every patient in the modern hospital, particularly those in critical care settings. Currently in most hospitals, these data are “ephemeral,” meaning they appear on the bedside monitor and then disappear. These data are among

Figure 6. Integration with a clinical data warehouse. A split screen shows information sent from I2B2 (left) to WaveformECG. In the expanded EKG annotations folder, three analysis results can be returned from WaveformECG to I2B2.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








the most actionable in the hospital because they reflect the patient’s moment-to-moment physiolog-ical functioning. Capturing these data and under-standing how they can be used along with other data from the electronic health record to more pre-cisely inform patient interventions has the poten-tial to significantly improve healthcare outcomes. In future work, we will extend WaveformECG to serve as a general-purpose platform for working with other types physiological time-series data.

Acknowledgments

Development of WaveformECG was supported by the National Heart, Lung and Blood Institute through NIH R24 HL085343, NIH R01 HL103727, and as a subcon-tract of NIH U54HG004028 from the National Center for Biomedical Ontology.

References

1. B. Surawicz et al., “AHA/ACCF/HRS Recommen-dations for the Standardization and Interpretation of the Electrocardiogram: Part III: Intraventricular Conduction Disturbances,” Circulation, vol. 119, 17 Mar. 2009, pp. e235–240.

2. P.M. Rautaharju et al., “AHA/ACCF/HRS Recom-mendations for the Standardization and Interpre-tation of the Electrocardiogram: Part IV: The ST Segment, T and U Waves, and the QT Interval,” Circulation, vol. 119, 17 Mar. 2009, pp. e241–250.

3. G.S. Wagner et al., “AHA/ACCF/HRS Recommen-dations for the Standardization and Interpretation of the Electrocardiogram: Part VI: Acute Ischemia/Infarction,” J. Am. College Cardiology, vol. 53, 17 Mar. 2009, pp. 1003–1011.

4. E.W. Hancock et al., “AHA/ACCF/HRS Recommen-dations for the Standardization and Interpretation of the Electrocardiogram: Part V: Electrocardio-gram Changes Associated with Cardiac Chamber Hypertrophy,” Circulation, vol. 119, 17 Mar. 2009, pp. e251–261.

5. R. Winslow et al., “The CardioVascular Research Grid (CVRG) Project,” Proc. AMIA Summit on Translational Bioinformatics, 2011, pp. 77–81.

6. M.A. Musen et al., “BioPortal: Ontologies and Data Resources with the Click of a Mouse,” Proc. Am. Medi-cal Informatics Assoc. Ann. Symp., 2008, pp. 1223–1224.

7. S.N. Murphy et al., “Serving the Enterprise and Be-yond with Informatics for Integrating Biology and the Bedside (I2B2),” J. Am. Medical Informatics As-soc., vol. 17, no. 2, 2010, pp. 124–130.

8. D.E. Bild et al., “Multi-ethnic Study of Atheroscle-rosis: Objectives and Design,” Am. J. Epidemiology,vol. 156, 1 Nov. 2002, pp. 871–881.

9. E.B. Lynch et al., “Cardiovascular Disease Risk Fac-tor Knowledge in Young Adults and 10-Year Change in Risk Factors: The Coronary Artery Risk Devel-opment in Young Adults (CARDIA) Study,” Am. J.Epidemiology, vol. 164, 15 Dec. 2006, pp. 1171–1179.

10. A. Cheng et al., “Protein Biomarkers Identify Pa-tients Unlikely to Benefit from Primary Prevention Implantable Cardioverter Defibrillators: Findings from the Prospective Observational Study of Im-plantable Cardioverter Defibrillators (PROSE-ICD),” Circulation: Arrhythmia and Electrophysiol-ogy, vol. 7, no. 12, 2014, pp. 1084–1091.

11. J.X. Yuan, Liferay Portal Systems Development, Packt Publishing, 2012.

12. R. Ananthakrishnan et al., “Globus Nexus: An Identity, Profile, and Group Management Platform for Science Gateways and Other Collaborative Sci-ence Applications,” Proc. Int’ l Conf. Cluster Comput-ing, 2013, pp. 1–3.

13. B. Sigoure, “OpenTSDB: The Distributed, Scalable Time Series Database,” Proc. Open Source Convention,2010; http://opentsdb.net/misc/opentsdb-oscon.pdf.

14. R.C. Taylor, “An Overview of the Hadoop/MapRe-duce/HBase Framework and Its Current Applica-tions in Bioinformatics,” BMC Bioinformatics, vol. 11, 2010, p. S1.

15. C. Pautasso, RESTful Web Services: Principles, Patterns, Emerging Technologies,” Springer, 2014, pp. 31–51.

16. K. Douglas and S. Douglas, PostgreSQL: A Com-prehensive Guide to Building, Programming, and Administering PostgreSQL Databases, SAMS pub-lishing, 2003.

17. G.B. Moody, R.G. Mark, and A.L. Goldberger, “PhysioNet: A Web-Based Resource for the Study of Physiologic Signals,” IEEE Eng. Medicine and Biol-ogy Magazine, vol. 20, no. 3, 2001, pp. 70–75.

18. D. Jayasinghe and A. Azeez, Apache Axis2 Web Ser-vices, Packt Publishing, 2011.

19. G.B. Moody, R.G. Mark, and A.L. Goldberger, “PhysioNet: Physiologic Signals, Time Series and Related Open Source Software for Basic, Clinical, and Applied Research,” Proc. Conf. IEEE Eng. Medi-cine and Biology Soc., vol. 2011, pp. 8327–8330.

20. Y. Chesnokov, D. Nerukh, and R. Glen, “Individu-ally Adaptable Automatic QT Detector,” Computers in Cardiology, vol. 33, 2006, pp. 337–341.

21. D.G. Strauss et al., “Screening Entire Health System ECG Databases to Identify Patients at Increased Risk of Death,” Circulation: Arrhythmia and Elec-trophysiology, vol. 6, no. 12, 2013, pp. 1156–1162.

22. A. Post et al., “Semantic ETL into I2B2 with Eu-reka!,” AMIA Summit Translational Science Proc., 2013, pp. 203–207.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM




http://opentsdb.net/misc/opentsdb-oscon.pdf







23. M. Saeed et al., “Multiparameter Intelligent Moni-toring in Intensive Care II: A Public-Access Inten-sive Care Unit Database,” Critical Care Medicine,vol. 39, no. 5, 2011, pp. 952–960.

24. J.J. Goldberger et al., “American Heart Association/American College of Cardiology Foundation/Heart Rhythm Society Scientific Statement on Noninva-sive Risk Stratification Techniques for Identifying Patients at Risk for Sudden Cardiac Death: A Sci-entific Statement from the American Heart Associa-tion Council on Clinical Cardiology Committee on Electrocardiography and Arrhythmias and Council on Epidemiology and Prevention,” Circulation, vol. 118, 30 Sept. 2008, pp. 1497–1518.

25. D.G. Strauss et al., “ECG Quantification of Myo-cardial Scar in Cardiomyopathy Patients with or without Conduction Defects: Correlation with Car-diac Magnetic Resonance and Arrhythmogenesis,” Circulation: Arrhythmia and Electrophysiology, vol. 1, no. 12, 2008, pp. 327–336.

Raimond L. Winslow is the Raj and Neera Singh Pro-fessor of Biomedical Engineering and director of the Institute for Computational Medicine at Johns Hopkins University. His research interests include the use of com-putational modeling to understand the molecular mech-anisms of cardiac arrhythmias and sudden death, as well as the development of informatics technologies that pro-vide researchers secure, seamless access to cardiovascular research study data and analysis tools. Winslow is prin-

cipal investigator of the CardioVascular Research Grid Project and holds joints appointments in the departments of Electrical and Computer Engineering, Computer Sci-ence, and the Division of Health Care Information Sci-ences at Johns Hopkins University. He’s a Fellow of the American Heart Association, the Biomedical Engineer-ing Society, and the American Institute for Medical and Biological Engineers. Contact him at [email protected].

Stephen Granite is the director of database and software development of the Institute for Computational Medi-cine at Johns Hopkins University. He’s also the program manager for the CardioVascular Research Grid Project. Granite has an MS in computer science with a focus in bioinformatics and an MS in business administration with a focus in competitive intelligence, both from Johns Hopkins University. Contact him at [email protected].

Christian Jurado is a software engineer in the Institute for Computational Medicine at Johns Hopkins Univer-sity. He’s also lead developer of WaveformECG for the CardioVascular Research Grid Project. Jurado has a BS in computer science, specializing in Java Web develop-ment and Liferay. Contact him at [email protected].



The American Institute of Physics is an organization of scientific societies in the physical sciences, representing scientists, engineers, and educators. AIP offers authoritative information, services, and expertise in physics education and student programs, science communication, government relations, career services

for science and engineering professionals, statistical research in physics employment and education, industrial outreach, and the history of physics and allied fields. AIP publishes PHYSICS TODAY, the most closely followed magazine of the physical sciences community, and is also home to the Society of Physics Students and the Niels Bohr Library and Archives. AIP owns AIP Publishing LLC, a scholarly publisher in the physical and related sciences.

Board of Directors: Louis J. Lanzerotti (Chair), Robert G. W. Brown (CEO), Judith L. Flippen-Anderson (Corporate Secretary), J. Daniel Bourland, Charles Carter, Beth Cunningham, Robert Doering, Judy Dubno, Michael D. Duncan, David Ernst, Kate Kirby, Rudolf Ludeke, Kevin B. Marvel, Faith Morrison, Dian Seidel.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___________

__________

___________









PURPOSE: The IEEE Computer Society is the world’s largest association of computing professionals and is the leading provider of technical information in the field.MEMBERSHIP: Members receive the monthly magazine Computer, discounts, and opportunities to serve (all activities are led by volunteer members). Membership is open to all IEEE members, affiliate society members, and others interested in the computer field.COMPUTER SOCIETY WEBSITE: www.computer.orgOMBUDSMAN: Direct unresolved complaints to [email protected]: Regular and student chapters worldwide provide the opportunity to interact with colleagues, hear technical experts, and serve the local professional community.AVAILABLE INFORMATION: To check membership status, report an address change, or obtain more information on any of the following, email Customer Service at [email protected] or call +1 714 821 8380 (international) or our toll-free number, +1 800 272 6657 (US):

Membership applicationsPublications catalogDraft standards and order formsTechnical committee listTechnical committee applicationChapter start-up proceduresStudent scholarship informationVolunteer leaders/staff directoryIEEE senior member grade application (requires 10 yearspractice and significant performance in five of those 10)

PUBLICATIONS AND ACTIVITIESComputer: The flagship publication of the IEEE Computer Society, Computer, publishes peer-reviewed technical content that covers all aspects of computer science, computer engineering, technology, and applications.Periodicals: The society publishes 13 magazines, 19 transactions, and one letters. Refer to membership application or request information as noted above.Conference Proceedings & Books: Conference Publishing Services publishes more than 175 titles every year.Standards Working Groups: More than 150 groups produce IEEE standards used throughout the world.Technical Committees: TCs provide professional interaction in more than 45 technical areas and directly influence computer engineering conferences and publications.Conferences/Education: The society holds about 200 conferences each year and sponsors many educational activities, including computing science accreditation.Certifications: The society offers two software developer credentials. For more information, visit www.computer.org/certification.

NEXT BOARD MEETING

13–14 November 2016, New Brunswick, NJ, USA

EXECUTIVE COMMITTEEPresident: Roger U. FujiiPresident-Elect: Jean-Luc Gaudiot; Past President: Thomas M. Conte; Secretary: Gregory T. Byrd; Treasurer: Forrest Shull; VP, Member & Geographic Activities: Nita K. Patel; VP, Publications: David S. Ebert; VP, Professional & Educational Activities: Andy T. Chen; VP, Standards Activities: Mark Paulk; VP, Technical & Conference Activities: Hausi A. Müller; 2016 IEEE Director & Delegate Division VIII: John W. Walz; 2016 IEEE Director & Delegate Division V: Harold Javid; 2017 IEEE Director-Elect & Delegate Division V: Dejan S. Miloji i

BOARD OF GOVERNORSTerm Expriring 2016: David A. Bader, Pierre Bourque, Dennis J. Frailey, Jill I. Gostin, Atsuhiro Goto, Rob Reilly, Christina M. SchoberTerm Expiring 2017: David Lomet, Ming C. Lin, Gregory T. Byrd, Alfredo Benso, Forrest Shull, Fabrizio Lombardi, Hausi A. MüllerTerm Expiring 2018: Ann DeMarle, Fred Douglis, Vladimir Getov, Bruce M. McMillin, Cecilia Metra, Kunio Uchiyama, Stefano Zanero

EXECUTIVE STAFFExecutive Director: Angela R. BurgessDirector, Governance & Associate Executive Director: Anne Marie KellyDirector, Finance & Accounting: Sunny HwangDirector, Information Technology & Services: Sumit Kacker Director, Membership Development: Eric BerkowitzDirector, Products & Services: Evan M. ButterfieldDirector, Sales & Marketing: Chris Jensen

COMPUTER SOCIETY OFFICESWashington, D.C.: 2001 L St., Ste. 700, Washington, D.C. 20036-4928Phone: Fax: +1 202 728 9614Email: [email protected] Alamitos: 10662 Los Vaqueros Circle, Los Alamitos, CA 90720Phone: +1 714 821 8380Email: [email protected]

MEMBERSHIP & PUBLICATION ORDERSPhone: Fax:Asia/Pacific: Watanabe Building, 1-4-2 Minami-Aoyama, Minato-ku, Tokyo 107-0062, JapanPhone: Fax: +81 3 3408 3553Email: [email protected]

IEEE BOARD OF DIRECTORSPresident & CEO: Barry L. ShoopPresident-Elect: Karen BartlesonPast President: Howard E. MichelSecretary: Parviz FamouriTreasurer: Jerry L. HudginsDirector & President, IEEE-USA: Peter Alan EcksteinDirector & President, Standards Association: Bruce P. KraemerDirector & VP, Educational Activities: S.K. RameshDirector & VP, Membership and Geographic Activities: Wai-Choong (Lawrence) WongDirector & VP, Publication Services and Products: Sheila HemamiDirector & VP, Technical Activities: Jose M.F. MouraDirector & Delegate Division V: Harold JavidDirector & Delegate Division VIII: John W. Walz

revised 10 June 2016

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________

___________

__________

__________

____________

________

_______

http://www.computer.org



http://www.computer.org/certification












The 1995 Nobel Prize in Chemistry was awarded to Paul J. Crutzen, Mario J. Mo-lina, and F. Sherwood Rowland “for their work in atmospheric chemistry, particu-

larly concerning the formation and decomposition of ozone.”1 Molina and Rowland performed calcu-lations predicting that chlorofluorocarbon (CFC) gases being released into the atmosphere would lead to the depletion of the ozone layer. Because the ozone layer absorbs ultraviolet light, its deple-tion would lead to an increase in ultraviolet light on the Earth’s surface, resulting in an increase in skin cancer and eye damage in humans. The sub-sequent international treaty, the Montreal Protocol on Substances that Deplete the Ozone Layer, was universally adopted and phased out the production of CFCs; it serves as an exemplar of public policy being informed by science.

The underlying calculations used by Molina and Rowland have their basis in chemical kinet-ics, which concerns the rate at which chemical re-actions occur. When a chemical reaction (such as the combustion of methane) takes place, the over-all reaction might appear simple—such as CH4 + 2O2 = CO2 + 2H2O—but the actual chemistry is typically much more complex (for details, see the “Related Work in Chemical Kinetics” sidebar). An accurate analysis of the underlying combustion phenomenon requires consideration of all the spe-cies (molecules) and elementary reactions, which could number in the hundreds and thousands, re-

spectively. Other applications of kinetics include controlling photochemical smog through emis-sions regulations on automobiles and factories and the development of alternative fuels for the inter-nal combustion engine. The experimental testing needed for fuel certification is expensive and time-consuming, leading to the development of compu-tational approaches to minimize the experimental space.

The purpose of this article is to acquaint com-puter scientists with the application of computing in chemical kinetics and to outline future chal-lenges, such as the need to couple kinetics and transport phenomena to obtain more accurate pre-dictions. We focus on gas-phase chemistry; that is, all the chemicals are gases. Figure 1 shows the flow of the chemical kinetic computation. Aspects of this are similar to hardware and software design flows, where the upstream steps have considerable impact on the downstream steps, both in terms of computation time and quality of solution. This article’s organization mirrors the steps in the flow-chart, with sections on mechanism generation, consistency and completeness analysis, and mecha-nism reduction.

Mechanism Generation

The automated development of a reaction mechanism involves starting with a set of reactants and determin-ing the reactions they participate in, the intermedi-ate species that are generated, and, ultimately, the

Chemical Kinetics: A CS Perspective

Dinesh P. Mehta and Anthony M. Dean | Colorado School of Mines

Tina M. Kouri | Sandia National Labs

Chemical kinetics has played a critical role in understanding phenomena such as global climate change and photochemical smog, and researchers use it to analyze chemical reactors and alternative fuels. When computing is applied to the development of detailed chemical kinetic models, it allows scientists to predict the behavior of these complex chemical systems.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








products that are obtained. This is an inherently iterative process because intermediate species might themselves participate in reactions, resulting in the for-mation of new species, which might in turn react with other species to generate even more species, and so on.

This process can theoretically continue indefinitely, re-sulting in a combinatorial explosion of species and reac-tions. In practice, criteria are needed for two reasons: to decide when to terminate the process and to identify the most chemically important reactions and species.

Related Work in Chemical Kinetics

Chemical kinetics is a branch of chemistry concerned with the

rate at which chemical reactions occur. This is opposed to

chemical thermodynamics, which studies the enthalpy and entropyassociated with chemical reactions; that is, it tells you whether a

given chemical reaction might occur under certain conditions but

not how fast it will occur.

For example, chemical thermodynamics suggests that the

oxidation of graphite (carbon) resulting in carbon dioxide is highly

favored at room temperature. However, graphite can be exposed

to air indefinitely without any apparent changes because the

reaction is very slow. The rate of a chemical reaction is affected

by the reactants’ nature, physical state (solid, liquid, or gas), and

concentrations, as well as the temperature, pressure, and the

presence of catalysts and inhibitors.

Chemical reaction kinetics can be described by rate laws. Chemists and chemical engineers can use the resulting

mathematical models to better understand a variety of chemical

reactions and design chemical systems that maximize product

yield, minimize harmful effects on the environment, and so on.

Consider the reaction O + N2O 2NO. Here, one atom of

oxygen reacts with one molecule of nitrous oxide to produce two

molecules of nitric oxide. Let [N2O] denote the concentration

of nitrous oxide. The reaction rate can be expressed as the rate

of disappearance of a reactant (say, N2O) by denoting it by the

derivative of concentration with respect to time: d[N2O]/dt. Note

that d[N2O]/dt is negative because the concentration of N2O

(a reactant) reduces with time. The rate of a reaction can also be

expressed as the rate of product formation: in this case,

(1/2)(d[NO]/dt). The 1/2 accounts for the production of twomolecules of nitric oxide in the reaction. More generally, consider

a reaction of the form aA + bB cC + dD, where A and B are

reactants, C and D are products, and a, b, c, and d are integers that

denote the relative amounts of reactants and products consumed and

produced, respectively. The reaction rate can then be expressed as

− =− = =a

d Adt b

d Bdt c

d Cdt d

d Ddt

1 [ ] 1 [ ] 1 [ ] 1 [ ].

The field of chemical kinetics was pioneered by Cato M.

Guldberg and Peter Waage, who observed in 1864 that reaction

rates are related to the concentrations of the reactants. Typically,

∝

=

RATE A B

k A B

[ ] [ ]

[ ] [ ] ,

x y

x y

where k is the rate constant. The values of x and y, which can be

determined experimentally, depend on the reaction and aren’t

necessarily equal to the stoichiometric coefficients a and b. The

order of a reaction is defined by x + y.It often turns out that a seemingly simple reaction actually

consists of a number of even simpler steps. For example, the reaction

H2 + Br2 2HBr consists of the following elementary steps:

Br2 2Br

Br + H2 HBr + H

H + Br2 HBr + Br

H + HBr H2 + Br

2Br Br2

This collection of elementary steps is called a reaction mechanism.

It turns out that x = a and y = b in an elementary reaction—that

is, the order is identical to the molecularity. For an elementary

reaction, the rate law can therefore be written by inspection using

the Guldberg–Waage mass action law.

Given a reaction mechanism, a quantitative characterization of

the chemical system is achieved by developing a set of differential

equations, one for each species in the mechanism (in our example,

Br2, H2, Br, H, HBr). Assuming that the rate constants for the five

reactions above are k1 through k5, respectively, we can write down

the following set of ordinary differential equations (ODEs):

ddt

k k k

ddt

k k

ddt

k k k

ddt

k k k k k

ddt

k k k

[Br ][Br ] [H][Br ] [Br]

[H ][Br][H ] [H][HBr]

[H][Br][H ] [H][Br ] [H][HBr]

[Br]2 [Br ] [Br][H ] [H][Br ] [H][HBr] 2 [Br]

[HBr][Br][H ] [H][Br ] [H][HBr].

21 2 3 2 5

2

22 2 4

2 2 3 2 4

1 2 2 2 3 2 4 52

2 2 3 2 4

− = + −

− = −

= − −

= − + + −

= + −

These are then numerically integrated to give, for each species, a

description of the concentration variation with time. The resulting

predictions are compared to experimental data to assess the proposed

mechanism’s suitability. Although our example reaction consists

of five reactions and five species, a complex reaction can contain

thousands of species and tens of thousands of reactions, requiring

efficient integration algorithms. A description of the ODE solvers and

the electronic structure calculations used to determine rate constants,

while crucial to this technology, are beyond this article’s scope.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










Bond-Electron and Reaction Matrices

We now describe the use of matrices to generate products from a set of reactants and reaction types.2,3 The bond-electron (BE) matrix represents a species and is a variation of the classical adjacency matrix used to represent graphs. Specifically,

■ graph vertices are augmented with labels to de-note atoms, such as C for carbon, H for hydro-gen, and O for oxygen; and

■ multiple edges are permitted between vertices to account for bond order.

For example, a pair of C atoms can be joined by a single bond, a double bond, or a triple bond; these three cases must be distinguishable. Bond forma-tion is governed by the participating atoms’ va-lences—that is, the number of unpaired electrons in their outermost shells. The valences for C, O, and H are 4, 2, and 1, respectively. A single bond is formed by the contribution of one unpaired elec-tron from the two participating atoms.

Element Mij in an n n BE matrix M of a mol-ecule with n atoms denotes the number of bonds between atoms i and j, when i j. The diagonal

element Mii (typically zero in an adjacency matrix) denotes the number of free electrons of atom i that aren’t used in its bonds. The sum of the elements in the ith row then gives the valence of atom i. Figure 2 illustrates the concept.

A reaction (R) matrix is used to capture the bond changes associated with a certain type of reaction. Several well-known reaction types exist—including hydrogen abstraction, b-scission, and recombina-tion—and each has well-defined behavior with re-spect to the bond changes that occur in the reaction. This is illustrated using hydrogen abstraction—that is, the removal (abstraction) of an H atom from a molecule by a radical (a species with an atom with a unpaired electron in its outermost shell)—as follows:

X H Y* X* Y H.

Here, the radical Y* abstracts the H atom from the molecule X-H, giving the molecule Y-H and the radical X*. The bonds associated with three atoms are impacted by this reaction:

■ Xa, the atom in X whose bond with the H atom is broken;

■ the H atom itself; and ■ Yb, the atom in Y with the unpaired electron,

that forms a bond with the H atom.

These are reflected in the H-abstraction reaction matrix:

Xa H Yb

Xa 1 1 0

H 1 0 1

Yb 0 1 1

Figure 3 illustrates in detail how the products of the reaction in Figure 2 are generated.

Termination and Selection

Let the initial set of reactants be R0. Assume that reaction matrices are applied to all possible com-binations of reactants to generate products as de-scribed in the previous section. Let the set of new products generated be R1. We now repeat the pro-cess on R0 R1 to generate R2.

This process can be repeated indefinitely. The challenge is twofold: to determine which criteria should be used to terminate this process and to identify how to select chemically significant species and reactions, while leaving chemically insignifi-cant ones out. The rate-based approach,4 which has found favor within the kinetics community, uses

Figure 1. Mechanism generation flow chart. The kineticist uses chemical insights gained through experiments and theory to develop rate rules, which are used by the generation algorithm to create a new reaction mechanism. The reaction mechanism is checked for consistency and completeness. This is followed by validation procedures. If the mechanism fails the validation procedures—that is, if predictions obtained by running ordinary differential equations (ODE) solvers don’t match experimental data—the process must be repeated. Upon passing the validation procedures, the size of the mechanism is reduced, giving a final mechanism. This mechanism can be used along with computational fluid dynamics (CFD) computation to perform accurate system simulations.

Mechanismgeneration

Rate rules

Chemical insight(expt, theory)

Consistency and completenessanalysis

Validation

Mechanism reduction

Final mechanism

No

Yes

Efficient ODE solver

CFD solver

Chemical kineticsflow diagram

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








reaction rate computations during mechanism gen-eration to determine which species are chemically significant: assume as before that the initial reac-tants set is R0. Once all the reactions (including rate constants) involving R0 are determined, the system of ordinary differential equations (ODEs) is solved, giving the rate at which the various prod-ucts (in R1) are formed. Only the products that are formed the fastest are added to R0, and the process is repeated. The process terminates when the rates at which all the products formed are less than a us-er-specified threshold. The relative rates of individ-ual reactions depend on the temperature and pres-sure of the system. Consequently, the mechanism derived for the same set of reactants could vary for different temperature–pressure combinations.

Estimating Rate Constants

We now briefly describe how rate constants are es-timated. Functional groups are specific groups of atoms within molecules that are responsible for the characteristic chemical behaviors of those mole-cules. For example, all acids (for example, HCl and H2SO4) contain the H atom and all alkalies (such as NaOH and KOH) contain the OH grouping. Reaction rate constants and other thermochemi-cal properties that are required to set up the sys-tem of ODEs can be estimated from the functional groups that participate in a reaction. Estimates are required because the direct measurement of these quantities is often impractical.

Functional groups are represented in a rooted tree data model,5 in which the root represents a general functional group and its descendants rep-resent more specialized groups. Figure 4 shows a portion of a functional group tree for classifying

carbonyls (a carbon atom double-bonded to an ox-ygen atom). The more specialized the knowledge of functional groups in a reaction, the more accurate the rate constant estimate that can be obtained.

Consistency and Completeness Analysis

Mechanisms generated using these techniques must be checked for consistency and completeness. Due to the large sizes of the mechanisms, kineticists use software tools to verify accuracy and completeness. Tools that automatically classify reactions into

Figure 2. Hydrogen (H) abstraction reaction example and bond-electron (BE) matrices for all participating species: (a) and (b) the reactants methane and hydroxyl radical, and (c) and (d) the products methyl radical and water. The * denotes an atom with an unpaired electron. Each atom has a label used later to identify it uniquely.

4H H2C1

H

1 2 3 4 5

1111

0

0

0

0

0

0

0

0

0

0

0

0

0

0C

H

H

H

H

1

2

3

4

5

1

1

1

1

0

0

0

H+ +B A

O* H H

A

A

B

2

O

H

H

0

1

1

1

0

0

1

0

0

B 2

2B AO

H

5

3

4H C*

H

H

1

1

11

1

O

H

A

B 0

3

4

5

CA B

H

H

H

1

1

1

1

1

0

0

0

1

0

0

0

1

0

0

0

3 4 5

5

1

3

(a) (b) (c) (d)

Figure 3. Reaction matrices for the reaction shown in Figure 2. (a) We combine

the BE matrices of the reactants CH4 and OH and place boxes around matrix elements that will be impacted by the succeeding steps. (b) The expanded reaction matrix. (c) The result of adding the matrices from (a) and (b) with boxes around the elements that are affected by the addition. (d) Rows and

columns are reordered, giving the products CH3 and H2O, by identifying connected components of the graph using, for example, breadth-first search.

1 2 3 4 5 A B Xa

Xa

Yb

Yb1111

0000

0000

0000

0000

00000

0000

H

H1 0–1

–1

–1

0

0 1

1

0

01111

1 2 3 4 5 A B

00

00

00

00

00

11

10

CHHHH

12345

(a) (b)

(c) (d)

AB

12345AB

1345AB2

OH

C 10

00000

000000

000000

0000

000

0 00 1

1

1 1 1 0 0000010

1

0

11100

HHHHOH

C 11 3 4 5 A B 2

1 1 10 0 00 0 00 0 0

00

00

00

00

011

1 10000

00

00

0 0 00 0 00 0 0

111000

HHHOHH

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










specific reaction types are key to this process be-cause they allow the mechanism to be sorted into manageable groups and simplify the task of check-ing for completeness of reactions and consistency of rate coefficient assignments.

Automated Reaction Mapping

Automated reaction mapping (ARM) is a funda-mental first step in the classification of chemical re-actions. The objective is to determine which bonds are broken and formed in a reaction. Figure 5a shows our earlier hydrogen-abstraction reaction consisting of two reactants and two products. The input to the automated reaction mapping problem is a balancedchemical reaction; that is, the same number and types of atoms are present on both the left-hand side (LHS) and the right-hand side (RHS) of the reac-tion. The output is the list of bonds that were broken or formed to transform the reactants into products.

ARM is formally defined as an optimization problem: find a bijection (one-to-one mapping) from the reactant atoms to the product atoms that minimizes the number of bonds broken or formed. This mapping must respect atom labels; that is, a reactant C atom is mapped to a product C atom. Clearly, there’s only one way to map the C and O atoms in the reaction in Figure 5a. Of the 5! = 120 ways to map the five H atoms on the LHS to the five H atoms on the RHS, Figure 5a shows an op-timal mapping that corresponds to breaking a C-H reactant bond and forming an O-H product bond for a total cost of two. Optimal solutions to ARM aren’t guaranteed to reflect the underlying chem-istry, but have been found to be accurate for com-bustion reactions.

The general ARM problem is known to be NP-hard.6 Unlike other optimization applications, in which a suboptimal solution is often acceptable, it’s crucial to the reaction classification application that optimal solutions be found. We’ve developed a family of exhaustive algorithms that find optimal solutions by using an approach that systematically removes bonds from reactant and product graphs until the LHS and RHS of the reaction are identi-cal. In the example of Figure 5a, removing a C-H reactant bond and an O-H product bond results in the LHS becoming identical to the RHS. Bonds removed from reactant graphs represent bonds broken during the reaction, while bonds removed from product graphs represent bonds formed dur-ing the reaction.

To implement this algorithm, we need a meth-od to determine whether the LHS is identical to the RHS. This in turn boils down to the famous graph isomorphism question. Two graphs G and Hare isomorphic if there’s a bijection f from the verti-ces of G (denoted V(G)) to the vertices of H(V(H)) such that for any two vertices u and v in G, (u,v)is an edge in G if and only if ( f(u), f(v)) is an edge in H. In practice, this problem is solved for chemi-cal graphs using canonical labeling. The canoni-cal label CL(G) of a graph G is a character string such that for any two graphs G1 and G2, G1 is iso-morphic to G2 if and only if CL(G1) = CL(G2). In other words, to compare two chemical graphs, first convert them into strings using canonical label-ing and then perform a simple string comparison. Canonical labeling algorithms (e.g., Nauty) exist for chemical graphs that are fast in practice, but have exponential runtimes in the worst case.

ARM algorithms use a bit string data struc-ture in which each bit corresponds to a reactant

Figure 4. Portion of a carbonyl functional group tree. “!O” denotes any atom except oxygen.

C = O

C = O = C = O

O = C = OC = C = OC = OO C = O!O

!O

C = OC

H

C = OC

C

C = OO

O

Aldehydes

C = OO

!O

KetonesCarbonates

Figure 5. Mapping a simple chemical reaction OH + CH4

H2O +CH3. (a) A bijection, or one-to-one mapping, on vertices. (b) Each bond is labeled to indicate its position in the bit string. (c) A bit pattern of bonds that will be broken or retained.

O C CO

H

C H

H

00010000 01

H

H

H

H

CHH HOO

H1 H3

H2

H4

H4

H3 H5

H2

H1

H5

0 2 41

35

0 1 2 3 4 5 6 7 8 9

68 9

7

(a)

(b)

(c)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








or product bond. A bit set to 0 indicates the bond shouldn’t be broken, while a bit set to 1 indicates the bond should be broken. Figure 5c shows a bit string corresponding to the reaction of Figure 5b (in which each bond is labeled to indicate posi-tion in the bit string). In Figure 5c, the bit string indicates that the reactant bond labeled 3 and the product bond labeled 5 must be broken. We then use canonical labels to see whether this break re-sults in the LHS becoming equal to the RHS; in this case, it does, giving us a mapping of cost 2, which is optimal. Note that there are bit patterns (such as 100000000) that don’t give LHS = RHS and therefore aren’t valid mappings. Also, because the reaction is balanced, breaking all the bonds—that is, a bit pattern with all 1s—guarantees the ex-istence of a mapping.

Of course, we don’t know a priori which of the 2b subsets of b bonds will result in an optimal mapping. A relatively simple, but remarkably effec-tive, approach is to try all the

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟⎟

bi

bit patterns with i 1s, with i going from 0 to b,stopping as soon as a mapping is found. In the worst case, as mentioned earlier, this approach is guaranteed to find a mapping when i = b when the bit pattern consists of all 1s, resulting in an expo-nential time complexity. In practice, the minimum number of bond changes in a chemical reaction is small. For hydrogen abstraction, this quantity is two, which means that the number of bit patterns examined by the algorithm is bounded by O(b2), a polynomial.

Reaction Classification

A classified and sorted reaction mechanism can be used to

■ check for completeness in the mechanism, ■ check the consistency of rate coefficient assignments, ■ focus on unclassified reactions when looking

for problems if validation fails, and ■ compare multiple mechanisms that model the

same phenomena.

Reaction classification is based on rules asso-ciated with the properties of the reaction and its species. These rules can be recorded in a rule-based system such as Jess, which allows for rule modifi-cation without requiring recompilation of the soft-

ware and redeployment of the system.7 The rules for hydrogen abstraction are as follows:

(defrule habstraction

(1)(Reaction {numReactants == 2})

(2)(Reaction {numProducts == 2})

(3) (Reaction {atLeastOneRadicalReactant == TRUE})

(4) (Reaction {atLeastOneRadicalProduct == TRUE})

(5) (Reaction {sameRadicalReactantAndProduct == FALSE})

(6)(Mapping {allHydrogenBonds == TRUE})

(7) (Mapping{hydrogenGoingFromStableToRadical

Reactant == TRUE})

(8)(Reaction {numBondsBroken == 1})

(9)(Reaction {numBondsFormed == 1})

=>

(add (new String HydrogenAbstraction)))

Statements (1) and (2) verify that there are two reactants and two products using methods from the Reaction class. Statements (3) and (4) verify the exis-tence of at least one radical reactant and one radi-cal product using methods from the Reaction class. Statement (5) verifies that the radical reactant isn’t the same as the radical product using a method from the Reaction class. Statement (6) verifies that each bond broken or formed was connected to a hydro-gen atom using a method from the Mapping class. Statement (7) verifies that a hydrogen atom moved from a stable to a radical reactant using a method from the Mapping class. Statements (8) and (9) verify that exactly one reactant bond was broken and one product bond was formed using methods from the Reaction class. Notice that rules (8) and (9) pertain to bonds broken and formed in the reaction obtained using the ARM techniques described earlier.

The complicated nature of gas-phase reaction systems makes it impractical to devise a set of rules that classify all reactions. Unclassified reactions are important in their own right because they allow the kineticist to focus on problems in mechanisms that have failed validation procedures. Our system was able to determine the classification for about 95 percent of the reactions in a set of benchmark combustion mechanisms.

Mechanism Reduction

After consistency and completeness analysis, the system of ODEs is solved and concentration-time profiles are generated. With improvements in com-puting hardware and ODE solver algorithms, these large systems are now solved routinely. The mecha-nisms are validated by comparing the predictions with available data. Many of the problems of interest require coupling of the validated kinetic mechanism

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










with computational fluid dynamics (CFD) to ad-dress the coupled kinetics/transport problem. The resulting computational demand is such that the kinetics community has invested significant effort to develop mechanism reduction techniques that replace the original large mechanism with a much smaller one that closely approximates the behavior of the original. Specifically, the solution of the sys-tem of ODEs requires at least O(N2) time, where N is the number of species. Further, these sets of equations must be solved in large numbers of cells.

A representative mechanism reduction technique is based on directed relation graphs (DRG).8 Each species in the original mechanism is represented by a vertex in the DRG. Intuitively, there’s a directed edge from species X to species Y in the DRG if and only if the removal of Y would significantly impact the production rate of X. In other words, an edge from X to Y means that we must retain Y in the reduced mechanism to correctly evaluate the production rate of X. We now describe a quantitative criterion based on the underlying chemistry to formalize this idea. The total production rate of X, denoted P(X), is the sum of its individual reaction production rates, add-ed over all the reactions in which X participates. The production rate of X that can be directly attributed to Y, denoted as P(X,Y ), is a similar sum restricted to the subset of reactions in which both X and Y partici-pate. This is normalized to obtain

=rP X Y

P X( , )

( ).XY

There’s an edge from X to Y in the DRG if and only if rXY , where is a small user-defined threshold value.

Given a starting user-specified set S of major species in the mechanism, the key algorithmic idea of this approach is to traverse the graph us-ing, for example, depth-first search to identify, in linear time, all the vertices that are reachable from S. These reachable vertices are precisely the species that must be retained in the reduced mechanism. The species that aren’t reachable along with the re-actions they participate in are eliminated.

W e’ve discussed pure chemical kinetics, in that mechanisms have been generated with-

out regard to the spatial aspects of real systems. We’ve also assumed that systems are homoge-neous; that is, species are uniformly distributed throughout a volume so that the likelihood of a particular reaction taking place is independent of location. This is also known as zero-dimensional kinetics. However, nature isn’t necessarily ho-mogenous—different species occur in different regions with varying concentrations and tempera-tures. For example, in the context of atmospheric chemistry, a combination of factors leads to sub-stantial stratification of species and temperature with height. Similarly, it’s well known that tubu-lar reactors have radial variations in flow proper-ties that introduce heterogeneity in the system. Therefore, to more accurately model real systems, we need a tighter coupling of the computational chemical kinetics techniques described here with CFD. We expect that improvements in comput-ing technology (hardware, software, and algo-rithms) will facilitate the integration of these two disciplines.9

References

1. M.J. Molina and F.S. Rowland, “Stratospheric Sink for Chlorofluoromethanes: Chlorine Atom-Cata-lysed Destruction of Ozone,” Nature, vol. 249, 1974, pp. 810–812.

2. I. Ugi et al., “New Applications of Computers in Chemistry,” Angewandte Chemie, Int’ l Ed., vol. 18, no. 2, 1979, pp. 111–123.

3. L.J. Broadbelt, S.M. Stark, and M.T. Klein, “Com-puter Generated Pyrolysis Modeling: On-the-Fly Generation of Species, Reactions, and Rates,” In-dustrial & Eng. Chemistry Research, vol. 33, no. 4, 1994, pp. 790–799.

4. R.G. Susnow et al., “Rate-Based Construction of Kinetic Models for Complex Systems,” J. Physical Chemistry A, vol. 101, no. 20, 1997, pp. 3731–3740.

5. W.H. Green Jr., “Predictive Kinetics: A New Ap-proach for the 21st Century,” Chemical Eng. Kinet-ics, vol. 32, Academic Press, 2007, pp. 1–50.

To more accurately model real systems, we need a tighter coupling of the computational chemical kinetics techniques described here with CFD. We expect that improvements in computing technology (hardware, software, and algorithms) will facilitate the integration of these two disciplines.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










6. J. Crabtree and D. Mehta, “Automated Reaction Mapping,” J. Experimental Algorithmics, vol. 13, no. 15, 2009, article no. 15.

7. T. Kouri et al., “RCARM: Reaction Classification Using ARM,” Int’ l J. Chemical Kinetics, vol. 45, no. 2, 2013, pp. 125–139.

8. T. Lu and C.K. Law, “A Directed Relation Graph Method for Mechanism Reduction,” Proc. Combus-tion Institute, vol. 30, no. 1, 2005, pp. 1333–1341.

9. S.W. Churchill, “Interaction of Chemical Reac-tions and Transport. 1. An Overview,” Industrial & Eng. Chemistry Research, vol. 44, no. 14, 2005, pp. 5199–5212.

Dinesh P. Mehta is professor of electrical engineering and computer science at the Colorado School of Mines. His research interests include applied algorithms, VLSI design automation, and cheminformatics. Mehta re-ceived a PhD in computer and information science from the University of Florida. He’s a member of IEEE and the ACM. Contact him at [email protected].

Anthony M. Dean is a professor of chemical engineering and vice president for research at the Colorado School of Mines. His research interests include quantitative ki-netic characterization of reaction networks in a variety of systems. Dean received a PhD in physical chemistry from Harvard University. He’s a member of the Ameri-can Chemical Society, the American Institute of Chemi-cal Engineers, and the Combustion Institute. Contact him at [email protected].

Tina M. Kouri is a research and development computer scientist at Sandia National Labs. Her research interests include applied algorithms and cheminformatics. Kouri received a PhD in mathematical and computer sciences from the Colorado School of Mines. She’s a member of the ACM. Contact her at [email protected].

Advertising Personnel

Marian Anderson: Sr. Advertising CoordinatorEmail: [email protected]: +1 714 816 2139 | Fax: +1 714 821 4010

Sandy Brown: Sr. Business Development Mgr.Email [email protected]: +1 714 816 2144 | Fax: +1 714 821 4010

Advertising Sales Representatives (display)

Central, Northwest, Far East:Eric KincaidEmail: [email protected]: +1 214 673 3742Fax: +1 888 886 8599

Northeast, Midwest, Europe, Middle East:Ann & David SchisslerEmail: [email protected], [email protected]: +1 508 394 4026Fax: +1 508 394 1707

Southwest, California:Mike HughesEmail: [email protected]: +1 805 529 6790

Southeast:Heather BuonadiesEmail: [email protected]: +1 973 304 4123Fax: +1 973 585 7071

Heather BuonadiesEmail: [email protected]: +1 973 304 4123Fax: +1 973 585 7071

Advertising Sales Representatives (Jobs Board)

Heather BuonadiesEmail: [email protected]: +1 973 304 4123Fax: +1 973 585 7071

ADVERTISER INFORMATION

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________________

_______________

_____________ ________________

____________

______________ ________________

________________

______________________________

___________

____________




















CLOUD COMPUTING

Epilepsy is a disorder that affects the brain, causing seizures. During a seizure, a patient could lose consciousness, including while walking or driving a vehicle, which could

result in significant injury or death. According to a recent survey, the main cause of death for epileptic patients includes sudden unexpected death duringepilepsy due to drowning and accidents, which ac-count for 89 percent of total epilepsy-related deaths in Australia.1 Such patients can benefit from an alert before the start of a seizure or emergency treat-ment when they have a seizure, thus improving their quality of life and safety considerably.

In a clinical study, Brian Litt and colleagues2 ob-served that an increase in the amount of abnormal electrical activity occurs before a seizure’s onset. One of the most important steps to protect the life of an epileptic patient is the early detection of seizures, which can help patients take precautionary measures and prevent accidents. It has also been observed that during the transition from a normal state to an ictal state (mid-seizure), to detect a seizure, the electrical activity in the patient’s brain needs to be recorded continuously and efficiently around the clock. Elec-troencephalogram (EEG) is the most commonly used technique to measure electrical activities in the brain for the diagnosis of epileptic seizures.

In this direction, wireless sensor network (WSN) technology is emerging as one of the most promising

options for real-time and continuous monitoring of chronically ill and assisted living patients remotely, thus minimizing the need for caregivers. One of the important segments of WSN is body sensor net-works (BSNs), which record the vital signs of a pa-tient such as heart rate, electrocardiogram (ECG), and EEG. These wearable sensors are placed on the patient’s body, and their key benefit is mobility—they enable the patient to move freely insider or outside the home. BSNs generate a huge amount of sensor data that needs to be processed in real time to provide timely help to the patient. Cloud computing provides the ability to store and analyze this rapidly generated sensor data in real time from sensors of dif-ferent patients residing in different geographic loca-tions. The cloud computing infrastructure integrated with BSNs provides an infrastructure to monitor and analyze the sensor data of large numbers of epilep-tic patients around the globe efficiently and in real time.3 The cloud service provider is bound to provide an agreed upon quality of service based on a service-level agreement (SLA), and appropriate compensa-tion is paid to the customer if the required service levels aren’t met.4 To protect the patient from acci-dents when a seizure occurs, ideally, family members could continuously monitor the patient everywhere, which isn’t feasible under the traditional circum-stances. Hence, the main objectives of our proposed system are

A Cloud-Based Seizure Alert System for Epileptic

Patients That Uses Higher-Order Statistics

Sanjay Sareen | Guru Nanak Dev University, Amritsar, India and IK Gujral Punjab Technical University, Kaurthala, India

Sandeep K. Sood | Guru Nanak Dev University Regional Campus, Gurdaspur, India

Sunil Kumar Gupta | Beant College of Engineering and Technology, Gurdaspur, India

Automatic detection of an epileptic seizure before its occurrence could protect patients from accidents or even save lives. A framework that automatically predicts seizures can exploit cloud-based services to collect and analyze EEG data from a patient’s mobile phone.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








■ to detect preictal variations that occur prior to seizure onset so that the patient can be warned in a timely manner before the start of a seizure, and

■ to alert the patient, his or her family members, and a nearby hospital for emergency assistance.

To achieve these objectives, we propose a model in which each patient is registered by entering per-sonal information through a mobile phone, then

a unique identification number (UID) is allocat-ed to that person. The data from body sensors in digital form is collected through patients’ mobile phones using the Bluetooth technology. The fast Walsh-Hadamard transform (FWHT) is used to extract features or abnormalities from the EEG sig-nal; these features are then reduced using higher-order spectral analysis (HOSA) and classified into normal, preictal, and ictal states using a Gaussian

Related Work in Seizure Prediction

The sensor data stream generated from body sensor networks

(BSNs) has drawn the attention of researchers, many of whom

have taken the initiative to develop cloud-based systems based on

BSNs to develop e-healthcare applications. In 2003, Leon Iasemi-

dis and colleagues1 designed an algorithm based on the short-term

maximum Lyapunov exponents (STLmax) to detect a seizure prior

to its occurrence. However, it’s based on the assumption that the

occurrence of the first seizure is known. Joel Niederhauser and col-

leagues2 proposed a model to detect a seizure using a periodogram

of the electroencephalogram (EEG) signal and demonstrated that

EEG events occur prior to the electrical onset of a seizure. However,

this model is only applicable to patients with temporal lobe epi-

lepsy. In 2008, Hasan Ocak and colleagues3 proposed a technique

for the analysis of seizures using wavelet packet decomposition and

a genetic algorithm. Shayan Fakhr and colleagues4 presented a

review of a variety of techniques for EEG signal analysis of patients

in a sleep state. They studied different preprocessing, feature

extraction, and classification techniques to process the sleep

EEG signals. Sriram Ramgopal and colleagues5 presented another

review on seizure prediction and detection methods and studied

their usage in closed-loop warning systems. In 2015, Mohamed

Menshawy and colleagues6 developed a mobile-based EEG moni-

toring system to detect epileptic seizures. They implemented an

appropriate combination of different algorithms for preprocessing

and feature extraction of EEG signals. In this model, a k-means

clustering algorithm is used to classify the features into different

homogeneous clusters in terms of their morphology.

Recently, cloud computing in the field of healthcare has started

to gain momentum. Suraj Pandey and colleagues7 proposed an

architecture for online monitoring of patient health using cloud

computing technologies. Abdur Forkan and colleagues8 proposed a

model based on service-oriented architecture that enables real-time

assisted living services. It provides a flexible middleware layer that

hides the complexity in the management of sensor data from dif-

ferent kinds of sensors as well as contextual information. Giancarlo

Fortino and colleagues9 proposed an architecture based on the

combined use of BSNs and the cloud computing infrastructure. It

monitors assisted living via wearable body sensors that send data to

the cloud with the help of a mobile phone. Muhannad Quwaider and

Yaser Jararweh10 proposed a prototype for efficient sensor data col-

lection from wireless body area networks that contains a virtual ma-

chine and a virtualized cloudlet that integrates the cloud capabilities

with the sensor devices. Recently, Ahmed Lounis and colleagues11

proposed a new secure cloud-based system using wireless sensor

networks (WSNs) that enable a healthcare institution to process

data captured by a WSN for patients under supervision.

References

1. L.D. Iasemidis et al., “Adaptive Epileptic Seizure Prediction System,”

IEEE Trans. Biomedical Eng., vol. 50, no. 5, 2003, pp. 616–627.

2. J.J. Niederhauser et al., “Detection of Seizure Precursors from

Depth-EEG Using a Sign Periodogram Transform,” IEEE Trans. Biomedical Eng., vol. 51, no. 4, 2003, pp. 449–458.

3. H. Ocak, “Optimal Classification of Epileptic Seizures in EEG

Using Wavelet Analysis and Genetic Algorithm,” Signal Pro-cessing, vol. 88, 2008, pp. 1858–1867.

4. S.M. Fakhr et al., “Signal Processing Techniques Applied to

Human Sleep EEG Signals—a Review,” Biomedical Signal Pro-cessing and Control, vol. 10, 2014, pp. 21–33.

5. S. Ramgopal et al., “Seizure Detection, Seizure Prediction,

and Closed-Loop Warning Systems in Epilepsy,” Epilepsy and Behavior, vol. 37, 2014, pp. 291–307.

6. E.M. Menshawy, A. Benharref, and M. Serhani, “An Automatic Mo-

bile-Health Based Approach for EEG Epileptic Seizures Detection,”

Expert Systems with Applications, vol. 42, 2015, pp. 7157–7174.

7. S. Pandey et al., “An Autonomic Cloud Environment for Host-

ing ECG Data Analysis Services,” Future Generation Computer Systems, vol. 28, 2012, pp. 147–154.

8. A. Forkan, I. Khalil, and Z. Tari, “CoCaMAAL: A Cloud-Oriented

Context-Aware Middleware in Ambient Assisted Living,” Future Generation Computer Systems, vol. 35, 2014, pp. 114–127.

9. G. Fortino et al., “BodyCloud: A SaaS Approach for Commu-

nity Body Sensor Networks,” Future Generation Computer Sys-tems, vol. 35, 2014, pp. 62–79.

10. M. Quwaider and Y. Jararweh, “Cloudlet-Based Efficient Data

Collection in Wireless Body Area Networks,” Simulation Mod-elling Practice and Theory, vol. 50, 2015, pp. 57–71.

11. A. Lounis et al., “Healing on the Cloud: Secure Cloud Architec-

ture for Medical Wireless Sensor Networks,” Future Generation Computer Systems, vol. 55, 2016, pp. 266–277.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









CLOUD COMPUTING

process classification algorithm. GPS is used to track the location of patients from their respective mobile phones. Whenever the system detects the preictal state of the patient, an alert message will be generated to be sent to the patient’s mobile phone, family member, and a nearby hospital, depending upon the location of the patient.

See the “Related Work in Seizure Prediction” sidebar for more information on the use of cloud computing and wireless BSNs to predict epileptic seizures.

Proposed Model

The proposed model consists of target patients, a BSN, data acquisition and transmission, data col-lection, seizure prediction, and GPS-based loca-tion tracking. The BSN consists of wearable EEG sensors placed on different parts of the brain for capturing EEG signals. The data acquisition and transmission component comprise a smartphone and an Android-based application that capture data from body sensors and send it to the cloud along with the user’s personal information, manually en-tered through the app. The data collection compo-nent is used to collect and store raw sensor data in a database and transforms it into a suitable form for further processing and analysis. It contains a cloud storage repository to store patients’ personal

information and their sensor data. The system as-signs each user a UID at the time of registration. The seizure prediction component performs tasks such as data validation, feature extraction, and fea-ture classification. The FWHT and higher-order statistics analyze and extract the feature set from the EEG signal. A Gaussian process classifies the feature set into normal, preictal, and ictal states of seizure. Based on the classification, the system can generate an alert message and send it to the hospi-tal closest to the user’s geographic location, a fam-ily member, and the actual user. The objective in sending an alert message to users’ mobile phones is to encourage them to take precautionary mea-sures to protect themselves from injuries. The GPS-based location-tracking component keeps track of the location of the patients with the help of their mobile phones. Figure 1 demonstrates the design of our proposed system for predicting and detect-ing seizures.

Data Acquisition and Transmission

The EEG sensor device contains one or more elec-trodes to detect the voltage of current flowing through the brain neurons from multiple segments of the brain. In our model, we use an Emotiv EPOC headset, which contains 14 sensors placed on the scalp to read signals from different areas of

Figure 1. The architecture of the proposed cloud-based seizure alert system for epileptic patients. The model integrates wireless body sensor network, mobile phone, cloud computing, and Internet technology to predict the seizure in real time irrespective of the patient’s geographic location.

Cloud storage and processing of EEG signal

ERdatabase

Datacollection

Datavalidation

Featureextraction

Featureclassification

GPS-basedlocation tracking

Familymember

Hospital

Wi-Fi4G

Data acquisitionand transmission

Bluetooth

Electroencephalogram(EEG)sensors

to mobilephone

Patient

Alert message sent to patient

AlertmessageAl

ert

mes

sage

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








the brain. The signals are transformed at 200 Hz using an analog-to-digital converter before being sent to a mobile phone. The raw data streams gen-erated by EEG sensors are collected continuously in real time by the patient’s own mobile phone us-ing wireless communication protocol. The mobile phone constitutes a wireless personal area network (WPAN) that receives data from the BSN. Blue-tooth is used to transfer the data streams between Bluetooth-enabled devices over short distances. Several sensor devices can be connected to one Bluetooth server device (such as a mobile phone), which acts as a coordinator. An Android-based ap-plication collects digital sampled values from the body sensors. The mobile phone transmits the data to the cloud via a suitable communication proto-col, such as Wi-Fi 3G or 4G networks.

Data Collection

Signal data streams from wearable body sensors are captured through mobile phones and sent to the cloud for storage and analysis. The data received is stored in the cloud database known as an epilepsy record (ER) database. Patients’ personal informa-tion is stored in the ER database with their UID, as shown in Table 1.

Data Validation

The original EEG signals recorded by the sensors are contaminated with a variety of external influ-ences such as noise and artifacts that originate from two sources: physiological and nonphysiolog-ical sources. Physiological artifacts are generated from sources within the body, such as eye move-ments, ECGs, and electromyography. Nonphysi-ological artifacts come from external sources such as electronic components, line power, and environ-ment. Such artifacts should be eliminated from the EEG signals by using a filtering mechanism such as a band-pass or low-pass filter.

Feature Extraction from EEG Signals

The EEG signal in its original form doesn’t provide any information that can be helpful in detecting a seizure. The variation in signal pattern during dif-ferent identifiable seizure states can be detected by applying an appropriate feature-extraction tech-nique. Inadequate feature extraction might not provide good classification results, even though the classification method is highly optimized for the problem at hand. Several feature-extraction meth-ods based on the time domain and feature domain, and wavelet transform (WT) features are available.

Because the EEG signals are nonstationary and their frequency response varies with time, conven-tional methods based on time and frequency do-mains aren’t suitable for seizure prediction.

Fast Walsh-Hadamard transform. The FWHT decom-poses a signal into a group of rectangular or square waveforms with binary values of 1 or 1, known as Walsh functions. This generates a unique sequence value assigned to each Walsh function and is used to estimate frequencies in the original signal. The FWHT has the ability to accurately detect signals, contains sharp discontinuities, and takes less com-putation time using fewer coefficients.

The FWHT converts a signal from the time domain to the frequency domain and is effec-tive for locating transient events, which might occur before the seizure onset in both time and frequency domains. It’s capable of extracting and highlighting discriminating features of the EEG signal, such as epileptic spikes in the frequency and spectral domains, with greater speed and ac-curacy. The FWHT of a signal x(t) having length N can be defined as

∑ ( )==

−

yN

x WAL n i1 , ,n i

i

N

0

1

where i = 0, 1, …, N 1 and WAL(n) are Walsh functions.

The features extracted from the EEG signal in the form of FWHT coefficients are normalized to remove any possible errors that might occur due to inadequate extracted features. This can be done using the equation

μσ

= − ∀ =npy

y i n, , 1,2, , ,ii

i

where m and are the mean and standard devia-tion, respectively, over all features.

Table 1. Personal attributes of a patient.

Serial number Attribute Data type

1 Social Security Number Integer

2 Name String

3 Age Integer

4 Sex String

5 Address String

6 Mobile number Integer

7 Family member’s name String

8 Family member’s mobile number Integer

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









CLOUD COMPUTING

Higher-order spectral analysis (HOSA). Seizure pre-diction of epileptic patients with a higher degree of accuracy and rapid response is a major chal-lenge. In the preictal state, bursts of complex epileptiform discharges occur in the patient’s EEG signal. These quantitative EEG changes can be detected by appropriate analysis of EEG sig-nals. Higher-order statistics are widely used for the analysis of EEG and ECG data to diagnose faults in the human body such as tremor, epilepsy, and heart disease.5 However, EEG signals contain significant nonlinear and non-Gaussian features among signal frequency components.6 Existing techniques aren’t sufficient in handling these non-linear and non-Gaussian characteristics. HOSA is able to effectively analyze such signals to diagnose signal abnormalities.

HOSA is the spectral representation of a sig-nal’s higher-order cumulants. It’s used for power spectral analysis, which is a natural extension to the signal’s higher-order powers, and is represent-ed by the average of the signal squared (that is, the second-order moment).7 The higher-order sta-tistics are very useful in handling non-Gaussian random processes and are used to retrieve high-er-order cumulants of a signal, even in the pres-ence of artifacts. The bispectrum of a signal x is calculated using the Fourier transform evaluated at f1 and f2 and is defined by the mathematical equation

∑( ) ( ) ( ) ( )= +B f f X f X f X f f, . . ,i i ii

1 2 1 2*

1 2

where X( f1), X( f2), and X( f1 + f2) represent the power spectral components computed by the fast Fourier transform (FFT) algorithm. The value X*( f ) is the conjugate of X( f ).

In our study, we used bicoherence, which is a normalized bispectrum and is very useful in analyz-ing EEG signals. Bispectrum values contain both the amplitude of the signal and the degree of phase coupling, whereas bicoherence values directly repre-sent the degree of phase coupling. The bispectrum is normalized using bicoherence, such that it con-tains a value between 0 and 1 and is defined by the equation

BICf f

P f P f P f f

B=

,

+,

i

1 2

i 1 i 2 i 1 2∑)

) ) )(

( ( (where P( f1), P( f2), and P( f1 + f2) represent the power spectrum. We use spectral estimation to

detect the distribution of the energy contained in a signal, and we use entropy parameters to characterize the irregularity (normal, preictal, and ictal) of the EEG signal. Different statistical characteristics are examined and the entropy-based parameters listed in Equations 1 through 3 are considered to be the most important and distinctive for seizure state detection. Different entropy values of the normalized bispectrum are evaluated and can be represented mathematically as follows.

Normalized Shannon entropy (E1) is

∑=−E plog ,ii

i1 p

where

∑( )( )

=Ω

pBIC f f

BIC f f

,,

.i1 2

1 2(1)

Log energy entropy (E2) is

∑( )=−E plog ,ii

2

where

∑( )

( )=

Ω

pBIC f f

BIC f f

,

,.i

1 22

1 22

(2)

The concentration of r norm entropy (E3) is

∑( )=E plog ,ii

3

where

∑)

)(

(=

ρ

ρ

Ω

pBIC f f

BIC f f

,

,.i

1 2

1 2

(3)

Feature classification. Once different entropy pa-rameters are extracted from the EEG signal, an automatic classification of features into different seizure states is one of the essential processes of our model. We implemented an unsupervised classification technique because live EEG data coming from patients’ mobile phones need to be analyzed in the cloud in real time, and it’s not possible to first label such EEG data with the help of a physician. Thus, these techniques only depend on the information contained in the EEG data, and there’s no need for prior knowl-edge about the data. While several unsupervised

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








classification techniques are available for classify-ing EEG signals, we adopted the Gaussian pro-cess technique based on Laplace approximation. We made this choice due to the fact that it can be applied to very large databases. In this technique, clustering is generally used prior to classifiers to prepare a training dataset for classifiers.

The Gaussian process classifier is used to mod-el the three states of seizure class probabilities and is given by the equation

∑( ) ( )( )=⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟⎟=

−

p Y f f Y f| exp exp ,i i i cj

c

iT

i,1

1

where …= ⎡⎣ ⎤⎦f f f, ,i i i cT

,1 , is a vector of the la-tent function values related to data point i, and

…= ⎡⎣ ⎤⎦Y y y, ,i i i cT

,1 , is the corresponding target vector, which has one entry for the correct class for the observation i and zero entries otherwise.

GPS-Based Location Tracing

The objective of location tracking is to identify the patient’s location to provide him or her with immediate treatment whenever a seizure occurs. The mobile phone’s GPS function is used to track the patient’s location, which is sent to the cloud through the Internet. An alert message is gener-ated before the triggering of the seizure and is sent to the patient’s mobile phone, as well as to family members and a nearby hospital.

Experimental Results and Performance

Analysis

We conducted different experiments to analyze and classify EEG signals. Our objective was to identify the preictal state so as to provide alerts to the patient before the seizure actually occurs. The EEG recordings used in this experiment were collected from five patients at a sampling rate of 173.61 Hz by using surface electrodes placed on the skull. Each set (A–E) contains 100 files, and

each file consists of 4,096 values of one EEG time series in ASCII code. The first four sets (A–D) were obtained from nonepileptic patients. The last set (E) was recorded from an epileptic patient who had seizure activity. Therefore, our experimental data set contains a total of 500 single-channel EEG epochs (windows), out of which 400 are of nonepileptic patients and 100 are of an epilep-tic patient. Each EEG epoch is 23.6 s long. The recordings were captured using a 128-channel amplifier and converted into digital form at a sam-pling rate of 173.61 Hz and 12-bit analog/digi-tal resolution. Table 2 shows some details of EEG recordings related to nonepileptic and epileptic patients.8

We analyzed the EEG signals using Matlab and its toolboxes. We performed our experiments on an Intel i5 CPU at 2.40 GHz with 2 Gbytes memory running on Windows 7. Our experiment performed the following tasks:

■ EEG signal decomposition, ■ bispectral analysis, ■ feature extraction based on entropy, ■ feature classification, ■ performance analysis on Amazon Elastic Com-

pute Cloud (EC2), and ■ performance comparison.

EEG Signal Decomposition

In the first stage, we applied the FWHT to decom-pose the signal. We extracted the discriminating features in terms of frequency and spectral domain. We applied Algorithm 1 to each patient’s EEG data file, which each contain 4,096 points and generate 8,192 coefficients. Figure 2 represents the original EEG signal and its FWHT coefficients for a non-epileptic and an epileptic patient.

One of the major problems of seizure state characterization is in identifying whether the pro-cess is Gaussian or linear. In our experiment, the Hinich test is applied for detecting the nonskew-ness and linearity of the process.9 Different statisti-

Table 2. Summary of analyzed EEG recordings.

Queries

Normal EEG Epileptic EEG

Set A Set B Set C Set D Set E

Patient’s type Nonepileptic Nonepileptic Nonepileptic Nonepileptic Epileptic

Recording type Surface Surface Surface Surface Intracranial

No. epochs 100 100 100 100 100

Epoch duration 23.6 s 23.6 s 23.6 s 23.6 s 23.6 s

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









CLOUD COMPUTING

Figure 2. Original EEG signal and fast Walsh-Hadamard transform (WHT) coefficients: (a) nonepileptic patient and (b) epileptic patient. The fast WHT coefficients extracts the discriminating features of the EEG signal such as epileptic spikes.

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500

−200

−100

0

100

200

No. samples

Am

plit

ude

EEG signal of nonepileptic patient

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000

0

1

2

3

4

5

Sequency index

Mag

nitu

de

WHT coefficients

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500−2,000

−1,000

0

1,000

2,000

No. samples

Am

plit

ude

EEG signal of epileptic patient

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,0000

10

20

30

Sequency index

Mag

nitu

de

WHT coefficients

(a) (b)

Algorithm 1. Feature extraction and selection.

Input List of patient directories containing EEG data.

Output

Shannon entropy, log energy entropy and norm entropy

Let wh[ ] and np[ ] be the one-dimensional matrix that stores FWHT and normalizedFWHT coefficients for each sample;

for each patient UID do

Locate the directory labeled with the UID of the patient registered with the system;

if EEG data for that UID already exists then

Replace the existing data with the new one;

else

Create a new patient directory with the UID of the patient and store the EEG data;

end if

Read EEG data from the directory;

Compute FWHT coefficients for each EEG sample by invoking fwht Matlab algorithm

and save them in one-dimensional matrix wh[ ];

for each FWHT coefficient do

Find the mean and standard deviation of each FWHT coefficient using mean() and std()

Matlab functions;

Normalize each FWHT coefficient using equation

np[i] = (wh[i]– mean(wh[i]))/std(wh[i]);

end for

Apply the bispecd and bicoher Matlab algorithms to generate bispectrum and bicoherencerespectively using following parameters;

(a) Data vector of FWHT coefficients

(b) Fast Fourier transformation (fft) length

(c) Window specification for frequency domain smoothing

(d) Number of segments per sample (e) Percentage of overlappingCompute the Shannon entropy, log energy entropy and norm entropy of bicoherence;

end for

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








cal parameters such as mean, variance, skewness, Gaussianity, and so on, based on FWHT coeffi-cients are evaluated in Table 3.

Bispectral Analysis

Bispectral analysis is a powerful tool for detecting interfrequency phase coherence, which is used for characterization of the EEG signal’s different states. The bispectrum is computed for each dataset to per-form in-depth analysis of features using the HOSA toolbox in Matlab.10 For this purpose, the normal-ize point (np) is calculated for each FWHT coef-ficient (y) by using the equation

y mean ystd y

np ,)

)(

(=−

where the functions mean() and std() are used to calculate the mean and standard deviation, respec-tively, of each FWHT coefficient.

The bispectrum is computed by applying the direct FFT method from the HOSA toolbox to each normalized point. A data vector matrix of size

256 256 is obtained. Figure 3 shows the bispec-trum of a nonepileptic and an epileptic patient.

The bicoherence, the normalized form of the bispectrum, is estimated using the direct FFT method in the HOSA toolbox. We used bicoher-ence to quantify the quadratic phase coupling in EEG signals, which is very useful in detecting nonlinear coupling in the time series for the char-acterization of different seizure states. Figure 4 represents the bicoherence of a nonepileptic and an epileptic patient.

Feature Extraction Based on Entropy

In this stage of the experiment, we determine the best features relevant to three seizure states (normal, preic-tal, and ictal) by evaluating different kinds of entropy from the bicoherence. In the seizure recognition, we considered three classes: normal, preictal, and ictal. Hence, we computed three different sets of entropy values for the recognition of different seizure states. Table 4 shows the mean values of different seizure states for the three selected features computed on the basis of third-order polyspectra for the five patients.

Table 3. Summary statistics for fast WHT coefficients.

Parameter Ictal Preictal Normal

Mean 1.9187e-017 2.28767e-017 2.30355e-017

Variance 0.999878 0.999878 0.999878

Skewness (normalized) 22.3652 0.505892 0.187998

Kurtosis (normalized) 1,031.16 15.8076 7.19272

Gaussianity linearity test 66,108.9418 180.1864 41.5563

R (estimated) 2,733.5991 15.6339 1.266

2,571.0311 7.1698 1.5579

R (theory) 136.8142 7.4108 3.778

Maximum of bispectrum 41.0551 1,226.2354 2,564.5996

Figure 3. Nonparametric bispectrum: (a) nonepileptic patient and (b) epileptic patient. The bispectrum is capable of retrieving the higher-order cumulants of a signal even in the presence of artifacts (FFT = fast Fourier transform).

Bispectrum estimated via the direct (FFT) method

f1

f2

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4−0.5−0.4−0.3−0.2−0.1

00.10.20.30.4

(a)

Bispectrum estimated via the direct (FFT) method

f2

−0.5−0.4−0.3−0.2−0.1

00.10.20.30.4

f1−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4

(b)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









CLOUD COMPUTING

Feature Classification

The Gaussian process model classifies the data into three classes, where each class represents a differ-ent seizure state. Algorithm 2 is designed to con-tinuously take entropy-based features from the feature extraction component and perform initial classification to detect different seizure states. Af-ter the initial classification, the reclassification is performed as soon as new data is received. In our experiment, the three features selected from the 500 samples were used as input to the Gaussian classification model using Weka 3.6. Table 4 shows that the entropy parameters E1, E2, and E3 decrease from the normal state to the preictal state. The values of these parameters reduce significantly in the ictal state. Such variations in entropy param-

eters help label the different seizure states. Figure 5 depicts different seizure states.

Performance Analysis Using Amazon EC2

We propose that the application for predicting a pa-tient’s seizure before its occurrence be hosted on the cloud so that we could test our model’s performance in real time. For this purpose a general-purpose com-pute-optimized c4:xlarge single-instance consist-ing of four high-frequency Intel Xeon E5-2666 v3 (Haswell) processors and 7.5 GiB or Gibibyte RAM with dynamic provisioning offered by Amazon EC2 (http://aws.amazon.com/ec2/instance-types) were used to host the application over the cloud. A Java-based application was designed and installed in the cloud to perform both feature extraction and feature classification functions. EEG data for five patients isn’t sufficient to evaluate our proposed model, so we used a bootstrapping technique11 to replicate the EEG data of these five patients to 50,000 patients randomly, using coefficients minimum ( 20.5920), maximum (25.5656), and mean (0:0122) as the vari-ant. In a 60-minute experiment, the system started with 5,000 patients, then after each 6-minute dura-tion, the system increased by 5,000.

Table 4. Entropy-based coefficients for the three selected features

of seizure states.

Feature Normal Preictal Ictal

E1 6.4122e+03 2.3083e+03 476.6912

E2 4.6655e+05 5.3641e+05 6.9407e+05

E3 1.3780e+04 6.0346e+03 1.3214e+03

Algorithm 2. EEG signal classification.

Input Entropy parameters and UID of a patient

Output Classify or reclassify feature set

if patient UID is already exists then

Replace the existing feature set with the new one;

Execute the Gaussian process for the reclassification of different seizure states;

else

Create a new patient directory with UID number of the patient and store the data;

Execute the Gaussian process algorithm;

end if

Label three sets of clusters as normal, preictal, and ictal;

Figure 4. Bicoherence: (a) nonepileptic patient and (b) epileptic patient. The bicoherence is used to retrieve different types of entropy values to characterize the different seizure states.

Bicoherence estimated via the direct (FFT) method

f1

f2

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4−0.5−0.4−0.3−0.2−0.1

00.10.20.30.4

(a)

Bicoherence estimated via the direct (FFT) method

f1

f2

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4−0.5−0.4−0.3−0.2−0.1

00.10.20.30.4

(b)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://aws.amazon.com/ec2/instance-types






Figure 6a represents our proposed model’s re-source utilization for several patients; Figure 6b shows the response time. The system has a lower response time for fewer patients due to low compu-tational load. Figure 6c shows the model’s latency time, which increases with the increase in the number of patients.

We performed a comparative evaluation on a desktop computer and Amazon EC2 with different sets of patients, starting from 5,000 and increasing up to 50,000. Figure 7 shows the execution time to process and classify the EEG data. Results show that the time required for the computation of EEG data on Amazon EC2 is reduced significantly from the time required on the desktop computer.

The accurate classification of patient data to de-tect different seizure states is a vital step in our pro-posed model. Different classification algorithms such as multiplayer perceptron (MPP),12 linear regres-sion (LR),13,14 and least median of square regression (LMSR)15 were also tested in Weka 3.616 to compare their performance with our proposed Gaussian pro-cess. Table 5 shows the summary of statistics in dif-ferent classification models tested in Weka 3.6.

Table 6 shows the results of classification accu-racy in the Gaussian process classifier. The classifier is able to classify normal, preictal, and ictal states with an accuracy of 84.20 percent, 86.40 percent, and 89.00 percent, respectively.

Next, we calculated the classification accuracy of detecting the preictal state versus a non-preictal state using the three statistical measures of sensitiv-ity, specificity, and accuracy.17,18 The accuracy of each classification algorithm was tested in Weka 3.6; Table 7 shows the sensitivity, specificity, and accuracy scores. The proposed Gaussian process classification algorithm provides a high sensitivity of 83.6 percent and a high accuracy of 85.1 percent over all other classification models. Moreover, the

Gaussian process classifier has a larger area under the receiver operating characteristic (ROC) curve than other models. It’s clear from Table 7 that the Gaussian classifier achieves the highest classification accuracy of 85.1 and justifies its use in our proposed

Figure 5. Gaussian cluster plot based on three entropy features, E1, E2, and E3. The value of these entropy features decreases from normal state to ictal state and can be used to categorize the different seizure states.

–5 –4 –3 –2 –1 0 1

x104–7.5–7–6.5–6–5.5–5–4.5–4

x105

–1

–0.5

0

0.5

1

Nor

m e

ntro

py (

E 3)

Log energy entropy (E2)Normalized

shannon entropy (E1)

IctalPreictal

Normal

Figure 6. Performance analysis of proposed model on Amazon EC2: (a) resource utilization of system, (b) response time of system, and (c) latency time of system.

6 12 18 24 30 36 42 48 54 600

102030405060708090

100

Time (in minutes)

Res

ourc

e ut

iliza

tion

(%

)

15,000 30,000 45,000

Total no. users

(a)

(b)

(c)

6 12 18 24 30 36 42 48 54 600

2

4

6

8

10

12

Time (in minutes)

Res

pons

e ti

me

(s)

Total no. users

6 12 18 24 30 36 42 48 54 600

0.5

1

1.5

2

2.5

3

3.5

Time (in minutes)

Late

ncy

tim

e (s

)

Total no. users

Figure 7. Comparative performance of EEG signal analysis on the Amazon EC2 cloud and a desktop computer. The time required for the analysis of EEG data on Amazon EC2 is reduced significantly than on the desktop computer.

0 5 10 15 20 25 30 35 40 45 50 55 600

100

200

300

400

500

600

No. patients (in thousands)

Exe

cuti

on t

ime

(s) Desktop computer

Amazon EC2 cloud

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









CLOUD COMPUTING

model. Figure 8 illustrates the comparison of the accuracy of the different classification algorithms.

Performance Comparison

We compared our model with adaptive epileptic seizure prediction system,19 as both are designed for the detection of a seizure’s preictal state.

The technique of adaptive epileptic seizure pre-diction system19 is based on prior knowledge of the occurrence of the first seizure, which can’t be used for real-time monitoring. In our proposed model, such a condition isn’t required, and it can therefore be used for real-time detection and monitoring of seizures.

An algorithm proposed by Leon Iasemidis and colleagues19 is based on the detection of the criti-cal electrode sites before a seizure. The reliability and accuracy depend on the probability of detect-ing the critical electrode sites. Our proposed model takes EEG data from all the electrodes and extracts features relevant to different seizure states; hence, it’s more reliable and accurate.

Iasemidis and colleagues19 analyzed spatiotempo-ral dynamical characteristics of multichannel intra-cranial EEG signals by measuring approximations of Lyapunov exponents used to determine the stability of any steady state behavior. Such dynamic behavior in spatiotemporal patterns of the brain occurs in patients with refractory temporal lobe epilepsy. Our model uses

Table 5. Performance results of different classifiers tested in Weka 3.6.

Parameters Gaussian process

Multiplayer

perceptron Linear regression

Least median of

square regression

Correlation coefficient 0.9782 0.9364 0.9402 −0.8973

Mean absolute error 821.1310 982.234 916.1129 6,485.2421

Root-mean-square error 1,032.1451 1,385.2702 1,466.9124 16,664.1458

Relative absolute error 22.4512% 29.4412% 27.5514% 192.4591%

Root relative square error 22.0897% 28.1602% 30.6487% 342.4587%

Total no. instances 50,000 50,000 50,000 50,000

Time taken 21 s 11 s 16 s 44 s

Table 6. Classification accuracy of the GP classifier with entropy features (E1, E2, E3).

Categories No. instances

No. correctly classified

instances

Correct

classification (%)

Normal 50,000 42,100 84.2

Preictal 50,000 43,200 86.4

Ictal 50,000 44,500 89.0

Table 7. Detailed accuracy of Gaussian and other models for EEG signal classification.

Classification

model Sensitivity (%) Specificity (%) Accuracy (%)

Receiver operating

characteristic area

Gaussian process 83.6 16.3 85.1 0.984

Multiplayer perceptron

78.5 21.5 80.3 0.928

Linear regression 71.7 28.3 77.4 0.892

Least median of squares regression

26.6 73.4 25.2 0.464

Figure 8. Performance analysis of classification algorithms. The accuracy of EEG signal classification for different algorithms is shown in the graph and varies with the number of patients.

0 5 10 15 20 25 30 35 40 45 500

102030405060708090

No. patients (in thousands)

Acc

urac

y of

cla

ssifi

cati

on (

%)

GP MP LR LMSR

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








higher-order statistics to detect and extract features representing nonlinearity of the EEG signal; hence, its use isn’t limited to a specific part of the brain.

M edical data needs to be shared among physi-cians, healthcare agencies, and other authorized

users to provide better treatment and reduced costs. However, the privacy issues associated with sharing such sensitive data is a big concern. Future work will focus on incorporating new data privacy techniques to secure patients’ personal and health information.

References1. M. Bellon, R.J. Panelli and F. Rillotta, “Epilepsy-

Related Deaths: An Australian Survey of the Expe-riences and Needs of People Bereaved by Epilepsy,” Elsevier J. Seizure, vol. 29, 2015, pp. 162–168.

2. B. Litt et al., “Epileptic Seizures May Begin Hours in Advance of Clinical Onset: A Report of Five Pa-tients,” Neuron, vol. 30, 2001, pp. 51–64.

3. G. Fortino and M. Pathan, “Integration of Cloud Computing and Body Sensor Networks,” Future Gen-eration Computer Systems, vol. 35, 2014, pp. 57–61.

4. B. Javadi, J. Abawajy, and R. Buyya, “Failure-Aware Resource Provisioning for Hybrid Cloud Infrastruc-ture,” Parallel Distributed Computing, vol. 72, no. 10, 2012, pp. 1318–1331.

5. J. Jakubowski et al., “Higher Order Statistics and Neu-ral Network for Tremor Recognition,” IEEE Trans. Biomedical Eng., vol. 49, no. 2, 2002, pp. 152–159.

6. P. Husar and G. Henning, “Bispectrum Analysis of Visually Evoked Potentials,” IEEE Eng. Medicine and Biology, vol. 16, no. 1, 1997, pp. 57–63.

7. C.L. Nikias and J.M. Mendel, “Signal Processing with Higher-Order Spectra,” IEEE Signal Processing,vol. 10, no. 3, 1993, pp. 10–37.

8. R.G. Andrzejak et al., “Indications of Nonlinear Deterministic and Finite Dimensional Structures in Time Series of Brain Electrical Activity: Depen-dence on Recording Region and Brain State,” Phys-ical Rev. E, vol. 64, no. 6, 2001, article no. 061907.

9. M.J. Hinich, “Testing for Gaussianity and Linearity of a Stationary Time Series,” Time Series Analysis,vol. 3, no. 3, 1982, pp. 169–176.

10. A. Swami, C.M. Mendel, and C.L. Nikias, Higher-Order Spectral Analysis (HOSA) Toolbox, version 2.0.3, 2000; http://in.mathworks.com/matlabcen-tral/fileexchange/3013-hosa-higher-order-spectral-analysis-toolbox.

11. A. Bao et al., “Helping Mobile Apps Bootstrap with Fewer Users,” Proc. 14th Int’ l Conf. Ubiquitous Com-puting, 2012, pp. 1–10.

12. H. Yan et al., “A Multilayer Perceptron-Based Medi-cal Decision Support System for Heart Disease Di-agnosis,” Expert Systems with Applications, vol. 30, no. 2, 2006, pp. 272–281.

13. D.M. Bates and D.G. Watts, Nonlinear Regression: Iterative Estimation and Linear Approximations, John Wiley & Sons, 1988.

14. M. Koc and A. Barkana, “Application of Linear Re-gression Classification to Low-Dimensional Datas-ets,” Neurocomputing, vol. 131, 2014, pp. 331–335.

15. P.J. Rousseeuw, “Least Median of Square Regres-sion,” Am. Statistical Assoc., vol. 79, no. 388, 1984, pp. 871–880.

16. M. Hall et al., “The WEKA Data Mining Software: An Update,” ACM SIGKDD Explorations Newslet-ter, vol. 11, no. 1, 2009, pp. 10–18.

17. A. Subasi, “EEG Signal Classification Using Wave-let Feature Extraction and a Mixture of Expert Model,” Expert Systems with Applications, vol. 32, 2007, pp. 1084–1093.

18. P. Baldi et al., “Assessing the Accuracy of Prediction Algorithms for Classification: An Overview,” Bioin-formatics, vol. 16, no. 5, 2001, pp. 412–424.

19. L.D. Iasemidis et al., “Adaptive Epileptic Seizure Prediction System,” IEEE Trans. Biomedical Eng.,vol. 50, no. 5, 2003, pp. 616–627.

Sanjay Sareen is a system manager at Guru Nanak Dev University, Amritsar, Punjab, India. His research interests include cloud computing, Internet of Things and data se-curity. Sareen is pursuing a PhD in computer applications at IK. Gujral Punjab Technical University, Kapurthala, Punjab, India. Contact him at [email protected].

Sandeep K. Sood is a professor at Guru Nanak Dev University Regional Campus, Gurdaspur, Punjab, In-dia. His research interests include cloud computing, data security, and big data. Sood has a PhD in computer science and engineering from IIT Roorkee, India. Con-tact him at [email protected].

Sunil Kumar Gupta is an associate professor at Beant Col-lege of Engineering and Technology, Gurdaspur, Punjab, India. His research interests include cloud computing, mobile computing, and distributed systems. Gupta has a PhD in computer science from Kurukshetra University, Kurukshetra. Contact him at [email protected].



qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________

_____________

______________________

______________________________

__________

______________


http://in.mathworks.com/matlabcentral/fileexchange/3013-hosa-higher-order-spectral-analysis-toolbox










HYBRID SYSTEMS

Since the 1980s, the US National Science Foundation (NSF) has funded supercom-puters for use by scientific researchers and engineers, but continuing this practice

today involves many challenges. An interim NSF report published in 2014 made it clear that the high cost of high-end facilities and shrinking NSF resources are compounded by the fact that the com-puting needs of scientists and engineers are becom-ing more diverse.1 For example, data analytics is a rapidly growing field that brings with it completely different computing requirements than conven-tional scientific and engineering simulations. For optimal application performance, a certain system structure is desired, and different disciplines tend to have different optimal systems. Some applications, for example, are shifting from conventional CPUs to heterogeneous parallelized architectures that in-clude GPUs. Cloud computing could be a potential solution to meet these expanding computing needs. Although cloud computing services could be trans-formative for some fields, there’s a high level of un-certainty about the cost tradeoffs, and the options must be evaluated carefully.1

A case in support of cloud computing is that if it’s used as an alternative to constructing and main-taining a local high-performance computing cluster (HPCC), it would relieve institutions and compa-nies from the drudgery and cost of building and

maintaining their own local HPCCs. Instead, they can simply set up an account and run an applica-tion instantly for a fee, with no financial or installa-tion maintenance overhead. Moreover, a cloud ser-vice can offer great flexibility—HPC users can out-source their lower-profile jobs to cloud servers and reserve the most critical ones for their local clusters. An additional benefit to using cloud computing is that various machine configurations can be expedi-tiously tested and explored for benchmarking pur-poses, which can lead to more appropriate decisions for those planning on building their own HPCC.

However, are the cloud services available today prepared to meet the needs of HPC applications? Is using the cloud a viable alternative to localized, con-ventional supercomputers? Amazon Web Services (AWS) is one of the most prevalent vendors in the cloud computing market,2 and its computing service, Amazon Elastic Cloud Compute (EC2), offers a va-riety of virtual computers (www.ec2instances.info). In recent years, several new services tailored toward HPC applications have been released; AWS seems to be an appropriate cloud computing provider to evaluate whether cloud computing is ready for HPC applications. The first work that evaluated Amazon’s EC2 service for an HPC application ran coupled atmosphere-ocean climate models and performed standard benchmark tests.3 That work highlighted that the performance was significantly worse in the

The Feasibility of Amazon’s Cloud Computing

Platform for Parallel, GPU-Accelerated,

Multiphase-Flow Simulations

Cole Freniere, Ashish Pathak, Mehdi Raessi, and Gaurav Khanna | University of Massachusetts Dartmouth

Amazon's Elastic Compute Cloud (EC2) service could be an alternative computational resource for running MPI-parallel, GPU-accelerated, multiphase-flow simulations. The EC2 service is competitive with a benchmark cluster in a certain range of simulations, but there are some performance limitations, particularly in the GPU and cluster network connection.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________

http://www.ec2instances.info






cloud than at dedicated supercomputer centers and was only competitive with low-cost cluster systems. The poor performance occurred because latencies and bandwidths were inferior to dedicated centers; the authors recommended that the interconnect net-work be upgraded to systems such as Myrinet or In-finiBand to be desirable for HPC use. Peter Zaspel and Michael Griebel4 evaluated AWS for their het-erogeneous CPU-GPU parallel two-phase flow solv-er, similar to the solver we present in this article. This work concluded that the cloud was well prepared for moderately sized computational fluid dynamics (CFD) problems for up to 64 cores or 8 GPUs, and that it was a viable and cost-effective alternative to mid-sized parallel computing systems. However, if the cloud cluster was increased to more than eight nodes, network interconnect problems followed. In 2012, Piyush Mehrotra and coworkers5 of the NASA Ames Research Center compared Amazon’s perfor-mance to their renowned Pleiades supercomputer. For single-node tests, AWS was highly competitive with Pleiades, but for large core counts, it was signifi-cantly slower because the Ethernet connection didn’t compare well with Pleiades’ InfiniBand network. The authors concluded that Amazon’s computers aren’t suitable for tightly coupled applications, where fast communication is paramount.

Many other studies conducted standard bench-mark tests on AWS to compare it to a conventional HPCC and reached similar conclusions. Zach Hill and Marty Humphrey6 concluded that AWS’s ease of use and low cost make it an attractive option for HPC, but not for tightly coupled applications. Keith Jackson and coworkers7 ran their own ap-plication in addition to standard benchmark tests, and also concluded that AWS isn’t suited for tight-ly coupled applications. Yan Zhai and coworkers8

included a variety of benchmark tests, application tests, and a highly detailed breakdown of the costs associated with the two alternatives, producing a more positive evaluation of AWS than most other studies, but with an admission that the cloud isn’t ideal for codes that require many small messages between parallel processes. Aniruddha Marathe and coworkers9 ran benchmark tests and developed a pricing model to evaluate AWS as an alternative to a local cluster on a case-by-case basis but didn’t use this model to present quantitative economic results. Overall, the general conclusions regarding cloud computing for HPC applications has evolved over time as the market has developed.

Our work is concerned with evaluating AWS for a GPU-accelerated, multiphase-flow solver—a

3D parallel code. Keeping in mind that Amazon’s services are rapidly evolving and that new hard-ware options are constantly being added, the key question addressed is the following: Is outsourcing HPC workloads to the AWS cloud a viable alter-native to using a local, purpose-built HPCC? This question is answered from the perspective of our own research group; more broad recommendations are made for other HPC users. Not surprisingly, the answer to this question depends on many fac-tors. We believe this work is the first to comprehen-sively test the g2.2xlarge GPU-instance of AWS for multiphase-flow simulations; it's not limited to just standard benchmark tests.

Amazon Web Services

The user can manage cloud services with the AWS management console via a Web browser or the com-mand line. AWS offers more than 40 different ser-vices, but the only ones necessary for our tests were the EC2 service for virtual computer rental and the Simple Storage Service (S3) for data storage. EC2 of-fers computers (known as instances) with a variety of hardware specifications—the most basic instance is a single-core CPU with 1 Gbyte of RAM, priced at US$0.013 per hour, and the most expensive instance consists of 32 cores with 104 Gbytes of RAM, priced at $6.82 per hour (www.ec2instances.info).

The user must select an Amazon Machine Im-age (AMI), which includes the operating system and software loaded onto the instance. Several default and community AMIs built by other cus-tomers are available. Because default AMIs are very bare-boned, it may be necessary for the user to in-stall several libraries and other software on the in-stance to run specific applications—we spent con-siderable time properly configuring the instance for our application. However, once the instance is set up to the user’s liking, a new AMI can be saved from that machine and can be used as a template to easily create more instances in the future. This is a critical feature when building clusters of instances.

Building a Cluster in the Cloud

For instances to communicate over the same net-work, they must be launched into the same place-ment group, which ensures that the requested ma-chines are physically located in the same comput-ing facility. Three notable tools are available:

■ Cloud Formation Cluster (CfnCluster) is of-fered by AWS for cluster creation, but it’s cur-rently limited in its configuration options. Only

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_______________








HYBRID SYSTEMS

one default AMI is available for all instance types, and because all custom AMIs must be first constructed off the default AMI, this posed limitations for us.

■ Starcluster was developed by MIT and enables easy configuration and management of clus-ters. In contrast to CfnCluster, many default AMIs are available for various instances. How-ever, some newer instances aren’t supported on Starcluster. We used Starcluster to set up our clusters and found it easy to use.

■ CloudFlu was developed specifically for the CFD program OpenFOAM to provide ease of use to scientists and engineers who are new to the cloud computing environment.

Hardware Specifications of the Benchmark

and Amazon Clusters

The University of Massachusetts Dartmouth HPCC is the benchmark cluster for this study. It contains two Intel Xeon (quad-core) E5620 2.4 GHz processors, 24 Gbytes of DDR3 ECC 1333 MHz RAM, one Nvidia Tesla (Fermi) M2050 GPU with 3 Gbytes memory, and an InfiniBand network connection. The instance on AWS, which was the most appropriate for our applications, was the g2.2xlarge instance. It has eight high-frequency Intel Xeon E5-2670 (Sandy Bridge) 2.6 GHz pro-cessors, 15 Gbytes RAM, one Nvidia GRID K520 GPU with 1,536 CUDA cores, and 4 Gbytes mem-ory. The Tesla GPU was purpose-built for scientific computing, while the GRID GPU is marketed for high-performance gaming. No numbers for the network connection speed for this instance type are published, but it’s claimed to have high networking performance in EC2 listings (www.ec2instances.info). Other instances advertise a 10 Gbytes/s Eth-ernet connection, such as the g2.8xlarge instance, which is identical to the g2.2xlarge instance ex-cept that it has four times as many GPUs, cores, and RAM. However, it was introduced during the time of this study, and building a cluster from it proved to be a large obstacle that wasn’t overcome because of the lack of support from StarCluster. CfnCluster successfully launched a cluster of these instances, but the default AMI was too restrictive, and configuring the necessary libraries over the de-fault AMI proved very difficult. Another instance, called cg1.4xlarge, is a cluster GPU instance, but it was first offered in 2010 and is now considered a previous-generation instance and wasn’t deemed desirable for our needs.

Multiphase-Flow Solver

The multiphase-flow solver simulates a two-fluid flow interacting with moving rigid solid bodies.10,11

The two fluids are incompressible, immiscible, and Newtonian. The two-step projection method12 is implemented to solve the flow equations. The solu-tion procedure includes a pressure Poisson problem that is solved iteratively at each time step by using a Jacobi preconditioned conjugate gradient method. The pressure solution is the bottleneck of the over-all algorithm, taking 60 to 90 percent of the total execution time. To remove this bottleneck, Stephen Codyer and coworkers13 ported the pressure solu-tion to GPUs using MPI parallelism. The pressure solver requires communication between CPUs and GPUs, which is done through the Peripheral Com-ponent Interconnect Express (PCIe) bus, peaking at 4 Gbytes/s on the benchmark cluster. Additionally, at the end of each iteration, the pressure solution for the MPI ghost cells is transferred to the neighboring MPI subdomains constituting the CPU-CPU com-munication. Consequently, the flow solver’s com-pute time depends heavily on GPU speed, commu-nication time across the GPU device and the CPU, and communication time across different CPUs. We evaluated both the CPU-GPU and CPU-CPU com-munication times. The benchmarked problem was a freely falling, rigid solid wedge that’s released in air and eventually impacts a water free-surface.10

MPI Communication Benchmarks

To determine MPI communication performance for both clusters, we used the Ohio micro-benchmark suite developed by the Ohio State University (mvapi-ch.cse.ohio-state.edu/benchmarks). We conducted point-to-point tests to study latency and bandwidth internodally between any two random nodes in the two clusters. This is representative of the ghost cell data transfer that occurs between MPI subdomains. We also conducted collective latency tests that uti-lize all nodes in a cluster. The results for these tests are presented in logarithmic scale in Figure 1.

Point-to-point latency tests. Referring to Figure 1a, it’s apparent that the latencies are 10 to 40 times larger on AWS than on the benchmark cluster. For small mes-sage sizes, the latency for the benchmark cluster and AWS are 2 and 85 s, respectively. For applications requiring frequent communication of small messages, a factor-of-40 deficit for AWS can drastically affect performance. However, as message size increases, the disparity isn’t quite as large: the latency on AWS is a factor of 10 higher than the benchmark cluster.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___

_____________________

_____


http://mvapich.cse.ohio-state.edu/benchmarks






Point-to-point bandwidth tests. Figure 1b shows the communication bandwidth between two nodes. The maximum sustained bandwidth rates for AWS and the benchmark cluster are 984 Mbytes/s and 25.6 Gbytes/s, respectively. The bandwidth ranges from 15 to 25 times lower on Amazon, illustrating the difference between the Ethernet connection in the AWS placement group and the InfiniBand con-nection on the benchmark cluster. Contrary to the latency tests, the bandwidth tests show that AWS suffers more at larger message sizes.

Collective latency tests. The graphs for the collective latency tests aren’t presented in this article for brev-ity; the results are actually similar to the point-to-point tests. For an eight-node cluster on Amazon, the collective test MPI_alltoall approaches laten-cies of 700 s, while on the benchmark cluster, it’s 80 s. Such large latencies drastically slow down the flow solver when quantities across multiple pro-cesses are collected and summed.

Connection Speed over the Internet from a

Local Machine to Instances

For our purposes, it was convenient to simply se-cure copy (scp) the data directly from Amazon’s virtual machines to ours, rather than using S3. The bandwidth fluctuated between 1 and 7 Mbytes/s, which is a reasonable connection. There could be some cases in which the data must persist past the lifetime of the instance, for example, if the output data can’t be copied to a local server as quickly as the application produces it.

Performance of MPI-Parallel

GPU-Accelerated Code

We tested the flow solver’s performance on AWS by simulating a rigid, solid wedge free-falling through air and impacting a water surface.10 The time spent in communication between devices is termed com-

munication overhead, and in the context of weak and strong scaling, we determined it for both CPU-GPU and CPU-CPU communication for various cluster sizes. Typically, when the flow solver is running on a conventional HPCC, about 10 to 25 percent of the execution time is spent just trans-ferring data from the CPU to GPU, and 5 to 10 percent is spent transferring data between CPUs through MPI-parallel calls. Thus, any decrease in communication speed performance on AWS can have a significant impact on overall execution time.

GPU Performance

GPU speed is of great importance and drastically affects execution time. The GPU on the g2.2xlarge instance was found to be about 25 percent slower than the benchmark cluster’s GPU. This impedi-ment plays a large role in the results for overall AWS performance.

Strong Scaling

The simulation tested for strong scaling required nearly all the 15 Gbytes of memory offered by a sin-gle g2.2xlarge instance. As Figure 2 shows, the AWS cluster is 25 to 40 percent slower than the benchmark cluster. Note that the speedup is reported relative to one node on the benchmark cluster on a loga-rithmic scale. For low node counts, the AWS clus-ter is competitive with the benchmark cluster and is merely 25 percent slower than the benchmark. However, as node count increases, AWS doesn’t fare as well: the performance is 30 percent slower than the UMD HPCC for clusters with two nodes or more. The solver’s general behavior for strong scaling is as follows: increasing the number of processes for a fixed problem size means less memory per process, that is, the number of cells per process decreases. Memory transfer between processes is directly relat-ed to the number of cells per process, and commu-nication time between processes is proportional to

Figure 1. The average (a) latency and (b) bandwidth for various message sizes between two nodes on each cluster. Note that the horizontal axis is logarithmic base 2 and the vertical is logarithmic base 10.

1

10

100

1,000

10,000

1,00,000

1 100 10,000 10,00,000

Poi

nt-t

o-po

int

late

ncy

(μs)

Message size (bytes)

AWS cluster

UMD HPCC (benchmark cluster)

0.1

1

10

100

1,000

10,000

1 100 10,000 10,00,000

Poi

nt-t

o-po

int

band

wid

th (

MB

/s)

Message size (bytes)

AWS cluster


(a) (b)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









HYBRID SYSTEMS

the memory that must be transferred. Hence, strong scaling has the advantage of reducing the workload and communication time per process, but it has the disadvantage of requiring a large network size.

CPU-GPU communication. A single node consists of one CPU (eight processors) and one GPU card. All the pressure field data for eight CPU processes are transferred between the CPU and GPU twice dur-ing each iteration of the pressure solver. The pres-sure solver was set to iterate 1,000 times, and the communication time was determined by modify-ing the code to either allow communication be-tween the CPU and GPU or not at all. This iso-lated the communication time between the CPU and GPU. Figure 3 shows the results for commu-nication time in logarithmic scale as the cluster is scaled up. Recognizing that scaling up the cluster decreases the number of cells per process, CPU-GPU communication time decreases accordingly. This behavior is observed for both the benchmark and AWS, implying that network performance

from the CPU across the PCIe bus to the GPU is highly competitive between the two clusters.

CPU-CPU communication. MPI communication con-stitutes the CPU-CPU communication in the solver. The Ohio State MPI benchmarks presented earlier are representative of this CPU-CPU communication. Using a similar procedure as in the CPU-GPU com-munication evaluation, we calculated the communi-cation overhead: the pressure solver was allowed to it-erate 1,000 times, both with and without CPU-CPU communication. One layer of ghost cells is necessary for each shared boundary, so as the number of sub-domains increases, the number of ghost cells increas-es disproportionally. This is one of the limitations of domain decomposition and leads to diminishing re-turns for each node added. Figure 4 shows the results for these tests. For the benchmark cluster, CPU-CPU communication starts off at 2.7 seconds and drops to less than 1 second very consistently for all sub-sequent cluster sizes. On the other hand, on AWS, the overhead starts off lower than the benchmark, at 1.3 seconds, but when a second node is added, it steps up dramatically to 4.6 seconds. It’s interesting to note that AWS communication time increases with the addition of a second node, whereas the UMD HPCC communication time decreases. The key difference is that the addition of a second node on the AWS cluster requires the use of the Ethernet network, which negatively impacts performance. An-other shortfall of AWS is that the performance of its Ethernet network is highly variable, which is visible in the error bars in Figure 4. Even though the CPU-CPU communication time on AWS is higher than the benchmark cluster, the difference isn’t significant for this particular application because the time spent in communication is relatively small compared to the total execution time.

Weak Scaling

Figure 5 shows the results of the weak scaling tests. Note that the scaling is presented relative to one UMD HPCC node. The AWS cluster is 25 to 45 percent slower overall than the benchmark cluster. As the number of nodes increases, AWS becomes progressively slower than the UMD HPCC. For example, AWS is 25 percent slower than the UMD HPCC for single-node test cases, but for high node counts, it becomes 45 percent slower. The 25 percent deficit for AWS for one node is because the GPU is inherently less pow-erful than the benchmark cluster. However, the increased deficit with large cluster size is due to

Figure 2. Strong scaling speedup for the benchmark cluster at the University of Massachusetts Dartmouth (UMD) and the AWS cluster. Note that the speedup is reported relative to one UMD HPCC node. The vertical axis is logarithmic base 2.

0.4

0.8

1.6

3.2

6.4

1 2 3 4 5 6 7 8

Spe

edup

rel

ativ

e to

one

benc

hmar

k no

de

No. nodes (8 cores and 1 GPU per node)

AWS cluster


Figure 3. Strong scaling CPU-GPU communication overhead time in seconds after 1,000 iterations of the pressure solver. Error bars indicate the maximum and minimum data points, and plotted points are averages. The vertical axis is logarithmic base 2.

1

2

4

8

16

32

1 2 3 4 5 6 7 8

Com

mun

icat

ion

tim

e (s

)


AWS cluster


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








the poor network communication of AWS relative to the benchmark.

CPU-GPU communication. We used the same CPU-GPU communication tests that we used for strong scaling for weak scaling. Figure 6 shows the results for CPU-GPU communication time as a function of cluster size. Similar to strong scaling, in weak scaling, Amazon’s instances are highly competitive with the benchmark cluster, although they’re significantly less consistent. Note that the number of grid points per subdomain remains constant. At each iteration of the pressure solver, the pressure field information is trans-ferred to the GPU device, and the amount of data transferred is proportional to the total number of grid points in the subdomain. These two facts imply that ideally the communication time to the GPU would remain constant because the same amount of data is being transferred for all tests. This inference is accurate for the benchmark cluster: the CPU-GPU communi-cation overhead is consistently around 8 seconds, which

is about 15 percent of the total execution time. AWS shows similar behavior, although less consistently. We have no explanation for why CPU-GPU communica-tion increases at seven and eight nodes for AWS.

CPU-CPU communication. For weak scaling, the amount of data exchanged between parallel pro-cesses is the same regardless of cluster size, so theo-retically the only variable from one run to the next is the overhead from increasing the total number of processes. When comparing the two clusters (see Figure 7), drastically different behavior is ob-served for CPU-CPU communication overhead. For weak scaling, it’s at 2 seconds or less on the benchmark cluster, and it remains relatively con-stant. On AWS, it increases from 1 second for the single-node case all the way up to 8 seconds for the eight-node cluster. Note the variability in AWS performance, represented by the error bars in Figure 7. As previously stated, the Ethernet network on AWS is much slower than the InfiniBand net-work on the local cluster. However, the time spent in communication between CPUs is still relatively small compared to the time spent in CPU-GPU

Figure 7. Weak scaling CPU-CPU communication time in seconds after 1,000 iterations of the pressure solver. Error bars indicate the maximum and minimum data points, and plotted points are averages.

0123456789

10

1 2 3 4 5 6 7 8

Com

mun

icat

ion

tim

e (s

)


AWS cluster


Figure 4. Strong scaling CPU-CPU communication overhead time in seconds after 1,000 iterations of the pressure solver. Error bars indicate the maximum and minimum data points, and plotted points are averages. The vertical axis is linear, not logarithmic.

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Com

mun

icat

ion

tim

e (s

)


AWS cluster


Figure 5. Weak scaling performance for the benchmark cluster at UMD and the AWS cluster. Note that the performance is reported relative to one UMD HPCC node.

The ordinate is (t1,benchmark)/tN, where t1,benchmark is the computation time for one node on the benchmark

cluster, and tN is the time taken for a cluster of N nodes.

0

0.2

0.4

0.6

0.8

1.0

1.2

1 2 3 4 5 6 7 8

Wea

k sc

alin

g re

lati

ve t

o on

ebe

nchm

ark

node


AWS cluster


Figure 6. Weak scaling CPU-GPU communication time in seconds after 1,000 iterations of the pressure solver. Error bars indicate the maximum and minimum data points, and plotted points are averages.

02468

10121416

1 2 3 4 5 6 7 8

Com

mun

icat

ion

tim

e (s

)


AWS clusterUMD HPCC (benchmark cluster)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









HYBRID SYSTEMS

communication and general computations, so the slow network connection doesn’t pose as much of a problem as it presented in previous studies.

Cost Analysis

Cost is an important factor in our evaluation of AWS as an alternative to a local, conventional HPCC. Comparing the two alternatives on an hourly or total cost basis doesn’t lead to an imme-diately obvious conclusion because there are many variables that can affect the outcome. We used our local cluster for the cost analysis, and the results can be considered a case study. When building a local HPCC, the upfront cost is very large, but the invest-ment is relatively long term because it can last sever-al years. In addition to the electricity cost, research-ers using a local cluster might need to support IT professionals for maintenance services on the cluster. These additional costs over the cluster’s lifetime can become significant compared to the cluster’s upfront cost. Therefore, we’ll make the cost analysis and comparison with AWS both with and without these additional costs in the following sections.

Purchasing cloud services is a fundamentally dif-ferent approach to doing business. No maintenance or installation is required, and the upfront cost can be eliminated entirely by using an “on-demand” pay-ment method that charges the user by rounding up to the nearest hour of usage time. Customers can com-mit to a certain amount of reserved hours and pay an upfront cost that will reduce the total cost compared to the on-demand payment method. Pricing on AWS depends mainly on the following factors:

■ compute time, which is the most expensive fac-tor and depends on the instance type as well as usage tier (on-demand or reserved);

■ number of nodes; ■ amount and duration of data storage in the

cloud; and

■ amount of data that is transferred from AWS to the Internet.

AWS offers three usage tiers: on-demand, one-year reserved, and three-year reserved. Note that a reserved instance will use the same physical ma-chine for the reservation period. Table 1 presents a cost comparison between the benchmark local cluster and AWS for each usage tier. The bench-mark local cluster is considered with and without the additional costs associated with electricity and maintenance by IT professionals. In this analysis, we approximated that the electricity cost per node is $3,200 for a period of five years. We also approx-imated that 30 percent of a full-time IT profes-sional’s time is spent on local cluster maintenance, which would result in $1,600 in maintenance cost per node for a five-year period.

Integration of Performance with Cost

Next, we narrow down our price analysis to price per unit of useful computational work. In other words, how many simulations can be completed on a cost ba-sis? This type of analysis, admittedly, could be highly variable: it depends on the cluster size and the simu-lation, as well as the cluster’s hardware specifications. For the majority of test cases shown in Figures 2 and 5, AWS was about 40 percent slower than the benchmark cluster. Consequently, simulations require roughly 40 percent more time to complete on AWS than the benchmark cluster. To account for this, a weighting factor is applied to the results in Table 1, resulting in the “weighted cost per unit of work” shown in Table 2.

Breakdown of Total Cost

The total cost associated with running the test case simulation on AWS can be modeled by Equation 1. The cost of data storage is $0.03/(Gbyte-month) for both EC2 block storage and S3, while the cost of data transfer is $0.09/Gbyte:

Table 1. Cost comparison between the benchmark local cluster (first two rows) and the AWS cloud

HPC with on-demand, one-year, and three-year reservations. The benchmark is considered with and

without electricity and maintenance costs.

Total cost Equivalent cost per node-hour Useful life

Without electricity or maintenance $6,000 $0.137 5 years

With electricity and maintenance $10,800 $0.247 5 years

On-demand N/A $0.65 Hourly

1-year reservation $3,478 $0.40 1 year

3-year reservation $7,410 $0.282 3 years

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








Cost = (EC2) + (data storage) + (data transfer)

= (p × t1 × n) + (4.175 × 10–5 × t2 × x1) + (0.09 × x2), (1)

where p is the price of instance ($/node-hour), t1is compute time, n is the number of nodes, 4.175 × 10–5 is the price of data storage ($/Gbyte-hour), t2 is the duration of data storage (hours), x1 is the amount of data stored (Gbyte), 0.09 is the price of data transferred to the Internet ($/Gbyte), and x2 is the amount of data transferred (Gbyte).

Sample Calculation

The computational domain in the test case studied here consisted of 36 million grid points, which re-quired 60 Gbytes of RAM distributed across four nodes. The simulation time was 106 hours on AWS and 71 hours on the benchmark cluster. On AWS, 16 Gbytes of data were stored and transferred from the on-demand instances, which translates into $275 for EC2, $0.08 for data storage, and $1.44 for data transfer.

Clearly, EC2 is by far the largest contributor. On the benchmark cluster, the simulation cost is $39 when the electricity cost and maintenance are neglected and $70 when included. In both cases, running the simulation on the local cluster costs less than on AWS.

Consideration of Percent Utilization

In some cases, a local cluster might not be fully utilized at all times—that is, some nodes might be idle for an extended period of time. The amount of nodes utilized on a cluster can be simply represent-ed by a percent utilization quantity, which means

that some nodes are paid for but aren’t completing useful work. This increases the “weighted cost per unit of work” quantity, which can be quantified by the percentage of a local cluster’s utilization. If the local cluster’s percent utilization is below a critical value, then using AWS would be more cost-effective. Table 3 presents this critical value for the various AWS pricing options. For example, with the electricity and maintenance costs included, if the local cluster is utilized 27 percent or less, then the AWS on-demand option is more cost-effective. It should be mentioned that if the utilization of a local cluster is expected to be low, users could pool the resource with other local computational re-search groups, effectively subsidizing the cost and raising the utilization. It’s important to note that it’s less likely that reserved instances would have 100 percent utilization than on-demand instances, but the calculation for reserved instances are in-cluded with 100 percent utilization for consistency.

The percent utilization of our local cluster is much higher than the percentages shown in Table 3. Therefore, AWS isn’t a cost-effective option com-pared to our local cluster. The only AWS option that becomes relatively competitive when the costs associated with electricity and maintenance are in-cluded is the three-year reserved instance.

The performance of our in-house, 3D, MPI-paral-lel, GPU-accelerated, multiphase-flow solver was

assessed on both Amazon’s Elastic Compute Cloud service and the local HPC cluster at the University of Massachusetts Dartmouth, which is considered the benchmark. For the type of application that we

Table 3. Utilization evaluation.

On-demand 1-yr. reservation 3-yr. reservation

Percent utilization without electricity or maintenance 15 21 37

Percent utilization with electricity and maintenance 27 39 67

Table 2. Weighted cost per unit of work summary.

Weighted total cost Weighted unit cost Useful life

Without electricity or maintenance $6,000 $0.137 5 years

With electricity and maintenance $10,800 $0.247 5 years

On-demand N/A $0.91 Hourly

1-year reservation $4,850 $0.64 1 year

3-year reservation $10,300 $0.37 3 years

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









HYBRID SYSTEMS

tested, AWS (g2.2xlarge instance) isn’t fully recom-mended as an alternative to a local HPCC—spe-cifically, we found that the g2.2xlarge instance isn’t optimized for GPU-accelerated simulations. In fact, the GPU offered by the Amazon instance is a gam-ing GPU, which exhibited slower performance than the Tesla GPU on the benchmark cluster. If the GPU offered by Amazon cloud computing was more suit-able for HPC, then the results would improve. Ad-ditionally, the interconnect for the Amazon instance is an Ethernet connection with about 1 Gbyte/s bandwidth, which is about 25 times slower than the InfiniBand connection used on the benchmark cluster. The Amazon cloud clusters’ performance is also highly variable, particularly in the MPI com-munication across the nodes, which is a serious issue for heavy HPC workloads like ours, which require a heterogeneous CPU-GPU framework and frequent communication. However, our results show that the slow cluster network connection doesn’t hinder per-formance as much as previous studies suggest. Nev-ertheless, these impediments result in simulations that can take 40 percent longer than the benchmark cluster. From a cost viewpoint, the only AWS option that comes close to our local cluster when the costs associated with electricity and maintenance are in-cluded is the three-year reserved instance. All other AWS options are significantly more expensive than the local cluster.

It should be noted that performance on cloud clusters can vary considerably depending on appli-cation and hardware requirements. Members of the HPC community are encouraged to test their own applications on cloud computing services, such as AWS. New instances that could allow HPC us-ers to switch to more powerful instance types are frequently released on cloud computing services. An additional benefit is that cloud computing can be useful for companies or consultants who need quick access to medium-sized GPU clusters like the ones we tested, but as things currently stand, cloud computing probably wouldn’t be suitable for researchers and scientists who continuously need to run large-scale simulations for long periods of time. If an HPC user’s hardware needs are relative-ly simple, for instance, if the user doesn’t require GPUs or parallel processing, the cloud becomes more appealing. Finally, for those who are plan-ning on building a local HPCC, cloud computing services can be useful for testing various machine configurations for benchmarking purposes, which can lead to more effective decisions concerning fu-ture hardware investments.

Acknowledgments

We gratefully acknowledge support from the US Na-tional Science Foundation grants CBET-1236462, PHY-1303724, and PHY-1414440, and US Air Force support 10-RI-CRADA-09. We’re also grateful to the University of Massachusetts Dartmouth Office of Undergraduate Research for funding this project.

References

1. Nat’l Research Council, Future Directions for NSF Advanced Computing Infrastructure to Support US Science and Engineering in 2017–2020: Interim Re-port, Nat’l Academies Press, 2014.

2. D. Eadline, “Moving HPC to the Cloud,” Admin Magazine, 2015; www.admin-magazine.com/HPC/Articles/Moving-HPC-to-the-Cloud.

3. C. Evangelinos and C.N. Hill, “Cloud Comput-ing for Parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere-Ocean Climate Models on Amazon’s EC2,” Proc. 1st Workshop Cloud Computing and Its Applica-tions, 2008.

4. P. Zaspel and M. Griebel, “Massively Parallel Fluid Simulations on Amazon’s HPC Cloud,” Proc. 1st Int’ l Symp. Network Cloud Computing and Applica-tions, 2011, pp. 73–78.

5. P. Mehrotra et al., “Performance Evaluation of Ama-zon EC2 for NASA HPC Applications,” Proc. 3rd Workshop Scientific Cloud Computing, 2012, pp. 41–50.

6. Z. Hill and M. Humphrey, “A Quantitative Analy-sis of High Performance Computing with Amazon’s EC2 Infrastructure: The Death of the Local Clus-ter?,” Proc. 10th IEEE/ACM Int’ l Conf. Grid Com-puting, 2009, pp. 26–33.

7. L. Jackson et al., “Performance Analysis of High Performance Computing Applications on the Ama-zon Web Services Cloud,” Proc. IEEE 2nd Int’ l Conf. Cloud Computing Technology and Science, 2010, pp. 159–168.

8. Y. Zhai et al., “Cloud versus In-House Cluster: Evaluating Amazon Cluster Compute Instances for Running MPI Applications,” Proc. Int’ l Conf. High Performance Computing, Networking, Storage and Analysis, 2011, pp. 1–10.

9. A. Marathe et al., “A Comparative Study of High-Performance Computing on the Cloud,” Proc. ACM Symp. High-Performance Parallel and Distributed Computing, 2013, pp. 239–250.

10. A. Pathak and M. Raessi, “A 3D, Fully Eulerian, VOF-Based Solver to Study the Interaction between Two Fluids and Moving Rigid Bodies Using the Fic-titious Domain Method,” J. Computational Physics,vol. 311, 2016, pp. 87–113.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________

http://www.admin-magazine.com/HPC/Articles/Moving-HPC-to-the-Cloud






11. A. Pathak and M. Raessi, “A Th ree-Dimensional Vol-ume-of-Fluid Method for Reconstructing and Advect-ing Th ree-Material Interfaces Forming Contact Lines,” J. Computational Physics, vol. 307, 2016, pp. 550–573.

12. A.J. Chorin, “Numerical Solution of the Navier-Stokes Equations,” Mathematics of Computation,vol. 22, 1968, pp. 745–762.

13. S. Codyer, M. Raessi, and G. Khanna, “Using Graph-ics Processing Units to Accelerate Numerical Simu-lations of Interfacial Incompressible Flows,” Proc. ASME Fluid Engineering Conf., 2012, pp. 625–634.

Cole Freniere is pursuing an MS in mechanical engineering at the University of Massachusetts Dartmouth. His research interests include renewable energy, fl uid dynamics, and HPC. Specifi cally, he’s interested in the application of advanced computational simulations to aid in the design of ocean wave energy converters. Contact him at [email protected].

Ashish Pathak is a PhD candidate in the Engineering and Applied Science program at the University of Mas-sachusetts Dartmouth. His research interests include multiphase fl ows and their interaction with moving rig-id bodies. Contact him at [email protected].

Mehdi Raessi (corresponding author) is an assistant professor in the Mechanical Engineering Department at the University of Massachusetts Dartmouth. His re-search interests include computational simulations of multiphase fl ows with applications in energy systems (renewable and conventional), material processing, and microscale transport phenomena. Raessi has a PhD in mechanical engineering from the University of Toronto. Contact him at [email protected].

Gaurav Khanna is an associate professor in the Physics De-partment at the University of Massachusetts Dartmouth. His primary research project is related to the coalescence of binary black hole systems using perturbation theory and estimation of the properties of the emitted gravitational radiation. Khanna has a PhD in physics from Penn State University. He’s a member of the American Physical Soci-ety. Contact him at [email protected].




More at www.computer.org/elearning

All the Knowledge You Need—On Your Time

Sharpen you edge in Cisco, IT Security, MS Enterprise, Oracle, Project Management and many more.

3,000 online courses6,500 technical books11,000 training videosMentoring, practice exams, and much more!

Learn something new. Try Computer Society eLearning today!

All the Knowledge YouNeed—On Your Time

Sharpen you edge in Cisco, IT Security, MS Enterprise, Oracle, Project Management and many more.

3,000 online courses6,500 technical books11,000 training videosMentoring, practice exams, and much more!

Learn something new. Try Computer Society eLearning today!

Keeping YOU at the

Center of TechnologyIEEE Computer Society Online Training

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________

_________________________

_____________







http://www.computer.org/elearning





SECTION TITLE


COMPUTER SIMULATIONS

Massive Computation for Understanding Core-Collapse Supernova Explosions

Christian D. Ott | Caltech

Core-collapse supernova explosions come from stars more massive than 8 to 10 times the mass of the sun. Ten core-collapse supernovae explode per sec-ond in the universe—in fact, automated astronomi-

cal surveys discover multiple events per night, and one or two explode per century in the Milky Way. Core-collapse super-novae outshine entire galaxies in photons for weeks and out-put more power in neutrinos than the combined light output of all other stars in the universe, for tens of seconds. Th ese explosions pollute the interstellar medium with the ashes of thermonuclear fusion. From these elements, planets form and life is made. Supernova shock waves stir the interstellar gas, trigger or shut off the formation of new stars, and eject hot gas from galaxies. At their centers, a strongly gravitating com-pact remnant, a neutron star or a black hole, is formed.

As the name alludes, the explosion is preceded by the collapse of a stellar core. At the end of its life, a massive star has a core composed mostly of iron-group nuclei. Th e core is surrounded by an onion-skin structure of shells dominated by successively lighter elements. Nuclear fusion is still ongo-ing in the shells, but the iron core is inert. Th e electrons in the core are relativistic and degenerate. Th ey provide the li-on’s share of the pressure support stabilizing the core against gravitational collapse. In this, the iron core is very similar to a white dwarf star, the end product of low-mass stellar evolu-tion. Once the iron core exceeds its maximum mass (the so-called eff ective Chandrasekhar mass of approximately 1.5 to 2 solar masses [M⦿]), gravitational instability sets in. With-in a few tenths of a second, the inner core collapses from a central density of approximately 1010 g cm–3 to a density

Editors: Barry I. Schneider, [email protected] | Gabriel A. Wainer, [email protected]

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___________________








comparable to that in an atomic nucleus (approxi-mately 2.7 × 1014 g cm–3). There, the repulsive part of the nuclear force causes a stiffening of the equation of state (EOS; the pressure–density rela-tionship). The inner core first overshoots nuclear density, then rebounds (“bounces”) into the still collapsing outer core. The inner core then stabilizes and forms the inner regions of the newborn proto-neutron star. The hydrodynamic supernova shock is created at the interface of inner and outer cores. First, the shock moves outward dynamically. It then quickly loses energy by work done breaking up infalling iron-group nuclei into neutrons, protons, and alpha particles. The copious emission of neutri-nos from the hot (T 10 MeV 1011 K) gas fur-ther reduces energy and pressure behind the shock. The shock stalls and turns into an accretion shock: the ram pressure of accretion of the star’s outer core balances the pressure behind the shock.

The supernova mechanism must revive the stalled shock to drive a successful core-collapse su-pernova explosion. Depending on the structure of the progenitor star, this must occur within one to a few seconds of core bounce. Otherwise, continu-ing accretion pushes the protoneutron star over its maximum mass (approximately 2 to 3 M⦿), which results in the formation of a black hole and no su-pernova explosion. Figure 1 provides a schematic of the core-collapse supernova phenomenon and its outcomes.

If the shock is successfully revived, it must travel through the outer core and the stellar enve-lope before it breaks out of the star and creates the spectacular explosive display observed by astrono-mers on Earth. This could take more than a day for a red supergiant star (such as Betelgeuse, a 20 M⦿ star in the constellation Orion) or just tens of seconds for a star that has been stripped of its ex-tended hydrogen-rich envelope by a strong stellar wind or mass exchange with a companion star in a binary system.

The photons observed by astronomers are emitted extremely far from the central regions, and they carry information on the overall ener-getics, the explosion geometry, and the products of the explosive nuclear burning triggered by the passing shock wave. They can, however, only pro-vide weak constraints on the inner workings of the supernova. Direct observational information on the supernova mechanism can be gained only from neutrinos and gravitational waves that are emitted directly in the supernova core. Detailed computa-tional models are required for gaining theoretical

insight and for making predictions that can be contrasted with future neutrino and gravitational-wave observations from the next core-collapse su-pernova in the Milky Way.

Supernova Energetics and Mechanisms

Core-collapse supernovae are “gravity bombs.” The energy reservoir from which any explosion mecha-nism must draw is the gravitational energy released in the collapse of the iron core to a neutron star: ap-proximately 3 × 1053 erg (3 × 1046 J), a mass-energy equivalent of approximately 0.15 M⦿c2. A fraction of this tremendous energy is stored initially as heat (and rotational kinetic energy) in the protoneutron

Figure 1. Schematic of core collapse and its simplest outcomes. The image shows SN 1987A, which exploded in the large Magellanic cloud.

H

He

Iron core

Red supergiant(not drawn to scale)

Si

C,O

Innercore

1.5−2M

≈2,000 km

Accretion

≈4

00

km

R ≈ 109 km

Core collapse toprotoneutron star

(PNS)

Stalled shock

PNS

vv

v

v

v

vv

v

v

v

M

M

M

M

Shockrevived

Shock notrevived

©A

nglo

-Aus

tral

ian

Obs

erva

tory

Core-collapsesupernova explosion

black holeformation

τ ≈ 1 − fewseconds

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










star and the rest comes from its subsequent contrac-tion. Astronomical observations, on the other hand, show the typical core-collapse supernova explosion energy to be in the range 1050 – 1051 erg. Hyper-nova explosions can have up to 1052 erg, but they make up only about 1 percent of all core-collapse supernovae. A small subset of hypernovae are asso-ciated with gamma-ray bursts.

Where does all the gravitational energy that doesn’t contribute to the explosion energy go? The answer is neutrinos. Antineutrinos and neutrinos of all flavors carry away 99 percent ( 90 percent in the hypernova case) of the available energy over O(10) s as the protoneutron star cools and con-tracts. This was first theorized and then later obser-vationally confirmed with the detection of neutri-nos from SN 1987A, the most recent core-collapse supernova in the Milky Way vicinity.

Because neutrinos dominate the energy trans-port through the supernova, they might quite

naturally have something to do with the explosion mechanism. The neutrino mechanism, in its current form, was proposed by Hans Bethe and Jim Wilson.1

In this mechanism, a fraction (approximately 5 per-cent) of the outgoing electron neutrinos and antineu-trinos is absorbed in a layer between protoneutron star and the stalled shock. In the simplest picture, this neutrino heating increases the thermal pressure behind the stalled shock. Consequently, the dynami-cal pressure balance at the accretion shock is violated and a runaway explosion is launched.

The neutrino mechanism fails in spherical symmetry but is very promising in multiple dimen-sions (axisymmetry [2D], 3D). This is due largely to multidimenstional hydrodynamic instabilities that break spherical symmetry (see Figure 2 for an example2), increase the neutrino mechanism’s efficiency, and facilitate explosion. I discuss this in more detail later in this article. The neutrino mechanism is presently favored as the mechanism driving most core-collapse supernova explosions (a recent review appears elsewhere3).

Despite its overall promise, the neutrino mechanism is very inefficient. Only 5 percent of the outgoing total luminosity is deposited behind the stalled shock at any moment and much of this deposition is lost again as heated gas flows down, leaves the heating region, and settles onto the proto-neutron star. The neutrino mechnism may (barely) be able to power ordinary core-collapse supernovae, but it cannot deliver hypernova explosion energies or account for gamma-ray bursts.

An alternative mechanism that could be part of the explanation for such extreme events is the mag-netorotational mechanism.4–6 In its modern form, a very rapidly spinning core collapses to a protoneu-tron star with a spin period of only 1 millisecond. Its core is expected to be spinning uniformly, but its outer regions will be extremely differentially rotat-ing. These are ideal conditions for the magnetoro-tational instability (MRI7) to operate, amplify any seed magnetic field, and drive magnetohydrody-namic (MHD) turbulence. If a dynamo process is present, an ultra-strong largescale (globally ordered) magnetic field is built up. This makes the protoneu-tron star a protomagnetar. Provided this occurs, magnetic pressure gradients and hoop stresses could lead to outflows along the axis of rotation. The MRI’s fastest growing mode has a small wavelength and is extremely difficult to resolve numerically.

Because of this, all simulations of the magne-torotational mechanism to date have simply made the assumption that a combination of MRI and

Figure 2. Volume rendering of the specific entropy in the core of a neutrino-driven core-collapse supernova at the onset of explosion, based on 3D general-relativistic simulations2 and rendered by Steve Drasco (Cal Poly San Luis Obispo). Specific entropy is a preferred quantity for visualization: in the supernova’s core, it typically ranges from 1 to 20 units of Boltzmann’s constant kBper baryon. Shown is the large-scale asymmetric shock front and a layer of hot expanding plumes behind it. The physical scale is roughly 600 × 400 km.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








dynamo is operating. They then ad hoc impose a strong large-scale field as an initial condition. In 2D simulations, collimated jets develop along the axis of rotation. In 3D, the jets are unstable and a more complicated explosion geometry develops,4

as shown in Figure 3. Nevertheless, even in 3D, an energetic explosion could potentially be powered.

The magnetorotational mechanism requires one special property of the progenitor star: rapid core rotation. Currently, stellar evolution theory suggests that the cores of most massive stars should be slowly spinning. However, there could be exceptions of rapidly spinning cores at just about the right occur-rence rate to explain hypernovae and long gamma-ray bursts. In addition to the neutrino and magne-torotational mechanisms, several other explosion mechanisms have been proposed. A full review on explosion mechanisms appears elsewhere.3

A Multiscale, Multiphysics, Multidimensional

Computational Challenge

The core-collapse supernova problem is highly com-plex and inherently nonlinear, and it involves many branches of (astro)physics. Only limited progress can be made with analytic or perturbative meth-ods. Computational simulation is a powerful means for gaining theoretical insight and for mak-ing predictions that could be tested with astronom-ical observations of neutrinos, gravitational waves, and electromagnetic radiation.

Core-collapse supernova simulations are time evolution simulations: starting from initial condi-tions, the matter, radiation, and gravitational fields are evolved in time. In the case of time-explicit evo-lution, the numerical time step is limited by causal-ity, controlled by the speed of sound in Newtonian simulations and the speed of light in general-relativ-istic simulations. Because of this, an increase in the spatial resolution by a factor of two corresponds to a decrease in the time step by a factor of two. Hence, in a 3D simulation, the computational cost scales with the fourth power of resolution.

Multiscale

Taking the red supergiant in Figure 1 as an example, a complete core-collapse supernova simulation that follows the shock to the stellar surface would have to cover dynamics on a physical scale from approxi-mately 109 km (stellar radius) down to 0.1 km (the typical scale over which the structure and thermo-dynamics of the protoneutron star change). These ten orders of magnitude in spatial scale are daunt-ing. In practice, reviving the shock and tracking its

propagation to the surface can be treated as (almost) independent problems. If our interest is on the shock revival mechanism, we need to include the inner 10,000 km of the star. Because information about core collapse is communicated to overlying layers with the speed of sound, stellar material at greater radii won’t “know” that core collapse has occurred before it’s hit by the revived expanding shock.

Even with only five decades in spatial scale, some form of grid refinement or adaptivity is called

Figure 3. Volume rendering of the specific entropy in the core

of a magnetorotational core-collapse supernova. Bluish colors

indicate low entropy, red colors high entropy, and green and

yellow intermediate entropy. The vertical is the axis of rotation

and shown is a region of 1,600 × 800 km. The ultra-strong

toroidal magnetic field surrounding the protoneutron star pushes

hot plasma out along the rotation axis. The distorted, double-lobe

structure is due to an MHD kink instability akin to those seen

in Tokamak fusion experiments. This figure was first published

elsewhere4 and is used with permission.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










for: a 3D finite-difference grid with an extent of 10,000 km symmetric about the origin with uni-form 0.1 km cell size would require 57 Pbytes of RAM to store a single double-precision variable. Many tens to hundreds of 3D variables are required. Such high uniform resolution is not only currently impossible but also unnecessary. Most of the reso-lution is needed near the protoneutron star and in the region behind the stalled shock. The near-free-fall collapse of the outer core can be simulated with much lower resolution.

Because of the broad range of physics involved and the limited available compute power, early core-collapse supernova simulations were spherically symmetric (1D). Such 1D simulations often employ a Lagrangian comoving mass coordinate discreti-zation. This grid can be set up to provide just the right resolution where and when needed or can be dynamically re-zoned (an adaptive mesh refinement [AMR] technique). Other 1D codes discretize in the Eulerian frame and use a fixed grid whose cells are radially stretched using geometric progression.

In 2D simulations, Eulerian geometrically spaced fixed spherical grids are the norm, but some codes use cylindrical coordinates and AMR. Spherical grids, al-ready in 2D, suffer from a coordinate singularity at the axis that can lead to numerical artifacts. In 3D, they become even more difficult to handle, and their focus-ing grid lines impose a severe time step constraint near the origin. Some 3D codes still use a spherical grid, while many others employ Cartesian AMR grids. Re-cent innovative approaches use so-called multiblock grids with multiple curvilinear touching or overlap-ping logically Cartesian “cubed-sphere” grids.8

Multiphysics

Core-collapse supernovae are very rich in physics. All fundamental forces are involved and essential to the core-collapse phenomenon. These forces are probed under conditions that are impossible (or exceedingly difficult) to create in Earthbound laboratories.

Gravity drives the collapse and provides the en-ergy reservoir. It’s so strong near the protoneutron star that general relativity becomes important and its Newtonian description doesn’t suffice. The electro-magnetic force describes the interaction of the dense, hot, magnetized, perfectly conducting plasma and the photons that provide thermal pressure and make the supernova light. The weak force governs the in-teractions of neutrinos, and the strong (nuclear) force is essential in the nuclear EOS and nuclear reactions. All this physics occurs at the microscopic, per-particle level. Fortunately, the continuum assumption holds,

allowing us to describe core-collapse supernovae on a macroscopic scale by a coupled set of systems of non-linear partial differential equations (PDEs).

(Magneto)hydrodynamics (MHD). The stellar plasma is both in local thermodynamic equilibrium, essentially perfectly conducting, and essentially inviscid (al-though neutrinos might provide some shear viscos-ity in the protoneutron star). The ideal inviscid MHD approximation is appropriate under these conditions. The MHD equations are hyperbolic and can be written in flux-conservative form with source terms that don’t include derivatives of the MHD variables. They are typically solved with standard time-explicit high-resolution shock-cap-turing methods that exploit the characteristic struc-ture of the equations.9,10 Special attention must be paid to preserving the divergence-free property of the magnetic field. The MHD equations require an EOS as a closure.

Unless ultra-strong (B 1015 G), magnetic fields have little effect on the supernova dynamics and thus are frequently neglected. Because strong gravity and velocities up to a few tenths of the speed of light are involved, the MHD equations are best solved in a general-relativistic formulation. General-relativistic MHD is particularly computationally expensive be-cause the conserved variables are not the primitive variables (density, internal energy/temperature, ve-locity, chemical composition). The latter are needed for the EOS and enter flux terms. After each update, they must be recovered from the conserved variables via multidimensional root finding.

Gravity. Deviations in the strength of the gravita-tional acceleration between Newtonian and gen-eral-relativistic gravity are small in the precollapse core but become of order 10 to 20 percent in the protoneutron star phase. In the case of black hole formation, Newtonian physics breaks down com-pletely. General relativistic gravity is included at varying levels in simulations. Some neglect it com-pletely and solve the linear elliptic Newtonian Pois-son equation to compute the gravitational poten-tial. This is done by using direct multigrid methods or integral multipole expansion methods. Some codes modify the monopole term in the latter ap-proach to approximate general relativistic effects.

Including full general relativity is more chal-lenging, in particular in 2D and 3D, because general relativity has radiative degrees of freedom (gravita-tional waves) there. An entire subfield of gravitation-al physics, numerical relativity, spent nearly five de-

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








cades looking for ways to solve Einstein’s equations on computers.11 In general relativity, changes in the gravitational field propagate at the speed of light. Hence, time evolution equations must be solved. This is done by splitting 4D spacetime into 3D spa-tial slices that are evolved in the time direction. In the simplest way of writing the equations (the so-called Arnowitt-Deser-Misner [ADM] formulation), they form a system of 12 partial differential evolu-tion equations, 4 gauge variables that must be speci-fied (and evolved in time or recalculated on each slice), and 4 elliptic constraint equations without time derivatives. The ADM formulation has poor numerical stability properties that lead to violations of the constraint equations and numerical instabili-ties that make long-term evolution impossible.

It took until the late 1990s and the early 2000s for numerical relativity to find formulations of Ein-stein’s equations and gauge choices that together lead to stable long-term evolutions. In some cases, well-posedness and strong or symmetric hyperbo-licity can be proven. The equations are typically evolved time-explicitly with straightforward high-order (fourth and higher) finite-difference schemes or with multidomain pseudospectral methods.

Because numerical relativity only recently be-came applicable to astrophysical simulations, very few core-collapse supernova codes are fully gen-eral relativistic at this point.2,12 The fully general-relativistic approach is much more memory and FLOP intensive than solving the Newtonian Pois-son equation. Its advantage in large-scale compu-tations, however, is the hyperbolic nature of the equations, which doesn’t require global matrix in-versions or summations and thus is advantageous for the parallel scaling of the algorithm.

Neutrino transport and neutrino-matter interactions.

Neutrinos move at the speed of light (the very small neutrino masses are neglected) and can travel mac-roscopic distances between interactions. Therefore, they must be treated as nonequilibrium radiation. Radiation transport is closely related to kinetic theory’s Boltzmann equation. It describes the phase-space evolution of the neutrino distribution function or, in radiation transport terminology, their specific intensity. This is a 6 1-D problem: three spatial dimensions, neutrino energy, and two momentum space propagation angles in addition to time. The angles describe the directions from which neutrinos are coming and where they’re going at a given spa-tial coordinate. In addition, the transport equation must be solved separately for multiple neutrino spe-

cies: electron neutrinos, electron antineutrinos, and heavy-lepton (m, t) neutrinos and antineutrinos.

Figure 4 shows map projections of the momen-tum space angular neutrino distribution at different

Figure 4. Map projections of the momentum-space neutrino radiation field (for e at an energy of 16.3 MeV) going outward radially (from top to bottom) on the equator of a supernova core.9 Inside the protoneutron star (R 30 km), neutrinos and matter are in equilibrium, and the radiation field is isotropic. It becomes more forward peaked as the neutrinos decouple and become free streaming. Handling the transition from slow diffusion to free streaming correctly requires angle-dependent radiation transport, which is a 6 1-D problem and computationally extremely challenging.

30 km

60 km

120 km

150 km

240 km

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










radii in a supernova core. In the dense protoneutron star, neutrinos are trapped and in equilibrium with matter. Their radiation field is isotropic. They gradually diffuse out and decouple from matter at the neutrinosphere (the neutrino equivalent of the photosphere). This decoupling is gradual and marked by the transition of the angular distribution into the forward (radial) direction. In the outer decoupling region, neutrino heating is expected to occur, and the heating rates are sensitive to the angular distribution of the radiation field.9 Eventu-ally, at radii of a few hundred kilometers, the neu-trinos have fully decoupled and are free streaming. Neutrino interactions with matter (and thus the decoupling process) are very sensitive to neutrino energy, since weak-interaction cross-sections scale with the square of the neutrino energy.

This is why neutrino transport needs to be mul-tigroup, with typically a minimum of 10 to 20 en-ergy groups covering supernova neutrino energies of 1 – O(100) MeV. Typical mean energies of elec-tron neutrinos are around 10 to 20 MeV. Energy exchanges between matter and radiation occur via the collision terms in the Boltzmann equation. These are stiff sources/sinks that must be handled time-implicitly with (local) backward-Euler methods. The neutrino energy bins are coupled through frame-dependent energy shifts. Neutrino-matter interaction rates are usually precomputed and stored in dense multidimensional tables within which simulations interpolate.

Full 6 1-D general-relativistic Boltzmann neu-trino radiation-hydrodynamics is exceedingly chal-lenging and so far hasn’t been possible to include in core-collapse supernova simulations, but 3 1-D (1D in space, 2D in momentum space),13 5 1-D (2D in space, 3D in momentum space),9 and static 6D simulations14 have been carried out.

Most (spatially) multidimensional simulations treat neutrino transport in some dimensionally reduced approximation. The most common is an expansion of the radiation field into angular mo-ments. The nth moment of this expansion requires information about the (n 1)th moment (and in some cases, the (n 2)th moment as well). This necessitates a closure relation for the moment at which the expansion is truncated. Multigroup flux-limited diffusion evolves the 0th moment (the radiation energy density). The flux limiter is the closure that interpolates between diffusion and free streaming. The disadvantages of this method are its very diffusive nature (washes out spatial variations of the radiation field), its sensitivity

to the choice of flux limiter, and the need for time-implicit integration (involving global ma-trix inversion) due to the stability properties of the parabolic diffusion equation. Two-moment transport is the next better approximation, solv-ing equations for the radiation energy density and momentum (that is, the radiative flux) and requir-ing a closure that describes the radiation pressure tensor (also known as the Eddington tensor). This closure can be analytic and based on the local val-ues of energy density and flux (the M1 approxima-tion). Alternatively, some codes compute a global closure based on the solution of a simplified, time-independent Boltzmann equation. The major advantage of the two-moment approximation is that its advection terms are hyperbolic and can be handled with standard time-explicit finite-volume methods of computational hydrodynamics, and only the local collision terms need time-implicit updates.

There are now implementations of multigroup two-moment neutrino radiation-hydrodynamics in multiple 2D/3D core-collapse supernova simula-tion codes.12,15,16 This method could be sufficiently close to the full Boltzmann solution (in particular, if a global closure is used) and appears to be the way toward massively parallel long-term 3D core-collapse supernova simulations.

Neutrino oscillations. Neutrinos have mass and can oscillate between flavors. The oscillations occur in a vacuum but can also be mediated by neutrino-elec-tron scattering (the Mikheyev-Smirnov-Wolfenstein [MSW] effect) and neutrino-neutrino scattering. Neutrino oscillations depend on neutrino mixing parameters and on the neutrino mass eigenstates (the magnitudes of the mass differences are known but not their signs). Observation of neutrinos from the next galactic core-collapse supernova could help constrain the neutrino mass hierarchy.17

MSW oscillations occur in the stellar envelope. They’re important for the neutrino signal observed in detectors on Earth, but they can’t influence the ex-plosion itself. The self-induced (via neutrino-neutrino scattering) oscillations, however, occur at the extreme neutrino densities near the core. They offer a rich phenomenology that includes collective oscillation behavior of neutrinos.17 The jury’s still out on their potential influence on the explosion mechanism.

Collective neutrino oscillation calculations (es-sentially solving coupled Schrödinger-like equa-tions) are computationally intensive.17 They’re cur-rently performed independently of core-collapse

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








supernova simulations and don’t take into account feedback on the stellar plasma. Fully understand-ing collective oscillations and their impact on the supernova mechanism will quite likely require that neutrino oscillations, transport, and neutrino-mat-ter interactions are solved for together in a quan-tum-kinetic approach.18

Equation of state and nuclear reactions. The EOS is essential for the (M)HD part of the problem and for updating the matter thermodynamics after neutri-no-matter interactions. Baryons (protons, neutrons, alpha particles, heavy nuclei), electrons, positrons, and photons contribute to the EOS. Neutrino mo-mentum transfer contributes an effective pressure that is taken into account separately because neu-trinos are not everywhere in local thermodynamic equilibrium with the stellar plasma. In different parts of the star, different EOS physics applies.

At low densities and temperatures below ap-proximately 0.5 MeV, nuclear reactions are too slow to reach nuclear statistical equilibrium. In this regime, the mass fractions of the various heavy nu-clei (isotopes, in the following) must be tracked ex-plicitly. As the core collapses, the gas heats up and nuclear burning must be tracked with a nuclear re-action network, a stiff system of ODEs. Solving the reaction network requires the inversion of sparse matrices at each grid point. Depending on the number of isotopes tracked (ranging typically from O(10) to O(100)), nuclear burning can be a signifi-cant contributor to the overall computational cost of a simulation. The EOS in the burning regime is simple: all isotopes can essentially be treated as noninteracting ideal Boltzmann gases. Often, cor-rections for Coulomb interactions are included. Photons and electrons/positrons can be treated ev-erywhere as ideal Bose and Fermi gases, respective-ly. Because electrons will be partially or completely degenerate, computing the electron/positron EOS involves the FLOP-intensive solution of Fermi inte-grals. Because of this, their EOS is often included in tabulated form.

At temperatures above 0.5 MeV, nuclear sta-tistical equilibrium holds. This greatly simplifies things, since now the electron fraction Ye (number of electrons per baryon; because of macroscopic charge neutrality, Ye is equal to Yp, the number fraction of protons) is the only compositional vari-able. The mass fractions of all other baryonic spe-cies can be obtained by solving Saha-like equations for compositional equilibrium. At densities below approximately 1010 – 1011 g cm–3, the baryons can

still be treated as ideal Boltzmann gases (but in-cluding Coulomb corrections).

The nuclear force becomes relevant at den-sities near and above 1010 – 1011 g cm–3. It is an effective quantum manybody interaction of the strong force, and its detailed properties presently aren’t known. Under supernova conditions, mat-ter will be in NSE in the nuclear regime, and the EOS is a function of density, temperature, and Ye.Starting from a nuclear force model, an EOS can be obtained in multiple ways,19 including direct Hartree-Fock manybody calculations, mean field models, or phenomenological models (such as the liquid-drop model).

Typically, the minimum of the Helmholtz free energy is sought and all thermodynamic variables are obtained from derivatives of the free energy. In most cases, EOS calculations are too time-con-suming to be performed during a simulation. As in the case of the electron/positron EOS, large (more than 200 Mbytes must be stored by each MPI pro-cess), densely spaced nuclear EOS tables are pre-computed and simulations efficiently interpolate in (log , log T, Ye) to obtain thermodynamic and compositional information.

Multidimensionality

Stars are, at zeroth order, gas spheres. It’s thus nat-ural to start with assuming spherical symmetry in simulations—in particular, given the very limited compute power available to the pioneers of super-nova simulations. After decades of work, it now ap-pears clear that detailed spherically symmetric simu-lations robustly fail at producing explosions for stars that are observed to explode in nature. Spherical symmetry itself could be the culprit because sym-metry is clearly broken in core-collapse supernovae:

■ Observations show that neutron stars receive “birth kicks,” giving them typical velocities of O(100) km s–1 with respect to the center of mass of their progenitors. The most likely and straight-forward explanation for these kicks is that highly asymmetric explosions lead to neutron star re-coil, owing to momentum conservation.

■ Deep observations of supernova remnants show that the innermost supernova ejecta ex-hibit low-mode asphericity similar to the ge-ometry of the shock front shown in Figure 2.

■ Analytic considerations as well as 1D core-col-lapse simulations show that the protoneutron star and the region behind the stalled shock where neutrino heating takes place are both

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










unstable to buoyant convection, which always leads to the breaking of spherical symmetry.

■ Rotation and magnetic fields naturally break spherical symmetry. Observations of young pulsars show that some neutron stars must be born with rotation periods on the order of 10 milliseconds. Magnetars could be born with even shorter spin periods if their magnetic field is derived from rapid differential rotation.

■ Multidimensional simulations of the violent nuclear burning in the shells overlying the iron core show that large-scale deviations from sphericity develop that couple into the precol-lapse iron core via the excitation of nonradial pulsations.20 These create perturbations from which convection will grow after core bounce.

Given the above, multidimensional simula-tions are essential for studying the dynamics of the supernova engine. The rapid increase of compute power since the early 1990s has facilitated increas-ingly detailed 2D radiation-hydrodynamics simu-lations over the past two and a half decades. Three-dimensional simulations with simplified neutrino treatments have been carried out since the early 2000s. The first 3D neutrino radiation-hydrody-namics simulations have become possible only in the past few years, thanks to the compute power of large petascale systems like the US-funded Blue Waters and Titan, and the Japanese K computer.

Core-Collapse Supernova Simulation Codes

Many 1D codes exist, some are no longer in use, and one is open source and free to download (http://GR1Dcode.org). There are approximately 10 (de-pending on how you count them) multidimen-sional core-collapse supernova simulation codes

in the community. Many, in particular the 3D codes, follow the design encapsulated in Figure 5. They employ a simulation framework (such as FLASH, http://flash.uchicago.edu/site/flashcode, or Cactus, http://cactuscode.org) that handles do-main decomposition, message passing, memory management, AMR, coupling of different physics components, execution scheduling, and I/O.

Given the tremendous memory requirement and FLOP consumption of the core-collapse su-pernova problem, these codes are massively parallel and employ both node-local OpenMP and inter-node MPI parallelization. All current codes follow a data-parallel paradigm with monolithic sequen-tial scheduling. However, this limits scaling, can create load imbalances with AMR, and makes the use of GPU/MIC accelerators challenging because communication latencies between accelerator and CPU block execution in the current paradigm.

The Caltech Zelmani2 core-collapse simulation package is an example of a 3D core-collapse super-nova code. It is based on the open source Cactus framework, uses 3D AMR Cartesian and multi-block grids, and employs many components pro-vided by the open source Einstein Toolkit (http://einsteintoolkit.org). Zelmani has fully general-rela-tivistic gravity and implements general-relativistic MHD. Neutrinos are included either via a rather crude energy-averaged leakage scheme that approx-imates the overall energetics of neutrino emission and absorption or via a general-relativistic two-moment M1 radiation-transport solver that has re-cently been deployed on first simulations.16

In full radiation-hydrodynamics simulations of the core-collapse supernova problem with eight levels of AMR, Zelmani exhibits good strong scal-ing with hybrid-OpenMP/MPI to 16,000 cores on Blue Waters. At larger core counts, load imbalances due to AMR prolongation and synchronization op-erations begin to dominate the execution time.

Multidimensional Dynamics and Turbulence

Even before the first detailed 2D simulations of neutrino-driven core-collapse supernovae became possible in the mid-1990s, it was clear that buoyant convection in the protoneutron star and in the neu-trino-heated region just behind the stalled shock breaks spherical symmetry. Neutrino-driven con-vection is due to a negative radial gradient in the specific entropy, making the plasma at smaller radii “lighter” than overlying plasma. This is a simple consequence of neutrino heating being strongest at the base of the heating region. Rayleigh-Taylor-

Figure 5. Multiphysics modules of core-collapse supernova simulation codes. The simulation framework provides parallelization, I/O, execution scheduling, AMR, and memory management.

Hydrodynamics/MHD

Neutrino transport and interactions

Gravity

Equation of state/nuclear reactions

Simulationframework

AMRMemoryCoupling

SchedulingCommunication

I/O

Core-collapse supernova simulation components

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________

____________

____

__________

____

http://www.GR1Dcode.org

http://flash.uchicago.edu/site/flashcode

http://cactuscode.org

http://einsteintoolkit.org






like plumes develop from small perturbations and grow to nonlinear convection. This convection is extremely turbulent because the physical viscosity in the heating region is vanishingly small. Neutri-no-driven turbulence is anisotropic on large scales (due to buoyancy), mildly compressible (the flow reaches Mach numbers of approximately 0.5), and only quasi-stationary because an explosion eventu-ally develops. Nevertheless, it turns out that Kol-mogorov’s description for isotropic, stationary, incompressible turbulence works surprisingly well for neutrino-driven turbulence (see Figure 6).

There is something special about neutrino-driven convection in core-collapse supernovae: un-like convection in globally hydrostatic stars, neutri-no-driven convection occurs on top of a downflow of outer core material that has accreted through the stalled shock and is headed for the protoneu-tron star. The consequence of this is that there is a competition between the time it takes for a small perturbation to grow to macroscopic scale to be-come buoyant and the time it takes for it to leave the region that is convectively unstable (the heat-ing region) as it is dragged with the background flow toward the protoneutron star. This means that there are three parameters governing the appear-ance of neutrino-driven convection: the strength of neutrino heating, the initial size of perturbations entering through the shock, and the downflow rate through the heating region. Because of this, neu-trino-driven convection is not a given, and simula-tions find that it does not develop in some stars.

But even in the absence of neutrino-driven convection, there is another instability that breaks spherical symmetry in the supernova core: the standing accretion shock instability (SASI).3 SASI was first discovered in simulations that did not include neutrino heating. It works via a feedback cycle: small perturbations enter through the shock, flow down to the protoneutron star, and get reflect-ed as sound waves that in turn perturb the shock. The SASI is a low-mode instability that is most manifest in an up-down sloshing (l = 1 in terms of spherical harmonics) along the symmetry axis in 2D and in a spiral mode (m = 1) in 3D. Once it has reached nonlinear amplitudes, the SASI cre-ates secondary shocks (entropy perturbations) and shear flow from which turbulence develops. SASI appears to dominate in situations in which neutri-no-driven convection is weak or absent: in condi-tions where neutrino heating is weak, the perturba-tions entering the shock are small, or the downflow rate through the heating region is high.

Independent of how spherical symmetry is bro-ken in the heating region, all simulations agree that 2D/3D is much more favorable for explosion than 1D. Some 2D and 3D simulations yield explosions for stars where 1D simulations fail.21 Why is that?

The first reason has been long known and is seemingly trivial: the added degrees of freedom, lateral motion in 2D, and lateral and azimuthal motion in 3D all have the consequence that a gas element that enters through the shock front spends more time in the heating region before flowing down to settle onto the protoneutron star. Because it spends more time in the heating region, it can absorb more neutrino energy, increasing the neu-trino mechanism’s overall efficiency.

The second reason has to do with turbulence and has become apparent only in the past few years. Turbulence is often analyzed employing Reynolds decomposition, a method that separates back-ground flow from turbulent fluctuations. Using this method, we can show that turbulent fluctuations lead to an effective dynamical ram pressure (Reyn-olds stress) that contributes to the overall momen-tum balance between behind and in front of the stalled shock. The turbulent pressure is available only in 2D/3D simulations, and it has been demon-strated22 that because of this pressure, 2D/3D core-collapse supernovae explode with less thermal pres-sure and, consequently, with less neutrino heating.

Figure 6. Schematic view of turbulence: kinetic energy is injected into the flow at large scales and cascades through the inertial range via nonlinear interactions of turbulent eddies to small scales (high wave numbers in the spectral domain) where it dissipates into heat. The scaling of the turbulent kinetic energy with wavenumber in the inertial range is k–5/3 for Kolmogorov turbulence. This scaling is also found in very high-resolution simulations of neutrino-driven convection.

log(wavenumber k )

log(Ek)

∝k –5/3

Inertial rangeInjection

scaleDissipation

scale

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










Now, the Reynolds stress is dominated by tur-bulent fluctuations at the largest physical scales: a simulation that has more kinetic energy in large-scale motions will explode more easily than a simu-lation that has less. This realization readily explains recent findings by multiple simulation groups, namely, that 2D simulations appear to explode more readily than 3D simulations.21,22 This is likely a consequence of the different behaviors of turbu-lence in 2D and 3D. In 2D, turbulence transports kinetic energy to large scales (which is unphysical), artificially increasing the turbulent pressure contri-bution. In 3D, turbulence cascades energy to small scales (as it should and is known experimentally), so a 3D supernova will generally have less turbu-lent pressure support than a 2D supernova.

Another recent finding by multiple groups is that simulations with lower spatial resolution appear to explode more readily than simulations with higher resolution. There are two possible explanations for this and it is likely that they play hand-in-hand: one, low

resolution creates a numerical bottleneck in the turbu-lent cascade, artificially trapping turbulent kinetic en-ergy at large scales where it can contribute most to the explosion, and two, low resolution also increases the size of numerical perturbations that enter through the shock and from which buoyant eddies form. The larger these seed perturbations are, the stronger is the turbu-lent convection and the larger is the Reynolds stress.

The qualitative and quantitative behavior of turbulent flow is very sensitive to numerical resolu-tion. This can be appreciated by looking at Figure 7, which shows the same 3D simulation of neutrino-driven convection at four different resolutions, span-ning a factor of 12 from the reference resolution that is presently used in many 3D simulations and which underresolves the turbulent flow. As resolution is in-creased, turbulent flow breaks down to progressively smaller features. What also occurs but that cannot be appreciated from a still figure is that the intermit-tency of the flow increases as the turbulence is better resolved. This means that flow features are not persis-tent but quickly appear and disappear through non-linear interactions of turbulent eddies. In this way, the turbulent cascade can be temporarily reversed (this is called backscatter in turbulence jargon), cre-ating large-scale intermittent flow features similar to what is seen at low resolution. The role of intermit-tency in neutrino-driven turbulence and its effect on the explosion mechanism remain to be studied.

A key challenge for 3D core-collapse supernova simulations is to provide sufficient resolution so that kinetic energy cascades away from the largest scales at the right rate. Resolution studies suggest that this could require between 2 to 10 times the resolution of current 3D simulations. A 10-fold increase in resolution in 3D corresponds to a 10,000 times in-crease in computational cost. An alternative could be to devise an efficient subgrid model that, if in-cluded, provides the correct rate of energy transfer to small scales. Work in that direction is still in its infancy in the core-collapse supernova context.

Making Magnetars: Resolving the

Magnetorotational Instability

The magnetorotational mechanism relies on the presence of an ultra-strong (1015 to 1016 G) global, primarily toroidal, magnetic field around the proto-neutron star. Such a strongly magnetized protoneu-tron star is called a protomagnetar. It has been the-orized that the MRI7 could generate a very strong local magnetic field that could be transformed into a global field by a dynamo process. While appeal-ing, it was not at all clear that this is what happens.

Figure 7. Slices from four semiglobal 3D simulations of neutrino-driven convection with parameterized neutrino cooling and heating, carried out in a 45 wedge. The color map is the specific entropy; blue colors mark low-entropy regions, red corresponds to high entropy. Only the resolution is varied. The wedge marked “ref.” is the reference resolution ( r = 3.8 km, = = 1.8 ) that corresponds to the resolution of present global 3D detailed radiation-hydrodynamics core-collapse supernova simulations. Note how low resolution favors large flow features and how the turbulence breaks down to progressively smaller features with increasing resolution. This figure includes simulations up to 12 times the reference resolution that were run on 65,536 cores of Blue Waters. Rendered by David Radice (Caltech).

Ref.

12 x

6 x

2 x

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








The physics is fundamentally global and 3D, and global 3D MHD simulations with sufficient resolu-tion to capture MRI-driven field growth were im-possible to perform for core-collapse supernovae.

This changed with the advent of Blue Waters–class petascale supercomputers and is a testament to how increased compute power and capability systems like Blue Waters facilitate scientific discovery. Our group at Caltech carried out full-physics 3D global general-relativistic MHD simulations of 10 millisec-onds of a rapidly spinning protoneutron star’s life, starting shortly after core bounce.23 We cut out a cen-tral octant (with appropriate boundary conditions) from another, lower-resolution 3D AMR simula-tion, and covered a 3D region of 140 70 70 km with uniform resolution. We performed four simu-lations to study the MHD dynamics at resolutions of 500 m (approximately 2 points per MRI wave-length), 200 m, 100 m, and 50 m (approximately 20 points per MRI wavelength). Because we employed uniform resolution and no AMR, the simulations showed excellent strong scaling. The 50 m simula-tion was run on 130,000 Blue Waters cores and con-sumed roughly 3 million Blue Waters node hours (approximately 48 million CPU hours).

Our simulations with 100 m and 50 m resolu-tion resolve the MRI and show exponential growth of the magnetic field. This growth saturates at small scales within a few milliseconds and is consistent with what we anticipate on the basis of analytical es-timates. The MRI drives the MHD turbulence that is most prominent in the layer of greatest rotational shear, just outside of the protoneutron star core at ra-dii of 20 to 30 km. What we did not anticipate is that in the highest-resolution simulation (which re-solves the turbulence best), an inverse turbulent cas-cade develops that transports magnetic field energy toward large scales. It acts as a large-scale dynamo that builds up global, primarily toroidal field, just in the way needed to power a magnetorotational ex-plosion. Figure 8 shows the final toroidal magnetic field component in our 50 m simulation after 10 ms of evolution time. Regions of strongest positive and negative magnetic field are marked by yellowish and light blue colors, respectively, and are just outside the protoneutron star core. At the time shown, the magnetic field on large scales has not yet reached its saturated state. We expect this to occur after approxi-mately 50 ms, which could not be simulated.

Our results suggest that the conditions neces-sary for the magnetorotational mechanism are a generic outcome of the collapse of rapidly rotating cores. The MRI is a weak field instability and will

grow to the needed saturation field strengths from any small seed magnetic field. The next step is to find a way to simulate for longer physical time and with a larger physical domain. This will be neces-sary to determine the long-term dynamical impact of the generated large-scale magnetic field. Such simulations will require algorithmic changes to im-prove parallel scaling and facilitate the efficient use of accelerators; they could even require larger and faster machines than Blue Waters.

Figure 8. Visualization by Robert R. Sisneros (NCSA) and Philipp Mösta (UC Berkeley) of the toroidal magnetic field built up by an inverse cascade (large-scale dynamo) from small-scale magnetoturbulence in a magnetorotational core-collapse supernova. Shown is a 140 70 km 3D octant region with periodic boundaries on the x-z and y-z faces. Regions of strongest positive and negative magnetic field are marked by light blue and yellowish colors. Dark blue and dark red colors mark regions of weaker negative and positive magnetic field.23

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










Core-collapse supernova theorists have always been among the top group of users of super-

computers. The CDCs and IBMs of the 1960s and 1970s, the vector Crays of the 1970s to 1990s, the large parallel scalar architectures of the 2000s, and the current massively parallel SIMD machines all paved the path of progress for core-collapse super-nova simulations.

Today’s 3D simulations are rapidly improv-ing in their included macroscopic and micro-scopic physics. They are beginning to answer decades-old questions and are allowing us to formulate new ones. There is still much need for improvement, which will come at no small price in the post–Moore’s law era of heterogeneous supercomputers.

One important issue that the community must address is the reproducibility of simulations and the verification of simulation codes. It still occurs more often than not that different codes starting from the same initial conditions and implement-ing nominally the same physics arrive at quanti-tatively and qualitatively different outcomes. In the mid-2000s, an extensive comparison of 1D supernova codes provided results that are still be-ing used as benchmarks today.13 Efforts are now underway that will lead to the definition of mul-tidimensional benchmarks. In addition to code comparisons, the increasing availability of open source simulation codes and routines for generat-ing input physics (such as neutrino interactions) is furthering reproducibility. Importantly, these open source codes now allow new researchers to enter the field without the need to spend many years developing basic simulation technology that already exists.

Core collapse is, in essence, an initial value prob-lem. Current simulations, even those in 3D, start from spherically symmetric precollapse conditions from 1D stellar evolution codes. However, stars ro-tate, and convection in the layers surrounding the inert iron core is violently aspherical. These asphe-ricities have an impact on the explosion mechanism. For 3D core-collapse supernova simulations to pro-vide robust and reliable results, the initial conditions must be reliable and robust, and will likely require simulating the final phases of stellar evolution in 3D,20 which is another multidimensional, multiscale, multiphysics problem.

Neutrino quantum-kinetics for including neu-trino oscillations directly into simulations will be an important but exceedingly algorithmically and computationally challenging addition to the simu-

lation physics. Formalisms for doing so are under development, and first implementations (in spa-tially 1D) simulations could be available in a few years. A single current top-of-the-line 3D neutrino radiation-hydrodynamics simulation can be car-ried out to approximately 0.5 to 1 second after core bounce at a cost of several tens of millions of CPU hours, but it still underresolves the neutrino-driven turbulence. What is needed now are many such simulations for studying sensitivity to initial con-ditions such as rotation and progenitor structure and input physics. These simulations should be at higher resolution and carried out for longer so that the longer-term development of the explosion (or collapse to a black hole) and, for example, neutron star birth kicks can be reliably simulated.

Many longer simulations at higher resolution will require much more compute power than is currently available. The good news is that the next generation of petascale systems and, certainly, ex-ascale machines in the next decade will provide the necessary FLOPS. The bad news: the radical and disruptive architectural changes necessary on the route to exascale will require equally disruptive changes in supernova simulation codes. Already at petascale, the traditional data-parallel, linear/sequential execution model of all present super-nova codes is the key limiting factor of code per-formance and scaling. A central issue is the need to communicate many boundary points between subdomains for commonly employed high-order finite-difference and finite-volume schemes. With increasing parallel process count, communication eventually dominates over computation in current supernova simulations.

Because latencies cannot be hidden, efficiently offloading data and tasks to accelerators in hetero-geneous systems is difficult for current supernova codes. The upcoming generation of petascale ma-chines such as Summit and Sierra fully embraces heterogeneity. For exascale machines, power con-sumption will be the driver of computing architec-ture. Current Blue Waters already draws approxi-mately 10 MW of power, and there is not much upward flexibility for future machines. Unless there are unforeseen breakthroughs in semicon-ductor technology that provide increased single-core performance at orders of magnitude lower power footprints, exascale machines will likely be all-accelerator with hundreds of millions of slow, highly energy-efficient cores.

Accessing the compute power of upcoming petascale and exascale machines requires a radical

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








departure from current code design and major code development efforts. Several supernova groups are exploring new algorithms, numerical methods, and parallelization paradigms. Discontinuous Galerkin (DG) finite elements24 have emerged as a promis-ing discretization approach that guarantees high numerical order while minimizing the amount of subdomain boundary information that needs to be communicated between processes. In addition, switching to a new, more flexible parallelization ap-proach will likely be necessary to prepare supernova codes (and other computational astrophysics codes solving similar equations) for exascale machines. A prime contender being considered by supernova groups is task-based parallelism, which allows for fine-grained dynamical load balancing and asyn-chronous execution and communication. Frame-works that can become task-based backbones of future supernova codes already exist, such as Charm++ (http://charm.cs.illinois.edu/research/charm), Legion (http://legion.stanford.edu/over-view), and Uintah (http://uintah.utah.edu).

Acknowledgments

I acknowledge helpful conversations with and help from Adam Burrows, Sean Couch, Steve Drasco, Ro-land Haas, Kenta Kiuchi, Philipp Mösta, David Radice, Luke Roberts, Erik Schnetter, Ed Seidel, and Masaru Shibata. I thank the Yukawa Institute for Theoretical Physics at Kyoto University for hospitality while writ-ing this article. This work is supported by the US Na-tional Science Foundation (NSF) under award numbers CAREER PHY-1151197 and TCAN AST-1333520, and by the Sherman Fairchild Foundation. Computations were performed on NSF XSEDE under allocation TG-PHY100033 and on NSF/NCSA Blue Waters under NSF PRAC award number ACI-1440083. Movies of simulation results can be found on www.youtube.com/SXSCollaboration.

References

1. H.A. Bethe and J.R. Wilson, “Revival of a Stalled Supernova Shock by Neutrino Heating,” Astrophysi-cal J., vol. 295, Aug. 1985, pp. 14–23.

2. C.D. Ott et al., “General-Relativistic Simulations of Three-Dimensional Core-Collapse Supernovae,” Astrophysical J., vol. 768, May 2013, article no. 115.

3. H.-T. Janka, “Explosion Mechanisms of Core-Col-lapse Supernovae,” Ann. Rev. Nuclear and Particle Science, vol. 62, Nov. 2012, pp. 407–451.

4. P. Mösta et al., “Magnetorotational Core-Collapse Supernovae in Three Dimensions,” Astrophysical J. Letters, vol. 785, Apr. 2014, article no. L29.

5. G.S. Bisnovatyi-Kogan, “The Explosion of a Rotat-ing Star as a Supernova Mechanism,” Astronomi-cheskii Zhurnal, vol. 47, Aug. 1970, p. 813.

6. J.M. LeBlanc and J.R. Wilson, “A Numerical Exam-ple of the Collapse of a Rotating Magnetized Star,” Astrophysical J., vol. 161, Aug. 1970, pp. 541–551.

7. S.A. Balbus and J.F. Hawley, “A Powerful Local Shear Instability in Weakly Magnetized Disks. I—Linear Analysis. II—Nonlinear Evolution,” Astro-physical J., vol. 376, July 1991, pp. 214–233.

8. A. Wongwathanarat, H. Janka, and E. Muller, “Hydrodynamical Neutron Star Kicks in Three Dimensions,” Astrophysical J. Letters, vol. 725, Dec. 2010, pp. L106–L110.

9. C.D. Ott et al., “Two-Dimensional Multiangle, Multigroup Neutrino Radiation-Hydrodynamic Simulations of Postbounce Supernova Cores,” As-trophysical J., vol. 685, Oct. 2008, pp. 1069–1088.

10. E.F. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics, Springer, 1999.

11. T.W. Baumgarte and S.L. Shapiro, Numerical Rela-tivity: Solving Einstein’s Equations on the Computer,Cambridge Univ. Press, 2010.

12. T. Kuroda, T. Takiwaki, and K. Kotake, “A New Multi-Energy Neutrino Radiation-Hydrodynamics Code in Full General Relativity and Its Application to Gravitational Collapse of Massive Stars,” Astro-physical J. Supplemental Series, vol. 222, Feb. 2016, article no. 20.

13. M. Liebendörfer et al., “Supernova Simulations with Boltzmann Neutrino Transport: A Compari-son of Methods,” Astrophysical J., vol. 620, Feb. 2005, pp. 840–860.

14. K. Sumiyoshi et al., “Multidimensional Features of Neutrino Transfer in Core-Collapse Supernovae,” Astrophysical J. Supplemental Series, vol. 216, Jan. 2015, article no. 5.

15. E. O’Connor and S.M. Couch, “Two Dimensional Core-Collapse Supernova Explosions Aided by General Relativity with Multidimensional Neu-trino Transport,” submitted to Astrophysical J., Nov. 2015; arXiv:1511.07443.

16. L.F. Roberts et al., “General Relativistic Three-Dimensional Multi-Group Neutrino Radiation-Hydrodynamics Simulations of Core-Collapse Su-pernovae,” submitted to Astrophysical J., Apr. 2016; arXiv:1604.07848.

17. A. Mirizzi et al., “Supernova Neutrinos: Produc-tion, Oscillations and Detection,” La Rivista del Nuovo Cimento, vol. 39, Jan. 2016, pp. 1–112.

18. A. Vlasenko, G.M. Fuller, and V. Cirigliano, “Neu-trino Quantum Kinetics,” Physical Rev. D., vol. 89, no. 10, 2014, article no. 105004.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________

___________

____


http://charm.cs.illinois.edu/research/charm

http://legion.stanford.edu/overview

http://uintah.utah.edu

http://www.youtube.com/SXSCollaboration

http://legion.stanford.edu/overview

http://charm.cs.illinois.edu/research/charm







19. A.W. Steiner, M. Hempel, and T. Fischer, “Core-Collapse Supernova Equations of State Based on Neutron Star Observations,” Astrophysical J., vol. 774, Sept. 2013, article no. 17.

20. S.M. Couch et al., “The Three-Dimensional Evolu-tion to Core Collapse of a Massive Star,” Astrophysi-cal J. Letters, vol. 808, July 2015, article no. L21.

21. E.J. Lentz et al., “Three-Dimensional Core-Col-lapse Supernova Simulated Using a 15 M Progeni-tor,” Astrophysical J. Letters, vol. 807, July 2015, article no. L31.

22. S.M. Couch and C.D. Ott, “The Role of Turbulence in Neutrino-Driven Core-Collapse Supernova Explo-sions,” Astrophysical J., vol. 799, Jan. 2015, article no. 5.

23. P. Mösta et al., “A Large-Scale Dynamo and Mag-netoturbulence in Rapidly Rotating Core-Collapse Supernovae,” Nature, vol. 528, no. 7582, 2015, pp. 376–379; www.nature.com/nature/journal/v528/n7582/full/nature15755.html.

24. J.S. Hesthaven and T. Warburton, Nodal Discon-tinuous Galerkin Methods: Algorithms, Analysis, and Applications, 1st ed., Springer, 2007.

Christian D. Ott is a professor of theoretical astrophys-ics in the Theoretical Astrophysics Including Cosmol-ogy and Relativity (TAPIR) group of the Walter Burke Institute for Theoretical Physics at Caltech. His research interests include astrophysics and computational simula-tions of core-collapse supernovae, neutron star mergers, and black holes. Ott received a PhD in physics from the Max Planck Institute for Gravitational Physics and Uni-versität Potsdam. Contact him at [email protected].



qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________

_____________

____________________________________________

http://www.nature.com/nature/journal/v528/n7582/full/nature15755.html



http://www.computer.org/jobs





IEEE Computer Society 2016 Call forMAJOR AWARD NOMINATIONSHelp Recognize Computing’s Most Prestigious

contributors in the teaching and R&D computing communities. All members of the profession are invited to nominate individuals who they consider most eligible to receive international recognition of an appropriate society award.

Deadline: 15 October 2016Nomination Site: awards.computer.orgFor more information visit: www.computer.org/awards

Computer Entrepreneur AwardSterling Silver GobletVision and leadership resulting in the growth of some segment of the computer industry.

Technical Achievement Award

Contributions to computer science or computer technology.

Harry H. Goode Memorial Award

Information sciences, including seminal ideas, algorithms, computing directions, and concepts.

Hans Karlsson Award

Team leadership and achievement through collaboration in computing standards.

Richard E. Merwin Distinguished Service Award

Outstanding volunteer service to the profession at large, including service to the IEEE Computer Society.

Harlan D. Mills Award

Contributions to the practice of software engineering through the application of sound theory.

Computer Pioneer Award

Pioneering concepts and development of the

W. Wallace McDowell Award

Recent theoretical, design, educational, practical, or other tangible innovative contributions.

Taylor L. Booth Award

Contributions to computer science and engineering education.

Computer Science & Engineering Undergraduate Teaching Award

Recognizes outstanding contributions to undergraduate education.

IEEE-CS/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award(Joint award by CS/SEI)

Software professionals or teams responsible for an improvement to their organization’s ability to create and evolve software-dependent systems.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://awards.computer.org

http://www.computer.org/awards





SECTION TITLEEditors: Konrad Hinsen, [email protected] | Konstantin Läufer, laufer@

LEADERSHIP COMPUTINGEditors: James J. Hack, [email protected] | Michael E. Papka, [email protected]


Multiyear Simulation Study Provides Breakthrough in Membrane Protein Research

Laura Wolf | Argonne National Laboratory

“Molecular machines,” composed of protein components, consume energy to perform specific biological functions. The concerted actions of the proteins trigger many of the

critical activities that occur in living cells. However, like any machine, the components can break (through various muta-tions), and then the proteins fail to perform their functions correctly.

It’s known that malfunctioning proteins can result in a host of diseases, but pinpointing when and how a mal-function occurs is a significant challenge. Very few func-tional states of molecular machines are determined by experimentalists working in wet laboratories. Therefore, more structure-function information is needed to develop an understanding of disease processes and to design novel therapeutic agents.

The research team of Benoît Roux, a professor in the University of Chicago’s Department of Biochemistry and Molecular Biology and a senior scientist in Argonne Na-tional Laboratory’s Center for Nanoscale Materials, relies on an integrative approach to discover and define the ba-sic mechanisms of biomolecular systems—an approach that relies on theory, modeling, and running large-scale simula-tions on some of the fastest open science supercomputers in the world.

Computers have already changed the landscape of biol-ogy in considerable ways; modeling and simulation tools are routinely used to fill in knowledge gaps from experiments, helping design and define research studies. Petascale super-computing provides a window into something else entirely: the ability to calculate all the interactions occurring be-tween the atoms and molecules in a biomolecular system,

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________________








such as a molecular machine, and to visualize the motion that emerges.

The Breakthrough

Roux’s team recently concluded a three-year In-novative and Novel Computational Impact on Theory and Experiment (INCITE) project at the Argonne Leadership Computing Facility (ALCF), a US Department of Energy (DOE) Office of Science User Facility, to understand how P-type ATPase ion pumps—an important class of mem-brane transport proteins—operate. Over the past decade, Roux and his collaborators, Avisek Das, Mikolai Fajer, and Yilin Meng, have been devel-oping new computational approaches to simulate virtual models of biomolecular systems with un-precedented accuracy.

The team exploits state-of-the-art develop-ments in molecular dynamics (MD) and protein modeling. The MD simulation approach, frequent-ly used in computational physics and chemistry, calculates the motions of all the atoms in a given molecular system over time—information that’s impossible to access experimentally. In biology, large-scale MD simulations provide a perspective to understand how a biologically important mo-lecular machine functions.

For several years, Roux’s research has been focused on the membrane proteins that control the bidirectional flow of material and informa-tion in a cell. Now, in a major breakthrough, he and his team have described the complete transport cycle in atomic detail of a large cal-cium pump called sarco/endoplasmic reticulum calcium ATPase, or SERCA, which plays an im-portant role in normal muscle contraction. This membrane protein uses the energy from ATP hy-drolysis to transport calcium ions against their concentration gradient and, importantly, its malfunction causes cardiac and skeletal muscle diseases.

Roux and his team wanted to understand how SERCA functions in a membrane, so they set out to build a complete atomistic picture of the pump in action. Das, a postdoctoral research fellow in Roux’s lab, did this by obtaining all the transition pathways for the entire ion transport cycle using an approach called the string meth-od—essentially capturing a “molecular movie” of the transport process, frame by frame, of how different protein components and parts within the proteins communicated with each other (see Figure 1).

The Science

A membrane protein, like all protein molecules, con-sists of a long chain of amino acids. Once fully formed, it folds into a highly specific conformation that enables it to perform its biological function. Membrane pro-teins change shape and go through many conforma-tional “states” to perform their functions.

“From a scientific standpoint, membrane pro-teins such as the calcium pump are very interest-ing because they undergo complex changes in their three-dimensional conformations,” said Roux. “Ul-timately, a better understanding may have a great impact on human health.”

Experimentalists understand the structural de-tails of proteins’ stable conformational states but very little about the process by which a protein changes from one conformational state to another. “Only computer simulation can explore the inter-actions that occur during these structural transi-tions,” said Roux.

Intermediate conformations along these transi-tions could potentially provide the essential infor-mation needed for the discovery of novel therapeu-tic agent design. (Drugs are essentially molecules that counteract the effect of bad mutations to help recover the normal functions of the protein.) Be-cause membrane proteins regulate many aspects of cell physiology, they can serve as possible diagnos-tic tools or therapeutic targets.

Roux and his team are trying to obtain de-tailed knowledge about all the relevant conforma-tional states that occur during SERCA’s transport cycle. In years one and two of the study, Roux’s team identified two of the conformation transition pathways needed to describe the cycle. Last year, the project shifted focus to the three remaining pathways.

The ALCF Advantage

As is the case for much of the domain science re-search being conducted on DOE leadership super-computer systems today, biomolecular science re-lies on advances in methodology as well as in soft-ware and hardware technologies. The usefulness of Roux’s simulations hinges on the accuracy of the modeling parameters and on the efficiency of the MD algorithm enabling the adequate sampling of motions.

Computational science teams can spend years refining their application code to do what they need it to do, which is often to simulate a particu-lar physical phenomenon at the necessary space and time scales. Code advancements can push the

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








LEADERSHIP COMPUTING


simulation capabilities and take advantage of the machine’s features, such as high processor counts or advanced chips, to evolve the system for longer and longer periods of time.

Roux and his team used a premier MD simula-tion code, called NAMD, that combines two ad-vanced algorithms—the swarm-of-trajectory string method and multidimensional umbrella sampling.

NAMD, which was first developed at the Uni-versity of Illinois at Urbana-Champaign by Klaus Schulten and Laxmikant Kale, is a program used to carry out classical simulations of biomolecular sys-tems. It’s based on the Charm++ parallel program-ming system and runtime library, which provides in-frastructure for implementing highly scalable parallel applications. When combined with a machine-spe-cific communication library (such as PAMI, available

on Blue Gene/Q), the string method can achieve ex-treme scalability on leadership-class supercomputers.

ALCF staff provided maintenance and sup-port for NAMD software and helped coordinate and monitor the jobs running on Mira, ALCF’s 10-Pflops IBM Blue Gene/Q.

ALCF computational scientist Wei Jiang has been actively collaborating with Roux’s team since 2012, as part of Mira’s Early Science Program. Jiang worked with IBM’s system software team on early stage porting and optimization of NAMD on the Blue Gene/Q architecture. He’s also one of the core developers of NAMD’s multiple copy al-gorithm, which is the foundation for multiple IN-CITE projects that use NAMD.

Jiang, who has a background in computation-al biology, considers the recent work a significant

Figure 1. Interaction of cytoplasmic domains in the calcium pump of sarscoplasmic reticulum. These six states have been structurally

characterized and represent important intermediates along the reaction cycle. The blue domain, shown in surface representation, is

called the phosphorylation domain (P). The red and green domains, shown as C traces, are called actuator (A) and nucleotide binding

(N) domains, respectively. The red and green patches in the P domain are interacting with residues in A and N domains, respectively.

Two residues are considered to be in contact if at least one pair of non-hydrogen atoms is within 4 Å of each other. (Image: Avisek Das,

University of Chicago, used with permission.)

R≈109 km

E1 E1–2Ca2+–ATP E1P–2Ca2+–ADP

E2 E2–Pi E2P

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










breakthrough. “Only in the third year of the proj-ect did we begin to see real progress,” he said. “Th e fi rst and second year of an INCITE project is often accumulated experience.”

The computations Roux and his team ran for this breakthrough work will serve as a road-

map for simulating and visualizing the basic mech-anisms of biomolecular systems going forward. By studying experimentally well-characterized systems of increasing size and complexity within a unifi ed theoretical framework, Roux’s approach off ers a new route for addressing fundamental biological questions.

Acknowledgments

An award of computer time was provided by the US Department of Energy’s Innovative and Novel Compu-tational Impact on Th eory and Experiment (INCITE) program. Th is research used resources of the Argonne

Leadership Computing Facility, which is a US DOE Of-fi ce of Science User Facility supported under Contract DE-AC02-06CH11357.

Laura Wolf is a science writer and editor for Argonne National Laboratory. Her interests include science com-munication, supercomputing, and new media art. Wolf received a BA in political science from the University of Cincinnati and an MA in journalism from Columbia College Chicago. Contact her at [email protected].

All the Knowledge

On Your Time.Learn something new. Try Computer Society

eLearning today!


More at www.computer.org/elearning

Keeping YOU at the

Center of TechnologyComputer Society eLearning

All the Knowledge

On Your Time.Learn something new. Try Computer Society

eLearning today!

U at the

erlogyl

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________




http://www.computer.org/elearning





SECTION TITLEEditors: Konrad Hinsen, [email protected] | Konstantin Läufer, [email protected]

VISUALIZATION CORNER


Editors: Joao Comba, UFRGS, [email protected], and Daniel Weiskopf, [email protected]: Konrad Hinsen, [email protected] | Konstantin Läufer, [email protected]: Konrad Hinsen, [email protected] | Konstantin Läufer, [email protected]

Beyond the Third Dimension: Visualizing High-Dimensional Data with Projections

Renato R.O. da Silva | University of São Paulo, BrazilPaulo E. Rauber and Alexandru C. Telea | University of Groningen, The Netherlands

Many application fi elds produce large amounts of multidimensional data. Simply put, these are da-tasets where, for each measurement point (also called data point, record, sample, observation, or

instance), we can measure many properties of the underlying phenomenon. Th e resulting measurement values for all data points are usually called variables, dimensions, or attributes. A multidimensional dataset can thus be described as an n × mdata table having n rows (one per observation) and mcolumns (one per dimension). When n is larger than roughly 5, such data is called high-dimensional. Such datasets are common in engineering (think of manufac-turing specifi cations, quality assurance, and simulation or process control); medical sciences and e-government (think of electronic patient dossiers [EPDs] or tax offi ce

records); and business intelligence (think of large tables in databases).

While storing multidimensional data is easy, understand-ing it is not. Th e challenge lies not so much in having a large number of observations but in having a large number of di-mensions. Consider, for instance, two datasets A and B. Da-taset A contains 1,000 samples of a single attribute, say, the birthdates of 1,000 patients in an EPD. Dataset B contains 100 samples of 10 attributes, say, the amounts of 10 diff erent drugs distributed to 100 patients. Th e total number of mea-surements in the two datasets is the same (1,000). Yet, un-derstanding dataset A is quite easy, and it typically involves displaying either a (sorted) bar chart of its single variable or a histogram showing the patients’ age distribution. In contrast, understanding dataset B can be very hard—for example, it

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________








might be necessary to examine the correlations of any pair of two dimensions of the 10 available ones.

In this article, we discuss projections, a particu-lar type of tool that allows the efficient and effective visual analysis of multidimensional datasets. Pro-jections have become increasingly interesting and important tools for the visual exploration of high-di-mensional data. Compared to other techniques, they scale well in the number of observations and dimen-sions, are intuitive, and can be used with minimal effort. However, they need to be complemented by additional visual mechanisms to be of maximal add-ed value. Also, as they’ve been originally developed in more formal communities, they’re less known or accessible to mainstream scientists and engineers. We provide here a compact overview of how to use projections to understand high-dimensional data, present a classification of projection techniques, and discuss ways to visualize projections. We also com-ment on the advantages of projections as opposed to other visualization techniques for multidimensional data, and illustrate their added value in a complex visual analytics workflow for machine learning ap-plications in medical science.

Exploring High-Dimensional Data

Before outlining solutions for exploring high-dimen-sional data, we need to outline typical tasks that must be performed during such exploration. These can be classified into observation-centric tasks (which address questions focusing on observations) and dimension-centric tasks (which address questions fo-cusing on the dimensions). Observation-centric tasks include finding groups of similar observations and finding outliers (observations that are very different from the rest of the data). Dimension-centric tasks include finding sets of dimensions that are strongly correlated and dimensions that are mutually inde-pendent. There exist also tasks that combine observa-tions and dimensions, such as finding which dimen-sions make a given group of observations different from the rest of the data. Several visual solutions ex-ist to address (parts of) these tasks, as follows. More details on these and other visualization techniques for high-dimensional data appear elsewhere.1,2

Tables

Probably the simplest method is to display the en-tire dataset as a n × m table, as we do in a spread-sheet. Sorting rows on the values in a given column lets us find observations with minimal or maximal values for that column and then read all their di-mensions horizontally in a row. Visually scanning a

sorted column lets us see the distribution of values of a given dimension.

But while spreadsheet views are good for show-ing detailed information, they don’t scale to data-sets having thousands of observations and tens of dimensions or more. To address such scalability, table lenses refine the spreadsheet idea: they work much like zooming out of the drawing of a large table, thereby reducing every row to a row of pix-els. Rather than showing the actual textual cell content, cell values are now drawn as horizontal pixel bars colored and scaled to reflect data values. As such, columns are effectively reduced to bar graphs. Using sorting, we can now view the varia-tion of dimension values for much larger datasets. However, reasoning about the correlation of differ-ent dimensions isn’t easy using table lenses.

Scatterplots

Another well-known visualization technique for multidimensional data is a scatterplot, which shows the distribution of all observations with respect to two chosen dimensions i and j. Finding correla-tions, correlation strengths, and the overall distri-bution of data values is now easy. To do this for mdimensions, a so-called m × m scatterplot matrix can be drawn, showing the correlation of each di-mension i with each other dimension j. However, reasoning about observations is hard now—an ob-servation is basically a set of m2 points, one in each scatterplot in the matrix. Also, scatterplot matri-ces don’t scale well for datasets having more than roughly 8 to 10 dimensions.

Parallel Coordinates

A third solution for visualizing multidimensional data is parallel coordinates. Here, each dimension is shown as a vertical axis, thus the name parallel coordinates. Each observation is shown as a frac-tured line that connects the m points along these axes corresponding to its values in all the m dimen-sions. Correlations of dimensions (shown by adja-cent axes) can now be spotted as bundles of parallel line segments; inverse correlations are shown by a typical x-shaped line-crossing pattern. Yet, par-allel coordinates don’t scale well beyond 10 to 15 dimensions. Also, they might require careful order-ing of the axes to bring dimensions that one wants to compare close to each other in the plot.

Multidimensional Projections

Projections take a very different approach to visual-izing high-dimensional data. Think of the n data

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










points in an m-dimensional space. The dataset can then be conceptually seen as a point cloud in this space. If we could see in m dimensions, we could then (easily) find outliers as those points that are far from all other points in the cloud and find im-portant groups of similar observations as dense and compact regions in the point cloud.

However, we can’t see in more than three di-mensions. Note also that a key ingredient of per-forming the above-mentioned tasks is reasoning in terms of distances between the points in m dimen-sions. Hence, if we could somehow map, or project, our point cloud from m to two or three dimen-sions, keeping the distances between point-pairs, we could do the same tasks by looking at a 2D or 3D scatterplot. Projections perform precisely this operation, as illustrated by Figure 1. Intuitively, they can be thought of as reducing the unneces-sary dimensionality of the data (the original mdimensions), keeping the inherent dimensionality (that which encodes distances, or similarities, be-tween points). Additionally, we can color-code the projected points by the values of one dimension, to get extra insights.

There are two main use cases for projections. The first is to reduce the number of dimensions by keeping only one dimension from a set of dimen-sions which are strongly correlated, or by dropping dimensions along which the data has a very low variance. Essentially, this preserves patterns in the data (clusters, outliers) but makes its usage simpler,

as there are fewer dimensions to consider next. The simplified dataset can next be used instead of the original one in various processing or analysis tasks. The second use case involves reducing the number of dimensions to two or three, so that we can vi-sually explore the reduced dataset. In contrast to the first case, this usually isn’t done by dropping dimensions but by creating two or three synthetic dimensions along which the data structure is best preserved. We next focus on this latter use case.

Projection Techniques

Many different techniques exist to create a 2D or 3D projection, and they can be classified according to several criteria, as follows.

Dimension versus distance. The dimension versus distance classification looks at the type of informa-tion used to construct a projection. Distance-based methods use only the distances, or similarities, be-tween m-dimensional observations. Typical distances here are Euclidean and cosine, thus, the projection algorithm’s input is an n × n distance matrix be-tween all observation pairs. Such methods are also known as multidimensional scaling (MDS) because they intuitively scale the m-dimensional distances to 2D distances. Technically, this is done by optimizing a function that minimizes the so-called aggregated normalized stress, or summed difference between the inter-point distances in m dimensions and 2D, re-spectively. The main advantage of MDS methods is

Figure 1. From a multivariate data table to a projection. Projections can be thought of as reducing the unnecessary dimensionality of the data (the original m dimensions) keeping the inherent dimensionality (that which encodes dis-tances, or similarities, between points).

color map values ofa selected column

Data table 2D projection

a table row getsmapped to a point

2D point distance reflectsnD row distance

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








that they don’t require the original dimensions—a dissimilarity matrix between observations is suf-ficient and extremely useful in cases where we can measure the similarities in some data collections but don’t precisely know which attributes (dimensions) explain those similarities. The main disadvantage of MDS methods is that they require storing (and ana-lyzing) an n × n distance matrix. For n being tens of thousands of observations, this can be very expen-sive.3 Several MDS refinements have been proposed, such as ISOMAP,4 Pivot MDS,5 and Fastmap,6

which can compute projections in (near) linear time to the number of observations.

In contrast, dimension-based methods use as in-put the actual m dimensions of all observations. For datasets having many more observations than dimen-sions (n much larger than n), this gives considerable savings. However, we now need to have access to the original dimension values. Arguably the best known method in this class is principal component analysis (PCA), whose variations are also known under the names of singular value decomposition (SVD) or Karhunen-Loève transform (KLT).7 Intuitively put, the idea of 2D PCA is to find the plane, in m dimen-sions, on which the projections of the n observations have the largest spread. Visualizing these 2D projec-tions will then give us a good way of understanding the actual variance of the data in m dimensions.8

While simple and fast, PCA-based methods work well only if the observations are distributed close to a planar surface in m dimensions. To understand this, consider a set of observations uniformly distributed on the surface of the Earth (a ball in 3D). When pro-jecting these, PCA will effectively squash the ball to a planar disk, projecting diametrically opposed ob-servations on the ball’s surface to the same location, meaning the projection won’t preserve distances. What we actually want is a projection that acts much as a map construction process, where the Earth’s sur-face is unfolded to a plane, with minimal distortions.

Global versus local. The global versus local classifica-tion looks at the type of operation used to construct a projection. Global methods define a single mapping, which is then applied for all observations. MDS and PCA methods fall in this class. The main disadvan-tage of global methods is that it can be very hard to find a single function that optimally preserves distances of a complex dataset when projecting it (as in the Earth projection example). Another dis-advantage is that computing such a global mapping can be expensive (as in the case of classical MDS). Local methods address both these issues, selecting a

(small) subset of observations, called representatives, from the initial dataset and then projecting these by using a high-accuracy method. This isn’t expensive, as the number of representatives is small. Finally, the remaining observations close to each representa-tive are fit around the position of the representative’s projection. This is cheaper, simpler, and also more accurate than using a global technique. Intuitively, think of our Earth example as splitting the ball sur-face into several small patches and projecting these to 2D. When such patches have low curvature, fit-ting them to a 2D surface is easier than if we were to project the entire ball at once. Good local methods include PLMP9 and LAMP.10 Using representatives has another added value: users can arrange these as desired in 2D, thereby controlling the projection’s overall shape with little effort.

Distance versus neighborhood preserving. A final classi-fication looks into what a projection aims to preserve. When it’s important to accurately assess the similar-ity of points, distance preservation is preferred. All projection techniques listed above fall into this class. However, as we’ve seen, getting a good distance pres-ervation for all points can be hard. When the number of dimensions is very high, the Euclidean (straight-line) distances between all point-pairs in a dataset tend to become very similar, so accurately preserving such distances has less value. In such cases, it’s often better to preserve neighborhoods in a projection—this way, the projection can still be used to reason about the groups and outliers existing in the high-dimen-sional dataset. Actually, the depiction of groups could get even clearer because the projection algorithm has more freedom to place observations in 2D, as long as the nearest neighbors of a point in 2D are the same as those of the same point in m dimensions. The best-known method in this class is t-stochastic neighbor embedding (t-SNE), which is used in many applica-tions in machine learning, pattern recognition, and data mining, and has a readily usable implementation (https://lvdmaaten.github.io/tsne).

Type of data. Most projection methods handle quantitative dimensions, whose values are typically continuously varying over some interval. Examples are temperature, time duration, speed, volume, or financial transaction values. However, projection techniques such as multiple correspondence analy-sis (MCA) can also handle categorical data (types) or mixed datasets of quantitative and categorical data. A good description of MCA and related tech-niques is given by Greenacre.11

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________________


https://lvdmaaten.github.io/tsne







The Projection Explorer is a very good place to start working with projections in practice.12 This tool implements a wide range of state-of-the-art projection techniques that can handle hundreds of thousands of observations with hundreds of di-mensions and provides several visualizations to in-teractively customize and explore projections. The tool is freely downloadable from http://infoserver.lcad.icmc.usp.br/infovis2/Tools.

Visualizing Projections

The simplest and most widespread way to visu-alize a projection is to draw it as a scatterplot. Here, each point represents an observation, and the 2D distance between points reflects the similarities of the observations in m dimensions. Points can be also annotated with color, labels,

or even thumbnails to explain several of their dimensions.

Figure 2a shows this for a dataset where ob-servations are images. The projection shows image thumbnails, organized by similarity. We can eas-ily see here that our image collection is split into two large groups; we can get more insight into the composition of the groups by looking at the thumbnails.

However, in many cases, there’s no easy way to draw a small thumbnail-like depiction of all the mattributes of an observation. Projections will then show us groups and outliers, but how do we ex-plain what these mean? In other words, how do we put the dimension information back into the pic-ture? Without this, the added value of a projection is limited.

Figure 2. Projection visualizations with (a) thumbnails, (b) biplot axes, (c) and (d) axis legends, and (e) key local dimensions.

axis 2

axis 4

axis 1

axis 7axis 8

axis 0

axis 6

axis 5

axis 6

axis 3

axis 2

axis 1axis 7 axis 4

maximum valueminimum value

Male Female

y legend

y legend

x legend

x legend

error legend

selecteddimensionfor colormapping(gender)

7: H– mass abundance

5: He+ mass abundancevariable 7

color: variable 5

spike

variable 5

(a) (b)

(d) (e)

(c)

6: He++ mass abundance

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________________

http://infoserver.lcad.icmc.usp.br/infovis2/Tools






There are several ways of explaining projec-tions. By far the simplest, and most common, is to color code the projection points by the value of a user-chosen dimension. If we next see strong col-or correlations with different point groups in the projection, we can explain these in terms of the selected dimension’s specific values or value rang-es. However, if we have tens of dimensions, using each one to color code the projection is tedious at best. Moreover, it could be that no single dimen-sion can explain why certain observations are simi-lar. Tooltips can be shown at user-chosen points, which does a good job explaining a few outliers one by one, but it doesn’t work if we want to explain a large number of points together.

One early way to explain projections is to draw so-called biplot axes.13 For PCA projections and variants, lines indicate the directions of maximal variation in the 2D space of all m dimensions. In-tuitively put, biplot axes generalize the concept of a scatterplot, where we can read the values of two dimensions along the x and y axes, to the case where we have m dimensions. Moreover, strongly corre-lated dimensions appear as nearly parallel axes, and independent dimensions appear as nearly orthogo-nal axes. Finally, the relative lengths of the axes indicate the relative variation of the respective di-mensions. Biplots can also be easily constructed for any other projection, including 3D projections that generate a 3D point cloud rather than a 2D scat-terplot.14 In such cases, the biplot axes need not be straight lines. Figure 2b shows an example of biplot axes for a dataset containing 2,814 abstracts of scientific papers. Each observation (abstract) has nine dimensions, indicating the frequencies of the nine most used technical terms in all abstracts. The projection, created using a force-based technique, places points close to each other if the respective ab-stracts are similar. Labels can be added to the axes to tell their identity and also indicate their signs (extremities associated to minimum and maximum values). The curvature of the biplot axes tells us that the projection is highly nonlinear—intuitively, we can think that the nine-dimensional space gets dis-torted when squashed into the resulting 3D space. This is undesirable because reading the values of the dimensions along such curved axes is hard.

Still, interpreting biplot axes can be challeng-ing, especially when we have 10 or more variables, as we get too many lines drawn in the plot. More-over, most users are accustomed to interpreting a point cloud as a Cartesian scatterplot—that is, they want to know what the horizontal (x) and ver-

tical (y) axes of the plot mean. For a projection, this isn’t easy because these axes don’t straightforwardly map to data dimensions but, rather, to combina-tions of dimensions. Luckily, we can compute the contribution of each of the original m data dimen-sions to the spread of points along the projection’s x and y axes. Next, we can visualize these contri-butions by standard bar charts (see Figure 2c): for each dimension, the x and y axis legends show a bar indicating how much that dimension is visible on the x and y axes. Long bars, thus, indicate di-mensions that strongly contribute to the spread of points along the horizontal and vertical directions. Figure 2c shows how this works: the dataset con-tains 583 patient records, each having 10 dimen-sions describing patients’ gender, age, and eight blood measurements. The projection shows two clusters placed aside each other.

How do we explain these? In the x axis legend, we see a tall orange bar, which tells us that this dimension (gender) is strongly responsible for the points’ horizontal spread. If we color the points by their gender value, we see that, indeed, gender ex-plains the clusters. Axis legends can also be used for 3D projections, as in Figure 2d, which shows a 3D projection of a 200,000-sample dataset with 10 di-mensions coming from a simulation describing the formation of the early universe.14 As we rotate the 3D projection, the bars in the axis legends change lengths and are sorted from longest to shortest, in-dicating the best-visible dimensions from a given viewpoint (dimensions 5 and 7, in our case). A third legend (Figure 2d, top right) shows which dimen-sions we can’t see well in the projection from the current viewpoint. These dimensions vary strongly along the viewing direction, so we shouldn’t use the current viewpoint to reason about them.

Biplot axes can also be inspected to get more detail. For example, we see that the projection’s saddle shape is mainly caused by variable 7 and that the spike outlier is caused by a combination of dimensions 5 and 6. This interactive viewpoint manipulation of 3D projections effectively lets us create an infinite set of 2D scatterplot-like visual-izations on the fly. Both biplot axes and axis leg-ends explain a projection globally. If well-separated groups of points are visible, we can’t directly tell which variables are responsible for their appearance without visually correlating the groups’ positions with annotations, which can be tedious. Local explanations address this by explicitly splitting the projection into groups of points that admit a single (simple) explanation, depicting this explanation atop

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










the groups. Figure 2e shows this for a dataset contain-ing 6,773 open source software projects, each having 11 quality metrics, along with their download count.15

The projection, constructed with LAMP, shows a concave shape but no clearly separated clusters.

Let’s consider next every projected point and sev-eral of its close neighbors—that is, a small circular patch of projected points. Because these points are close in the projection, they should also be similar in m dimensions. We can analyze these points to find which dimension is most likely responsible for their similarity. By doing this for all points in turn, we can rank all m dimensions by the number of points whose neighborhoods they explain. If we color code points by their best-explaining dimension, the pro-jection naturally splits into several clusters. We can next add labels with the names of their explaining di-mensions. Finally, we can tune the points’ brightness to show how much of a point’s similarity is explained by the single selected dimension. In Figure 2e, we see, for instance, that the lines of code metric (purple) explains two clusters of points—by interac-tive brushing, we can find that one contains small software projects and the other has large software projects. The bright-to-dark color gradient shows how it’s increasingly hard to explain a point’s similar-ity with its neighbors once we approach the cluster border, that is, the place where another dimension becomes key to explaining local similarity. Doing this visual partitioning of the projection into groups explained by dimensions would have been hard us-ing global methods only, such as biplot axes or axis legends. Besides explaining groups in a projection via single dimensions, we can also use tag clouds to show the names of several dimensions.16

Interpreting Projections

As already explained, projections can be used as visual proxies of high-dimensional spaces that en-able reasoning about a dataset’s structure. For this to work, however, a projection should faithfully preserve those elements of the data structure that are important for the task at hand. As such, before using a projection, it’s essential to check its quality.

The easiest way to do this is to compute the ag-gregated normalized stress. Low values of this stress tell us that the projection preserves distances well. However, if this single figure indicates low qual-ity, we don’t know what that precisely means or which observations are affected. More insight can be obtained by showing scatterplots of the original distances in m dimensions versus distances in the projection. Figure 3a illustrates this for several da-tasets and projection techniques.10 The ideal pro-jection behavior is shown with red diagonal lines; figures in each scatterplot show the aggregated nor-malized stress, telling us that LAMP is generally better than the other two studied projections. Yet, we don’t know what this means precisely. Looking at the scatterplots’ deviations from the red diago-nals, we get more insight in the nature of the er-rors: points under the diagonal tell us that original distances are subestimated in the projection, that is, that the projection compressed the data. Note that this is quite a typical phenomenon: projections have to embed points in a much lower-dimensional space, so crowding occurs very likely. For the isoletdataset, we see, for example, that small 2D distanc-es can mean a wider range of high-dimensional distances than large 2D distances, so close points in a projection may or may not be that close in mdimensions. For the viscontest dataset, we see that LAMP has a constant spread around the diagonal, indicating a uniform error distribution for all dis-tance ranges. In contrast, Glimmer shows a much worse error distribution.

While useful to reason about distances, such scatterplots don’t tell us where in the projection we have errors. For this, we can use observation-cen-tric error metrics.17 The aggregate error shows the normalized stress, aggregated per point rather than for all points. Figure 3b shows this for a projection created with LAMP. As we see, the projection over-all is of good quality, with the exception of four small hot spots. Figure 3c shows errors created by false neighbors—that is, points close in 2D but far in m dimensions, or zones where the projection compressed the high-dimensional space. We see

Looking at the scatterplots’ deviations from the red diagonals, we get more insight in the nature of the errors: points under the diagonal tell us that original distances are subestimated in the projection, that is, that the projection compressed the data.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








here only three hot spots, meaning that the fourth one in Figure 3b wasn’t caused by false neighbors. Figure 3d shows errors created by missing neigh-bors—that is, points close in m dimensions but far in 2D. The missing neighbors of the selected point of interest are connected by lines, which are bundled to simplify the image. The discrepancy between the 2D and original distances is also color coded on the points themselves. In this image, we see that the missing neighbors of the selected point are quite well localized on the other side of the pro-jection. This typically happens when a closed sur-face in m dimensions is split by the projection to be embedded in 2D. Finally, Figure 3e shows for a selected group of points all the points that are closer in m dimensions to a point in the group than to any other point but closer to points outside that group in 2D. This lets us easily see if groups that appear in the projection are indeed complete or if they actually miss members.

Using Projections in Visual Analytics

Workflows

So far, we’ve shown how we can construct projec-tions, check their quality, and visually annotate them to explain the contained patterns. But how are projections used in complex visual analytics workflows? The most common way is to visually explore them while searching for groups, and when such groups appear, to use tools like the ones pre-sented so far to explain them in terms of dimen-sions and dimension values.2 This is often done in data mining and machine learning.

We illustrate this with a visual analytics work-flow for building classifiers for medical diagnosis.18

The advent of low-cost, high-accuracy imaging de-vices has enabled both doctors and the public to generate large collections of skin lesion images. Dermatologists want to automatically classify these into benign (moles) and potentially malignant (melanoma), so they can focus their precious time

Figure 3. Projection visualized with (a) distance-centric methods and (b) through (e) observation-centric methods. The ideal projection behavior is shown with red diagonal lines. Figures in each scatterplot show the aggregated normalized stress, telling us that LAMP is generally better than the other two studied projections.

(b) (c)

selected point

missing neighborsof selected point

selectedgroup

missingmembers

pro

ject

ed d

ista

nce

(2 d

imen

sion

s)

original distance (m dimensions)

(a)

visc

onte

stis

olet

wdb

c

LAMP PLMPGlimmer

0.0494 0.0970 0.1556

0.0023 0.1478 0.0016

0.24140.63s

0.23244.32s

0.225310.03s

(d) (e)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM










on analyzing the latter. For this, image classifiers can be used: each skin image is described in terms of several dimensions, or features, such as color his-tograms, edge densities and orientations, texture patterns, and pigmentation. Next, dermatologists manually label a training dataset of images as be-nign or malignant, using it to train a classifier so it becomes able to label new images. Other applica-tions of machine learning include algorithm opti-mization, designing search engines, and predicting software quality.

Designing good classifiers is a long-standing problem in machine learning and is often re-ferred to as the “black art” of classifier design.19

The problem is multiple-fold: understanding discriminative features; understanding which observations are hard to classify and why; and selecting and designing features to improve clas-sification accuracy. Projections can help all these tasks, via the workflow in Figure 4. Given a set of input observations, we first extract features that are typically known to capture their essence (step 1). This yields a high-dimensional data table with observations as rows and features as columns. We also construct a small training set by manual labeling. Next, we want to determine how easy the classification problem ahead of us will be. For this, we project the training set and color obser-vations by class labels (step 2). If the classes we wish to recognize are badly separated, it makes

little sense to spend energy on designing and testing a classifier, since we seem to have a poor feature choice (step 4) We can then interactively select the desired class groups in the projection and see which features discriminate them best,18

repeating the cycle with a different feature subset (step 5). If, however, classes are well separated in the projection (step 3), our features discriminate them well, so the classification task isn’t too hard. We then proceed to design, train, and test the classifier (step 6). If the classifier yields a good performance, we’re done: we have a production-ready system (step 7). If not, we can again use projections to see which are the badly classified observations (step 8), which features are respon-sible for this (step 9), and engineer new features that separate these better (step 10). In this work-flow, projections serve two key tasks: predictingthe ease of building a good classifier ahead of the actual construction (T1), thereby saving us from designing a classifier with unsuitable features, and showing which observations are misclassified and their feature values (T2), thereby helping us design better features in a targeted way.

P rojections are the new emerging instrument for the visual exploration of large high-dimen-

sional datasets. Complemented by suitable visual explanations, they’re intuitive, easy to use, visually

Figure 4. Using projections to build and refine classifiers in supervised machine learning.

Input objects Features Projection Classifier design Classifier testing

Feature setredesign

Classificationsystem ready

Use inproduction

Feature vs. observation study

Iterative feature selection

Featureextraction

Featuresubset

ProjectGoodseparation?

Badseparation?

Too lowperformance?

Study problemcauses

Repeat cycle with newlydesigned features

Goodperformance?

Classifiertool

Trainingdata

Validationdata

T1

T2

1 2

5

4

3 6

8 7

910

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








compact, and easy to learn for users familiar with scatterplots. Recent technical developments allow their automatic computation from large datasets in seconds, helping users avoid complex parameter settings or needing to understand the underlying technicalities. As such, they’re part of the visual data scientist’s kit of indispensable tools.

But as projections become increasingly more useful and usable, several new challenges have emerged. Users require new ways to manipulate a projection to improve its quality in specific areas, to obtain the best-tuned results for their datasets and problems. Developers require consolidated implementations of projections that would let them integrate them in commercial-grade applica-tions such as Tableau. And last but not least, users and scientists require more examples of workflows showing how projections can be used in visual an-alytics sensemaking to solve problems in increas-ingly diverse application areas.

References

1. S. Liu et al., “Visualizing High-Dimensional Data: Advances in the Past Decade,” Proc. EuroVis–STARs, 2015, pp. 127–147.

2. C. Sorzano, J. Vargas, and A. Pascual-Montano, “A Survey of Dimensionality Reduction Techniques,” 2014; http://arxiv.org/pdf/1403.2877.

3. W.S. Torgeson, “Multidimensional Scaling of Similarity,” Psychometrika, vol. 30, no. 4, 1965, pp. 379–393.

4. J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, no. 5500, 2000, pp. 2319–2323.

5. U. Brandes and C. Pich, “Eigensolver Methods for Progressive Multidimensional Scaling of Large Data,” Proc. Graph Drawing, Springer, 2007, pp. 42–53.

6. C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algo-rithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” SIG-MOD Record, vol. 24, no. 2, 1995, pp. 163–174.

7. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990.

8. I.T. Jolliffe, Principal Component Analysis, Springer, 2002, p. 487.

9. F.V. Paulovich, C.T. Silva, and L.G. Nonato, “Two-Phase Mapping for Projecting Massive Data Sets,” IEEE Trans. Visual Computer Graphics, vol. 16, no. 6, 2010, pp. 1281–1290.

10. P. Joia et al., “Local Affine Multidimensional Pro-jection,” IEEE Trans. Visual Computer Graphics,

vol. 17, no. 12, 2011, pp. 2563–2571.11. M. Greenacre, Correspondence Analysis in Practice,

2nd ed., CRC Press, 2007.12. P. Pagliosa et al., “Projection Inspector: Assessment

and Synthesis of Multidimensional Projections,” Neurocomputing, vol. 150, 2015, pp. 599–610.

13. M. Greenacre, Biplots in Practice, CRC Press, 2007.

14. D. Coimbra et al., “Explaining Three-Dimensional Dimensionality Reduction Plots,” Information Visu-alization, vol. 15, no. 2, 2015, pp. 154–172.

15. R. da Silva et al., “Attribute-Based Visual Explana-tion of Multidimensional Projections,” Proc. Eu-roVA, 2015, pp. 134–139.

16. F.V. Paulovich et al., “Semantic Wordification of Document Collections,” Computer Graphics Forum,vol. 31, no. 3, 2012, pp. 1145–1153.

17. R.M. Martins et al., “Visual Analysis of Dimen-sionality Reduction Quality for Parameterized Projections,” Computers & Graphics, vol. 41, 2014, pp. 26–42.

18. P.E. Rauber et al., “Interactive Image Feature Se-lection Aided by Dimensionality Reduction,” Proc. EuroVA, 2015, pp. 54–61.

19. P. Domingos, “A Few Useful Things to Know about Machine Learning,” Comm. ACM, vol. 10, no. 55, 2012, pp. 78–87.

Renato R.O. da Silva is a PhD student at the University of São Paulo, Brazil. His research interests include mul-tidimensional projections, information visualization, and high-dimensional data analytics. Contact him at [email protected].

Paulo E. Rauber is a PhD student at the University of Groningen, the Netherlands. His research interests in-clude multidimensional projections, supervised classifier design, and visual analytics. Contact him at [email protected].

Alexandru C. Telea is a full professor at the University of Groningen, the Netherlands. His research interests include multiscale visual analytics, graph visualization, and 3D shape processing. Telea received a PhD in com-puter science (data visualization) from the Eindhoven University of Technology, the Netherlands. Contact him at [email protected].



qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________

____

_______

__________


http://arxiv.org/pdf/1403.2877









THE LAST WORD


by Charles Day

Computers in Cars

The major role that computational devices play in cars became dramatically apparent last September, when the Environmental Protection Agency announced the results of its investigation into Volkswagen. The EPA discovered that the German automaker had installed software in some of its diesel-engine cars that controlled a system

for reducing the emission of environmentally hostile nitrogen oxides—but only during an emissions test. On the open road, the car belched nitrogen oxides, unbeknownst to its driver.

Computers have been stealthily controlling cars for decades. My second car, a 1993 Honda Civic hatchback, had a computational device—an engine control unit—whose microproces-sor received data from sensors in and around the engine. On the basis of those data, the ECU would consult preprogrammed lookup tables and adjust actuators that controlled and opti-mized the mix of fuel and air, valve timing, idle speed, and other factors. This combination of ECU and direct fuel injection not only reduced emissions and boosted engine efficiency, it was also less bulky and mechanically simpler than the device it replaced, the venerable carburetor.

Unfortunately, however, the trend for computers in cars is toward greater complex-ity, not simplicity. Consider another Honda, the second-generation Acura NSX, which went on sale earlier this year. The supercar’s hybrid power train consists of a turbo-charged V6 engine mated to three electric motors: one each for the two front wheels and one for the two rear wheels. An array of sensors, microprocessors, and actuators ensures that all three motors are optimally deployed during acceleration, cruising, and braking.

And talking of braking, the NSX’s brake pedal isn’t actually mechanically con-nected to the brakes. Rather, it activates a rheostat, which controls the brakes electroni-cally. To preserve the feel of mechanical braking, a sensor gauges how much hydraulic pressure to push back on the driver’s foot.

In Formula One racing, the proliferation of computer control has led to an arms race among manufacturers, which reached its apogee in 1993. Thanks in part to its computer-controlled anti-lock brakes, traction control, and active suspension, the Williams FW15C won 10 of the season’s 16 races. The sport’s governing body responded by restricting electronic aids. By the 2008 season, all cars were compelled to use the same standard ECU. The 23-year-old Williams FW15C retains a strong claim to being the most technologically sophisticated Formula One car ever built.

Computers aren’t confined to supercars or racing cars. The July issue of Consumer Re-ports ranked cars’ infotainment systems, with Cadillac’s being among the worst. Owners reported taking months, even years, to master its user interface. “This car REALLY needs a co-pilot with an IT degree,” one despairing owner told the magazine. And this past May, USA Today reported that consumer complaints about vehicle software problems filed with the National Highway Traffic Safety Administration (NHTSA) jumped 22 percent in 2015 compared with 2014. Recalls blamed on software rose 45 percent.

I’m not against computers in cars. Rather, I worry that their encroachment will be-come so complete that consumers like me will be deprived of the choice to buy a car that lacks such fripperies as a remote vehicle starter system, rear vision camera, head-up display, driver seat memory, lane departure warning system, and so on. I worry, too, that even as the NHTSA records more software problems, it’s also considering whether to mandate computer-controlled safety features.

So although I wouldn’t turn down an Acura NSX, I’d rather drive one of its ances-tors, the Honda S800 roadster, circa 1968.

Charles Day is Physics Today’s editor in chief. The views in this column are his own and not nec-essarily those of either Physics Today or its publisher, the American Institute of Physics.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM







CALL FOR NOMINEES Education Awards Nominations

Deadline: 15 October 2016Nomination Site: awards.computer.org

Taylor L. Booth Education Award

A bronze medal and US$5,000 honorarium are awarded for an outstanding record in computer science and engineering education. The individual must meet two or more of the following criteria in the computer science and

Achieving recognition as a teacher of renown.

Inspiring others to a career in computer science and engineering education.

Two endorsements are required for an award nomination.

www.computer.org/web/awards/booth

Computer Science and Engineering Undergraduate Teaching Award

A plaque, certificate and a stipend of US$2,000is awarded to recognize outstanding contributions to undergraduate education through both teaching and service and for helping to maintain interest, increase the visibility of the society, and making a statement about the importance with which we view undergraduate education.

The award nomination requires a minimum of three endorsements.

www.computer.org/web/awards/cse-undergrad-teaching

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://awards.computer.org

http://www.computer.org/web/awards/booth

http://www.computer.org/web/awards/cse-undergrad-teaching





Sameer ChopraGVP &

Juan Miguel Lavista Haile Owusu

Meet Analytics Experts Face-To-Face

Why Attend Rock Stars of Pervasive, Predictive Analytics?

Want to know how to avoid the 4 biggest problems of predictive analytics – from the Principal Data Scientist at Microsoft? Take a little time to discover how predictive analytics are used in the real world, like at Orbitz Worldwide. You can stop letting traditional perspectives limit you with help from Mashable’s chief data scientist.

Come to this dynamic one-day symposium.

www.computer.org/ppa

18 October 2016 | Mountain View, CA

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://www.computer.org/ppa





contents | zoom in | zoom out for navigation instructions ...€¦ · visualization corner: joao...

Documents