enabling petascale data movement and analysis - mathematics and

12
CEDPS: Getting Your Data Downtown Enabling PETASCALE Data Movement and Analysis The flood of data being produced today and the tsunami of data expected within the decade raise a critical end-to-end question: how can researchers best access these data and use them for scientific discovery? To answer this question, the SciDAC Center for Enabling Distributed Petascale Science (CEDPS) is exploring diverse approaches — ranging from highly-scalable data-transfer tools to methods enabling server-side data analysis. The chair of a recent SciDAC workshop observed during his opening remarks that “if airplanes had gotten faster by the same factor as computers over the past 50 years, we would be able to cross the country in just one tenth of a second.” To this, some wag at the back of the room retorted, “yes, but it would still take you two hours to get downtown.” The important truth underlying this witticism is that petascale science is an end-to-end problem. Over the past several years, investments by the U.S. Department of Energy (DOE) Office of Science and others have resulted in some of the most powerful supercomputers and scientific facilities ever known, including examples like the computers at Argonne (ANL) and Oak Ridge (ORNL) national laboratories, and facilities such as the Advanced Photon Source (APS), the Spallation Neutron Source, and the Large Hadron Collider (LHC). These facilities allow researchers to produce in a few seconds data that previously would have taken days, and to obtain in days data that were previ- ously unthinkable. However, making sense of these data requires that users be able to access the data, and most users are located far from where the data are produced. It serves little purpose to accelerate data generation (the airplane analogy) if we cannot also accelerate the time required to move data into the hands of scientists (getting downtown). These considerations motivated the establishment of CEDPS, a SciDAC Center for Enabling Technol- ogy. CEDPS seeks to reduce time-to-discovery for DOE experimental and simulation science by accel- erating access to remote data and software. CEDPS has been pursuing this goal since 2006. As described in this article, CEDPS has achieved significant suc- cesses in both the production of useful tools and the use of those tools within DOE and other projects. Petascale Data in a Connected World Advances in computational power, detector capa- bilities, and storage capacity — following Moore’s law — are resulting in ever greater data volumes, from not only scientific computations but also experimental facilities. For example, while the Earth System Grid currently manages 200 ter- abytes (TB) of climate simulation data, including the datasets produced for the recently completed fourth Intergovernmental Panel on Climate Change assessment, the fifth assessment is expected to comprise 50 petabytes (PB) by 2013. These enormous quantities of data are produced as a result of billions of dollars invested in scientific instrumentation and the associated science. Thus, these data have considerable cost, as well as tremen- dous value, to scientists worldwide. Viewing science as an end-to-end problem, we must concern our- selves with how to reduce the time that elapses between the data being generated and the distribu- tion of new insights to the scientific community. It used to be common that a scientist generated data at some facility, then ran analysis programs at the same facility, and finally communicated 22 S CI DAC R EVIEW W INTER 2009 WWW . SCIDACREVIEW . ORG It serves little purpose to accelerate data generation if we cannot also accelerate the time required to move data into the hands of scientists.

Upload: others

Post on 10-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enabling Petascale Data Movement and Analysis - Mathematics and

C E D P S : G e t t i n g Yo u r D a t a D o w n t o w n

Enabling PETASCALEData Movement

and AnalysisThe flood of data being produced today and the tsunami of data expected within the decade raise acritical end-to-end question: how can researchers best access these data and use them for scientificdiscovery? To answer this question, the SciDAC Center for Enabling Distributed Petascale Science(CEDPS) is exploring diverse approaches — ranging from highly-scalable data-transfer tools tomethods enabling server-side data analysis.

The chair of a recent SciDAC workshop observedduring his opening remarks that “if airplanes hadgotten faster by the same factor as computers overthe past 50 years, we would be able to cross thecountry in just one tenth of a second.” To this, somewag at the back of the room retorted, “yes, but itwould still take you two hours to get downtown.”

The important truth underlying this witticism isthat petascale science is an end-to-end problem.Over the past several years, investments by the U.S.Department of Energy (DOE) Office of Science andothers have resulted in some of the most powerfulsupercomputers and scientific facilities everknown, including examples like the computers atArgonne (ANL) and Oak Ridge (ORNL) nationallaboratories, and facilities such as the AdvancedPhoton Source (APS), the Spallation NeutronSource, and the Large Hadron Collider (LHC).These facilities allow researchers to produce in afew seconds data that previously would have takendays, and to obtain in days data that were previ-ously unthinkable. However, making sense of thesedata requires that users be able to access the data,and most users are located far from where the dataare produced. It serves little purpose to acceleratedata generation (the airplane analogy) if we cannotalso accelerate the time required to move data intothe hands of scientists (getting downtown).

These considerations motivated the establishmentof CEDPS, a SciDAC Center for Enabling Technol-ogy. CEDPS seeks to reduce time-to-discovery for

DOE experimental and simulation science by accel-erating access to remote data and software. CEDPShas been pursuing this goal since 2006. As describedin this article, CEDPS has achieved significant suc-cesses in both the production of useful tools and theuse of those tools within DOE and other projects.

Petascale Data in a Connected WorldAdvances in computational power, detector capa-bilities, and storage capacity — following Moore’slaw — are resulting in ever greater data volumes,from not only scientific computations but alsoexperimental facilities. For example, while theEarth System Grid currently manages 200 ter-abytes (TB) of climate simulation data, includingthe datasets produced for the recently completedfourth Intergovernmental Panel on ClimateChange assessment, the fifth assessment isexpected to comprise 50 petabytes (PB) by 2013.

These enormous quantities of data are producedas a result of billions of dollars invested in scientificinstrumentation and the associated science. Thus,these data have considerable cost, as well as tremen-dous value, to scientists worldwide. Viewing scienceas an end-to-end problem, we must concern our-selves with how to reduce the time that elapsesbetween the data being generated and the distribu-tion of new insights to the scientific community.

It used to be common that a scientist generateddata at some facility, then ran analysis programsat the same facility, and finally communicated

22 S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

It serves little purpose toaccelerate data generationif we cannot alsoaccelerate the timerequired to move data intothe hands of scientists.

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 22

Page 2: Enabling Petascale Data Movement and Analysis - Mathematics and

results via a research publication. This mode ofworking posed no particular demands on the dis-tributed computing infrastructure. For variousreasons, however, this relatively simple mode ofworking is not always satisfactory. One reasonis that, as a result of specialization of infrastruc-ture, it may not be feasible to perform analyses atthe site where the data were generated. Indeed,many experimental facilities, and even somesupercomputers, provide only limited storagespace and post-processing facilities. In thesecases, the user has no choice but to transfer datato a home institute or another facility for analy-sis. In the past, this transfer was often achievedvia tape. However, increased data volumes maketransfer by tape increasingly problematic. A sec-ond reason is that raw data can be of interest tomany researchers other than the individual orteam who first generated it. Thus, we want to findways of making that data available to many. Athird reason is that, as science becomes morecomplex and interdisciplinary, people are eager

to combine data from multiple sources. Forexample, climate scientists want to compareresults from different climate models, and mate-rials scientists want to image the same samplewith both neutrons and photons.

These and other considerations motivate us topursue strategies to reduce barriers that restrictaccess to data produced at scientific facilities.Such strategies fall into two categories:

• Moving data to remote users via high-speednetworks and software designed to facilitate therapid, reliable, and secure movement of data

• Facilitating local analysis by remote users, viahardware and software designed to support rapid,reliable, and secure server-side data analysis

These strategies form two of the three CEDPSfocus areas. The third area is troubleshooting,because distributed systems — especially high-per-formance distributed systems — are prone to failure.

23S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 1. Connectivity and bandwidth requirements of selected DOE science areas and facilities, as determinedwithin requirements elicitation workshops conducted by DOE’s Energy Science network (ESnet). This table is adaptedfrom a presentation given by Eli Dart of ESnet to the Advanced Scientific Computing Advisory Committee (ASCAC)Networking Subcommittee Meeting on April 13, 2007.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

CEDPS has achievedsignificant successes inboth the production ofuseful tools and the useof those tools within DOEand other projects.

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 23

Page 3: Enabling Petascale Data Movement and Analysis - Mathematics and

C E D P S : G e t t i n g Yo u r D a t a D o w n t o w n

It is unlikely that it will ever be possible or desir-able to move all data. However, it is impressive justhow much data can be moved. Over the 10 gigabitper second (Gbps) links that are common today,

we can (in principle) move 100 TB in a day, if thestorage systems and local area networks at thesource and destination can sustain 10 Gbps speeds.Emerging 100 Gbps links can increase this num-

24 S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 2. Challenges inherent in moving large quantities of data fast over wide-area networks, and potentialsolutions to those challenges.

Figure 3. Major protocol specifications used in GridFTP and some of the major features provided by thosespecifications. RFCs are specifications released by the Internet Engineering Task Force; a GFD is a specificationissued by the Open Grid Forum.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

CEDPS has adopted atwo-pronged strategy:scaling GridFTP to thepetascale, andimplementing higher-levelbehaviors in libraries.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 24

Page 4: Enabling Petascale Data Movement and Analysis - Mathematics and

ber to a petabyte per day. At these speeds, it maystill be faster and cheaper overall to ship data byfilling a truck with disks, but the inherent complex-ity is so substantial that it becomes increasingly lessattractive. Thus, even experiments operating at theLHC in Geneva, Switzerland, which expect to pro-duce many petabytes per year, plan to transferthose data to the United States via the Internet.These considerations continue to drive bandwidthrequirements for DOE networking ever higher, asillustrated in figure 1 (p23).

Putting Data Where You Need ThemSay you have produced 10 TB of data and want toensure these data are available elsewhere — per-haps at a single location, or perhaps at several —rapidly, reliably, and securely.

This task is not always easy to achieve. In prin-ciple, you should be able to move 10 TB over a mod-ern 10 Gbps network in a less than three hours. Inpractice you may find that it takes weeks, and muchpain and suffering, to achieve such a transfer. Fig-ure 2 lists some of the problems that can arise. Thefirst set of problems relates to performance. Far toooften, some component of the end-to-end path (orthe top-to-bottom configuration of application,middleware, network protocols, and hardware) ismisconfigured, with the result that transfers slowto a crawl. A second set of problems relates to reli-ability. Even the most carefully managed storagesystem, firewall, router, network, or computer willoccasionally fail; and in the absence of fault toler-ance mechanisms, any failure in an end-to-end pathwill cause the entire transfer to fail.

The solutions to many of these problems (figure 2)are known, at least in principle. But packaging, con-figuring, and delivering them in a form where theycan be used easily can introduce significant chal-lenges. CEDPS has adopted a two-pronged strategyto achieve this goal: scaling GridFTP to the petascale,and implementing higher-level behaviors in libraries.

Scaling GridFTP to the PetascaleThe name GridFTP is commonly used to referboth to a data movement protocol and to soft-ware that implements that protocol. The GridFTPprotocol specification describes both a profile onexisting specifications of and a set of extensionsto the popular file transfer protocol (FTP) (figure3). The result is a protocol that is well-suited forhigh-performance transport, because of its sup-port for features such as reliability, third-partytransfer, striping, and protocol tuning.

Multiple implementations of the GridFTP pro-tocol exist. CEDPS works with two of these imple-mentations: one developed by the Globus team atANL, and a second developed by the dCache teamat Fermilab.

The Globus implementation of GridFTP is themost full-featured and efficient available. Its mod-ular architecture (figure 4) permits it to access datastored in a variety of data systems, move data usinga variety of transport protocols, and exploit strip-ing for concurrency and high-performance trans-port. These features enable Globus GridFTP toachieve high end-to-end performance over bothlocal-area and wide-area networks. A popular

25S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 4. The Globus GridFTP architecture. Clients can request transfers to/from aserver or from one server to another (a third party transfer). A GridFTP server front endcan control the operation of multiple data movers, in order to support many clientsmore efficiently and/or to access data striped across multiple disks. A modular datastorage interface (DSI) can provide access to a variety of storage systems, includingNFS, HPSS, OPENDAP, and Hadoop. The number of data movers can be varieddynamically, for example in response to changing client load.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

Figure 5. GridFTP performance with and without “lots ofsmall files” optimizations. In each case, 1 GB is transferred,with varying numbers of files (top x-axis) and file sizes(bottom x-axis). Without pipelining, data transfer performance(shown on the y-axis in Mbps) degrades once file size dropsbelow 100 MB. With pipelining, high performance issustained down to a file size of about 250 kB.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 25

Page 5: Enabling Petascale Data Movement and Analysis - Mathematics and

C E D P S : G e t t i n g Yo u r D a t a D o w n t o w n

recent enhancement permits authentication usingSSH credentials as an alternative to DOE X.509 cer-tificates. Another recent enhancement achievesdramatic performance improvements for manysmall files, via the pipelining and concurrent exe-cution of many transfer commands (figure 5, p25).

The following examples illustrate the impactGridFTP is having on DOE science.

• The Argonne Leadership Computing Facility(ALCF) has based its data management system onthe Globus GridFTP software, using it to managethe movement of data to and from the HPSS massstore, the ALCF’s high-performance file servers,and external users (figure 6). This capability pro-vides performance of 400 megabytes per second(MB/s), superior by 10% to the alternative per-formance of 360 MB/s for PFTP on a local-areanetwork when using a single data mover. Supportfor multiple data movers, as developed by CEDPS,increases performance to 1.72 gigabytes per sec-ond (GB/s) on 75% of available nodes and shouldreach 2 GB/s when all are used.

• The DOE Energy Sciences Network (ESnet)recently announced speedups of 20 times or more,to 200 MB/s, over wide area links from the NationalEnergy Research Scientific Computing Center(NERSC) to the Oak Ridge Leadership ComputingFacility (OLCF). These speedups were made possi-ble by the use of GridFTP. “I admit to waiting morethan an entire workday for a 33 GB input file to SCPand feeling extremely discouraged knowing I had 20more to transfer,” says Dr. Hai Ah Nam, a computa-tional scientist in the OLCF Scientific ComputingGroup researching the nuclear properties of carbon-14. She now transfers 40 TB of information betweenNERSC and OLCF for each nucleus she studies —and each such transfer takes less than three days.

• APS users in Australia report similar speedups rel-ative to the SCP tool, which makes it possible fordata from experiments to be transferred by networkrather than courier. Also, APS is using GridFTP forautomated data movement between its beamlinedata acquisition machine and its HPC cluster, wherethe acquired data are processed (figure 7). Earlier, thisdata movement was manual, and APS was getting adata rate of 23 MB/s using a windows native proto-col. With GridFTP, the data rates are significantlybetter — about 110 MB/s, a five-fold improvement.It allows the entire acquired dataset to be moved inthe downtime between samples; thus, the samplesdo not build up on the acquisition machine.

• The Dark Energy Survey is a five-year optical imag-ing survey project led by Fermilab to make precisionmeasurements of the sky, aimed at determining the

26 S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 6. The Argonne Leadership Computing Facility (ALCF) has based its datamanagement system on the Globus GridFTP software, using it to manage the movementof data to and from the HPSS mass store, the ALCF’s high-performance file servers, andexternal users.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

Figure 7. The Dark Energy Survey project operates a four meter telescope and isconstructing a new 500 megapixel optical camera for its survey.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

Figure 8. The Dark Energy Survey project uses GridFTP to transfer large volumes ofdata across the country.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 26

Page 6: Enabling Petascale Data Movement and Analysis - Mathematics and

reasons the Universe is accelerating (figure 8). Toaccomplish this goal, researchers will use the LargeSynoptic Survey Telescope to survey more than20,000 square degrees to an unprecedented depthover 10 years, creating a database of more than 100Pb. The telescope’s data torrent presents significantchallenges for near-real-time analysis of data streams.Preliminary experiments are using GridFTP foraccess to its enormous datasets (figure 9). “The proj-ect has needed to transfer data between NCSA andSDSC, between NCSA and Fermilab, from dedicatedservers to shared supercomputer platforms, fromparallel file systems to tape archives. For all of theseuse cases, GridFTP has provided the flexible andhigh-performance solution required by the DarkEnergy Survey Data Management project,” says Dr.Greg Daues, a research programmer at NCSA whoworks on the Dark Energy Survey project.

CEDPS enhancements to Globus GridFTP seemto be correlated with substantial increases in boththe number of GridFTP servers reporting usingGridFTP’s optional usage reporting feature and thenumber of reported downloads (figure 10, p28).

CEDPS also supports work on a second imple-mentation of GridFTP that was incorporated in the

dCache distributed storage system used in high-energy physics experiments. One importantenhancement to dCache that GridFTP introducedwas the support for checksum calculations — animportant capability given that the 16-bit Transmis-sion Control Protocol checksum is widely viewed asinadequate given the large amounts of data trans-ferred over modern networks. This feature detectsan average of 40 errors per million transfers on datatransferred by the D0 experiment at Fermilab andan unrevealed (but we expect similar) number in themore than 10 TB of data transferred per day for theEuropean Organization for Nuclear Research’sCompact Muon Solenoid experiment.

Raising the Level of DiscourseRapid, secure, and reliable movement of individ-ual files (and sets of files) is a necessary capabilityfor many DOE science projects. However, it is notin itself sufficient for many purposes. In particular,many users ask for more sophisticated manage-ment functionality, so that they can, for example,request that “only files that have changed, in thisdirectory, should be replicated.” CEDPS is develop-ing a variety of higher-level tools and services tomeet this sort of demand.

27S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 9. Automated data movement using GridFTP at APS.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

Rapid, secure, andreliable movement of filesis a necessary capabilityfor many DOE scienceprojects. CEDPS isdeveloping a variety ofhigher-level tools andservices to meet this sortof demand.

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 27

Page 7: Enabling Petascale Data Movement and Analysis - Mathematics and

C E D P S : G e t t i n g Yo u r D a t a D o w n t o w n

Data mirroring refers to the synchronization ofsource and destination directories. A mirroringoperation involves first the evaluation of the stateof these directories followed by the updating ofthe destination to bring it into synchrony. Theevaluation of the state of synchrony of the direc-tories may include comparing file names, sizes,modification timestamps, and/or checksums. Ide-ally, a data mirroring tool copies only files or por-tions of files that do not currently exist or thatrequire updating at the destination site.

Data mirroring may be required for a variety ofreasons. For example, we may want to increaseavailability in case of hardware failures, networkoutages, or even natural disasters. Alternatively,we may wish to improve performance by allow-ing a dataset to be accessed from the best of sev-eral locations or simultaneously from multiplelocations. Probably the most common reason isa desire to have a copy of essential datasets athome institutions. We may also want to replicatedata to a storage system near a specific computa-tional resource to improve performance.

CEDPS has developed a data mirroring toolcalled globus-url-sync to provide secure and effi-cient data mirroring for high-bandwidth and largedata environments. Popular UNIX mirroring tools,such as rsync, were originally developed for low-bandwidth environments. Thus, these tools incurhigh CPU costs, because they use aggressive check-summing to minimize data transport. In addition,they do not support high-performance data pro-tocols such as GridFTP and provide only limiteddata transfer security. For these reasons, the toolsare not often used in scientific environments, par-ticularly when large datasets are involved. Theglobus-url-sync tool extends the functionality ofthe popular globus-url-copy command line tool tomirror a source and destination directory, trans-mitting only modified files and using GridFTP asthe underlying transfer protocol.

The globus-url-sync tool has been designed incollaboration with Stephen Miller’s Neutron Scat-tering Science Division of the Spallation NeutronSource at ORNL. Their data mirroring require-ments include sharing within their collaborationand also with the APS group at ANL. We are inthe process of deploying this mirroring toolwithin the ORNL environment and performinginitial data mirroring operations.

DataKoa (Further Reading, p33) is a hosted datamovement service — that is, a service operatedby a third party to which a user can hand off theproblem of managing data movement amongtwo or more locations in a distributed system.CEDPS researchers believe that a hosted servicecan achieve better reliability and a more rapidresponse to problems than a service operated byan individual user on, for example, a desktop.

Scalable ServicesMoving data to computation is not always feasi-ble — and may be expected to be less feasible inthe future, as data volumes continue to grow.Thus, we seek methods for enabling remote accessto code and easily moving computation to remotecomputers. Two major initiatives in the CEDPSServices area address these two challenges.

Service-Oriented Science ToolsCEDPS has worked closely with researchers atOhio State University and elsewhere to produce,deploy, and evaluate tools for creating, deploying,publishing information about, discovering,invoking, and composing web services — serv-ices that encapsulate some specific data or soft-ware of interest to remote users. In particular,CEDPS has developed tools for wrapping scienceapplications as services — the grid Resource Allo-cation and Virtualization Environment (gRAVI)— and has worked with various DOE teams toapply these tools in different settings.

28 S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 10. Opt-out usage reporting integrated into GridFTP provides partial data on usage. Shown here (left) is the number of unique GridFTPservers with usage reporting enabled, and (right) the number of transfers per day reported by those servers, over a roughly three-year period.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

CEDPS researchersbelieve that a hostedservice can achieve betterreliability and a more rapidresponse to problems thana service operated by anindividual user on, forexample, a desktop.

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 28

Page 8: Enabling Petascale Data Movement and Analysis - Mathematics and

The CEDPS team has worked with the team atOhio State University in developing Introduce,which provides one-stop shopping for building afully-functional service from scratch. Introduceaims to reduce the service development and deploy-ment effort by hiding low-level details of the GlobusToolkit and to enable the implementation ofstrongly typed Grid services. Introduce also hidesthe complexity in developing and deploying secureweb services. Introduce is built using a plug-inarchitecture that gRAVI utilizes to automaticallygenerate Grid services around applications. The enduser can start with an application or an executableand create a fully functional, secure Grid servicearound that application without writing a singleline of code. Also, gRAVI annotates the servicesgenerated with appropriate metadata that couldhelp scientists discover these services and use themin their workflows. In addition, gRAVI generatescode to register the generated service to a central-ized registry with appropriate metadata that wouldmake it easier for users to discover the service. Intro-duce and gRAVI together provide the full spectrumof tools that are required to create, secure, deploy,and register application services.

The CEDPS Services recently completed a fullimplementation of rapid creation and deploymentof application services using gRAVI. APS is an earlyadopter of gRAVI in the area of high-throughputtomography. Tomography at APS uses high-inten-sity X-rays to create many projections through a

sample across a range of angles. These projectedimages contain information about how the X-rayswere absorbed through various paths in the sample.A numerical processing technique can then be usedon the projections to develop a three-dimensionaldensity map of the sample known as a reconstruc-tion (figure 11). This density map is a nondestructivelook at the interior structures of the sample. Tomog-raphy, therefore, is a powerful tool to nondestruc-tively look at the detailed structure within anoptically opaque sample. Tomography is used tostudy a wide array of samples from many disciplines,such as materials science, biology, and medicine.

High-throughput tomography requires that anumber of computational resources be available toefficiently process raw data into high-quality recon-structions. At APS, a parallel processing cluster-cou-pled with a multi-terabyte disk array has beenemployed to process data at rates comparable toacquisition rates. Additional beamlines have sincedeveloped the ability to perform tomography exper-iments. The challenge at APS has thus become howto provide support for the same high-performancecomputational systems without necessarily dupli-cating the costs. A collaborative effort with theCEDPS team led to the development of an imple-mentation that may answer this challenge. The groupat APS used gRAVI to wrap analytical routines assecure services and created workflows that providesecure local and remote access to locate sample datawithin the system, browse the data, and create and

29S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 11. A three-dimensional density map of a sample taken using gRAVI.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

A collaborative effort withthe CEDPS team led tothe development of animplementation that mayanswer this challenge …

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 29

Page 9: Enabling Petascale Data Movement and Analysis - Mathematics and

C E D P S : G e t t i n g Yo u r D a t a D o w n t o w n

run simple workflow systems to automate process-ing and delivery of large numbers of samples.

The gRAVI tools have also been adopted enthu-siastically at NERSC for rapid virtualization andprovisioning of applications. In addition, the toolsare being used in the cancer Bioinformatics Gridand the Cardio Vascular Grid Research projects,both sponsored by the National Institutes ofHealth and the OMII-UK team.

Infrastructure as a Service (IaaS)Also known as “cloud,” the IaaS has emerged as auseful approach to delivering computing and stor-age capabilities to users who do not want or can-not afford to operate their own dedicated systems(sidebar “A Cloud for STAR”). Use of virtualmachine technologies also addresses code porta-bility issues that often bedevil scientific projects.For example, the Solenoid Tracker at RHIC (STAR)team tells us that it can take several weeks to installand validate on a new computer the more than 200packages that comprise their data analysis system.With virtual machines, it suffices to download andstart up the virtual machine image.

CEDPS has contributed to the development of theNimbus virtual machine provisioning software,which in turn has developed a substantial user com-munity within and beyond DOE. An early applica-tion success was enabling the first production runof the nuclear physics STAR applications on Ama-zon’s EC2 cloud computing infrastructure in Sep-

tember 2007. The deployment of the STAR clusteron EC2 was orchestrated by the Nimbus ContextBroker service that enables automatic and securedeployment of turnkey virtual clusters, bridging thegap between functionality provided by EC2 and theend product that scientific communities need todeploy their applications. Scientific production runsrequire careful and involved environment prepa-ration and reliability; this run was a significant steptoward convincing the broad STAR community thatreal science can be done using cloud computing.

TroubleshootingYou have just created 10 TB of data. You fire up yourGridFTP client to move these data to a remote com-puter for analysis, and you are told it should be donein three hours. It is late in the day, so you head homeand plan to start the analysis tomorrow. The nextmorning, however, you discover that the transfer isonly 10% done. Apparently something went wrongor was wrong all along. The cause could be yourclient settings, your host computer, the networkgateway, the network itself, or the receiver’s hard-ware or software. How can you tell where the prob-lem lies, and how to correct it?

This vignette captures the essence of the problemthat CEDPS troubleshooting researchers seek tosolve. It is a difficult problem because wide area net-works inevitably span administrative domains.Sending data involves engaging a multiscale com-plex system, just as driving your car involves engag-

30 S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

The advantages of cloud computing weredramatically illustrated in March 2009 byresearchers working on the STAR nuclear physicsexperiment at Brookhaven National Laboratory’sRelativistic Heavy-Ion Collider. The STARscientists had a late-coming simulation requestto produce results for the Quark Matterconference, but all the computational resourceswere either committed to other tasks or did notsupport the environment needed for STAR.Fortunately, working with the Nimbus team, theSTAR researchers were able to dynamicallyprovision virtual clusters and run the additionalcomputations just in time.

This most recent achievement is a culminationof a few years of close collaboration between theNimbus team and the STAR experiment. Nimbusis an open source cloud computing infrastructureconsisting of tools that allow users to deployvirtual machines (VM) on resources in a manner

similar to Amazon’s EC2 (Workspace Service)and then combine them into “turnkey” virtualclusters, sharing security and configurationcontext, that can be used as deploymentplatforms for scientific computations (ContextBroker).

The STAR team recognized the benefits ofvirtualization early on: it allows scientists toconfigure a VM image exactly to their needs andhave a fully validated experimental softwarestack ready for use. A virtual cluster composed ofhundreds of such images can be deployed onremote resources in minutes. In contrast, Gridresources available at sites not expresslydedicated to STAR can take months to have theirconfiguration adapted to support the STARenvironment. Furthermore, since softwareprovisioning and updates of remote sites arehard to track, the scientists cannot always besure that their data productions are consistent.

The STAR scientists started out by developingand deploying their VMs on a small Nimbus cloudconfigured at the University of Chicago. Once thevirtual machines were deployed, they used theNimbus Context Broker to configure them intoOpen Science Grid (OSG) clusters where a jobcould be submitted just as if it were another OSGcluster. However, since STAR production runsrequire hundreds of nodes, the collaboratingteams soon started moving those clusters tocommercial infrastructure. The Nimbus ContextBroker was adapted to work with EC2, and agateway was developed to facilitate access. Thefirst STAR production run on EC2 took place inSeptember 2007, and over the following year theSTAR scientists in collaboration with the Nimbusteam made further progress, evaluating theperformance and successfully conducting a fewnon-critical runs preparing ground for elastic use ofEC2 in the latest run.

A C loud for STAR

CEDPS has contributed tothe development of theNimbus virtual machineprovisioning software,which in turn hasdeveloped a substantialuser community withinand beyond DOE.

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 30

Page 10: Enabling Petascale Data Movement and Analysis - Mathematics and

ing the complex internal machinery of a modernautomobile. The big difference between the networkand the automobile, from the perspectives of com-plexity and reliability, is that the set of componentsthat make up your car have been designed and testedto work together (you hope!) by one manufacturer,using one set of specifications, quality control pro-cedures, and so on. An end-to-end data path, on theother hand, must mesh together applications, mid-dleware components, and networks that are devel-oped and operated independently.

CEDPS troubleshooting focuses on the end-to-end problem — that is, on correlating all availableperformance information from the applicationlayer, middleware, host, and network (figure 12).In order to perform these correlations, we havedeveloped a framework for collecting monitoringdata from a distributed system, then normalizingthis information into a common data model andloading it into a relational database (figure 13). Thisframework has been incorporated into a softwareproject that predates CEDPS, called NetLogger, andthus it is called the NetLogger Pipeline. To collectlogs, we employ the widely used open-source soft-ware syslog-ng. To normalize and load the perform-ance information into a database, we use alightweight set of Python modules.

The flexibility of the NetLogger Pipeline is evi-denced by the variety of applications to which ithas been applied, including:

• On the NERSC Parallel Distributed SystemsFacility (PDSF), analysis of data from STARBerkley Storage Manager (BeStMan) data trans-fers have revealed unexpected network perform-ance, which has triggered PDSF upgrades andconfiguration changes.

• The NERSC Project Accounts team is using theNetLogger Pipeline to normalize the logs and logdatabase to perform traceability analysis.

• The Pegasus team at USC/ISI uses the NetLog-ger Pipeline for large CyberShake workflows.Special analysis tools were able to efficiently ana-lyze execution logs of earthquake science work-flows consisting of upwards of a million tasks(figure 14, p32).

• The Tech-X STAR job submission portal usesthe NetLogger Pipeline database to drill down tosite-specific information for a STAR portal job. Aprototype of this functionality was demonstratedat SC08.

Storage Resource Managers (SRMs) are mid-dleware components that provide a commonaccess interface, dynamic space allocation, andfile management for shared distributed storagesystems. A natural use of SRMs is the coordina-tion of large-volume data streaming between

31S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 12. The CEDPS troubleshooting group focuses on all aspects of the data problem, since virtually any component canimpact successful data transfer.

Figure 13. Framework developed by CEDPS and incorporated into the NetLogger project.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

CEDPS troubleshootingfocuses on the end-to-endproblem — that is, oncorrelating all availableperformance informationfrom the application layer,middleware, host, andnetwork.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 31

Page 11: Enabling Petascale Data Movement and Analysis - Mathematics and

C E D P S : G e t t i n g Yo u r D a t a D o w n t o w n

sites. Optimizing the performance of SRMs overhigh-speed network interfaces is an importantchallenge that impacts the work of scienceresearch such as the STAR and Earth SystemGrid projects. The most recent version of anSRM, developed at Lawrence Berkley NationalLaboratory (LBNL) and currently deployed inseveral projects by the SciDAC Storage Data

Management Center, is called BeStMan. Whenmanaging multiple files, BeStMan can takeadvantage of the available network bandwidth byscheduling multiple concurrent file transfers.CEDPS and storage Data Managementresearchers have been working together tounderstand and optimize the performance ofBeStMan.

The STAR experiment uses BeStMan to movedatasets between Brookhaven National Labora-tory (BNL) and the PDSF analysis cluster atNERSC. In this case, BeStMan communicateswith GridFTP servers at BNL. Using CEDPS tools,we can view the transfer performance as seen byeach endpoint (figure 15). The GridFTP and BeSt-Man estimates of bandwidth roughly track eachother, although there are some discrepanciesbecause each component records the start andend time of a transfer slightly differently.

More important is that initial analyses of thethroughput between these sites revealed that thedata transfer rate was far below what anyoneexpected. Note in figure 15 that the upper limit isaround 50 megabits per second (Mbps); inde-pendent tests could achieve upwards of 500Mbps! (For more details, see “PDSF–BNL Band-width” in Further Reading).

This result triggered a flurry of activity thatresulted in three discoveries:

• The network interface controller in the end-host at PDSF was 10 times slower than the cross-country link from California to New York. Thisproblem was easily fixed but is a classic exampleof the configuration problems often encounteredin the “last mile” of the network.

• The hardware at Brookhaven was older andslower than required. Large data transfers requirea lot of memory, because the data nodes need toremember all the data in transit in case some of itis lost. The longer and “fatter” (that is, faster) thenetwork is between the two endpoints, the morememory is required. Again, this problem was eas-ily fixed once recognized.

• Even after end-host upgrades and under thebest possible circumstances, the transfer rates areasymmetrical. Data flow roughly twice as fastfrom BNL to PDSF than the other way around.The reasons for this difference are still beinginvestigated.

This application also reveals two problemswith the current infrastructure. The first is thatthe monitoring is not truly end-to-end. Thisarrangement is essentially a political, and not atechnical, problem — establishing agreement on

32 S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Figure 14. Earthquake movement from the San Joaquin Valley, California to Mexico,across the Los Angeles basin, moments after a simulated rupture. Such simulationscan generate up to 40 TB of data.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

Figure 15. Throughput of BeStMan client at NERSCcommunicating with GridFTP servers at Brookhaven NationalLaboratory, as seen from both ends of the connection.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 32

Page 12: Enabling Petascale Data Movement and Analysis - Mathematics and

what to monitor, and how, between many dis-tinct hierarchies of command and control. Thesecond is that configuration of data movementendpoints is difficult because there are many dif-ferent endpoints being used for different projects;what is needed is a single managed set of hard-ware resources that is dedicated to data transfers.More detail on how CEDPS and other researchersare cooperating to solve both these problems isprovided in the sidebar “Bottleneck Detection:Disk or Network?”

The Next StepsScience is an end-to-end problem: even the mostsophisticated numerical simulation and the mostadvanced experiment become valuable onlywhen the data that they generate is translated intoinsight. Thus, methods for reliable, secure, andrapid data movement — or for avoiding datamovement altogether by enabling effective server-side analysis — are essential elements of a petas-cale computing solution.

The SciDAC CEDPS is dedicated to providingsuch methods and ensuring they are applied effec-tively within DOE science projects. The project’sparticipants have worked on these problems formany years, both individually and (in many cases)together. What distinguishes CEDPS from past

efforts is the focus on scaling to the petascale and,in the process, addressing performance and reli-ability problems that arise in the context of chal-lenging DOE applications.

A current focus of CEDPS work is demonstrat-ing and evaluating the capabilities of CEDPStools at substantial scale. To that end, we aredefining a set of data challenges, in which we seekto move large quantities of data over wide areanetworks in an efficient, reliable, and hands-offmanner. The first such data challenge will involveone million files of average size 10 MB, a secondchallenge will involve 10 million files, and a thirdwill feature 100 million files. We encourage read-ers with interesting distributed data problems tocontact us. !

Contributors Dr. Ian Foster, ANL; Dr. Ann Chervenak,USC/ISI; Dr. Dan Gunter, LBNL; Dr. Kate Keahey, ANL;Ravi Madduri, ANL; and Raj Kettimuthu, ANL

Further ReadingDataKoahttp://www.datakoa.org

PFSF–BNL Bandwidthhttp://www.cedps.net/index.php/PDSF_-_BNL_bandwidth_measurements

33S C I D A C R E V I E W W I N T E R 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

As wide-area networks get faster, they are beginning to outpace disks.Until recently, even a single cheap disk could keep up with a very fastand very expensive wide-area network. Now, with 10 Gbps networksdeployed and 100 Gbps around the corner, a single disk may not beanywhere near enough. Therefore it is important when optimizing theperformance of a data transfer to know which part of the path — thesender disk, the network, or the receiver disk — is the bottleneck. And itis also important to know, particularly for transfers that take severalhours, where this bottleneck is before the transfer completes.

To address this problem, the CEDPS project has added features tothe NetLogger Toolkit that allow it to efficiently and transparentlysummarize the current throughput of a data transfer that has had onlya few lines of instrumentation added to its code. The frequency ofsummarization, from one summary per run, to one summary every Nseconds, to a detailed log of every single read and write operation,can be controlled at run time. The instrumentation has been added tothe GridFTP server software so a running server can show whether thedisk or network is a bottleneck for every transfer. An option has alsobeen added to the client program so the user can get a bottleneckanalysis report.

An example of one-second summaries on a short local transfer(figure 16) shows the write to disk as usually, but not always, thebottleneck.

Bott leneck Detect ion : D isk or Network?

Figure 16. An example Netlogger summary report of disk and networkthroughput for a short local GridFTP transfer.

SO

UR

CE: S

CID

AC C

EDPS

ILLUS

TRA

TION: A. TO

VEY

The SciDAC CEDPS isdedicated to providingmethods for reliable,secure, and rapid datamovement, and ensuringthese methods areapplied effectively withinDOE science projects.

SciDAC Winter09-Inside.qxd 11/23/09 4:00 PM Page 33