workshop organizing committee: rosalind r. jamescarolyn lawrence sharon papiernikcurt van tassell

22
1111010 1010001 0100101 0101001 1011011 1010101 0101011 1101110 1010101 0101110 Big Data Computing: Building a Vision for ARS Information Management Feb. 5-7, 2012 GWCC, Beltsville, MD Workshop Organizing Committee: Rosalind R. James—Carolyn Lawrence —Sharon Papiernik—Curt Van Tassell

Upload: bradyn-dickerson

Post on 29-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

111101010100010100101010100110110111010101010101111011101010101010111000110110101010001010

1

Big Data Computing: Building a Vision for ARS

Information Management

Feb. 5-7, 2012GWCC, Beltsville, MD

Workshop Organizing Committee:Rosalind R. James—Carolyn Lawrence—Sharon Papiernik—Curt Van Tassell

Page 2: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Workshop Purpose

Bring ARS scientific capability to the cutting edge

Page 3: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Workshop PurposeDevelop a vision and strategy that defines:

(1) ARS scientific Big Data needs

(2) An infrastructure for dealing with these needs

for now and into the future

Page 4: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

What is Big Data?

Massive amounts of data that collect over time that are difficult to analyze and handle using common data management tools.

Page 5: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Size Isn’t Everything…Big Data comes in V-Dimensions:

•Volume. With large size comes difficulty in finding what is relevant, space to store it, and how to index it

•Variety. Highly structured data, variability structured data, and unstructured data

•Velocity. How fast is the data created, and how fast must it be processed?

•Veracity. Uncertain or imprecise data.

Page 6: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

What makes Big Data so important?

Researchers no longer simply ask,“What experimental design will best address this

question?”But rather,

“What can I glean from extant data?”Or better yet,

“What insights can I glean if I could fuse data from multiple domains?”

From: The Fourth Paradigm: Data-Intensive Scientific Discovery

Page 7: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

We are drowning in information…The world will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.

EO Wilson. 1998. Consilience, The Utility of Knowledge

Page 8: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Scientific computing is becoming increasingly data intensive.

We are becoming increasingly able to

• Answer previously intractable questions,

• More efficiently solve problems,

• Characterize the natural world to a greater level of detail

Page 9: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

An era of large datasetsLarge Hadron Collider

15 Pbytes/year (15 x 106 Gbytes, 15 x 103 Tbytes)

Pan-STARRS (panoramic survey telescope)2Gbytes per image, taken every 30 sec from 4

cameras

Several Tbytes/night/telescope

Natl. Human Genome Research Institute1000 genomes = 200 Tbytes

Beijing Genomics Institute5 Tbytes/day

Page 10: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

GenBank Sequence Growth (to 2008)

Page 11: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

What it takes to move Big Data

1Gbyte data• T1 line: 1.5 hrs• Thin Ethernet: 14 min• Fast Ethernet: 1 min

1 Tbyte data• T1 line: 65 days, 22.5 hrs• Thin ethernet: 10 days,

4.3 hrs• Fast ethernet: 1 day, 0.5

hrs• Gig-E: 2 hrs, 26 min.

Page 12: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Moving into the cloudScientists need to be able to move and

share large datasets. Cloud/Cluster/Grid computing.

Not just for holding data, but for computations

Reduce the need to repeatedly move the same datasets.

Page 13: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Libraries: Provide access and dissemination of information…

Page 14: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Existing Systems for Handling Big DataXCEDE (replaces TeraGrid)

A virtual system that scientists can use to interactively share super computer resources, data, & expertise

Composite of several university advanced computer centers

iPlant (Texas Advanced Computing Center)Plant genomic dataCyber infrastructure for the transfer, storage,

analysis, visualization, meta-data control, discovery, etc.

Cloud computing

Page 15: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Existing Big Data Systems (cont.)Three Rivers Optical Exchange (part of XCEDE)Amazon Cloud Computing

Purchase computing power and storage, as needed

John Wesley Powell Center for Analysis & SynthesisUSGSEarth sciences issues“Enhancing scientific discovery

& problem solving through integrated research.”

European grid systemsWatson (?)

Page 16: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

ARS Could Provide Leadership for Agricultural Data

OSTP Big Data Research and Development InitiativeJohn Holdren (3/29/2012)

The government is under investing in data management

The process of going from data knowledge understanding is being inhibited

Human capital needs People with deep analytical skills, Data-savvy managers/executives Greater IT savvy technicians, for both structured and

unstructured data

Page 17: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

What does ARS have to add?Decision support software operate from a cloud

systemPublic databases could be better organized and

more easily accessible, collectivelyLarge data

Currently wasting money on redundant hardwareAnd softwareCurrently have difficulty moving the dataCloud systems facilitate fusing datasets

ARS capable of long-term stability for storage, analyses

Page 18: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Thus this Workshop Will

Gather together ARS scientists who are already working with large dataor with experience and knowledge of our

current database collectionsor who are trying to work with Big Data

Include speakers familiar with Big Scientific Data issues, who have developed solutions

Develop a Vision for what an ARS solution should look like.

Page 19: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Outcome of the Workshop

A white paper describing a vision for ARS Big Data, including examples of current needs and an infrastructure for meeting current and future needs.

This infrastructure will include

IT resources

Intellectual resources

Personnel resources

Page 20: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Recipients of the Information

•ARS Administrators (AC Council)

•ARS Office National Programs

•OCIO and IT Specialists in the Field

•ARS Scientific Staff (scientists, technicians, computational biologists, statisticians)

Page 21: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

The climb is steep, but there are cairns along the way.

Page 22: Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell

Thank you!