first hands-on workshop on leveraging high performance ... › documents › 1084364 › ...center...

24
First Hands-On Workshop on Leveraging High Performance Computing Resources for Managing Large Datasets Ritu Arora Email: [email protected] October 27, 2014 1

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

First Hands-On Workshop on Leveraging High Performance

Computing Resources for Managing Large Datasets

Ritu Arora

Email: [email protected]

October 27, 2014

1

Page 2: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Very Grateful to the Workshop Sponsors

• National Science Foundation (NSF)

• Texas Advanced Computing Center (TACC)

• National Energy Research Scientific Computing Center

(NERSC)

• Lawrence Livermore National Lab (LLNL)

• Extreme Science and Engineering Discovery Environment

(XSEDE)

2

Page 3: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

What is High Performance Computing (HPC) and High Throughput Computing (HTC)?

Why use HPC and high-end storage resources for managing large datasets?

What policies and procedures should one be aware of while using open-science HPC and high-end resources for

managing large datasets?

Why this workshop?

3

What & Why?

Page 4: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Large Datasets – “Large” is Relative here

4

Source: http://www.garot.com/travels/2006/03_Mar_BigBend/images_11.asp Source: http://1hdwallpapers.com/wide_load_comin_through-wallpaper.html

• Mostly using “large” in relative terms

• After a particular point, the magnitude of data by itself stops being the differentiator • Bill Gates is already super rich – I would still use the term “super rich” for him even if he adds some more amount to his already large amounts of wealth. Likewise, 70 PB of data is large for me and so is 100 PB

Page 5: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Some Challenges in Handling Large Datasets

• Data transfer becomes challenging – Transferring 4.3 TB of data from the Stampede Supercomputer in Austin

to the Gordon Supercomputer in San Diego, took approx. 210 hours

– The transfer was restarted about 14 times during June 3 to June 18, 2014 - about 15 days

– If the data transfer would have completed without any interruptions, it would have completed in about 9 days at the given speed

– Multiple reasons for interruption - sometimes maintenance on Stampede or Gordon, some other file-system issue, network traffic/available bandwidth - all are factors affecting the data transfer rate

• Data processing in a time-bound manner becomes challenging

• Data search also becomes challenging if proper data management tools and strategies are not used

5

Page 6: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Taming the Large Dataset – Divide and Conquer

6

Page 7: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

What Resources and Skills are Needed?

• High-end storage and processing infrastructure

– Awareness about availability of these resources without involving any direct cost is critical

• Scalable software tools for pre-processing, processing, and post processing the large datasets

• Strategies for fast data transfer, processing, preservation, archival, and data sharing

– Culling

– Virtualization

– Visualization

– Human-Computer Interaction

• Skills development and training is needed, especially understanding of basic Linux and HPC user environment

7

Page 8: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Agenda

8

Page 9: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Morning Session - 8:30 AM to 12:10 PM

• 8:45 AM - 9:15 AM: Computing and Data Management at the Joint Genome Institute, by Kirsten Fagnan

• 9:15 AM - 9:45 AM: Unlocking the Power of High Performance Computing for Big Humanities Data Curation, by Jessica Trelogan

• 9:45 AM - 10:15 AM: IDC Update on How Big Data Is Redefining High Performance Computing, by Earl Joseph (IDC)

• 10:15 AM - 10:30 AM: Break

• 10:30 AM - 11:00 AM: Introduction to HPC and the National CyberInfrastructure, by Ritu Arora (TACC)

• 11 AM - 12:10 PM: Introduction to Linux & connecting to Stampede or Hopper, hands-on session, by Ritu Arora (TACC)

9

Page 10: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Lunch Session -12:10 to 1:30 PM • Workshop participants who are not supported by the NSF travel

award are on their own for lunch

• Networking and mentoring activities for students sponsored by the NSF travel grant, coordinated by Elizabeth Bautista (NERSC) and Valerie Shilling (TACC)

– A group of 3-4 students will be assigned a mentor from the list below

• Elizabeth, Kirsten, Jessica, Nicole, Raquell , Jeff Todd, Daniel Jacob Fedor-thurman

– Please feel free to ask for a mentor with whom you would like to interact

– Students are encouraged to discuss any technical and career related questions with their mentors over lunch

10

Page 11: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Afternoon Session – 1:30 PM to 6:00 PM

• 1:30 PM - 1:40 PM: Introduction to the test-case to be used for further exercises, by Jessica Trelogan

• 1:40 PM - 2:40 PM: Hands-on exercises on data transfer, calculating checksum, and metadata extraction by Ritu Arora

• 2:40 PM – 3:00 PM: Data transfer and sharing with Globus, Vas Vasiliadis

• 3:00 PM - 3:20 PM: Design and development of data management workflows on HPC resources, Ritu Arora

• 3:20 PM - 3:35 PM: Break

• 3:35 PM - 4:00 PM: Improv Session, by Raquell Holmes

• 4:00 PM - 6:00 PM: Hackathon Sessions, coordinated by TACC and NERSC staff

11

Page 12: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Evening Session – 7:00 PM to 8:30 PM

• This is a closed group meet-up organized for students sponsored by the NSF travel grant

• Dinner meet-up with Lawrence Berkley National Lab and Lawrence Livermore National Lab

– Venue: Check with Valerie

12

Page 13: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Hackathon & Improv Science Sessions

• Often collaborative work is required to solve big problems

– Need to bring in perspective from different domains

– Need to develop a decent working equation with your team-members so that you can achieve the project goals on time

• A diverse audience in terms of skill-set and background present at the workshop - learn from each other – foster collaboration

– Library and Information Science, Computer Science, Health Science, Biology, Information Technology, …

• You will be asked to pick a challenge to solve with your team – chance to apply the knowledge that you gained during the workshop

13

Page 14: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Introduction to Speakers

14

Page 15: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Kirsten Fagnan

Kirsten Fagnan is a high-performance computing and bioinformatics consultant at the Lawrence Berkeley National Lab's National Energy Research Scientific Computing Center (NERSC) and the Joint Genome Institute (JGI). As a consultant, Fagnan is responsible for assisting scientists in managing their datasets, optimizing and debugging user code for data analysis, strategic project support of data management, training and maintaining software applications and libraries. Fagnan's interests are scientific computing, mathematical biology and education technology. Fagnan earned her PhD in Applied Mathematics at the University of Washington in 2010, her BA from UC Berkeley in 2002 and joined LBNL in 2010 as a petascale postdoctoral fellow.

Kirsten's Talk: Computing and Data Management at the Joint Genome Institute

15

Page 16: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Jessica Trelogan

16

Jessica Trelogan has an MA in Classics from the University of Texas at Austin, where she has also completed extensive graduate coursework in Geography. She is a Research Associate at the Institute of Classical Archaeology specializing in GIS and remote sensing with long experience in the application of those technologies to archaeological fieldwork, conservation, research and publishing. Currently she is also acting as curator of a large and complex data collection that represents several decades' worth of excavation, survey, and study at sites in Italy and Ukraine. She has presented papers related to that work at conferences in Computing in Archaeology, Digital Curation, and Digital Humanities. Jessica's Talk: Unlocking the Power of High Performance Computing for Big Humanities Data Curation

Page 17: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Earl Joseph

17

Earl Joseph is IDC's Program Vice President for High-Performance Computing (HPC) and Executive Director of the HPC User Forum. He leads IDC's HPC technical computing team, driving research and consulting efforts associated with the United States, Europe and Asia-Pacific markets for technical servers and supercomputers, clouds, visualization and clustering. This research includes market sizing, market share, segmentation, tracking, trending, data center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph advises IDC clients on the competitive, managerial, technological, integration and implementation issues for technical servers. He also founded and operates IDC's highly successful, high-end HPC User Forum. Dr. Joseph holds a Ph.D. from the University of Minnesota where his research focus was the strategic management of high technology firms, and an undergraduate degree in business and technology from the University of Minnesota.

Earl's Talk: IDC Update on How Big Data Is Redefining High Performance Computing

Page 18: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Vas Vasiliadis

Vas is Director of Products, Communication and Development at the Computation Institute (CI) at the University of Chicago. Vas has over 25 years of experience in operational and consulting roles, spanning strategy, marketing and product management. Most recently, Vas was a principal at Strategos, the innovation consulting firm founded by Gary Hamel, where he led Fortune 100 management teams in defining their growth agenda. Prior to Strategos, Vas led marketing efforts at Univa, a leading provider of grid and cloud computing solutions. Vas joined Univa's founding team shortly after inception and was instrumental in defining the product vision, raising venture capital and launching the company's initial products.

Vas’s talk: Data transfer and sharing with Globus

18

Page 19: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Administrative Trivia

19

Page 20: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

LinkedIn.com Group

• A closed group to discuss the usage of HPC for Big Data Management, and to discuss opportunities in these areas

– Ask questions that you have on these topics

– Find potential collaborators

• For continued support and mentoring even after the workshop, please join the Linkedin.com group

– Leveraging High Performance Computing Resources for Managing Large Datasets

20

Page 21: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Post-Workshop Survey

• A link to the survey for the workshop will be send to you by the end of the day

• You are requested to please take few minutes of your time to give feedback on the quality of the workshop

– This is crucial for reporting to funding agencies

– Important for us to understand what we did right or wrong so that we can adjust the agenda of our next workshop

21

Page 22: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Volunteers

• Six student volunteers, NERSC staff, TACC staff available to assist with the sessions

– Identify the volunteers near you

• Anyone taking notes/photos?

– Thank you in advance – could you please share those with us?

• Directly post on the Linkedin.com group

• Via email to [email protected]

22

Page 23: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

For Students Funded through NSF

• Please see Valerie to collect cash for your per diem and ground transportation - during breaks, lunch, or in the evening at 6 PM

• Please note: If you incur any ground transportation expenses that are larger than what can be covered through the cash provided, please email me and Valerie ASAP

23

Page 24: First Hands-On Workshop on Leveraging High Performance ... › documents › 1084364 › ...center issues, and vendor analysis for multi-user technical server technology. Dr. Joseph

Wishing you a very productive day!

Thanks!

24