georgia advanced computing resource center …gacrc software software selection, installation,...

GEORGIA ADVANCED COMPUTING RESOURCE CENTER (GACRC) RESOURCES

Enterprise Information Technology Services Michael S. Lucas December 5th, 2012

GACRC STAFF

Greg Derda - IT Manager/Bioinformatics Consultant

Yecheng Huang - Bioinformatics Consultant Shan-Ho Tsai - Computational Physics and High

Performance Computing Consultant Paul Brunk - Principal Unix Systems Administrator Curtis Combs - Principal Storage Administrator Jason Stone - Servers Systems Administrator

2

GACRC HARDWARE RESOURCES

HPC Computers 230 compute nodes (2600 compute cores), 32 with InfiniBand connectivity. Job submission for all

nodes is managed by a variant of Sun Grid Engine (Univa Grid Engine) queue management system (zcluster).

An IBM p655 AIX cluster with a total of 32 8-way nodes available to general users (pcluster).

For large memory jobs, there are: Four 8-core, 192GB high-memory compute nodes Ten 12-core, 256GB high-memory compute nodes Two 32-core, 512GB high-memory compute nodes.

Six 32-core, 64GB GPU control nodes.

One nVidia Tesla S1070 with four GPU cards (960 GPU cores) for programs written to use this

architecture.

One nVidia 2075 GPU processor (448 GPU cores)

3

GACRC HARDWARE RESOURCES (CONT.)

Storage and Connectivity The GACRC has a three-tiered storage architecture. Tier 1 = 150TB (usable) on a Panasas ActiveStor 12 storage

cluster; Tier 2 = 165TB (raw) on five Sun Fire X4500 Thumpers. Tier 3 = 330TB (raw) on ten TEC Services ARCH storage

subsystems. All of the GACRC's existing frames and storage subsystems

are interconnected using Brocade switches over private networks and protected by a pair of Juniper firewall security appliances.

4

GACRC HARDWARE RESOURCES (CONT.)

New Hadoop Cluster

5

GACRC SOFTWARE

Software selection, installation, maintenance and troubleshooting, based on researchers' needs utilizing both open source solutions and commercial offerings.

Over 300 applications (Bioinformatics, Computational Chemistry, Computational Physics, Statistics and more) including compilers, debuggers, and math libraries.

6

RESEARCH COMPUTING RESOURCE SURVEY

We will be sending this link to you via email after the Lunch and Learn and appreciate your feedback.

https://ugeorgia.qualtrics.com/SE/?SID=SV_dmnOtNRxfsI9ZbL

7



CONTACT INFORMATION

Name: Michael Lucas Email: [email protected] Web: eits.uga.edu

8

AMAZON WEB SERVICES

Enterprise Information Technology Services Shawn Ellis December 5th, 2012

INFRASTRUCTURE AS A SERVICE (IAAS)

Customers still run their own servers Purely virtual datacenter: no racks, hardware,

network cabling, etc. 17 world-wide locations UGA chooses which locations host UGA and

processing Elasticity

Friday, January 04, 2013

10

AMAZON WEB SERVICES ENTERPRISE AGREEMENT

Features Right to audit: FISMA, SOC1 and Security Standards Pricing, pay-ahead credit

Challenges UGA is one of the first public institutions to negotiate a Enterprise Agreement for

AWS, definitely the first in USG. Indemnity, data privacy, FERPA, HIPPA, GLBA. Bandwidth, latency New paradigms for system administration

Advantages Amazon is very interested in working with Higher Education partners now! We have received funding grants, access to top-level executives and technical

people.

11

WHAT IS HAPPENING?

The contract is being negotiated EITS is working with researchers on testing

Hadoop clustering, Hadoop with R. Questions about communicating cost and

capacity to existing and potential researchers in this model.

Friday, January 04, 2013

12

CONTACT INFORMATION

Name: Shawn Ellis

Email: [email protected]

Web: eits.uga.edu

13

BIG DATA: Why Should You Care and How Can You Deal with It?

Lakshmish Ramaswamy Dept. of Computer Science

[email protected]

You Should Care Because..

Big Data is Everywhere LSST: 30 TB/Day LHC: 16 TB/Day

1e+11 Base Pairs

Health: 6 TB/Day 260M Tr/Day 500 TB/Day

21 TB/Day

What is Big Data? • “Data whose size forces us to look beyond

the tried and tested methods prevalent at that time”

• Currently it is the data that is is too large to be placed in a RDBMS and analyzed with

help of desktop-based statistics package Requires parallel algorithms running on a server cluster

• The V3 View • Volume --- Terabytes Zetabytes • Velocity --- Batch Data Streaming Data • Variety --- Structured Structured, semi-

structured, textual, multi-media, graphs

What is Causing Data Explosion? • Proliferation of pervasive

computing/communication devices • Inexpensive data collection and storage • Transformation to an e-based society • Desire to monitor and harness micro-level

characteristics and trends Sampling is not a preferred option

• Data Hoarding The “my great-grand PhD student may find it useful”

syndrome !!!

Where is the Data Coming From? • End-User created data

Emails, content shared on SNAs, Wikipedia, photos/videos, tweets

• Data collected about people Financial data, surveillance cameras, academic records

• Scientific data Atmospheric monitoring, high-energy physics, oceanics,

deep-space exploration

• Medical data Diagnostics, physician opinions, genetic information, scans

• Business data Stock market, currency market, company performance,

logistics, retailing, inventory

Limitations of Traditional Technologies

• RDBMS, data warehousing, data mining, supercomputers

• DBMSs designed for efficient transactions not efficient analytics

• Data is increasingly unstructured • Supercomputers are expensive, hard to

program and hard to manage • Data mining algorithms are centralized

Easier to Push Data into System than getting Information Out of the System

Big Data Computing Trends • Clusters of commodity servers • Distributed file system • Simple and efficient data management

No Schema, no indexing, no transactional support

• High-level programming interfaces • Simplified infrastructure management • Powerful fault tolerance

Big Data Technologies • Hadoop (map-reduce)

Cluster-based parallel processing framework

• Pig and Pig-latin SQL-like interface for creating map-reduce programs

• HBase and Apache Cassandra (BigTable) Non-relational distributed databases aka Key-value stores

• MongoDB High-performance document-oriented storage

• Giraph and GPS (Pregel) Cluster-based Graph Processing Engines

• Pegasus Hadoop-based graph mining tool

Hadoop in Action 30045 90 30602 88 30045 44 30005 60 30062 38 30605 50 30045 58 30005 83 30027 92 30606 66 30602 73 30601 82 30606 44

30045 90 30602 88 30045 44

30005 60 30602 38 30605 50

30606 66 30602 73 30601 82 30606 44

MAPPER 30045 (133, 2) 30602 (88, 1)

30005 (60, 1) 30602 (38, 1) 30605 (50, 1)

30606 (110, 2) 30602 (73, 1) 30601 (82, 1)

REDUCER

MAPPER

MAPPER

REDUCER

30602 66.3 30005 60 30605 50

30045 66.5 30606 55 30601 82

Current Research • Resource-efficient, scalable and quality-

aware data collection mechanisms • High-speed networking • Analytics on globally distributed data with

globally distributed clusters • Approximate analytics • Scalable machine learning and data mining

algorithms that can work in a distributed setting

• Security and privacy

Advanced Topics in Data Intensive Computing

• New course that was offered this semester • Covers many of the Bigdata technologies • Requires Java programming and database

experience

http://www.cs.uga.edu/~laks/courses/adic-fall2012.html

THANK YOU !!!

Business Analytics Concentration

MIS @ Terry

December 5th, 2012

MBA Business Analytics Concentration • Five core classes:

– Data Management – Business Process Management – Predictive Analytics – Data Warehousing and Mining – Emerging Analytical Technologies, Platforms & Applications

• Three electives – Energy Informatics – Marketing Analytics and Decision-Making – Introductory Biostatistics – Introduction to Epidemiology – Etc.

Hadoop Implementation • Robert Bearden, CEO of Hortonworks, Terry

Entrepreneur-in-Residence • Hortonworks provided UGA with education and

support for installing a Hadoop cluster to enable big data education and research

• Will be using in Spring 2013 in Data Management & Energy Informatics classes

Big Data and The Changing Nature of Science

(…and the Importance of Cyberinfrastructure Centers)

Nick Berente Terry College of Business

University of Georgia

Traditional Science:

Deduction (abduction) Hypothesize-test

Presenter

Presentation Notes

Lone scientist in a lab

Computational Science: Still (largely) scientific method

birth of a galaxy

hurricane simulation

heart muscle mitochondria Source of images: TACC

Computational Science Cyberinfrastructure

Computational Resources: “Big Iron” – Condo model = Cycles Memory Disciplinary / Interdisciplinary code Parallelized code Gateways & Workflows Visualization

Presenter

Presentation Notes

grad student writing spaghetti code -

Big Data Science: Observation in Natural Sciences

Induction! Pattern identification & matching

Alma Telescope Array: 66 telescopes

Global Ocean Observing System: 3000+ sensors

Presenter

Presentation Notes


Big Data Science: Observation in Social Sciences

social media User-generated content – social network analysis

Sequence Analysis: “Organizational Genetics”

Presenter

Presentation Notes


Big Data Science Cyberinfrastructure

Support for Inductive analysis – pattern identification and matching Everything associated with computational science plus increased focus on interpretation: Visualization Next-generation analytic methods Unstructured / multi-source data and, of course: Network throughput & Storage

Presenter

Presentation Notes


My Research: Next Generation Computational Science Centers

Presenter

Presentation Notes

Now globally distributed instruments and teams, infrastructural software, data, etc.

Centers

Center: “a facility providing a place for a particular activity or service” (Meriam Webster)

Cyberinfrastructure Innovation Centers

Centers – Significant Value

For universities, regions, nations, and globally - Science - Economic (local) - Cross-disciplinary Knowledge - Technological Innovation

RCN: Managing Collaborative Centers 1240160

EAGER: Supporting Successful Management- CI Centers 1148996

CI-TEAM: “Science Executive” education 1059153

Three NSF Research Projects

- Managing CI Centers - Oct. 2011 – UGA, Athens, GA - Virtual Organizations – June 2012 – Case, Cleveland, OH - Managing CI Centers – Feb. 2013 – UM, Ann Arbor, MI - Science Executive Ed – May 2013 – UGA, Atlanta, GA - Scientific Software – Oct. 2013 – UT, Austin, TX - Virtual Organizations – May 2014 – UI, Urbana-Champaign, IL - Scientific Software – May 2015 – CMU, Pittsburgh, PA - Managing CI Centers – May 2016 – UGA, Atlanta, GA

Series of Workshops & Reports

Research Directions

Enabling sustained innovation (centers vs. projects)

Metrics & benchmarking

Funding & human resource issues

Software engineering

CI issues for unstructured data for social science

Thank you! Nick Berente [email protected]

NSF OCI # 1059153

georgia advanced computing resource center …gacrc software software selection, installation,...

Documents