georgia advanced computing resource center …gacrc software software selection, installation,...
TRANSCRIPT
GEORGIA ADVANCED COMPUTING RESOURCE CENTER (GACRC) RESOURCES
Enterprise Information Technology Services Michael S. Lucas December 5th, 2012
GACRC STAFF
Greg Derda - IT Manager/Bioinformatics Consultant
Yecheng Huang - Bioinformatics Consultant Shan-Ho Tsai - Computational Physics and High
Performance Computing Consultant Paul Brunk - Principal Unix Systems Administrator Curtis Combs - Principal Storage Administrator Jason Stone - Servers Systems Administrator
2
GACRC HARDWARE RESOURCES
HPC Computers 230 compute nodes (2600 compute cores), 32 with InfiniBand connectivity. Job submission for all
nodes is managed by a variant of Sun Grid Engine (Univa Grid Engine) queue management system (zcluster).
An IBM p655 AIX cluster with a total of 32 8-way nodes available to general users (pcluster).
For large memory jobs, there are: Four 8-core, 192GB high-memory compute nodes Ten 12-core, 256GB high-memory compute nodes Two 32-core, 512GB high-memory compute nodes.
Six 32-core, 64GB GPU control nodes.
One nVidia Tesla S1070 with four GPU cards (960 GPU cores) for programs written to use this
architecture.
One nVidia 2075 GPU processor (448 GPU cores)
3
GACRC HARDWARE RESOURCES (CONT.)
Storage and Connectivity The GACRC has a three-tiered storage architecture. Tier 1 = 150TB (usable) on a Panasas ActiveStor 12 storage
cluster; Tier 2 = 165TB (raw) on five Sun Fire X4500 Thumpers. Tier 3 = 330TB (raw) on ten TEC Services ARCH storage
subsystems. All of the GACRC's existing frames and storage subsystems
are interconnected using Brocade switches over private networks and protected by a pair of Juniper firewall security appliances.
4
GACRC HARDWARE RESOURCES (CONT.)
New Hadoop Cluster
5
GACRC SOFTWARE
Software selection, installation, maintenance and troubleshooting, based on researchers' needs utilizing both open source solutions and commercial offerings.
Over 300 applications (Bioinformatics, Computational Chemistry, Computational Physics, Statistics and more) including compilers, debuggers, and math libraries.
6
RESEARCH COMPUTING RESOURCE SURVEY
We will be sending this link to you via email after the Lunch and Learn and appreciate your feedback.
https://ugeorgia.qualtrics.com/SE/?SID=SV_dmnOtNRxfsI9ZbL
7
AMAZON WEB SERVICES
Enterprise Information Technology Services Shawn Ellis December 5th, 2012
INFRASTRUCTURE AS A SERVICE (IAAS)
Customers still run their own servers Purely virtual datacenter: no racks, hardware,
network cabling, etc. 17 world-wide locations UGA chooses which locations host UGA and
processing Elasticity
Friday, January 04, 2013
10
AMAZON WEB SERVICES ENTERPRISE AGREEMENT
Features Right to audit: FISMA, SOC1 and Security Standards Pricing, pay-ahead credit
Challenges UGA is one of the first public institutions to negotiate a Enterprise Agreement for
AWS, definitely the first in USG. Indemnity, data privacy, FERPA, HIPPA, GLBA. Bandwidth, latency New paradigms for system administration
Advantages Amazon is very interested in working with Higher Education partners now! We have received funding grants, access to top-level executives and technical
people.
11
WHAT IS HAPPENING?
The contract is being negotiated EITS is working with researchers on testing
Hadoop clustering, Hadoop with R. Questions about communicating cost and
capacity to existing and potential researchers in this model.
Friday, January 04, 2013
12
BIG DATA: Why Should You Care and How Can You Deal with It?
Lakshmish Ramaswamy Dept. of Computer Science
You Should Care Because..
Big Data is Everywhere LSST: 30 TB/Day LHC: 16 TB/Day
1e+11 Base Pairs
Health: 6 TB/Day 260M Tr/Day 500 TB/Day
21 TB/Day
What is Big Data? • “Data whose size forces us to look beyond
the tried and tested methods prevalent at that time”
• Currently it is the data that is is too large to be placed in a RDBMS and analyzed with
help of desktop-based statistics package Requires parallel algorithms running on a server cluster
• The V3 View • Volume --- Terabytes Zetabytes • Velocity --- Batch Data Streaming Data • Variety --- Structured Structured, semi-
structured, textual, multi-media, graphs
What is Causing Data Explosion? • Proliferation of pervasive
computing/communication devices • Inexpensive data collection and storage • Transformation to an e-based society • Desire to monitor and harness micro-level
characteristics and trends Sampling is not a preferred option
• Data Hoarding The “my great-grand PhD student may find it useful”
syndrome !!!
Where is the Data Coming From? • End-User created data
Emails, content shared on SNAs, Wikipedia, photos/videos, tweets
• Data collected about people Financial data, surveillance cameras, academic records
• Scientific data Atmospheric monitoring, high-energy physics, oceanics,
deep-space exploration
• Medical data Diagnostics, physician opinions, genetic information, scans
• Business data Stock market, currency market, company performance,
logistics, retailing, inventory
Limitations of Traditional Technologies
• RDBMS, data warehousing, data mining, supercomputers
• DBMSs designed for efficient transactions not efficient analytics
• Data is increasingly unstructured • Supercomputers are expensive, hard to
program and hard to manage • Data mining algorithms are centralized
Easier to Push Data into System than getting Information Out of the System
Big Data Computing Trends • Clusters of commodity servers • Distributed file system • Simple and efficient data management
No Schema, no indexing, no transactional support
• High-level programming interfaces • Simplified infrastructure management • Powerful fault tolerance
Big Data Technologies • Hadoop (map-reduce)
Cluster-based parallel processing framework
• Pig and Pig-latin SQL-like interface for creating map-reduce programs
• HBase and Apache Cassandra (BigTable) Non-relational distributed databases aka Key-value stores
• MongoDB High-performance document-oriented storage
• Giraph and GPS (Pregel) Cluster-based Graph Processing Engines
• Pegasus Hadoop-based graph mining tool
Hadoop in Action 30045 90 30602 88 30045 44 30005 60 30062 38 30605 50 30045 58 30005 83 30027 92 30606 66 30602 73 30601 82 30606 44
30045 90 30602 88 30045 44
30005 60 30602 38 30605 50
30606 66 30602 73 30601 82 30606 44
MAPPER 30045 (133, 2) 30602 (88, 1)
30005 (60, 1) 30602 (38, 1) 30605 (50, 1)
30606 (110, 2) 30602 (73, 1) 30601 (82, 1)
REDUCER
MAPPER
MAPPER
REDUCER
30602 66.3 30005 60 30605 50
30045 66.5 30606 55 30601 82
Current Research • Resource-efficient, scalable and quality-
aware data collection mechanisms • High-speed networking • Analytics on globally distributed data with
globally distributed clusters • Approximate analytics • Scalable machine learning and data mining
algorithms that can work in a distributed setting
• Security and privacy
Advanced Topics in Data Intensive Computing
• New course that was offered this semester • Covers many of the Bigdata technologies • Requires Java programming and database
experience
http://www.cs.uga.edu/~laks/courses/adic-fall2012.html
THANK YOU !!!
Business Analytics Concentration
MIS @ Terry
December 5th, 2012
MBA Business Analytics Concentration • Five core classes:
– Data Management – Business Process Management – Predictive Analytics – Data Warehousing and Mining – Emerging Analytical Technologies, Platforms & Applications
• Three electives – Energy Informatics – Marketing Analytics and Decision-Making – Introductory Biostatistics – Introduction to Epidemiology – Etc.
Hadoop Implementation • Robert Bearden, CEO of Hortonworks, Terry
Entrepreneur-in-Residence • Hortonworks provided UGA with education and
support for installing a Hadoop cluster to enable big data education and research
• Will be using in Spring 2013 in Data Management & Energy Informatics classes
Big Data and The Changing Nature of Science
(…and the Importance of Cyberinfrastructure Centers)
Nick Berente Terry College of Business
University of Georgia
Traditional Science:
Deduction (abduction) Hypothesize-test
Computational Science: Still (largely) scientific method
birth of a galaxy
hurricane simulation
heart muscle mitochondria Source of images: TACC
Computational Science Cyberinfrastructure
Computational Resources: “Big Iron” – Condo model = Cycles Memory Disciplinary / Interdisciplinary code Parallelized code Gateways & Workflows Visualization
Big Data Science: Observation in Natural Sciences
Induction! Pattern identification & matching
Alma Telescope Array: 66 telescopes
Global Ocean Observing System: 3000+ sensors
Big Data Science: Observation in Social Sciences
social media User-generated content – social network analysis
Sequence Analysis: “Organizational Genetics”
Big Data Science Cyberinfrastructure
Support for Inductive analysis – pattern identification and matching Everything associated with computational science plus increased focus on interpretation: Visualization Next-generation analytic methods Unstructured / multi-source data and, of course: Network throughput & Storage
My Research: Next Generation Computational Science Centers
Centers
Center: “a facility providing a place for a particular activity or service” (Meriam Webster)
Cyberinfrastructure Innovation Centers
Centers – Significant Value
For universities, regions, nations, and globally - Science - Economic (local) - Cross-disciplinary Knowledge - Technological Innovation
RCN: Managing Collaborative Centers 1240160
EAGER: Supporting Successful Management- CI Centers 1148996
CI-TEAM: “Science Executive” education 1059153
Three NSF Research Projects
- Managing CI Centers - Oct. 2011 – UGA, Athens, GA - Virtual Organizations – June 2012 – Case, Cleveland, OH - Managing CI Centers – Feb. 2013 – UM, Ann Arbor, MI - Science Executive Ed – May 2013 – UGA, Atlanta, GA - Scientific Software – Oct. 2013 – UT, Austin, TX - Virtual Organizations – May 2014 – UI, Urbana-Champaign, IL - Scientific Software – May 2015 – CMU, Pittsburgh, PA - Managing CI Centers – May 2016 – UGA, Atlanta, GA
Series of Workshops & Reports
Research Directions
Enabling sustained innovation (centers vs. projects)
Metrics & benchmarking
Funding & human resource issues
Software engineering
CI issues for unstructured data for social science
Thank you! Nick Berente [email protected]
NSF OCI # 1059153