c-dac hpc activities & experience on accelerators · c-dac activities cutting across the above...
TRANSCRIPT
C-DAC HPC Activities & Experience on Accelerators
Goldi Misra Group Coordinator & Head HPC Solutions Group C-DAC, India
C-DAC HPC Activities & Experience on Accelerators
• High Performance Computing & Grid Computing • Hardware, Software, Systems, Applications, Research, Technology, Infrastructure
• Multilingual Computing and Heritage Computing • Tools, Fonts, Products, Solutions, Research, Technology Development
• Health Informatics • Hospital Information System, Telemedicine, Decision Support System, Tools, Traditional
Knowledge-base and DSS for Medicine
• Software Technologies including FOSS • FOSS, Multimedia, ICT for masses, E-Governance, Geomatics, ICT4D
• Professional Electronics including VLSI & Embedded Systems
• Digital Broadband and Wireless Systems, Network Technologies, Power Electronics, Real-Time Systems, Control Electronics, Embedded Systems, VLSI/ASIC Design, Agri Electronics, Strategic Electronics
• Cyber Security & Cyber Forensics • Cyber Security tools, technologies & solution development, Research & Training
Education and Training forms an important component of
C-DAC activities cutting across the above Thematic Areas
Thematic R & D Areas of C-DAC
Kaleidoscope of C-DAC Products
Spectrum of HPC Activities
Spectrum
of
HPC
Activities
Technologies
Systems
National Facilities
Applications
Solutions
Trainings
PARAM Series of Supercomputers
PARAM YUVA
1990 C-DAC’s PARAM 8000, India’s first Gigascale Supercomputer
2002 C-DAC’s PARAM Padma, India’s first Terascale Supercomputer Launched (Rank 171 in Top 500 List)
1990 -2000
ANUPAM- BARC PARAM – C-DAC ANURAG- DRDO Flowsolver- NAL
Parallel Initiatives
2007 CRL’s EKA ranks 4th in Top 500 List
9 Terascale Systems from India in Top 500 List
2010 Only 4 systems from India in Top 500 List as against 41 systems from China
India’s best system ranked 47 in Top 500 List
2000 -2007
Several HPC Facilities setup including those at C-DAC, IISc, BARC, NAL, CMMACS, DRDO, NCMRWF
Indian Supercomputing Scenario
2008 C-DAC’s PARAM Yuva Launched (Rank 68 in Top 500 List)
6 India Govt. takes initiative for a big leap in Supercomputing 2012
HPC Activities @ C-DAC
National HPC Facilities
NPSF @ Pune (1998)
PARAM 10000 system
CTSF @ Bangalore (2003)
PARAM Padma System
NPSF @ Pune (2009)
PARAM Yuva System
HPC Applications: 1988 -2011 C
FD
W
ea
the
r F
ore
cast
ing
E
volu
tio
na
ry
Co
mp
uti
ng
S
eis
mic
M
od
ell
ing
B
io-
info
rma
tics
S
tru
ctu
ral
En
gin
ee
rin
g
’88 … … ’91 … … … ‘95 … … ‘98 … ‘00 … ‘02 … … ‘08 … … ‘11
T80 T172 RTWS, WRF
Seismic inversion, pre and post stack migration, 2D & 3D models
Fracture Mechanics
FRP ,Smart Structures
FRP
Protein folding (300 ns), REMD, MEME
Electro-magnetics
Protein structure IC Engine,
Seismic inversion
Protein folding (1 ns)
CFD (Launch Vehicle)
Pre and post stack migration, 1D models
Composites
1st Mission 2nd Mission 3rd Mission 4th Mission Garuda 54 TF
• InClus- HPC cluster Building Toolkit
• CHReME – HPC Resource Management Engine
• ONAMA – HPC package for academic institutions
• Parallel File System
InClus Integrated Cluster Solution
• Web and Desktop based GUI
• Provision of Operating Systems on physical as well as virtual machines:
RHEL5.x, RHEL 6, CentOS5.x, CentOS6.x
• Development platform; Compilers, Debuggers
• Scheduler and resource manager
• Policy based accounting
• High availability support
• Remote console. Powerful shell support.
• Quickly set up and control Management node services: DNS, HTTP,
DHCP, TFTP
• User Management
• GPU Support
• Log monitoring
• Critical Error/ Warning reporting via Web interface
• SMS/Mail alerts for checking job status
InClus addresses the need of technical challenges in
the field of HPC, it makes cluster easy.
CHReME C-DAC’s HPC Resource Management Engine
CHReME portal is an end-user job submission, management and monitoring tool that works with various schedulers or Workload Managers such as Torque, OpenPBS, Sun Grid Engine, Moab, Load leveler, etc.
Timely E-mail notification regarding job status; personalized job list and job status information
Secure credential specific access on web through https
Allows users to configure their execution environment through compilers and libraries selection, scheduling parameters etc.
Scientific & Research Applications specific portals
CHReME addresses challenge of efficient and easy
usage and management of resources of HPC
systems
ONAMA
Onama is an integrated package which opens a new door to future technocrats, providing them a Quantum leap in developing a firm understanding through HPC in several engineering disciplines.
Onama comprises of a well selected set of parallel & serial applications and tools across various engineering disciplines such as Computer Science, Mechanical, Electronics and Communication, Electrical, Civil, Chemical engineering etc. Besides, it consists of a number of nVIDIA CUDA enabled applications in several domains such as molecular dynamics and physics.
With a mission of “ Equipping Premier Academic
Institutions with top of the class HPC solutions from
C-DAC packaged with open source software and
world class services. This would enable the
Premier Academic Institutions to benefit in terms of
service delivery and affordability.”
• Parallel, Multicore and Manycore Programming • System Administration & Management • Network Security & Audits • Storage Management Technologies • Facility Operations Management and Maintenance • GPU based Programming • HPC User Symposiums • C-DAC Certified HPC Professional
• Engineering & managing of large supercomputing systems and national supercomputing facility
• HW & SW skills in designing System Area Network
• Chip/PCB/system design skills (HW)
• Networking stack & system software (SW)
• Prototyping/Validation/Certification/Benchmarking/Training
• HW & SW skills in designing RC accelerators
• RC HW having upto 12 million logic gates for computing, with different host interfaces
• Porting applications/algorithms as HW circuits to achieve large speed-ups
• SW ecosystem design for various operating systems
Indigenous capability in
• Porting and scaling applications on large clusters
• Several collaborative projects in Science & Engineering research
• Increase of HPC user community in the nation
• Publications
(contd…)
Activities on Intel Many Integrated Core
(MIC) Architecture
Knights Ferry Co-Processor Card • 1.2 GHz, upto 32 cores, 2 GB GDDR5, 4 threads/core, 300W,
45nm process • MIC Platform Software Stack (MPSS) 1.0 and 2.0 • Development Tools: Intel FORTRAN & C++ Compilers, Intel
MPI and OpenMP, Intel MKL, IPP, TBB, ARBB, Cilk Plus, Support for Eclipse IDE, OpenCL support in future
Expected Specifications of Knights Corner
• More than 50 cores per chip
• 22nm process size
• ~1TF
• Mathematical Algorithms: Mandelbrot Set (An example
of a simple mathematical definition leading to complex behavior)
• Molecular Dynamics: MD_OPENMP
• Oceanography: Tsunami-N2 (Numerical simulation program with
the linear theory in deep sea and with the shallow water theory in shallow
sea and on land with constant grid length in the whole region)
• Astrophysics: CAMB (Code for Anisotropies in the Microwave
Background [CAMB] computes cosmic microwave background spectra given
a set of input cosmological parameters)
• Linear scalability and results are encouraging
• Based on familiar x86 architecture
• Run on standard, existing programming tools and methods
• Minimal Porting efforts
• No reprogramming for native compilation and execution
• Directive based offloading
• Availability of tools (profilers, debugging, monitoring etc.)
We intend to work on the commercial product KNC to get an exact idea of performance, scalability etc.
Long-term viability of the technology
Application development methodology and tools
Emerging Standards
Application Accelerators Facts
Technology is changing at a faster pace
Many technology providers do not stay in business long enough
Differ substantially from conventional multi-core programming
Require a deep study of the underlying hardware architecture to achieve a good performance
Applications can be tuned to achieve a good performance
Codes are platform dependent
OpenCL for many-core architectures
GPUs
OpenFPGA efforts for reconfigurable computing
• 2nd and 3rd gen Reconfigurable Computing (RC) platform
• Uses RC hardware with state of the art FPGAs
• RC hardware has upto 12 million logic gates for computing with different host interfaces
• Avatars – hw routines/ libraries
• Varada – APIs, kernel agent, Linux support
• Eco-friendly HPC solution
C-DAC RC
Digital Systems Design
Scientific Application
PCB Design & Assembly
Hardware Library Design
(Avatars)
System Software
Design
C-DAC’s Expertise
One of the enabling technologies useful in RC is the field-programmable gate array (FPGA).
Putting FPGAs on add-on cards or motherboards allow FPGAs to serve as compute-intensive co-processors.
FPGAs can be re-configured over and over again, to perform multitude of operations. This enables application-specific, dynamically "programmable" hardware accelerators.
Tim
e
3 hrs 8 min
29 min
16 RC 16 nodes (256 Cores)
Nodes: HP DL580G5 Quad Core Quad Socket Xeon 2.93 GHz
Query Software
(256 cores)
RC
(16 Cards)
Speed-Up
Per card in terms of cores
AAN10358 2 hr 18 min 31
sec
22 min 8 sec
100.1
NP_597681 2 hr 41 min 2 sec
25 min
39 sec
100.4
XP_001065955
3 hr 8 min 28
sec
29 min
47 sec
101.2
Application: Smith-Waterman Sequence Search Database: Protein
0
750
1500
2250
3000
19901992
19961998
20002002
20042006
20072008
FPGA
CPU
Year
Fre
qu
en
cy(M
Hz)
Source: Intel, IBM, Xilinx, Altera datasheets
• Scientific and engineering applications in the areas of fracture mechanics, radio astronomy and bioinformatics ported on RC provided significant acceleration compared to purely software based solutions.
• These speedups were further increased by many folds, based on configuration and applications. Bioinformatics sequence search solution using RC, gave more than 100 times faster results.
• C-DAC's own fracture mechanics code, having double precision Cholesky factorization and forward-backward substitution steps ported on RC provided 16X speedup.
• High speed data acquisition and signal processing solutions designed for Very Long Baseline Interferometry (VLBI) and power spectrum experimentation in radio astronomy, replaced a sizable computing cluster.
• Double precision matrix multiplication implemented on RC performed better than the standard math library.
• Evolution of reconfigurable logic design with more traditional computing paradigm.
• Development of more efficient cache replacement policy for FPGA configurations.
• Reduction in the run-time reconfiguration time.
• FPGAs with HPC will act as a solution to scaling challenges brought on by microprocessors (Power Consumption and clock frequencies).
• Mapping compute-intensive algorithms directly onto parallel FPGA hardware, tightly coupled to a conventional CPU through a high-speed I/O bus, complete applications can be accelerated by orders of magnitude over conventional CPU implementations.
• Development of run time debugging of multi platform enabled code.
Workstations Servers & Blades
Tesla Data Center & Workstation GPU Solutions
Tesla M-series GPUs M2090 | M2075/0 | M2050
Tesla C-series GPUs C2075/0 | C2050
M2090 M2075/
0 M2050
Cores 512 448 448
Memory 6 GB 6 GB 3 GB
Memory bandwidth (ECC off)
177.6 GB/s 150 GB/s 148.8 GB/s
Peak Perf Gflops
Single Precision
1331 1030 1030
Double Precision
665 515 515
C2075/0
C2050
448 448
6 GB 3 GB
148.8 GB/s
148.8 GB/s
1030 1030
515 515
Tesla: 2-3x Faster GPU Every 2 Years
16
2
4
6
8
10
12
14
DP G
FLO
PS p
er
Watt
2008 2010 2012 2014
T10 Fermi
Kepler
Maxwell
Worldwide GPU Supercomputer Momentum
Tesla GPUs Launched
First Double
Precision GPU
Tesla 20-series
(Fermi) Launched
Libraries Directives Programming
Languages
Applications
Easiest Approach for 2x to 10x
Acceleration
Maximum
Performance
• Accelerated Communication With Network and Storage Devices
• Peer-To-Peer Transfers Between GPUs
• Peer-To-Peer Memory Access
• GPUDirect For Video
NVIDIA GPUDirect
PGI CUDA x86 CUDA Now Available for CPUs and GPUs
Single CUDA
C / C++
Codebase
NVIDIA C / C++ Compiler
PGI CUDA X86 Compiler C / C++ Support
GPU
CPU
PGI CUDA x86 CUDA Now Available for CPUs and GPUs
Single CUDA
C / C++
Codebase
NVIDIA C / C++ Compiler
PGI CUDA X86 Compiler C / C++ Support
GPU
CPU
GPU Computing @ C-DAC
Particulars Tesla C1060 Tesla C2050 Tesla 2075
Architecture Tesla 10 Series GPU Tesla 20 Series GPU Tesla 20 Series GPU
Compute Capability 1.3 2 2
No. of Cores 240 448 448
GPU Memory 4 GB 3 GB 6 GB
Memory Bandwidth 102 GB/s 150 GB/s 150 GB/s
Bio-informatics:
• GPU-HMMER (Does protein sequence alignment using profile HMMs)
• MrBayes (Bayesian inference of phylogenetic and evolutionary models)
• CUDA-BLASTP (Designed to accelerate NCBI BLASTP for scanning
protein sequence databases)
• CUDA-MEME (Discover motifs on groups of related DNA or protein
sequences etc.)
Weather:
• WSM5 (WSM5 is WRF Single Moment 5 Cloud Microphysics module)
CONFIDENTIAL
• Strong scalability of certain codes on multiple cards
• Availability of numerous applications explicitly enabled
with CUDA
• Significant reprogramming efforts with CUDA for
maximum performance
• New and improved developer tools
• Programming Complexity
• Library & Tools Availability
• Power Vs performance
• Flexibility Vs Accessibility
• Acceleration
www.cdac.in
www.hpcwire.com
www.intel.com
www.nvidia.com