c-dac hpc activities & experience on accelerators · c-dac activities cutting across the above...

C-DAC HPC Activities & Experience on Accelerators

Goldi Misra Group Coordinator & Head HPC Solutions Group C-DAC, India

C-DAC HPC Activities & Experience on Accelerators

• High Performance Computing & Grid Computing • Hardware, Software, Systems, Applications, Research, Technology, Infrastructure

• Multilingual Computing and Heritage Computing • Tools, Fonts, Products, Solutions, Research, Technology Development

• Health Informatics • Hospital Information System, Telemedicine, Decision Support System, Tools, Traditional

Knowledge-base and DSS for Medicine

• Software Technologies including FOSS • FOSS, Multimedia, ICT for masses, E-Governance, Geomatics, ICT4D

• Professional Electronics including VLSI & Embedded Systems

• Digital Broadband and Wireless Systems, Network Technologies, Power Electronics, Real-Time Systems, Control Electronics, Embedded Systems, VLSI/ASIC Design, Agri Electronics, Strategic Electronics

• Cyber Security & Cyber Forensics • Cyber Security tools, technologies & solution development, Research & Training

Education and Training forms an important component of

C-DAC activities cutting across the above Thematic Areas

Thematic R & D Areas of C-DAC

Kaleidoscope of C-DAC Products

Spectrum of HPC Activities

Spectrum

of

HPC

Activities

Technologies

Systems

National Facilities

Applications

Solutions

Trainings

PARAM Series of Supercomputers

PARAM YUVA

1990 C-DAC’s PARAM 8000, India’s first Gigascale Supercomputer

2002 C-DAC’s PARAM Padma, India’s first Terascale Supercomputer Launched (Rank 171 in Top 500 List)

1990 -2000

ANUPAM- BARC PARAM – C-DAC ANURAG- DRDO Flowsolver- NAL

Parallel Initiatives

2007 CRL’s EKA ranks 4th in Top 500 List

9 Terascale Systems from India in Top 500 List

2010 Only 4 systems from India in Top 500 List as against 41 systems from China

India’s best system ranked 47 in Top 500 List

2000 -2007

Several HPC Facilities setup including those at C-DAC, IISc, BARC, NAL, CMMACS, DRDO, NCMRWF

Indian Supercomputing Scenario

2008 C-DAC’s PARAM Yuva Launched (Rank 68 in Top 500 List)

6 India Govt. takes initiative for a big leap in Supercomputing 2012

HPC Activities @ C-DAC

National HPC Facilities

NPSF @ Pune (1998)

PARAM 10000 system

CTSF @ Bangalore (2003)

PARAM Padma System

NPSF @ Pune (2009)

PARAM Yuva System

HPC Applications: 1988 -2011 C

FD

W

ea

the

r F

ore

cast

ing

E

volu

tio

na

ry

Co

mp

uti

ng

S

eis

mic

M

od

ell

ing

B

io-

info

rma

tics

S

tru

ctu

ral

En

gin

ee

rin

g

’88 … … ’91 … … … ‘95 … … ‘98 … ‘00 … ‘02 … … ‘08 … … ‘11

T80 T172 RTWS, WRF

Seismic inversion, pre and post stack migration, 2D & 3D models

Fracture Mechanics

FRP ,Smart Structures

FRP

Protein folding (300 ns), REMD, MEME

Electro-magnetics

Protein structure IC Engine,

Seismic inversion

Protein folding (1 ns)

CFD (Launch Vehicle)‏

Pre and post stack migration, 1D models

Composites

1st Mission 2nd Mission 3rd Mission 4th Mission Garuda 54 TF

• InClus- HPC cluster Building Toolkit

• CHReME – HPC Resource Management Engine

• ONAMA – HPC package for academic institutions

• Parallel File System

InClus Integrated Cluster Solution

• Web and Desktop based GUI

• Provision of Operating Systems on physical as well as virtual machines:

RHEL5.x, RHEL 6, CentOS5.x, CentOS6.x

• Development platform; Compilers, Debuggers

• Scheduler and resource manager

• Policy based accounting

• High availability support

• Remote console. Powerful shell support.

• Quickly set up and control Management node services: DNS, HTTP,

DHCP, TFTP

• User Management

• GPU Support

• Log monitoring

• Critical Error/ Warning reporting via Web interface

• SMS/Mail alerts for checking job status

InClus addresses the need of technical challenges in

the field of HPC, it makes cluster easy.

CHReME C-DAC’s HPC Resource Management Engine

CHReME portal is an end-user job submission, management and monitoring tool that works with various schedulers or Workload Managers such as Torque, OpenPBS, Sun Grid Engine, Moab, Load leveler, etc.

Timely E-mail notification regarding job status; personalized job list and job status information

Secure credential specific access on web through https

Allows users to configure their execution environment through compilers and libraries selection, scheduling parameters etc.

Scientific & Research Applications specific portals

CHReME addresses challenge of efficient and easy

usage and management of resources of HPC

systems

ONAMA

Onama is an integrated package which opens a new door to future technocrats, providing them a Quantum leap in developing a firm understanding through HPC in several engineering disciplines.

Onama comprises of a well selected set of parallel & serial applications and tools across various engineering disciplines such as Computer Science, Mechanical, Electronics and Communication, Electrical, Civil, Chemical engineering etc. Besides, it consists of a number of nVIDIA CUDA enabled applications in several domains such as molecular dynamics and physics.

With a mission of “ Equipping Premier Academic

Institutions with top of the class HPC solutions from

C-DAC packaged with open source software and

world class services. This would enable the

Premier Academic Institutions to benefit in terms of

service delivery and affordability.”

• Parallel, Multicore and Manycore Programming • System Administration & Management • Network Security & Audits • Storage Management Technologies • Facility Operations Management and Maintenance • GPU based Programming • HPC User Symposiums • C-DAC Certified HPC Professional

• Engineering & managing of large supercomputing systems and national supercomputing facility

• HW & SW skills in designing System Area Network

• Chip/PCB/system design skills (HW)

• Networking stack & system software (SW)

• Prototyping/Validation/Certification/Benchmarking/Training

• HW & SW skills in designing RC accelerators

• RC HW having upto 12 million logic gates for computing, with different host interfaces

• Porting applications/algorithms as HW circuits to achieve large speed-ups

• SW ecosystem design for various operating systems

Indigenous capability in

• Porting and scaling applications on large clusters

• Several collaborative projects in Science & Engineering research

• Increase of HPC user community in the nation

• Publications

(contd…)

Activities on Intel Many Integrated Core

(MIC) Architecture

Knights Ferry Co-Processor Card • 1.2 GHz, upto 32 cores, 2 GB GDDR5, 4 threads/core, 300W,

45nm process • MIC Platform Software Stack (MPSS) 1.0 and 2.0 • Development Tools: Intel FORTRAN & C++ Compilers, Intel

MPI and OpenMP, Intel MKL, IPP, TBB, ARBB, Cilk Plus, Support for Eclipse IDE, OpenCL support in future

Expected Specifications of Knights Corner

• More than 50 cores per chip

• 22nm process size

• ~1TF

• Mathematical Algorithms: Mandelbrot Set (An example

of a simple mathematical definition leading to complex behavior)

• Molecular Dynamics: MD_OPENMP

• Oceanography: Tsunami-N2 (Numerical simulation program with

the linear theory in deep sea and with the shallow water theory in shallow

sea and on land with constant grid length in the whole region)

• Astrophysics: CAMB (Code for Anisotropies in the Microwave

Background [CAMB] computes cosmic microwave background spectra given

a set of input cosmological parameters)

• Linear scalability and results are encouraging

• Based on familiar x86 architecture

• Run on standard, existing programming tools and methods

• Minimal Porting efforts

• No reprogramming for native compilation and execution

• Directive based offloading

• Availability of tools (profilers, debugging, monitoring etc.)

We intend to work on the commercial product KNC to get an exact idea of performance, scalability etc.

Long-term viability of the technology

Application development methodology and tools

Emerging Standards

Application Accelerators Facts

Technology is changing at a faster pace

Many technology providers do not stay in business long enough

Differ substantially from conventional multi-core programming

Require a deep study of the underlying hardware architecture to achieve a good performance

Applications can be tuned to achieve a good performance

Codes are platform dependent

OpenCL for many-core architectures

GPUs

OpenFPGA efforts for reconfigurable computing

• 2nd and 3rd gen Reconfigurable Computing (RC) platform

• Uses RC hardware with state of the art FPGAs

• RC hardware has upto 12 million logic gates for computing with different host interfaces

• Avatars – hw routines/ libraries

• Varada – APIs, kernel agent, Linux support

• Eco-friendly HPC solution

C-DAC RC

Digital Systems Design

Scientific Application

PCB Design & Assembly

Hardware Library Design

(Avatars)

System Software

Design

C-DAC’s Expertise

One of the enabling technologies useful in RC is the field-programmable gate array (FPGA).

Putting FPGAs on add-on cards or motherboards allow FPGAs to serve as compute-intensive co-processors.

FPGAs can be re-configured over and over again, to perform multitude of operations. This enables application-specific, dynamically "programmable" hardware accelerators.

Tim

e

3 hrs 8 min

29 min

16 RC 16 nodes (256 Cores)

Nodes: HP DL580G5 Quad Core Quad Socket Xeon 2.93 GHz

Query Software

(256 cores)

RC

(16 Cards)

Speed-Up

Per card in terms of cores

AAN10358 2 hr 18 min 31

sec

22 min 8 sec

100.1

NP_597681 2 hr 41 min 2 sec

25 min

39 sec

100.4

XP_001065955

3 hr 8 min 28

sec

29 min

47 sec

101.2

Application: Smith-Waterman Sequence Search Database: Protein

0

750

1500

2250

3000

19901992

19961998

20002002

20042006

20072008

FPGA

CPU

Year

Fre

qu

en

cy(M

Hz)

Source: Intel, IBM, Xilinx, Altera datasheets

• Scientific and engineering applications in the areas of fracture mechanics, radio astronomy and bioinformatics ported on RC provided significant acceleration compared to purely software based solutions.

• These speedups were further increased by many folds, based on configuration and applications. Bioinformatics sequence search solution using RC, gave more than 100 times faster results.

• C-DAC's own fracture mechanics code, having double precision Cholesky factorization and forward-backward substitution steps ported on RC provided 16X speedup.

• High speed data acquisition and signal processing solutions designed for Very Long Baseline Interferometry (VLBI) and power spectrum experimentation in radio astronomy, replaced a sizable computing cluster.

• Double precision matrix multiplication implemented on RC performed better than the standard math library.

• Evolution of reconfigurable logic design with more traditional computing paradigm.

• Development of more efficient cache replacement policy for FPGA configurations.

• Reduction in the run-time reconfiguration time.

• FPGAs with HPC will act as a solution to scaling challenges brought on by microprocessors (Power Consumption and clock frequencies).

• Mapping compute-intensive algorithms directly onto parallel FPGA hardware, tightly coupled to a conventional CPU through a high-speed I/O bus, complete applications can be accelerated by orders of magnitude over conventional CPU implementations.

• Development of run time debugging of multi platform enabled code.

Workstations Servers & Blades

Tesla Data Center & Workstation GPU Solutions

Tesla M-series GPUs M2090 | M2075/0 | M2050

Tesla C-series GPUs C2075/0 | C2050

M2090 M2075/

0 M2050

Cores 512 448 448

Memory 6 GB 6 GB 3 GB

Memory bandwidth (ECC off)

177.6 GB/s 150 GB/s 148.8 GB/s

Peak Perf Gflops

Single Precision

1331 1030 1030

Double Precision

665 515 515

C2075/0

C2050

448 448

6 GB 3 GB

148.8 GB/s

148.8 GB/s

1030 1030

515 515

Tesla: 2-3x Faster GPU Every 2 Years

16

2

4

6

8

10

12

14

DP G

FLO

PS p

er

Watt

2008 2010 2012 2014

T10 Fermi

Kepler

Maxwell

Worldwide GPU Supercomputer Momentum

Tesla GPUs Launched

First Double

Precision GPU

Tesla 20-series

(Fermi) Launched

Libraries Directives Programming

Languages

Applications

Easiest Approach for 2x to 10x

Acceleration

Maximum

Performance

• Accelerated Communication With Network and Storage Devices

• Peer-To-Peer Transfers Between GPUs

• Peer-To-Peer Memory Access

• GPUDirect For Video

NVIDIA GPUDirect

PGI CUDA x86 CUDA Now Available for CPUs and GPUs

Single CUDA

C / C++

Codebase

NVIDIA C / C++ Compiler

PGI CUDA X86 Compiler C / C++ Support

GPU

CPU

GPU Computing @ C-DAC

Particulars Tesla C1060 Tesla C2050 Tesla 2075

Architecture Tesla 10 Series GPU Tesla 20 Series GPU Tesla 20 Series GPU

Compute Capability 1.3 2 2

No. of Cores 240 448 448

GPU Memory 4 GB 3 GB 6 GB

Memory Bandwidth 102 GB/s 150 GB/s 150 GB/s

Bio-informatics:

• GPU-HMMER (Does protein sequence alignment using profile HMMs)

• MrBayes (Bayesian inference of phylogenetic and evolutionary models)

• CUDA-BLASTP (Designed to accelerate NCBI BLASTP for scanning

protein sequence databases)

• CUDA-MEME (Discover motifs on groups of related DNA or protein

sequences etc.)

Weather:

• WSM5 (WSM5 is WRF Single Moment 5 Cloud Microphysics module)

CONFIDENTIAL

• Strong scalability of certain codes on multiple cards

• Availability of numerous applications explicitly enabled

with CUDA

• Significant reprogramming efforts with CUDA for

maximum performance

• New and improved developer tools

• Programming Complexity

• Library & Tools Availability

• Power Vs performance

• Flexibility Vs Accessibility

• Acceleration

www.cdac.in

www.hpcwire.com

www.intel.com

www.nvidia.com

http://www.cdac.in/

http://www.hpcwire.com/

http://www.intel.com/

http://www.nvidia.com/

Thank You Thank You

[email protected]

[email protected]

c-dac hpc activities & experience on accelerators · c-dac activities cutting across the above...

Documents