spectrum scale ai ecosystem and how it supports gpu ......2020/09/01  · spectrum scale ai insights...

25
Spectrum Scale AI ecosystem and how it supports GPU based workloads including Power AI Tomer Perry Scalable I/O development Includes content provided by Piyush Chaudhary, Ted Hoover, Andreas Koeninger, Simon Lorenz

Upload: others

Post on 12-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

Spectrum Scale AI ecosystemand how it supports GPU based workloads including Power AI

Tomer PerryScalable I/O development

Includes content provided by Piyush Chaudhary, Ted Hoover, Andreas Koeninger, Simon Lorenz

Page 2: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Disclaimer

2IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

The information in this document is IBM CONFIDENTIAL.

This information is provided on an "AS IS" basis without warranty of any kind, express or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Some jurisdictions do not allow disclaimers of express or implied warranties in certain transactions; therefore, this statement may not apply to you.

This information is provided for information purposes only as a high level overview of possible future products. PRODUCT SPECIFICATIONS, ANNOUNCE DATES, AND OTHER INOFORMATION CONTAINED HEREIN ARE SUBJECT TO CHANGE AND WITHDRAWAL WITHOUT NOTICE.

USE OF THIS DOCUMENT IS LIMITED TO SELECT IBM PERSONNEL AND TO BUSINESS PARTNERS WHO HAVE A CURRENT SIGNED

NONDISCLUSURE AGREEMENT ON FILE WITH IBM. THIS INFORMATION CAN ALSO BE SHARED WITH CUSTOMERS WHO HAVE A

CURRENT SIGNED NONDISCLOSURE AGREEMENT ON FILE WITH IBM, BUT THIS DOCUMENT SHOULD NOT BE GIVEN TO A

CUSTOMER EITHER IN HARDCOPY OR ELECTRONIC FORMAT.

IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM's sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

IBM reserves the right to change product specifications and offerings at any time without notice. This publication could include technical inaccuracies or typographical errors. References herein to IBM products and services do not imply that IBM intends to make themavailable in all countries.

Page 3: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 3IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Outline

• AI Pipeline

• HDP goes Mainstream

• Containers

• Storage for AI

• Getting the data closer to the GPU

Page 4: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 4IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Outline

• AI Pipeline

• HDP goes Mainstream

• Containers

• Storage for AI

• Getting the data closer to the GPU

Page 5: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Workflow and Data Flow is complex

5IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Data Source

New Data

Years of Data

Inference

Trained Model

Deploy in Production using Trained Model

Seconds to results

Data Preparation

Data Cleansing & Pre-Processing

Training Dataset

Testing Dataset

Weeks & months

Heavy IO

Iterate

Build, Train, Optimize Models

AI Deep Learning Frameworks(Tensorflow & Caffe)

Monitor & Advise

Instrumentation

Distributed & Elastic Deep Learning

Parallel Hyper-Parameter Search & Optimization

Network Models

Hyper-Parameters

Days & weeks

Traditional Business

IoT & Sensors

Collaboration Partners

Mobile Apps & Social Media

Legacy

Page 6: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Enterprise Data Pipeline with IBM Spectrum Storage

6IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Insights Out

Trained Model

Inference

Data In

Transient Storage

SDS/Cloud

Global Ingest

Throughput-oriented,

globally accessible

Cloud

ETL

High throughput, Random

I/O,

SSD/Hybrid

Archive

High scalability, large/sequential I/O

HDD Cloud Tape

Hadoop / Spark

Data Lakes

Throughput-oriented

Hybrid/HDD

ML / DLPrep ⇨ Training ⇨ Inference

High throughput, low latency,

Random I/O

SSD/NVMe

Classification &

Metadata Tagging

High volume, index &

auto-tagging zone

Fast Ingest /

Real-time Analytics

High throughput

SSD

Throughput-oriented,

software defined

temporary landing zone

capacity tier

performance tierperformance &

capacity Tier

performance &

capacity Tierperformance tier

capacity tier

EDGE INGEST ORGANIZE ANALYZE INSIGHTSML / DL

Page 7: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Enterprise Data Pipeline with IBM Spectrum Storage

7IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

EDGE INGEST ORGANIZE ANALYZE INSIGHTS

Insights Out

Trained Model

Inference

Data In

Transient Storage

Global Ingest

Cloud

ETL

SSD/Hybrid

Archive

HDD Cloud Tape

Hadoop / Spark

Data LakesHybrid/HDD

ML / DLPrep ⇨ Training ⇨ Inference

SSD/NVMe

Classification &

Metadata Tagging

Fast Ingest /

Real-time AnalyticsSSD

SDS/Cloud

Spectrum Scale

Cloud Object

Storage

Cloud Object

Storage

Spectrum Scale Spectrum Scale

Spectrum DiscoverSpectrum Scale

Spectrum Scale

Spectrum ScaleCloud Object

Storage

Spectrum

Archive

ML / DL

Cloud Object Storage

Cloud Object Storage

Page 8: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 8IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Outline

• AI Pipeline

• HDP goes Mainstream

• Containers

• Storage for AI

• Getting the data closer to the GPU

Page 9: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Integration of HDFS in CES

9IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

• HDFS Transparency becomes an integral part of Spectrum Scale• Easy setup of HDFS Transparency using existing CES mechanisms

e.g. „mmces service enable“• Only NameNodes will be managed by CES• HDFS clients always talk to the same CES IP (for NameNode requests)• CES monitors the NameNode and moves the CES IP to another available node if something goes

wrong• Multiple HDFS clusters supported through multiple CES groups

IBM Spectrum Scale Cluster

CES Node(Active HDFS NameNode)

Regular GPFS Nodes(HDFS DataNodes)

HDFS Client

Always talks to the same CES IP

If the Active NameNode fails,CES will move the IP to a working NameNode

CES Node(Standby HDFS NameNode)

Special CES Group with single IP

Page 10: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 10IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Outline

• AI Pipeline

• HDP goes Mainstream

• Containers

• Storage for AI

• Getting the data closer to the GPU

Page 11: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Journey to cloud requires an open, hybrid approach

11IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

ProductivePredictablePortableFuture proof by building once, deploying anywhere for flexible data and workload placement

Container platform

Open and integrated consistent management services that ensure operational integrity and reduce cost

Operational services

Integrated and secure containerized software for an agile, yet governed, enterprise

Containerized softwaresecure by design

ManageBuildMove

Page 12: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Goal: Deliver High Performance File Services to Containerized Application Workloads

12IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Support for Multiple Clouds

• Public, Private, Hybrid

Support Hybrid Use Cases

• Cloud Burst – Single Name Space

• Multi Cloud Data Sharing

• Archive

• High Performance Tiering

Solution Integration (Partners)

Support Workloads that Require High Performance File Services

• Analytics & Cognitive

• High Performance Computing

• AI Data Pipeline

Support the Workload Ecosystem in the Cloud

• Containerized Applications, Storage

• Ephemeral and Persistent Storage Volumes

Flexible Deployment

• Dynamic Provisioning, Configuration, Upgrade

Page 13: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Spectrum Scale Containers Models

13IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Storage for Containers

• Container Ready Storage

Storage in Containers

• Containerized Storage

Container Storage Interface (CSI) for ScaleStorage Provision

and Attachment

Application ContainerApplication ContainerApplication ContainerApplication Container

Application ContainerApplication ContainerApplication ContainerApplication Container

Application ContainerApplication ContainerApplication ContainerApplication Container

Spectrum Scale Client Spectrum Scale

Connectivity

Auto-deploy

Container Storage Interface (CSI) for ScaleStorage Provision

and Attachment

Application ContainerApplication ContainerApplication ContainerApplication Container

Application ContainerApplication Container

Application ContainerApplication ContainerApplication ContainerApplication Container

Containerized Spectrum Scale

Auto-deploy

Application ContainerApplication Container

Page 14: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Evolution of IBM Spectrum Scale Containers

14IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Scale for Containers v1

Spectrum Scale

(bare metal deployment)

Scale in a Container

Scale for Containers v2

SEC 2.0 CSI 1.0

Kubernetes

InfrastructureIBM & Partners

OS Support

InfrastructureIBM & Partners

OS Support

Spectrum ScaleSpectrum Scale

Kubernetes

OpenShift Interoperability

Spectrum Scale

InfrastructureIBM & Partners

OS Support

InfrastructureIBM & Expanded Partner Ecosystem

RHEL

OpenShift

CSI 1.1

Kubernetes

Spectrum Scale

Scale in a Container w/ CloudPaks

InfrastructureIBM & Expanded Partner Ecosystem

RHEL

OpenShift

CSI 1.2

Kubernetes

Spectrum Scale

Common Services

Page 15: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Evolution of IBM Spectrum Scale on Cloud

15IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Spectrum Scale(bare metal deployment)

InfrastructureIBM & Partners

OS Support

Spectrum Scale

Spectrum Scale IBM Cloud

Spectrum Scale

Spectrum Scale on AWS

InfrastructureIBM & Partners

OS Support

Spectrum Scale

AWS Common Services

AMI

Spectrum Scale Partner Solutions

InfrastructureIBM & Partners

OS Support

Spectrum Scale

Common Services

Scale in a ContainerOn Cloud

InfrastructureIBM & Partners

OS Support

CSI 1.2

Spectrum Scale

Scale in a Container

Common Services

Scale in a Container w/ CloudPaks(Multi-Cloud)

InfrastructureIBM & Partners

OS Support

FuturePartner and Scale Offerings

CurrentPartner and Scale Offerings

InfrastructureIBM & Expanded Partner

Ecosystem

RHEL

OpenShift

CSI 1.2

Kubernetes

Spectrum Scale

Common Services

Page 16: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Evolution of Hybrid Cloud with IBM Spectrum Scale

16IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Current

Spectrum Scale on AWS

InfrastructureIBM & Partners

OS Support

Spectrum Scale

AWS Common Services

AMI

Spectrum Scale IBM Cloud

Spectrum Scale

InfrastructureIBM & Partners

OS Support

Spectrum Scale(bare metal deployment)

InfrastructureIBM & Partners

OS Support

Spectrum Scale

Single Name Space w/AFM

Scale in a ContainerOn Cloud

InfrastructureIBM & Partners

OS Support

CSI 1.2

Spectrum Scale

Scale in a Container

Common Services

Spectrum Scale(bare metal deployment)

InfrastructureIBM & Partners

OS Support

Spectrum Scale

InfrastructureIBM & Expanded Partner

Ecosystem

RHEL

OpenShift

CSI 1.2

Kubernetes

Spectrum Scale

Common Services

Scale in a Container w/ CloudPaks(Multi-Cloud)

Future

Page 17: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 17IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Outline

• AI Pipeline

• HDP goes Mainstream

• Containers

• Storage for AI

• Getting the data closer to the GPU

Page 18: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Start small and scale easilyfrom experiment to production at enterprise scale

18IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Page 19: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

NVMe Flash for AI and Big Data WorkloadsIBM Elastic Storage System 3000

19IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

All-new storage solution

• Leverages proven FS9100 technology

• Integrated scale-out advanced data management with end-to-end NVMe storage

• Containerized software for ease of install and update

• Fast initial configuration, update and scale-out expansion

• Performance, capacity, and ease of integration for AI and Big Data workflows

Page 20: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 20IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Outline

• AI Pipeline

• HDP goes Mainstream

• Containers

• Storage for AI

• Getting the data closer to the GPU

Page 21: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Data Accelerator for AI and Analytics Infrastructure

21IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

• Performance Tier

• Maximize performance of storage: $/IOP & $/GB/s are key

• Low latency random I/O & High bandwidth sequential

• Relatively small compared to Capacity tier (say 5-25%)

• Can be Lower Durability, Lower Availability, Lower Reliability,if Architected properly

• No Geo-distribution

• Capacity Tier (aka “Data Lake”)

• Minimize the cost of storage: $/TB is key

• High Durability, Availability, Reliability, Geo-distribution

Hadoop / SparkML / DL

Prep ⇨ Training ⇨ Inference

IBM

SpectrumScale

High Performance Tier

Metadata Search Engine

Organizer / Porter

Data Lakes / Archive

Capacity Tier / Data Lake

IBM CloudObject Storage

IBM

SpectrumScale

NASFilers

Ing

est

Org

an

ize

An

aly

ze

ML

/ D

L

Servers with

CPUs & GPUs

Shared Storage

Page 22: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Accelerating data for NVIDIA GPUs

22IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

• NVIDIA Magnum IO is a collection of software APIs and libraries to optimize storage and network I/O performance in multi-GPU, multi-node processing environments. NVIDIA developed Magnum IO in close collaboration with storage industry leaders, including IBM.

• Collaboration with Nvidia continues to align Spectrum Scale’s pagepool with GPU memory (“pagepool tiering”)

https://devblogs.nvidia.com/wp-content/uploads/2019/08/GPUDirect-Fig-1-New.png

https://www.nvidia.com/en-us/data-center/magnum-io/

Page 23: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential 23IBM Spectrum Scale / Dec, 2019 / © 2019 IBM Corporation

Spectrum Scale AI

Questions?

Page 24: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

24

Page 25: Spectrum Scale AI ecosystem and how it supports GPU ......2020/09/01  · Spectrum Scale AI Insights Out Trained Model Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented,

IBM Confidential

Updated and Simplified IBM Spectrum Scale Picture

25IBM Spectrum Scale / June 07, 2019 / © 2019 IBM Corporation

Data Accelerator for AI and Analytics

IBM Spectrum ScaleAutomated data placement and data migration

Flash DiskShared Nothing

Cluster JBOD/JBOF

Spectrum Scale ECE

Global NamespaceManagement API (RESTful) Advanced GUI

Tape

Site A

Site B

Site C

Worldwide Data

Distribution (AFM) IBM CloudObject Storage

IBM Cloud

Transparent Cloud

Tier

C O N T A I N E R R U N T I M E

Container

Storage Enabler for Containers / Container Storage Interface

File

SMBNFS

POSIX

Analytics

TransparentHDFS

Micro- Services

Apps

S3

New Gen. &

Traditional

Applications

Object

S3 (Swift)

Swift

… and others

… and others

Runs o

n…

IBM Cloud