a scalable artificial intelligence data pipeline for ... · sanhita sarkar, ph.d. global director,...
TRANSCRIPT
Sanhita Sarkar, Ph.DGlobal Director, Analytics Software Development
Track: Machine Learning
September 25, 2019
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
A Scalable Artificial Intelligence Data Pipeline for Accelerating Time to Insight
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI implementations 2
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
3
4
7
Disaggregated Architectures for AI Implementations 2
2
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
DEMO: A Real-World Use Case6
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5Performance of a Data Pipeline from Object Notifications to Metadata Search5
AI Pipeline is Data Intensive
3
Data Sources
IoT and Sensors
Business Processes
Legacy
Ingest
Transient
Centralized
High Performance
Data Preparation
ETL - Data Cleansing,
Pre-processing
Model Training
Deep Learning
Frameworks
Trained Models
Model Serving
Deploy Trained Models
Inference
Data Hours and Days Seconds to ResultsDays and WeeksSpeed and Scale2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
AI Pipeline has Varying Characteristics and Performance Requirements
4
Data Sources
IoT and Sensors
Business Processes
Legacy
Ingest
Transient
Centralized
High Performance
Data Preparation
ETL - Data Cleansing,
Pre-processing
Model Training
Deep Learning
Frameworks
Trained Models
Model Serving
Deploy Trained Models
Inference
Data Hours and Days Seconds to ResultsDays and WeeksSpeed and Scale
Data Variety
Data Volume
Data Disparity
Data Volume
Data Velocity
Data Quality
Model Quality
Response Time
Throughput
Data Caching
Data Access Latency
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
AI Pipeline has Varying Infrastructure Requirements
5
Data Sources
IoT and Sensors
Business Processes
Legacy
Ingest
Transient
Centralized
High Performance
Data Preparation
ETL - Data Cleansing,
Pre-processing
Model Training
Deep Learning
Frameworks
Trained Models
Model Serving
Deploy Trained Models
Inference
Data Hours and Days Seconds to ResultsDays and WeeksSpeed and Scale
Throughput-oriented Temporary Tier
(NVMe™/Object Storage)
Throughput-oriented, Globally Accessible Capacity Tier
(Object Storage)
Throughput-oriented, High Performance Tier
(NVMe™)
Throughput-oriented, Random I/O Performance and Capacity Tier (NVMe™/Hybrid)
Low Latency, Throughput-oriented, Scalable, Random I/O
Performance Tier (NVMe™)
Scalable, Large/Sequential I/O, Capacity Archival Tier (Object Storage)
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI Implementations 2
6
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Aggregated vs. Disaggregated Architecture for AI
Data Network
Shared pool of NVMe™ Flash Storage
Data Network
Object Storage
• Model training is limited to the flash storage capacity integrated in the GPU servers and this incurs multiple data transfers from the object storage, once data grows over capacity on the servers.
• Incurs delays in model serving and inference.
Object Storage
• Model training can scale independently on a disaggregated pool of GPUs, shared flash and object storage, with no subsequent data transfers.
• Inference by the model serving client is faster due to immediate access to trained models on the shared flash storage.
Model Training
Pool of GPUs
GPU server(s) collocated with Flash
Centralized Data Repository
Data Preparation
Model Serving
Model ServingModel TrainingCentralized Data Repository
Data Preparation
Training data
Training dataTrained models
Training data
Trained models
7
Ingest
Ingest
Inference
Clients
Clients
Clients
Inference
Clients
Clients
Clients
GPU GPU GPU GPU GPU GPU GPU GPU
GPUGPUGPUGPUGPUGPUGPUGPU
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
High Performance Rack-scale Flash Storage
Option 1: Disaggregated Architecture of an AI Data Pipeline
Raw S3 Object
Raw S3 Object
Raw S3 Object
Model vn+1
Model vn+2
Model v2n
AI framework (TensorFlow, PaddlePaddle, PyTorch, Caffe, etc)
GPU Cluster
AI Model Serving
Model Training Model Serving
Train and Validate Model
Batched
Produce & Publish CNN/RNN Models
Deploy
Inference
Clients
Clients
Clients
Centralized, High Capacity Landing Zone
Ingest
DataPreparation
Training Data
Training Data
Validation Data
Object Storage
Model v₁
Model v₂
Model vn
Validation Data
82019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Object Storage
NVMe™ All-Flash Array
Key-Value Store
Implemented Workflow: Option 1
Service for Object Notifications / Kafka Producer
Elasticsearch©
Gets notifications
of newly ingested S3
objects
Kafka TopicPublishes the
system metadata of the new S3 objects to a Kafka topic
AI Model Serving
Storage Bucket
Subscribes to the Kafka topic to get the S3 locations of the newly ingested objects
New Image S3™ ObjectsNew Text S3™ Objects
New Video S3™ Objects
Search and VisualizeCustom metadata of
image/text/video S3 objects
TensorFlow/ PaddlePaddle/
PyTorch
Gets the S3 objects from object storage
Uses AI/Deep Learning algorithms for
analyzing the objects, and generating labels, and custom metadata Stores the labels, paths and
custom metadata of the S3 objects in Elasticsearch
DashboardKibana™
9
Inference Score
Notifies S3 locations of failed objects to be used for re-training the model
Scor
e <
thre
shol
d
Data Preparation
AI Model Training
GPU Cluster
TensorFlow/ PaddlePaddle/
PyTorch Models
Gets
AI m
odel
from
Fla
sh A
rray
Training Data
Scor
e >
thre
shol
d
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
High Performance Rack-scale Flash Storage
Option 2: Disaggregated Architecture of an AI Data Pipeline
Raw Data
Raw Data
Raw Data
Model vn+1
Model n+2
Model v2n
AI framework (TensorFlow, PaddlePaddle, PyTorch, Caffe, etc)
GPU Cluster
AI Model Serving
Model Training Model Serving
Train and Validate Model
Batc
hed
Produce & Publish CNN/RNN Models
Deploy
Clients
Clients
Clients
High Capacity Data Archive
Ingest
Data Preparation
Training Data
Training Data
Validation Data
Object Storage
Model v₁
Model v₂
Model vn
Validation Data
10
Archived Training Data
Archived Raw Data
Inference
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Object Storage
Key-Value Store
NVMe™ All-Flash Array
Implemented Workflow: Option 2
Elasticsearch©
AI Model Serving
Image StreamText Stream
Video Stream
Search and VisualizeCustom metadata of
image/text/video files
TensorFlow/ PaddlePaddle/
PyTorch
Gets the input data and models from Flash Array
Uses AI/Deep Learning algorithms for
analyzing the input data files, and
generating labels, and custom metadata Stores the labels, paths and
custom metadata of the files in Elasticsearch
DashboardKibana™
11
Inference Score
Notifies locations of failed data files to be used for re-training the model
Scor
e <
thre
shol
d
Models
Raw Data
Data Preparation
AI Model Training
GPU Cluster
TensorFlow/ PaddlePaddle/
PyTorch
Apache Kafka©
Training Data
ObjectBucket
ObjectBucket
Compresses and moves old files and models
Key-Value Storage
Scor
e >
thre
shol
d
S3™ Cloud Connector
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Option 3: Disaggregated Architecture of an AI Data Pipeline
Raw Data
Raw Data
Raw Data
Model vn+1
Model n+2
Model v2n
AI framework (TensorFlow, PaddlePaddle, PyTorch, Caffe, etc)
GPU Cluster
AI Model Serving
Model Training Model Serving
Train and Validate Model
Batc
hed
Produce & Publish CNN/RNN Models
Deploy
Clients
Clients
Clients
Ingest
Data Preparation
Training Data
Training Data
Validation Data
Model v₁
Model v₂
Model vnValidation Data
12
Archived Training Data
Archived Raw Data
Inference
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Centralized, Globally Accessible High Capacity Tier
ARCHIVE
Object Storage
StreamerData Generator
Object Storage w/ File Interface
Key-Value Store
Implemented Workflow: Option 3
Kafka Producer
Elasticsearch©
Kafka Topic 1
AI Model Serving
Directory
New Image FilesNew Text Files
New VideosSearch and Visualize
Custom metadata of image/text/video files
TensorFlow/ PaddlePaddle/
PyTorchUses AI/Deep Learning
algorithms to analyze the files, and generate custom
metadata Stores the paths and
custom metadata of the files in
Elasticsearch
DashboardKibana™
Inference Score
Scor
e <
thre
shol
d
Data Preparation
AI Model Training
GPU Cluster
TensorFlow/ PaddlePaddle/
PyTorch
Models
Gets files and AI model from object storage file interface
Training Data
Scor
e >
thre
shol
d
Watcher
Publishes new files
Kafka Topic 2
Publishes locations and metadata of
new files
Sends event notifications and locations of newly ingested files
Pushes latest files to a Kafka Producer
Raw Data
Not
ifies
loca
tions
of f
aile
d fil
es to
be
use
d fo
r re-
trai
ning
the
mod
el
Streams files
132019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI Implementations 2
14
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Image Training Performance: Disaggregated Flash and GPUs
15
• On a disaggregated architecture comprising an NVMe™ All-Flash Array, and
a single 8-GPU server, the training performance with most AI models scales almost linearly up to 4 GPUs and is ~7x with 8 GPUs, except for AlexNet and LeNet, where the training performance saturates at 1 GPU itself.
multiple GPU servers, the training performance scales linearly with the number of servers, irrespective of the choice of AI models.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
0
4000
8000
12000
16000
20000
24000
ResNet-50 AlextNet ResNet-152 Inception-V3 LeNet Inception-V4 GoogleNet
Imag
es/s
ecImage Training Performance Throughput: 8-GPU server
1GPU2 GPU4 GPU8 GPUHigher the
Better
0
40000
80000
120000
160000
200000
8GPU 16GPU 24GPU 32GPU 40GPU 48GPU 56GPU 64GPU
Imag
es/s
ec
Image Training Perfomance Throughput by Model
Inception-V4 ResNet-152 Inception-V3 ResNet-50 GoogleNet AlextNet LeNet
Higher the Better
I/O Throughput: Image Training on Disaggregated Flash and GPUs
16
• On a disaggregated architecture comprising a single 8-GPU server and an NVMe™ All-Flash Array –
the average I/O throughput during training using the ResNet-50 model (compute intensive) is ~653 MB/s, the GPU utilization being ~100% (size of image data is 164 GB, each image being ~100 KB)
the average I/O throughput during training using the LeNet model (I/O intensive) is ~2.2 GB/s, the GPU utilization being ~20%. So the LeNet model requires ~3.4x the I/O throughput, compared to ResNet-50.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
0
20
40
60
80
100
0
200
400
600
800
1000
GPU
usa
ge
I/O
Thr
ough
put
(MB/
s)
Time
I/O Throughput: Image Training with ResNet-50 (8 GPUs )
I/O Throughput GPU usage
0
20
40
60
80
100
1900
2000
2100
2200
2300
GPU
usa
ge
I/O
Thr
ough
put
(MB/
s)
Time
I/O Throughput: Image Training with LeNet (8 GPUs)
I/O Throughput GPU usage
Image Inference Performance on Disaggregated Flash and GPUs
17
• The inference throughput is measured as the aggregated images/sec inference results using ImageNet datasets across multiple GPU containers.
• On a disaggregated architecture comprising an NVMe™ All-Flash Array, and
a single 8-GPU server, results show that the inference image processing rates are between ~3.3x to ~4x the training rates of the corresponding TensorFlow models.
multiple GPU servers, users have the flexibility to run mixed AI workloads for training and inference, by dedicating one or two GPUs to inference for every 8 GPUs, rest being allocated to training.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
0
4000
8000
12000
16000
20000
ResNet-50 Inception-V3 Inception-V4 VGG-16 ResNet-152
Imag
es/s
ecImage Inference Performance Throughput: 8-GPU server
1GPU2GPU4GPU8 GPU
Higher the Better
0
40000
80000
120000
160000
8GPU 16GPU 32GPU 64GPU
Imag
es/s
ec
Image Inference Performance Throughput by Model
Inception-V4ResNet-152VGG-16Inception-V3ResNet-50
Higher the Better
Example Configurations: GPU servers and an NVMe™ All-Flash Array
18
• An example allocation strategy of NVMe All-Flash arrays is considered for executing AI workloads –• 30% of the I/O bandwidth for model training, and the remaining 70% for various phases like data preparation,
inference, and other activities
• Considering the above allocation strategy, example configurations are derived using the I/O throughput required on a disaggregated architecture comprising a single NVMe All-Flash Array and a 8-GPU server, while using ResNet-50 and LeNet models for training - A single NVMe All-Flash Array can scale up to twelve 8-GPU servers running ResNet-50 model for the training phase, with
100% utilization of GPUs. With LeNet model, a single NVMe All-Flash Array can optimally scale up to three 8-GPU servers for the training phase.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
NVMe™ All-Flash Array
AI Model Used for Training Number of 8-GPU Servers to Scale
Number of Flash Arrays 1Compute-intensive ResNet-50
Model 12
I/O-intensive LeNet Model 3
Mixed I/O Throughput GB/s (80% Reads, 20% Writes)
23.5
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI Implementations 2
19
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Concurrent Clients NVMe™ All-
Flash Array
20
Data Ingestion Benchmark Schematics
Apache Kafka© Connect Cluster
Apache Kafka© Cluster
Worker 1....
Kafka Node 1
Client Node 1
Connector 8Connector 1
Worker 2.... Connector 16Connector 9
Worker 3.... Connector 24Connector 17
Kafka Node 2
Kafka Node 3
Kafka Node 4
Client Node 2
Client Node 3
Client Node 4
4 GB/s
Streaming data
For each benchmark run on 4 clients:
• Record size: 500 bytes/message• Sends 400 M messages at 4 GB/s• Publishes to a Kafka topic, 8 M
messages at a time.
• Each connector in the Kafka Connect Cluster subscribes to a Kafka topic in the Kafka Cluster, and sinks the data to an NVMe All-Flash Array.
• A total of 32 connectors is writing to a single NVMe All-Flash Array.
• Achieved average write throughput from a single NVMe All-Flash Array is ~ 3.2 GB/s.
For each benchmark run, Kafka cluster can process• 8 M messages/sec, for the
message size of 500 bytes.• Average latency of
acknowledgment to the clients is ~40 ms/message.
Worker 4....Connector 25 Connector 32
100 GigE connections
100 GigE connections
~4 GB/s
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Data Ingestion Performance with Kafka to an NVMe™ All-Flash Array
21
• Each connector (known as sink connector) is dedicated to a Kafka topic, i.e., the number of connectors is equal to the number of topics/files.
• Write throughput increases linearly from 8 to 16 connectors and by 1.5x from 16 to 32 connectors.
• With 4 Kafka Connect worker nodes and 32 sink connectors pointing to a single NVMe All-Flash Array, the average write throughput achieved is 3.2 GB/s.
• This test helps to determine the number of connectors to configure in the Kafka Connect cluster, based on the number of flash arrays, the input ingestion rates, and the available I/O throughput from the flash arrays.
• 1 JVM is used for each worker node in the Kafka Connect cluster, with a JVM heap size of 64 GB.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
1.1
2.1
3.2
0
0.5
1
1.5
2
2.5
3
3.5
8 16 32
Writ
e Th
roug
hput
(GB
/s)
Number of Connectors (one connector per topic/file)
Kafka Filestream Sink Connector - Scalablity with Number of Connectors
Higher the Better
1.1
2.1
3.2
00.5
11.5
22.5
33.5
1 2 4Wri
te T
hrou
ghpu
t (G
B/s)
Number of Connect Worker Nodes (8 Connectors per Worker Node )
Kafka Filestream Sink Connector - Scalablity with Number of Worker Nodes
Higher the Better
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI Implementations 2
22
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Performance: Object Notifications with Respect to Object Puts
• The benchmark comprises of 3 clients working as load generators and performing 'put' operations of a total of 2.4M puts using 300, 600 and 1200 connections in an object storage. The average notification rates of these object puts are measured simultaneously, with a batch size set to 1000, followed by measurements of the Kafka-Elastiscsearch Connector throughput and latency.
• The “average” rate of object notifications lags slightly behind the average rate of puts by approximately 572 to 730 notifications/sec with 600 and 1200 connections respectively, while there is no lag with 300 connections.
• The Kafka-Elasticsearch Connector consumes the messages at the same average rate of 1.7 MB/s to 2.1 MB/s that an object notification service (as a Kafka Producer) sends to the Kafka topic, with no visible latency in the Kafka cluster for acknowledging the notifications or in the Kafka-Elasticsearch Connector, as a consumer.
• An optimal throughput with no latency is achieved with a minimal configuration of Kafka, comprising of 2 nodes in the Kafka cluster, 1-2 Connect worker nodes with a total of 4 connectors, and 4-8 partitions of the Kafka topic. The CPU usage is 10-15% in both the Kafka cluster and Connect worker node(s), the Kafka JVM heap size being 32 GB.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved. 23
2,655 2,677
2,091 2,091
3,046
2,4741,933 1,933
2,912
2,1821,705 1,705
0500
100015002000250030003500
Avg. Puts/sec Avg. Object NotificationMsgs/sec
Object NotificationService/Kafka Producer
Throughput (KB/s)
Elasticsearch ConnectorConsumer Throughput
(KB/s)
Object Notification Rates: Single Bucket to Single Kafka Cluster/Topic(2.4M Puts)
300 connections600 Connections1200 connections
Reference Implementation of Kafka for Object Notification Service
Reference Implementation of Kafka Total Kafka Storage space¹= ((Retention Period in Secs * (Object Notification
Message Rate * 800)* (1+ RF)) Bytes
Number of NodesPhysical or Virtual
Machine Configuration Per Node
Expected 90th Percentile Producer Acknowledgement
Latency
Object Notification Message Rate (K Messages /Sec)
2 Nodes (8 Shards, Replication Factor (RF)=1)
4 vcpus, 64 GB <10ms 5008 vcpus, 64 GB <10ms 1000
16 vcpus, 128 GB <10ms 200032 vcpus, 128 GB <10ms 400048 vcpus, 128 GB <10ms 8000
¹For example, considering object notification message rate = 2,700 messages/sec, retention period (in Kafka topic) = 7 days, and RF =1, the total Kafka storage space required = 2,700* 800 bytes * (7*24*3600)*2 = 2.4 TB
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved. 24
Node 1
Elasticsearch Implementation for Metadata Indexing and Search- Up to 3 Billion Object Notification Messages
Load Balancer
Clients
Shard 1 leader
Shard 7 Replica
Shard 2 leader
Shard 8 Replica
ES Instance 1
Indexed data filesTransaction Log files
Node 4
Shard 7 leader
Shard 1 Replica
Shard 8 leader
Shard 2 Replica
ES Instance 4
Indexed data filesTransaction Log files
1. Dividing into multiple shards allows for speeding up index creation times, but slows down query response times due to the overhead of aggregating results from all the shards.
2. Optimal shard configurations for indexing and querying 3 billion object notification messages of 800 bytes, are determined:
• 2 primary shards (and 2 replicas) per Elasticsearch instance;
• As the number of messages grows, a max of 375M messages (~280 GB) per shard provides an optimal performance for both querying and indexing.
Node 2
Shard 3 leader
Shard 5 Replica
Shard 4 leader
Shard 6 Replica
ES Instance 2
Indexed data filesTransaction Log files
Indexed data filesTransaction Log files
Node 3
Shard 5 leader
Shard 3 Replica
Shard 6 leader
Shard 4 Replica
ES Instance 3
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved. 25
Elasticsearch Performance: Index Creation Times- Up to 1 Billion Object Notification Messages
• This benchmark measures the index creation times by Elasticsearch, as the number of object notification messages grows up to 1 billion. Tests have been performed using 4 nodes and 8 shards in the Elasticsearch cluster and in batches of 25M puts in an object storage, with a 2-hour interval between the batches. The average rate of puts in the object storage and the average rate of notification messages during this benchmark have been 1,100 and 1,092, respectively.
• Results show that the index creation rate by Elasticsearch is similar to the object notification message rate.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved. 26
25.050.5
75.2100.3
125.3150.5
175.8201.3
226.7252.2
0
50
100
150
200
250
300
100M 200M 300M 400M 500M 600M 700M 800M 900M 1000M
Tim
e (H
ours
)
Number of Object Notification Messages in Elasticsearch Index
Index Creation Times up to 1 Billion Object Notification Messages(1,100 Avg. Puts/Sec and 1,092 Avg. Object Notification Messages/Sec)
Elasticsearch Performance: Query Response Times and Throughput- Considering Object Notification Messages to Grow to 3 Billion over Time
• This benchmark executes a suite of 1,100 mixed queries, comprising of simple and complex queries, against the object notification messages. A simulation of messages up to 3 billion has been done using the existing 1 billion messages to measure the query response times for 3 billion and to come up with a reference implementation.
• Considering same number of nodes, the median response times of Elasticsearch for mixed queries increase with an increasing number of shards, along with a corresponding reduction in throughput, due to the overhead of aggregating the query results from all shards.
• Considering same number of shards, the median response times decrease by ~20%-38% from 2 to 4 nodes, along with a corresponding increase in throughput by 14%- 106%, the difference of throughput being significant with 3 billion messages.
• Optimal query response times and throughput are achieved with 4 nodes and 8 shards in an Elasticsearch cluster.
43 3
54
10
0
2
4
6
8
10
12
2 nodes(8 Shards)
4 nodes(8 Shards)
4 nodes(16 Shards)
Resp
onse
Tim
e (m
s)
Elasticsearch: Median Response Times (Mixed Queries)
1Billion 3 Billion
Lower the Better
4 nodes with 8 shards is an optimal choice
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Note: Here, a shard refers to a primary shard.
45,28251,705
47,031
8,886
18,34213,541
0
10000
20000
30000
40000
50000
60000
2 nodes(8 Shards)
4 nodes(8 Shards)
4 nodes(16 Shards)
Thro
ughp
ut (
Que
ries
per s
econ
d)
Elasticsearch: Mixed Query Throughput
1Billion 3 Billion Higher the Better
4 nodes with 8 shards is an optimal choice
27
Reference Implementation of Elasticsearch for Indexing and Search - Up to 3 Billion Object Notification Messages
Reference Implementation of Elasticsearch 3 Billion Object Notification Messages(Total Storage Space is ~ 4.5 TB including One Replica)
Number of NodesConfiguration of Physical
or Virtual Machine Per Node
Expected 90th
Percentile Response Time
Expected Query
Throughput(Queries/Sec)
Expected Indexing Rate (K
Messages/Sec)
4 Nodes (8 Shards, Replication Factor = 1)
4 vcpus, 256GB <1s 1528 118 vcpus, 256GB <1s 3057 23
16 vcpus, 256GB <1s 6114 4732 vcpus, 256GB <1s 12228 9348 vcpus, 256GB <1s 18342 141
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved. 28
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI Implementations 2
29
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Key-Value Store
Implemented Real-World Use Case
Service for Object Notifications
Elasticsearch©
Kafka TopicPublishes the
metadata of image S3 objects to a Kafka topic
AI Model Serving
Subscribes to the Kafka topic to get the S3 locations of the images
Car Image S3™ Objects
Search and Visualize
Custom metadata of car image S3
objects
TensorFlow
Gets the S3 image objects from object storage
Uses AI/Deep Learning algorithms for
analyzing the images, and generating custom
metadataStores the S3 object paths and
custom metadata in Elasticsearch
DashboardKibana™
30
Inference Score
Notifies S3 locations of failed flower image objects to be used for re-training the model
Scor
e <
80%
Data Preparation
AI Model Training
GPU Cluster
TensorFlow
Gets
AI m
odel
from
Fla
sh a
rray
Scor
e >
80%
Flower Image S3™ Objects
Models
NVMe™ All-Flash Array
Object Storage
Training Data
Search and Visualize
Custom metadata of flower image S3
objects
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Training and Inference Performance on a Disaggregated Architecture
Data Ingestion Performance with Apache Kafka© to an NVMe™ All-Flash Array
Summary and Best Practices
Topics Covered
3
4
7
Disaggregated Architectures for AI Implementations 2
31
DEMO: A Real-World Use Case6
An AI Data Pipeline - Stages and Requirements1
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Summary and Best Practices
32
• Implementing a disaggregated architecture of GPU compute, a shared pool of NVMe™ All-Flash Arrays and object storage system(s) has multiple benefits while executing AI workloads – Subsequent data transfers in and out of local SSDs of GPU servers can be avoided, as the data grows over capacity. Inference is faster due to immediate access to trained models on the shared flash storage. Businesses can independently scale GPU servers and shared flash arrays to meet the changing needs of their AI workloads, along
with the flexibility to run mixed AI workloads for training and inference. With a preferred allocation strategy of the I/O bandwidth, various teams can share and scale the flash arrays to serve multiple
GPU servers in a cost-effective manner. A high capacity object storage may be used as a landing zone for the ingested data as well as an archival solution.
• As a best practice to attain an optimal ingestion performance with Kafka to NVMe All-Flash Arrays – Number of connectors and worker nodes in the Kafka Connect cluster needs to be tuned, based on the number of flash arrays,
the input ingestion rates, and the available I/O throughput from the arrays; Based on the I/O throughput requirement, high-speed network interfaces and topology need to be configured for the Kafka
cluster, the worker nodes of the Kafka Connect cluster, and the flash array(s) to eliminate network bottlenecks.
• As a best practice to attain an optimal performance of object notifications and Elasticsearch – A minimal configuration of Kafka is needed, comprising of 2 nodes in the Kafka cluster, 1-2 Connect worker nodes with a total of
4 connectors, and 4-8 partitions of the Kafka topic. The object notification message rate is almost similar to the object put rate. The index creation rate of Elasticsearch is similar to the object notification message rate. Optimal indexing and search
performance is achievable with 4 nodes and 8 shards in an Elasticsearch cluster for up to 3 billion object notification messages.
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
Links to Western Digital Product Pages and White Papers
• Product Pages
Storage Platforms
Object Storage
• White Papers
ActiveScale™ Data Pipeline Service with Elasticsearch© – Best Practices for Metadata Indexing and Search
Use Cases for ActiveScale™ Data Pipeline Service
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.
33
Architecting Data Infrastructure for the Zettabyte Age
2019 Storage Developer Conference, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.