devaraj kavali, maygol kananizadeh, uma gangumalla · 2019-10-30 · legal disclaimers relative...

28
Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla

Page 2: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Legal Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

2

Page 3: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Legal Disclaimers

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.

Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see here

Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost

No computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system manufacturer and/or software vendor for more information.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.

Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice

*Other names and brands may be claimed as the property of others.

3

Page 4: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

About US

Devaraj Kavali is currently working a software engineer with Intel Corporation and has been working on designing and developing of large scale distributed systems for more than 10 years. He is also an Apache Hadoop Committer & PMC member. Maygol Kananizadeh has 6 years experience in performance engineering and competitive analyses. Uma Maheswara Rao G is an Apache Software Foundation Member. An Apache Hadoop committer, a member of the Apache Hadoop PMC, and a long-term active contributor to the Apache Hadoop project. He is also a PMC member for the Apache BookKeeper project. Uma is a Software Architect at Intel, works on open source technologies.

4

Page 5: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Agenda

• Current status of AI and data analytics

• Introducing Data Analytics Reference Stack (DARS) and usages

• Benchmarking the performance of DARS

• Future works

• Resources

5

Page 6: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Current State of AI and Data Analytics

• In 2017, the average fiber to the home (FTTH) household generated 85GB of Internet traffic and is expected to generate approximately 264GB of Internet traffic per month in 2022.

• For comparison, a smart car will generate 50GB, a smart hospital 3,000 GB, a plane 40TB, and a city safety system 50PB—in a single day. And these predictions are for 2019; by 2022 there will be 3X more connected devices (28.5 billion) than the global population which means 3x more traffic. The quantity of data generated is difficult to comprehend, much less make actionable.

• This exponential growth in volume and variety of data provides enterprises a tremendous opportunity to gain a competitive edge through analytics-driven insights.

• Those who turn the mountains of information into actionable intelligence will be positioned to make business operations more efficient, drive faster innovation, and deliver improved security.

6

Page 7: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Initiative of Software Stacks

• With this goal in mind, Intel is releasing a Data Analytics Reference Stack (DARS)

to help enterprises analyze, classify, recognize, and process large amounts of data.

• Using a modern system stack such as this, built on Intel® Xeon® Scalable

platforms and featuring software optimizations at each layer, enterprise customers

and developers can gain a significant performance boost, from hardware up to the

application layer.

• This ready-to-use stack gives application developers and architects a powerful way

to store and process large amounts of data by using a distributed processing

framework to efficiently build big-data solutions and solve domain-specific

problems.

• Having a streamlined system stack frees users from the complexity of integrating

multiple components and software versions, and helps to deliver a stable,

performant platform upon which to quickly develop, test, and deploy solutions.

7

Page 8: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Intel OneStack

8

DARS DBRS

DLRS

Page 9: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

9

Data Analytics Reference Stack Solution

A highly optimized stack for machine learning workloads on Intel Architecture

Retain

flexibility for

customized

solutions

Reduces

complexity

associated w/

software

components,

allows for

quick

prototyping

Flexible

deployments

models for

service

providers and

on prem

installations

Help maximize

performance of

data science

operations

container

images tuned

for hardware

acceleration

from IA

Tailored

images are

optimized

and tested

together for

various use

cases

Easy

deployment

from Docker

Hub

Page 10: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Data Analytics Reference Stack

Help maximize the full potential of IA-based platforms for data analytics workloads.

Open-sourced to ensure developers have easy access to features & functionality of Intel platforms.

*Other names and brands may be claimed as the property of others 10

Page 11: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

How the Data Analytics Reference Stack can be

used in the Data Center & for Enterprises • Highly-tuned and built for enterprises • Having a streamlined system stack frees users from the complexity of

integrating multiple components and software versions • Provides the best-known configuration at each level of the software stack

from the OS up to the framework layers

• Delivers the state-of-the-art open-source software optimized to utilize the latest Intel Architecture platform features

• Components carefully chosen, integrated and tested for functional

interoperability and performance

11

Page 12: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Data Analytics Use Cases

12

eCommerce Quantitative analysis and

interpretation of shopper habits for

product recommendations and

targeted advertising

Public Transportation

Traveler patterns analyzed to

create schedules optimized for

resources and usage

Genomics Patient data, lab research

combined and evaluated for future

medical breakthroughs

Retail

Discover insights on shopping

aspects in a brick and mortar

leading to operational agility

Social Media

Advertisers gain valuable insight to

user interests allowing for custom

content

Market Research

Data classification, scalable

processing for decision making

Page 13: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

13

DARS Configuration

Component Baseline DARS – MKL DARS – OpenBLAS

OS CentOS release 7.6.1810 ClearLinux 29100 ClearLinux 29100

Kernel 4.20.0-1.e17.elrepo.x86_64 5.0.2-717 5.0.2-717

Java OpenJDK-1.8_191 OpenJDK-11.0.2 OpenJDK-11.0.2

Math library F2JBLAS 1.1 MKL 2018.3.222 OpenBLAS 0.2.20

Hadoop Cloudera CDH-5.12

Apache Hadoop 3.2.0 +

patches#

Apache Hadoop 3.2.0 +

patches#

Spark Apache Spark 1.6 (from

CDH-5.12)

Apache Spark 2.4.0 +

patches#

Apache Spark 2.4.0 +

patches#

Scala 2.11 2.12.4 2.12.4

Filesystem HDFS (RF=3) HDFS (RF=3) HDFS (RF=3)

# Patches needed to resolve incompatibility with Java 11. Patches are available upstream.

Page 14: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

14

Hardware Configuration

Component

Nodes 1 master + 4 worker nodes

CPU sockets/node 2

CPU Cores / Threads

Clock : Base / Turbo

L3 Cache

Xeon Gold 6140 18 Cores / 36 Threads

2.3 GHz / 3.7 GHz

24.75 MB L3

Memory/Node 384 GB

12 * 32 GB DDR4 DIMMs

Rated @ 2400 MHz

Operating @ 2400 MHz

Storage/Node 5.6 TB 7 * 800GB SATA3 SSD

Network 10 Gbps Ethernet

Page 15: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

15

HiBench v7.0 (link)

• Overview:

• Collection of multiple workloads / micro-benchmarks

• Categories: micro, SQL, ML, websearch, graph, streaming

• 3 “vectors”: workload, dataset (tiny through bigdata), language (Java, Scala)

• New Workloads added in HiBench(v7) : Random Forest, Gradient Boosting

Tree, ALS, PCA, Linear Regression, SVM, SVD, LDA

• In scope:

• Spark-Kmeans

Page 16: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

16

Spark-Perf (link)

• Overview:

• Collection of multiple Spark workloads / micro-benchmarks

• In scope:

• ALS, GLM, SVD (ML workloads using Spark)

• With custom parameters

Page 17: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

17

Benchmarks Configuration

Spark-Perf Key parameters

ALS Users=40M Products=10M Ratings=50M

SVD Examples=60M Features=500 Ranks=50

GLM Examples=10M Features=100K N/A

HiBench Key parameters

K-means Clusters=5 Iterations=10 Samples=3.2B

Page 18: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

DARS Performance Compared to CentOS

Baseline performance shown as 1.0x 18

1.89

8.02

2.25

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

svd als k-means

DARS- MKL speedup vs. baseline (higher is better)

2.94

1.23

2.65

1.02

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

svd glm k-means als

DARS- OpenBLAS speedup vs. baseline (higher is better)

Page 19: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Alternating Least Square (ALS) Hot Functions

19

Solves a symmetric positive definite linear system via Cholesky factorization. The input arguments are modified in-place to store

the factorization and the solution

Page 20: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Singular Value Decomposition (SVD) Hot Functions

20

This method returns the next pseudorandom Gaussian distributed double number with mean 0.0 and standard deviation 1.0.

Page 21: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

GLM Classification Logistic Hot Functions

21

This method returns the next pseudorandom Gaussian distributed double number with mean 0.0 and standard deviation 1.0.

Page 22: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

K-means Hot Functions

22

1. DDOT forms the dot product of two vectors. Uses unrolled loops for increments equal to one.

2. Return the squared Euclidean distance between two vectors.

Page 23: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Future Works

New features:

• HDFS read cache with Apache Pass (AEP):

https://issues.apache.org/jira/browse/HDFS-13762

• Intel Quick Assist Technology (QAT):

https://www.intel.com/content/www/us/en/architecture-and-technology/intel-

quick-assist-technology-overview.html

• Intel Integrated Performance Primitives: https://software.intel.com/en-us/ipp

• Optimized Analytics Package for Spark Platform (OAP):

https://github.com/Intel-bigdata/OAP

23

Page 24: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Resources

• Download DARS: https://clearlinux.org/stacks/data-analytics

• HiBench 6.0: https://github.com/Intel-bigdata/HiBench/releases/tag/HiBench-

6.0

• HiBench 7.0: https://github.com/Intel-bigdata/HiBench/releases/tag/HiBench-

7.0

• Spark-Perf: https://gitlab.devtools.intel.com/SSP-Benchmarking/spark-perf-

for-spark2

24

Page 25: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,
Page 26: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

DARS components Benchmarks and sample

applications

Spark

Deployment tools,

utilities

Optimized system and

math libraries

Clear Linux

Hadoop

(MR, HDFS, YARN)

HiBench v7, spark-perf v2

Customized / proprietary configuration for deployment

Apache Spark 2.4 + patches

Apache Hadoop 3.2.0 + patches

OpenJDK-11.0.2

Latest upstream version (kernel 5.x)

Runtimes (Java, python)

ISA (AVX-512), Storage (Optane SSDs), Memory

(DCPMM) Hardware

Intel MKL 2018.3.222 / OpenBLAS-0.2.20

DARS

container

Available at: https://clearlinux.org/stacks/data-analytics

26

Page 27: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,

Latent Dirichlet Allocation (LDA) Hot Functions

27

DGEMV performs one of the matrix-vector operations

y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y,

where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

Page 28: Devaraj Kavali, Maygol Kananizadeh, Uma Gangumalla · 2019-10-30 · Legal Disclaimers Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result,