sas® and hadoop share cluster architecture · 2017-11-28 · #analyticsx james kochuba sas senior...

3
#analyticsx James Kochuba SAS Senior Architect ABSTRACT SAS solutions run in Hadoop, providing customers with the best tools to transform data in Hadoop all the way to providing insights to make the right decisions for your business. Incorporating SAS products and solutions into your shared Hadoop cluster in a cooperative fashion with YARN to manage the resources is possible. Best practices and customer examples will be provided and discussed around how to build and manage a shared cluster with SAS applications and products. HADOOP BACKGROUND HADOOP CONCERNS SAS® and Hadoop Share Cluster Architecture Apache Hadoop – Open-Source software based on HDFS, YARN/MR Hadoop Environment – HDFS, YARN/MR, Hive, Pig, Spark, Impala, ZooKeeper, Oozie, etc Hadoop Distribution – Cloudera, Hortonworks, MapR, etc Hadoop - Cheap environment for distributed storage and distributed compute with linear expansion HDFS (Hadoop Distributed File System) MapReduce Impala More User and More Services = Resource Contention = User Complaining Hadoop Solution = YARN Fix the shared cluster Resource Negotiator Capabilities Scheduling of resources Containers to attempt resource restriction Master Node Worker Node Worker Node YARN RM 16 GB Total Memory 4 GB Min Container size 1 shared queue YARN Default queue Spark-Submit (Spark) 2 containers 4 GB per container Beeline (Hive) 4 containers 2 GB per container Beeline (Hive) Spark- Submit (Spark) YARN NM 8 GB Memory YARN NM 8 GB Memory 2 - 4 GB 1 - 4 GB 4 - 4 GB 3 - 4 GB 1 - 4 GB 2 - 4 GB YARN RM 16 GB Total Memory 1 GB Min Container size 1 shared queue Beeline (Hive) Spark- Submit (Spark) 4 - 2 GB 2 - 4 GB 2 - 2 GB 3 - 2 GB 1 - 2 GB 1 - 4 GB

Upload: others

Post on 25-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAS® and Hadoop Share Cluster Architecture · 2017-11-28 · #analyticsx James Kochuba SAS Senior Architect SAS TECHNOLOGIES ON HADOOP RESULTS (CLICK TO EDIT) SAS® and Hadoop Share

#analyticsx

James Kochuba

SAS Senior Architect

ABSTRACT

• SAS solutions run in Hadoop, providing customers with the best tools to transform data in Hadoop all the way to providing insights to make the right decisions for your business.

• Incorporating SAS products and solutions into your shared Hadoop cluster in a cooperative fashion with YARN to manage the resources is possible.

• Best practices and customer examples will be provided and discussed around how to build and manage a shared cluster with SAS applications and products.

HADOOP BACKGROUND

HADOOP CONCERNS

SAS® and Hadoop Share Cluster Architecture

• Apache Hadoop – Open-Source software based on HDFS, YARN/MR

• Hadoop Environment – HDFS, YARN/MR, Hive, Pig, Spark, Impala, ZooKeeper, Oozie, etc

• Hadoop Distribution – Cloudera, Hortonworks, MapR, etc

• Hadoop - Cheap environment for distributed storage and distributed compute with linear expansion

HDFS (Hadoop Distributed File System)MapReduce

Impala

• More User and More Services = Resource Contention = User Complaining

• Hadoop Solution = YARN• Fix the shared cluster

• Resource Negotiator

• Capabilities

• Scheduling of resources

• Containers to attempt resource restriction

Mas

ter

No

de

Wo

rker

No

de

Wo

rker

No

de

YARN RM16 GB Total Memory4 GB Min Container size1 shared queue

YARN Default queue

Spark-Submit (Spark)• 2 containers • 4 GB per container

Beeline (Hive)• 4 containers • 2 GB per container

Beeline (Hive)

Spark-Submit (Spark)

YARN NM8 GB Memory

YARN NM8 GB Memory

2 - 4 GB

1 - 4 GB

4 - 4 GB

3 - 4 GB

1 - 4 GB

2 - 4 GB

YARN RM16 GB Total Memory1 GB Min Container size1 shared queue

Beeline (Hive)

Spark-Submit (Spark)

4 - 2 GB

2 - 4 GB

2 - 2 GB

3 - 2 GB

1 - 2 GB

1 - 4 GB

Page 2: SAS® and Hadoop Share Cluster Architecture · 2017-11-28 · #analyticsx James Kochuba SAS Senior Architect SAS TECHNOLOGIES ON HADOOP RESULTS (CLICK TO EDIT) SAS® and Hadoop Share

#analyticsx

James Kochuba

SAS Senior Architect

SAS TECHNOLOGIES ON HADOOP RESULTS (CLICK TO EDIT)

SAS® and Hadoop Share Cluster Architecture

SAS® Grid/Server

SAN

SAS In-memory(TKGrid / LASR)

High Speed Network

SAS Access

SAS In-database(Embedded Process -

EP)Hadoop Cluster

Analytic Platform

SAS

In-D

atab

ase

• SAS/ACCESS® Interfaces

• SAS® In-Database Analytics

• SAS® In-Memory Analytics

• SAS® Grid ComputingSAS® Grid/Server

libname hadoop hdfs(hdmd)libname hadoop (hive)libname hawq / impalaSAS reporting (SQL)

SAS ACCESS

Hive HDMD

Hawq \Impala

Scoring Accel callsCode Accel callsData Quality Accel calls

EP EP EP EP EP EP EP EP EP EP EP EP EP EP EP EP

YARN

Hadoop

SAS® TKGrid

HPA procs/LSAR

Page 3: SAS® and Hadoop Share Cluster Architecture · 2017-11-28 · #analyticsx James Kochuba SAS Senior Architect SAS TECHNOLOGIES ON HADOOP RESULTS (CLICK TO EDIT) SAS® and Hadoop Share

#analyticsx

James Kochuba

SAS Senior Architect

SAS ALL IN HADOOP

SAS® and Hadoop Share Cluster Architecture

REFERENCES

Kochuba, James. 2016. “Best Practices for Resource Management in Hadoop”. Available at http://support.sas.com/resources/papers/proceedings16/SAS2140-2016.pdf

“Apache Hadoop Yarn.” The Apache Software Foundation. June 29, 2015. Available at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

“SAS Analytics on Your Hadoop Cluster Managed by YARN.” July 14, 2015. Available at http://support.sas.com/rnd/scalability/grid/hadoop/SASAnalyticsOnYARN.pdf.

Troung, Kim. 2014. “How Apache Hadoop Helps SAS: SAS HPA and SAS LASR Analytic Server Integrate with Apache Hadoop YARN.” Available at http://hortonworks.com/blog/apache-hadoop-helps-sas/.

SAS Institute Inc. 2015. “PROPERTIES= LIBNAME Option.” SAS/ACCESS 9.4 for Relational Databases: Reference. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/acreldb/68028/HTML/default/viewer.htm#p0ly2onqqpbys8n1j9lra8qa6q20.htm.

SAS Institute Inc. “Introduction to SAS Grid Computing.” Available at http://support.sas.com/rnd/scalability/grid/. Accessed February 2016.

CONCLUSIONS

• YARN is the Hadoop answer to resource orchestration.

• SAS provides many technologies to run a SAS program that provide the best environment for the specific action.

• All of these technologies can run on a shared Hadoop environment and integrate with YARN as a central resource orchestrator.

• YARN is not perfect, and negotiation for resources during resource constrained times can be slow to change.

• The best tuning in the shared Hadoop environment for the YARN settings and OS will need to balance the software usage requirement and the business users’ expectations.

• The best practice is to first set up a simple resource orchestration that meets license requirements. Then slowly tune the settings to remove less-important items from taking resources, or reserve more resources for high-priority items. Leave the middle-priority items as is, because these will be extremely hard to modify with a resource negotiator.

• Remember, tools act differently when resources become constrained with a resource negotiator.

SAS® HPA

SAS® processes

libname hadoop hdfs (hdmd)libname hadoop (hive)libname hawq / impalaSAS SQL

SAS In Hadoop• SAS runs in Hadoop cluster• No separate SAS Compute servers• Leverages native YARN functionality for

SAS resource management

Scoring Accel callsCode Accel callsData Quality Accel calls

High PerformanceVisual AnalyticsIMSTAT

YARN

Hadoop

Hadoop Cluster with VA, VS, IMSTAT, HPA, Grid

SAS® LASR

SAS® EP

SAS® Grid