sas® and hadoop share cluster architecture · 2017-11-28 · #analyticsx james kochuba sas senior...
TRANSCRIPT
#analyticsx
James Kochuba
SAS Senior Architect
ABSTRACT
• SAS solutions run in Hadoop, providing customers with the best tools to transform data in Hadoop all the way to providing insights to make the right decisions for your business.
• Incorporating SAS products and solutions into your shared Hadoop cluster in a cooperative fashion with YARN to manage the resources is possible.
• Best practices and customer examples will be provided and discussed around how to build and manage a shared cluster with SAS applications and products.
HADOOP BACKGROUND
HADOOP CONCERNS
SAS® and Hadoop Share Cluster Architecture
• Apache Hadoop – Open-Source software based on HDFS, YARN/MR
• Hadoop Environment – HDFS, YARN/MR, Hive, Pig, Spark, Impala, ZooKeeper, Oozie, etc
• Hadoop Distribution – Cloudera, Hortonworks, MapR, etc
• Hadoop - Cheap environment for distributed storage and distributed compute with linear expansion
HDFS (Hadoop Distributed File System)MapReduce
Impala
• More User and More Services = Resource Contention = User Complaining
• Hadoop Solution = YARN• Fix the shared cluster
• Resource Negotiator
• Capabilities
• Scheduling of resources
• Containers to attempt resource restriction
Mas
ter
No
de
Wo
rker
No
de
Wo
rker
No
de
YARN RM16 GB Total Memory4 GB Min Container size1 shared queue
YARN Default queue
Spark-Submit (Spark)• 2 containers • 4 GB per container
Beeline (Hive)• 4 containers • 2 GB per container
Beeline (Hive)
Spark-Submit (Spark)
YARN NM8 GB Memory
YARN NM8 GB Memory
2 - 4 GB
1 - 4 GB
4 - 4 GB
3 - 4 GB
1 - 4 GB
2 - 4 GB
YARN RM16 GB Total Memory1 GB Min Container size1 shared queue
Beeline (Hive)
Spark-Submit (Spark)
4 - 2 GB
2 - 4 GB
2 - 2 GB
3 - 2 GB
1 - 2 GB
1 - 4 GB
#analyticsx
James Kochuba
SAS Senior Architect
SAS TECHNOLOGIES ON HADOOP RESULTS (CLICK TO EDIT)
SAS® and Hadoop Share Cluster Architecture
SAS® Grid/Server
SAN
SAS In-memory(TKGrid / LASR)
High Speed Network
SAS Access
SAS In-database(Embedded Process -
EP)Hadoop Cluster
Analytic Platform
SAS
In-D
atab
ase
• SAS/ACCESS® Interfaces
• SAS® In-Database Analytics
• SAS® In-Memory Analytics
• SAS® Grid ComputingSAS® Grid/Server
libname hadoop hdfs(hdmd)libname hadoop (hive)libname hawq / impalaSAS reporting (SQL)
SAS ACCESS
Hive HDMD
Hawq \Impala
Scoring Accel callsCode Accel callsData Quality Accel calls
EP EP EP EP EP EP EP EP EP EP EP EP EP EP EP EP
YARN
Hadoop
SAS® TKGrid
HPA procs/LSAR
#analyticsx
James Kochuba
SAS Senior Architect
SAS ALL IN HADOOP
SAS® and Hadoop Share Cluster Architecture
REFERENCES
Kochuba, James. 2016. “Best Practices for Resource Management in Hadoop”. Available at http://support.sas.com/resources/papers/proceedings16/SAS2140-2016.pdf
“Apache Hadoop Yarn.” The Apache Software Foundation. June 29, 2015. Available at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
“SAS Analytics on Your Hadoop Cluster Managed by YARN.” July 14, 2015. Available at http://support.sas.com/rnd/scalability/grid/hadoop/SASAnalyticsOnYARN.pdf.
Troung, Kim. 2014. “How Apache Hadoop Helps SAS: SAS HPA and SAS LASR Analytic Server Integrate with Apache Hadoop YARN.” Available at http://hortonworks.com/blog/apache-hadoop-helps-sas/.
SAS Institute Inc. 2015. “PROPERTIES= LIBNAME Option.” SAS/ACCESS 9.4 for Relational Databases: Reference. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/documentation/cdl/en/acreldb/68028/HTML/default/viewer.htm#p0ly2onqqpbys8n1j9lra8qa6q20.htm.
SAS Institute Inc. “Introduction to SAS Grid Computing.” Available at http://support.sas.com/rnd/scalability/grid/. Accessed February 2016.
CONCLUSIONS
• YARN is the Hadoop answer to resource orchestration.
• SAS provides many technologies to run a SAS program that provide the best environment for the specific action.
• All of these technologies can run on a shared Hadoop environment and integrate with YARN as a central resource orchestrator.
• YARN is not perfect, and negotiation for resources during resource constrained times can be slow to change.
• The best tuning in the shared Hadoop environment for the YARN settings and OS will need to balance the software usage requirement and the business users’ expectations.
• The best practice is to first set up a simple resource orchestration that meets license requirements. Then slowly tune the settings to remove less-important items from taking resources, or reserve more resources for high-priority items. Leave the middle-priority items as is, because these will be extremely hard to modify with a resource negotiator.
• Remember, tools act differently when resources become constrained with a resource negotiator.
SAS® HPA
SAS® processes
libname hadoop hdfs (hdmd)libname hadoop (hive)libname hawq / impalaSAS SQL
SAS In Hadoop• SAS runs in Hadoop cluster• No separate SAS Compute servers• Leverages native YARN functionality for
SAS resource management
Scoring Accel callsCode Accel callsData Quality Accel calls
High PerformanceVisual AnalyticsIMSTAT
YARN
Hadoop
Hadoop Cluster with VA, VS, IMSTAT, HPA, Grid
SAS® LASR
SAS® EP
SAS® Grid