the enterprise and connected data, trends in the apache hadoop ecosystem by alan gates

22
© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Upload: big-data-spain

Post on 16-Apr-2017

206 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 2: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

The Enterprise and Connected Data,Trends in the Apache HadoopEcosystemAlan GatesCo-FounderHortonworks@alanfgates

Moderador
Notas de la presentación
Speaker: Alan Gates, Hortonworks Co-Founder Title: 10 Years of Apache Hadoop and Beyond Duration: 40 minutes Abstract: In 2006, Apache Hadoop had its first line of code committed to what has become a breakthrough technology. A decade later, we are witness to open source innovation that has literally changed the face of business. Hadoop and related technologies have become the enterprise data platform, fueled by a rich ecosystem capable of supporting any application, any data, anywhere. Join Hortonworks Co-Founder Alan Gates as he as he drills down into the current and future state of Hadoop and reviews community initiatives aimed at enabling the next wave of modern data applications that are well governed and easy to deploy on-premises and in the cloud.
Page 3: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Hadoop Journey Begins…

1 ° ° °

° ° ° N

HDFS

MapReduce

Batch apps

2006

Moderador
Notas de la presentación
Our Hadoop journey began in 2006 focused on executing batch MapReduce jobs on petabytes of data. Yahoo’s decision to contribute Hadoop to the Apache Software Foundation was critical because a vibrant set of related technologies began to appear around Hadoop. [NEXT]
Page 4: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Today

Our Hadoop Journey: Ecosystem Innovation Accelerates 2006 2011

Moderador
Notas de la presentación
Fast forward to 2011 and the concept of YARN began to emerge. Its goal? Enable Hadoop to move from its batch-only roots and become a data platform capable of running batch, interactive, and real-time applications. The emergence of YARN further accelerated the innovation around Hadoop with the emergence of Spark, Kafka, Storm, and many other projects that started life as Apache Incubator proposals. [NEXT]
Page 5: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

6 Years of Apache Hive and Beyond

• Apache Hive becomes a Top-Level Project

• HiveServer2 adds ODBC/JDBC• SQL breadth expands with windowing

and more

• Apache Tez enters incubation

• Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support

• Standard SQL authorization, integration with Apache Ranger

• ACID transactions introduced• Governance added with Apache

Atlas integration

• Hive 2 introduces LLAP and intelligent in-memory caching

2010 2011 2012 2013 2014 2015 2016

A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud

• Extensive SQL:2011 Support• Compatible with every major BI Tool• Proven at 300+ PB Scale

Moderador
Notas de la presentación
I want to focus for minute in one area of how Hadoop has developed. Apache Hive has participated in that move from batch to interactive, from ETL only to enterprise ready EDW
Page 6: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC /JDBC SQL

Queries

Page 7: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: 25+x Performance Boost

0

5

10

15

20

25

30

35

40

45

50

0

50

100

150

200

250

Spee

dup

(x F

acto

r)

Que

ry T

ime(

s) (L

ower

is B

ette

r)

Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

Page 8: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s new in Spark 2.0? API Improvements

– SparkSession – new entry point– Unified DataFrame & DataSet API– Structured Streaming/Continuous Application

Performance Improvements– Tungsten Phase 2 – Whole-stage code generation

ML– ML model persistence– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)

SparkSQL– SQL 2003 support (new ANSI SQL parser, subquery support)

Moderador
Notas de la presentación
swiss army knife of big data, can do streaming, SQL, ETL, ML available from multiple languages (python, java, scala)
Page 9: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 10: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

How to Secure and Govern Access to Your Data?

Classification

Prohibition

Time

Location

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

Entitiesin Data

Lake

Policies

?

Moderador
Notas de la presentación
So the enterprise has invested in integrating Hadoop into its data lake architecture. Landing petabyte of data from streams, pipelines, data feeds into HDFS files, Hive and HBase tables, etc. The question arises of how we can setup policies for these data sets that enable us to secure and govern access to it. [NEXT ]
Page 11: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secure and Govern Your Data with Tag-Based Access Policies

Classification

Prohibition

Time

Location

Policies

PDPResource

Cache

Ranger

Manage Access Policies and Audit Logs

Track Metadataand Lineage

Atlas ClientSubscribers

to Topic

Gets MetadataUpdates

Atlas

MetastoreTags

Assets

Entitles

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

Entitiesin Data

Lake

Moderador
Notas de la presentación
The community has been hard at work on integrating Apache Atlas as a metadata catalog and Apache Ranger as the centralized security system to address this need. The result is tag-based authorization model driven by the metadata catalog (i.e. Atlas) with access and audit policies applied to those tags (via Ranger). This enables a more flexible way to govern access to data and data sets than traditional role/group based access policies. Ex. as data pipelines land data, they can tag that data as the data lands and the access policies setup for those tags immediately apply. Moreover, Ranger has added the notions of time-based and location-based access policies, so users can do things like limit access to data that’s older than 90 days (for example) or limit access to data from certain geographies. This provides important enterprise-focused capabilities that will help businesses deploy more modern data applications in a way where they have the confidence their data is secure and well-governed. [NEXT]
Page 12: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data In Motion

Constrained High-latency Localized context

Hybrid – cloud/on-premises Low-latency Global context

SOURCES REGIONAL INFRASTRUCTURE

CORE INFRASTRUCTURE

Moderador
Notas de la presentación
TALK TRACK People are no longer willing to wait until data is in the store before processing it Hortonworks DataFlow is a platform for data in motion. It is powered by Apache NiFI, Kafka, and Storm for dataflow management and stream processing. MiNiFi/NiFi : creates dynamic, configurable data pipelines Kafka support adaptation to differing rates of data creation and delivery Storm for real-time streaming processing to create immediate insights at a massive scale. There are scenarios where NiFI will provide all that you you need – especially in situations that only require dataflow management, but you will notice the orange and blue horizontal triangles provide a continuum of capability from edge to core, that indicates varying degrees of need for the different products.
Page 13: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Hadoop Journey: From the Data Center to the Cloud!2006 Today

Moderador
Notas de la presentación
So after 10 years, the Hadoop ecosystem is available everywhere. In the Data Center, within appliances, across public and private clouds. This maximizes choice for people interested in getting started with Hadoop and deploying it at scale for transformational use cases. [NEXT]
Page 14: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Hadoop in the Cloud?

Unlimited Elastic Scale

Ephemeral & Long-Running

IT & Business Agility

No UpfrontHW Costs

Moderador
Notas de la presentación
This session is focused on Hadoop in the cloud, so let’s start by answering the obvious question of why Hadoop in the cloud? The cloud saves time and money: Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs. The cloud is flexible and scales fast: you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter. The cloud makes you nimble: Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organizations immediate time to value and unlimited elastic scale. [NEXT]
Page 15: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Architectural Considerations for Hadoop in the Cloud

Shared Data& Storage

On-Demand Ephemeral Workloads

1010110101010101

010101010101010101010101010101010

Elastic Resource Management

Shared Metadata, Security & Governance

Moderador
Notas de la presentación
While there are a range of great choices in the market today, there’s more that we, as a community, can and should do to make Hadoop in the cloud better and first class. I’ll spend the remainder of this talk on the key architectural considerations Shared Data & Storage – the shared-data-lake is on cloud storage, it is not HDFS. Also memory and local storage play a different role – that of caching An import distinction in the cloud is On-Demand Ephemeral Workloads – this changes a number of things in fundamental ways. Shared Metadata, Security, and Governance remains important but need to be adjusted in the face of ephemeral clusters. And finally, I’ll touch on Elastic Resource Management We need to shift our thinking away from cluster resource management and more towards SLA-driven workloads [NEXT]
Page 16: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Data and Storage

Understand and Leverage Unique Cloud Properties Shared data lake is cloud storage accessible

by all apps Cloud storage segregated from compute Built-in geo-distribution and DR

Focus Areas Address cloud storage consistency

and performance Enhance performance via memory

and local storage

Shared Data& Storage

1010110101010101

010101010101010101010101010101010

Moderador
Notas de la presentación
In the cloud, the shared DataLake is on cloud storage. It is not HDFS of a specific Hadoop cluster. Note this is very different from a traditional on-premise cluster where each cluster has an internal shared store representing its internal DataLake. Moreover, it’s desirable to have this shared data be accessible by all apps, not just Hadoop apps – Cloud Native and 3rd party Good news: goe-distribution is built-into the cloud storage and DR becomes simpler. Cloud storage has two limitation: Eventual consistency and its API does not match the filesystem API expected by Hadoop and normal apps. Addressing these two issues is a key area of ongoing investment. I encourage you to attend today’s breakout session by my fellow Hortonworkers that’s focused on this topic. Cloud Storage is designed for low cost & scale – unfortunately performance is not its strong point due to segregation from compute. Memory and Local storage play a different role in the cloud – cache to enhance the performance. [NEXT]
Page 17: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enhance Performance via Caching

Tabular Data: LLAP Read + Write-thru Cache Shared across jobs / apps and across engines Cache only the needed columns Spills to SSD when memory is full (anti-caching) Read & Write-through cache Security: Column-level and row-level

HDFS Caching for Non-tabular Data Cache data from cloud storage as needed Write-through cache

Workloads

Cloud Storage

LLAP R/W TablesHDFS Files

Cache

Moderador
Notas de la presentación
Wrt to caching we need to consider both tabular data and non-tabular data. For tabular data, LLAP comes to rescue – it provide a tabular cache that’s shared across jobs, apps, and engines such as Hive and Spark. LLAP only caches the needed columns, so it’s very efficient in its use of memory. Further, data is stored in an internal serialized form to optimize compute The design center is anti-caching – put it all only memory and spill to disk/SSD when memory is full . LLAP currently provides read caching, but is being extended to support a write-through cache. And LLAP addresses a key security Gap for the Hadoop eco system, it provides a convenient place to address column-level and row-level access control that works across all kinds of Engines: Hive, Spark, Flink, or even old fashion MapReduce. Note this was not previously possible …. From a non-tabular data perspective, HDFS can be used to cache cloud data – both a read cache and a write-through cache. This essentially evolves HDFS to play a different role, A place to store intermediate data and also to be a finely-tuned caching layer between the applications and the cloud storage. [NEXT]
Page 18: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Prescriptive On-Demand Ephemeral Workloads

On-DemandEphemeralWorkloads

Data ScienceR/W TablesCompute Fabric

ETL

R/W TablesCompute Fabric

WarehouseR/W TablesCompute Fabric

Search

R/W TablesCompute Fabric

Moderador
Notas de la presentación
Always-on multitenant clusters are important for a range of mission critical use cases. However, bringing forth an ephemeral cluster to support a specific workload is game changing. The agile nature of the cloud allows us to create prescription workload environments. For someone interested in modeling and analyzing data sets, - they simply want to interact with a PRE-TUNED environment optimized for the application. - The complexities of configuring Spark, Hive and Hadoop need to be hidden under the hood. Whether it’s data science, data warehouse, ETL, or other common workload types, - provide pre-configured and pre-tuned compute environments - Further we need be able manage them in, ephemeral fashion. The NET: deliver user experiences that are focused on business agility, - rather than infinite configurability and cluster management. [NEXT]
Page 19: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Data Requires Shared Metadata, Security, and Governance

Shared Metadata Across All Workloads Metadata considerations

– Tabular data metastore– Lineage and provenance metadata– Pipeline and job management metadata– Add upon ingest– Update as processing modifies data

Access / tag-based policies and audit logs Centrally stored to facilitate use across clusters

– Ex. backed by Cloud RDS (or shared DB)

Classification

Prohibition

Time

Location

Streams

Pipelines

Feeds

Tables

Files Objects

SharedMetadata

Policies

Moderador
Notas de la presentación
So far: I shared data and storage and how to optimize performance by caching. Shared data fundamentally requires a shared approach to metadata, security and governance. The Metadata is not just the classic Hive metadata that describes the tabular data, -about storing and tracking the lineage and provenance of data, - about details related to data pipeline processing and job management. Tabular data needs to be available to all applications so that SQL is an option regardless of where your data is Also, as data is ingested and processed, metadata needs to be created and adjusted Governance and securing the data remain critical and its matadata needs to be managed across all workloads. - The work done by projects such as Ranger and Atlas need to be evolved to fit the cloud environment. If we don’t do this then the cloud will not be adapted aggressively for enterprise use. Getting back to the Shared metadata – each ephemeral cluster cannot have it private copy of the metadata.. In the cloud world, metadata must be centrally stored so it is used across all ephemeral clusters. [NEXT]
Page 20: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Elastic Resource Management in Context of Workload

Workload Management vs. Cluster Management Understand resource needs of different

workload types Add / remove resources to meet workload SLAs Manage compute power and high-performance

data-access (ex., LLAP) Pricing-aware: instances (spot, reserved),

data, bandwidthElasticResourceManagement

Moderador
Notas de la presentación
Final area: resource management. We up-level resource manage - So far, Yarn has focused on optimizing resources in the context cluster. - The cloud is not about the cluster it is about the workloads, And further resources are elastic. The scheduler needs to change its focus to managing resources in the context of a workload and meeting the workload’s SLA - It may need get extra resources from the cloud - get the right resource to match the needs of the workload. Sometimes adding compute power is not sufficient to meet an SLA, because latency/bandwidth to data may be the bottleneck –e.g. spin up LLAPs memory in order to improve caching and hence meet SLA Cloud offers another dimension – that of cost and budgets. There are different costs tied to CPU, memory and data access bandwidth, so elasticity and Spot pricing tradeoffs should be factored in Resource management in this new dimension is important if you want reap the benefit low cost cloud computing. To Summerize: the better one understands the nature of a workload, the more we are able to take advantage of elasticity and spot pricing. CONCLUDE: While one could lift and shift Hadoop on the cloud, I hope I have convinced you that we really need to evolve Hadoop to run first class in the cloud and also to take advantage of the unique cloud features such as elasticity. We at Hortonworks have been working on this over the few months and Ram will show you a quick demo of the tech-preview we are releasing this week. [NEXT]
Page 21: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data in Motion

Data at Rest

Deep HistoricalAnalysis

D ATA C E N T E R

Stream Analytics

Edge Data

Data in Motion

MachineLearning

C L O U D Edge Data

Edge Analytics

Data at Rest

Transformational Applications Require Connected Data

Moderador
Notas de la presentación
Today we have talked about evolving Hadoop to run well in the cloud. At Hortonworks, we are focused on enabling a connected data architecture that seamlessly spans the cloud and data center. This is illustrated on the screen. It stress two important points – the connectedness of the cloud and the on-premise infrastructure & data.. Also it illustrates the the connected ness of data at motion and data at rest. The Era of the Internet-of-Things demands that we manage the entire lifecycle of all data - (data in motion and data at rest) It’s about being able to collect and curate data across traditional silos so the various groups and lines of business can have a place where they can assemble a single view of data in order to drive deep historical insights. It’s also about proactively managing data from its point of inception and securely acquiring and delivering it. Moreover, it’s not just about point-to-point delivery, but it’s also about enabling bi-directional data flows that can leverage both real-time and historical insights to help shape and prioritize the flow of data. So in this diagram, for example, the upper-left edge could represent the connected car, whereas the lower-left edge can represent data from the manufacturing line. Having a connected data architecture that enables you to deal with all of this data unlocks the ability to figure out what manufacturing line issues may be causing operational issues in cars in the field, for example. In this world of next generation applications, I am existed about evolving the Hadoop eco-system to enable these types of use cases and usage models. [NEXT SLIDE]
Page 22: The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

Thank You

Moderador
Notas de la presentación
THANK YOU AND HAVE A GREAT CONFERENCE!