The Enterprise and Connected Data,Trends in the Apache HadoopEcosystemAlan GatesCo-FounderHortonworks@alanfgates
Moderador
Notas de la presentación
Speaker: Alan Gates, Hortonworks Co-Founder Title: 10 Years of Apache Hadoop and Beyond Duration: 40 minutes Abstract: In 2006, Apache Hadoop had its first line of code committed to what has become a breakthrough technology. A decade later, we are witness to open source innovation that has literally changed the face of business. Hadoop and related technologies have become the enterprise data platform, fueled by a rich ecosystem capable of supporting any application, any data, anywhere. Join Hortonworks Co-Founder Alan Gates as he as he drills down into the current and future state of Hadoop and reviews community initiatives aimed at enabling the next wave of modern data applications that are well governed and easy to deploy on-premises and in the cloud.
Our Hadoop journey began in 2006 focused on executing batch MapReduce jobs on petabytes of data. Yahoo’s decision to contribute Hadoop to the Apache Software Foundation was critical because a vibrant set of related technologies began to appear around Hadoop. [NEXT]
Fast forward to 2011 and the concept of YARN began to emerge. Its goal? Enable Hadoop to move from its batch-only roots and become a data platform capable of running batch, interactive, and real-time applications. The emergence of YARN further accelerated the innovation around Hadoop with the emergence of Spark, Kafka, Storm, and many other projects that started life as Apache Incubator proposals. [NEXT]
• HiveServer2 adds ODBC/JDBC• SQL breadth expands with windowing
and more
• Apache Tez enters incubation
• Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support
• Standard SQL authorization, integration with Apache Ranger
• ACID transactions introduced• Governance added with Apache
Atlas integration
• Hive 2 introduces LLAP and intelligent in-memory caching
2010 2011 2012 2013 2014 2015 2016
A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud
• Extensive SQL:2011 Support• Compatible with every major BI Tool• Proven at 300+ PB Scale
Moderador
Notas de la presentación
I want to focus for minute in one area of how Hadoop has developed. Apache Hive has participated in that move from batch to interactive, from ETL only to enterprise ready EDW
So the enterprise has invested in integrating Hadoop into its data lake architecture. Landing petabyte of data from streams, pipelines, data feeds into HDFS files, Hive and HBase tables, etc. The question arises of how we can setup policies for these data sets that enable us to secure and govern access to it. [NEXT ]
Secure and Govern Your Data with Tag-Based Access Policies
Classification
Prohibition
Time
Location
Policies
PDPResource
Cache
Ranger
Manage Access Policies and Audit Logs
Track Metadataand Lineage
Atlas ClientSubscribers
to Topic
Gets MetadataUpdates
Atlas
MetastoreTags
Assets
Entitles
Streams
Pipelines
Feeds
HiveTables
HDFSFiles
HBaseTables
Entitiesin Data
Lake
Moderador
Notas de la presentación
The community has been hard at work on integrating Apache Atlas as a metadata catalog and Apache Ranger as the centralized security system to address this need. The result is tag-based authorization model driven by the metadata catalog (i.e. Atlas) with access and audit policies applied to those tags (via Ranger). This enables a more flexible way to govern access to data and data sets than traditional role/group based access policies. Ex. as data pipelines land data, they can tag that data as the data lands and the access policies setup for those tags immediately apply. Moreover, Ranger has added the notions of time-based and location-based access policies, so users can do things like limit access to data that’s older than 90 days (for example) or limit access to data from certain geographies. This provides important enterprise-focused capabilities that will help businesses deploy more modern data applications in a way where they have the confidence their data is secure and well-governed. [NEXT]
Hybrid – cloud/on-premises Low-latency Global context
SOURCES REGIONAL INFRASTRUCTURE
CORE INFRASTRUCTURE
Moderador
Notas de la presentación
TALK TRACK People are no longer willing to wait until data is in the store before processing it Hortonworks DataFlow is a platform for data in motion. It is powered by Apache NiFI, Kafka, and Storm for dataflow management and stream processing. MiNiFi/NiFi : creates dynamic, configurable data pipelines Kafka support adaptation to differing rates of data creation and delivery Storm for real-time streaming processing to create immediate insights at a massive scale. There are scenarios where NiFI will provide all that you you need – especially in situations that only require dataflow management, but you will notice the orange and blue horizontal triangles provide a continuum of capability from edge to core, that indicates varying degrees of need for the different products.
Our Hadoop Journey: From the Data Center to the Cloud!2006 Today
Moderador
Notas de la presentación
So after 10 years, the Hadoop ecosystem is available everywhere. In the Data Center, within appliances, across public and private clouds. This maximizes choice for people interested in getting started with Hadoop and deploying it at scale for transformational use cases. [NEXT]
This session is focused on Hadoop in the cloud, so let’s start by answering the obvious question of why Hadoop in the cloud? The cloud saves time and money: Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs. The cloud is flexible and scales fast: you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter. The cloud makes you nimble: Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organizations immediate time to value and unlimited elastic scale. [NEXT]
Key Architectural Considerations for Hadoop in the Cloud
Shared Data& Storage
On-Demand Ephemeral Workloads
1010110101010101
010101010101010101010101010101010
Elastic Resource Management
Shared Metadata, Security & Governance
Moderador
Notas de la presentación
While there are a range of great choices in the market today, there’s more that we, as a community, can and should do to make Hadoop in the cloud better and first class. I’ll spend the remainder of this talk on the key architectural considerations Shared Data & Storage – the shared-data-lake is on cloud storage, it is not HDFS. Also memory and local storage play a different role – that of caching An import distinction in the cloud is On-Demand Ephemeral Workloads – this changes a number of things in fundamental ways. Shared Metadata, Security, and Governance remains important but need to be adjusted in the face of ephemeral clusters. And finally, I’ll touch on Elastic Resource Management We need to shift our thinking away from cluster resource management and more towards SLA-driven workloads [NEXT]
Understand and Leverage Unique Cloud Properties Shared data lake is cloud storage accessible
by all apps Cloud storage segregated from compute Built-in geo-distribution and DR
Focus Areas Address cloud storage consistency
and performance Enhance performance via memory
and local storage
Shared Data& Storage
1010110101010101
010101010101010101010101010101010
Moderador
Notas de la presentación
In the cloud, the shared DataLake is on cloud storage. It is not HDFS of a specific Hadoop cluster. Note this is very different from a traditional on-premise cluster where each cluster has an internal shared store representing its internal DataLake. Moreover, it’s desirable to have this shared data be accessible by all apps, not just Hadoop apps – Cloud Native and 3rd party Good news: goe-distribution is built-into the cloud storage and DR becomes simpler. Cloud storage has two limitation: Eventual consistency and its API does not match the filesystem API expected by Hadoop and normal apps. Addressing these two issues is a key area of ongoing investment. I encourage you to attend today’s breakout session by my fellow Hortonworkers that’s focused on this topic. Cloud Storage is designed for low cost & scale – unfortunately performance is not its strong point due to segregation from compute. Memory and Local storage play a different role in the cloud – cache to enhance the performance. [NEXT]
Tabular Data: LLAP Read + Write-thru Cache Shared across jobs / apps and across engines Cache only the needed columns Spills to SSD when memory is full (anti-caching) Read & Write-through cache Security: Column-level and row-level
HDFS Caching for Non-tabular Data Cache data from cloud storage as needed Write-through cache
Workloads
Cloud Storage
LLAP R/W TablesHDFS Files
Cache
Moderador
Notas de la presentación
Wrt to caching we need to consider both tabular data and non-tabular data. For tabular data, LLAP comes to rescue – it provide a tabular cache that’s shared across jobs, apps, and engines such as Hive and Spark. LLAP only caches the needed columns, so it’s very efficient in its use of memory. Further, data is stored in an internal serialized form to optimize compute The design center is anti-caching – put it all only memory and spill to disk/SSD when memory is full . LLAP currently provides read caching, but is being extended to support a write-through cache. And LLAP addresses a key security Gap for the Hadoop eco system, it provides a convenient place to address column-level and row-level access control that works across all kinds of Engines: Hive, Spark, Flink, or even old fashion MapReduce. Note this was not previously possible …. From a non-tabular data perspective, HDFS can be used to cache cloud data – both a read cache and a write-through cache. This essentially evolves HDFS to play a different role, A place to store intermediate data and also to be a finely-tuned caching layer between the applications and the cloud storage. [NEXT]
Always-on multitenant clusters are important for a range of mission critical use cases. However, bringing forth an ephemeral cluster to support a specific workload is game changing. The agile nature of the cloud allows us to create prescription workload environments. For someone interested in modeling and analyzing data sets, - they simply want to interact with a PRE-TUNED environment optimized for the application. - The complexities of configuring Spark, Hive and Hadoop need to be hidden under the hood. Whether it’s data science, data warehouse, ETL, or other common workload types, - provide pre-configured and pre-tuned compute environments - Further we need be able manage them in, ephemeral fashion. The NET: deliver user experiences that are focused on business agility, - rather than infinite configurability and cluster management. [NEXT]
Shared Data Requires Shared Metadata, Security, and Governance
Shared Metadata Across All Workloads Metadata considerations
– Tabular data metastore– Lineage and provenance metadata– Pipeline and job management metadata– Add upon ingest– Update as processing modifies data
Access / tag-based policies and audit logs Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB)
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Tables
Files Objects
SharedMetadata
Policies
Moderador
Notas de la presentación
So far: I shared data and storage and how to optimize performance by caching. Shared data fundamentally requires a shared approach to metadata, security and governance. The Metadata is not just the classic Hive metadata that describes the tabular data, -about storing and tracking the lineage and provenance of data, - about details related to data pipeline processing and job management. Tabular data needs to be available to all applications so that SQL is an option regardless of where your data is Also, as data is ingested and processed, metadata needs to be created and adjusted Governance and securing the data remain critical and its matadata needs to be managed across all workloads. - The work done by projects such as Ranger and Atlas need to be evolved to fit the cloud environment. If we don’t do this then the cloud will not be adapted aggressively for enterprise use. Getting back to the Shared metadata – each ephemeral cluster cannot have it private copy of the metadata.. In the cloud world, metadata must be centrally stored so it is used across all ephemeral clusters. [NEXT]
Final area: resource management. We up-level resource manage - So far, Yarn has focused on optimizing resources in the context cluster. - The cloud is not about the cluster it is about the workloads, And further resources are elastic. The scheduler needs to change its focus to managing resources in the context of a workload and meeting the workload’s SLA - It may need get extra resources from the cloud - get the right resource to match the needs of the workload. Sometimes adding compute power is not sufficient to meet an SLA, because latency/bandwidth to data may be the bottleneck –e.g. spin up LLAPs memory in order to improve caching and hence meet SLA Cloud offers another dimension – that of cost and budgets. There are different costs tied to CPU, memory and data access bandwidth, so elasticity and Spot pricing tradeoffs should be factored in Resource management in this new dimension is important if you want reap the benefit low cost cloud computing. To Summerize: the better one understands the nature of a workload, the more we are able to take advantage of elasticity and spot pricing. CONCLUDE: While one could lift and shift Hadoop on the cloud, I hope I have convinced you that we really need to evolve Hadoop to run first class in the cloud and also to take advantage of the unique cloud features such as elasticity. We at Hortonworks have been working on this over the few months and Ram will show you a quick demo of the tech-preview we are releasing this week. [NEXT]
Transformational Applications Require Connected Data
Moderador
Notas de la presentación
Today we have talked about evolving Hadoop to run well in the cloud. At Hortonworks, we are focused on enabling a connected data architecture that seamlessly spans the cloud and data center. This is illustrated on the screen. It stress two important points – the connectedness of the cloud and the on-premise infrastructure & data.. Also it illustrates the the connected ness of data at motion and data at rest. The Era of the Internet-of-Things demands that we manage the entire lifecycle of all data - (data in motion and data at rest) It’s about being able to collect and curate data across traditional silos so the various groups and lines of business can have a place where they can assemble a single view of data in order to drive deep historical insights. It’s also about proactively managing data from its point of inception and securely acquiring and delivering it. Moreover, it’s not just about point-to-point delivery, but it’s also about enabling bi-directional data flows that can leverage both real-time and historical insights to help shape and prioritize the flow of data. So in this diagram, for example, the upper-left edge could represent the connected car, whereas the lower-left edge can represent data from the manufacturing line. Having a connected data architecture that enables you to deal with all of this data unlocks the ability to figure out what manufacturing line issues may be causing operational issues in cars in the field, for example. In this world of next generation applications, I am existed about evolving the Hadoop eco-system to enable these types of use cases and usage models. [NEXT SLIDE]