1 © cloudera, inc. all rights reserved. simplifying hadoop: a secure and unified data access path...

Download 1 © Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Nong Li | Lenni Kuff | Stephen

If you can't read please download the document

Upload: katherine-lee

Post on 08-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

3 © Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop. Introducing RecordService

TRANSCRIPT

1 Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Nong Li | Lenni Kuff | Stephen Romanoff 2 Cloudera, Inc. All rights reserved. Introducing RecordService Nong Li | Lenni Kuff | Stephen Romanoff 3 Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop. Introducing RecordService 4 Cloudera, Inc. All rights reserved. Motivation As the Hadoop ecosystem expands, new components continue to be added Speaks to the overall flexibility of Hadoop This is good - more functionality, more workloads, more use cases. As use cases for Hadoop mature, user requirements and expectations increase: Security Performance Compatibility The flexibility of Hadoop has come at cost of increased complexity 5 Cloudera, Inc. All rights reserved. Storage Compute 6 Cloudera, Inc. All rights reserved. Storage Compute 7 Cloudera, Inc. All rights reserved. Example: Security Challenge: Provide unified fine-grained security across compute frameworks Integrating consistent security layer into every components is not scalable. Securing data at file-level precludes fine grained access control (column/row) File ACLs not enough - User can view all or nothing. Currently, must split files, duplicate data large operational cost. Solution: Add a level of abstraction - secure service to access datasets in record format Can now apply fine-grained constraints on projection of dataset Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer 8 Cloudera, Inc. All rights reserved. Example: Security How to provide unified access control across compute frameworks? Securing data at file-level precludes fine grained access control (column/row) File ACLs not enough - User can view all or nothing. Currently, must split files, duplicate data large operational cost. Solution: Add layer of abstraction - secure service to access datasets in record format Can now apply fine-grained constraints on projection of dataset Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer 9 Cloudera, Inc. All rights reserved. Introducing RecordService 10 Cloudera, Inc. All rights reserved. Architecture Summary Simplifies Provides a higher level, logical abstraction for data (ie Tables or Views) Returns schemed objects (instead of paths and bytes). No need for applications to worry about storage APIs and file formats. HCatalog? Similar concept, RecordService is secure, performant. Plan to support HCatalog as a data model on RecordService. Secures Secure service that does not execute arbitrary user code Central location for all authorization checks using Sentry metadata. Accelerates Unified data access path allows platform-wide performance improvements. 11 Cloudera, Inc. All rights reserved. Transition Nong starts here? 12 Cloudera, Inc. All rights reserved. Architecture 13 Cloudera, Inc. All rights reserved. Architecture Runs as a distributed service: Planner Servers & Worker Servers Servers do not store any state Easy HA, fault tolerance. Planner Servers responsible for request planning Retrieve and combine metadata (NN, HMS, Sentry) Split generation -> Creates tasks for workers Performs authorization Worker Servers reads from storage and constructs records. IO, file parsing, predicate evaluation Runs as the source for a DAG computation 14 Cloudera, Inc. All rights reserved. Architecture Server APIs Planner and Worker services expose thrift APIs PlanRequest(), Exec(), Fetch() PlanRequest() Accepts SQL to specify request: Support SELECT and PROJECT Access to tables and views stored in HMS Does not run operators that require data exchange; map only Generates a list of tasks which contain the request, each with locality Exec()/Fetch() Returns records in a canonical optimized, columnar-format. 15 Cloudera, Inc. All rights reserved. Architecture Fault tolerance Cluster state persisted in ZK Membership, delegation tokens, secret keys Servers do not communicate with each other directly => scalability Planner services Expected to run a few (i.e. 3) for HA Fault tolerance handled with clients getting a list of planners and failing over Plan requests are short Worker services Expect to run on each node in the cluster with data Fault tolerance handled by framework (e.g. MR) rescheduling task 16 Cloudera, Inc. All rights reserved. Architecture Security Authentication using Kerberos and delegation tokens Planner authorizes request using metadata in Sentry Column level ACLs Row level ACLs create a view with a predicate Masking create a view with the masking function in the select list Tasks generated by the planner are signed with a shared key Worker runs generated tasks. Does not authorize, relies on signed tasks Runs as user with full access to data, does not run user code 17 Cloudera, Inc. All rights reserved. Architecture Security example CREATE VIEW v as SELECT mask(credit_card_number) as ccn, name, balance, region FROM data WHERE region = Europe 1. Restrict access to the data set: disable access to data table and underlying files in HDFS. 2. Give access by creating view, v 3. Set column level permissions on v per user if necessary Write path (ingest) unchanged. Job expected to run as privileged user. 18 Cloudera, Inc. All rights reserved. Client APIs Integration with ecosystem Similar APIs designed to integrate with MapReduce and Spark Client APIs make things simpler Dont need to interact with HMS Care about the underlying storage format: worker always returns records in a canonical format. Storage engine details (e.g. s3) 19 Cloudera, Inc. All rights reserved. Client Integration APIs Drop in replacements for common existing InputFormats Text, Avro Can be used with Spark as well SparkSQL: integration with the Data Sources API Predicate pushdown, projection Migration should be easy 20 Cloudera, Inc. All rights reserved. MR Example //FileInputFormat.setInputPaths(job, new Path(args[0])); //job.setInputFormatClass(AvroKeyInputFormat.class); RecordServiceConfig.setInputTable(configuration, null, args[0]); job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class); 21 Cloudera, Inc. All rights reserved. Spark Example // Comment out one or the other val file = sc.recordServiceTextFile(path) //val file = sc.textFile(path) 22 Cloudera, Inc. All rights reserved. Spark SQL Example ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin) 23 Cloudera, Inc. All rights reserved. Performance Shares some core components with Impala IO management, optimized C++ code, runtime code generation, uses low level storage APIs Highly efficient implementation of the scan functionality Optimized columnar on wire format Inspired by Apache Parquet Accelerates performance for many workloads 24 Cloudera, Inc. All rights reserved. Terasort ~Worst case scenario. Minimal schema: a single STRING column Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales See Github repo for more details and runnable examples. 25 Cloudera, Inc. All rights reserved. TeraChecksum 26 Cloudera, Inc. All rights reserved. Spark SQL Represents a more expected use case Data is fully schemed TPCDS 500GB scale factor, on parquet Cluster 5 node cluster 27 Cloudera, Inc. All rights reserved. Spark SQL ~15% improvement in query times; queries are not scan bound 28 Cloudera, Inc. All rights reserved. Spark SQL 29 Cloudera, Inc. All rights reserved. RecordService at CapOne Early development partner Lets hear their use case Implementing Record Service in Capital Ones Data Lake Stephen Romanoff Director, Data Management 31 Capital One at a glance 32 We are building the technology foundation to ensure our analytics leadership Build an analytics architecture centered on a Hadoop-based Enterprise Data Hub Provide state-of-the-art analytical tools, unconstrained data storage & processing Empower our associates to dream and disrupt Key Objectives Delivery Principles Fast Prototyping, scaled agile delivery Smaller, cross functional teams Collaboration and leverage the power of Open Source 33 SQL Access Non SQL Access HDFS - Original file duplicated with horizontal and vertical filters Data duplication was the only way to meet fine grained Access Control needs Source File LOB A Affiliate split NPI File Non-NPI Other splits LOB B NPI Other splits Non-NPI Map Reduce Spark Pig Impala Hive Data is co-mingled - Multiple business lines, Affiliates, NPI, Credit Users need access for both SQL and non-SQL processing We need to provide Fine grained controls for all types of access Duplicating data was the only option 34 Sentry + Record Service provides us fine grained access controls across Hadoop compute frameworks HDFS Pig MR Spark Hive Meta Store Sentry + Record Service Table View 1 View 2 View n No Data Duplication Fine grained access controls for SQL, Pig, MR, Spark processing Optimized IO scanners provide high performance Abstraction from physical storage of data Existing applications migrated with minor code changes 35 Cloudera, Inc. All rights reserved. State of the project Available for beta already Integration with Spark and MR. Pig soon (via Hcatalog) Looking into other compute abstractions: e.g. crunch More InputFormat support Need your help! Well continually refresh beta, in particular client libraries. Apache 2.0 Licensed Intent to donate to Apache Software Foundation 36 Cloudera, Inc. All rights reserved. Conclusion RecordService provides a schemed data access service for Hadoop Logical data access instead of physical Much more powerful abstraction Demonstrated security enforcement, improved performance Simpler: clients dont need to worry about low level details: storage APIs, file formats Opens the door for future improvements 37 Cloudera, Inc. All rights reserved. Contributing! Mailing list: Discussion forum:p/Betahttp://community.cloudera.com/t5/Beta-Releases/bd- p/Beta Contributions:Documentation:Bug Reporting: Open Github IssueGithub Issue Beta Download:rvice/0-1-0.htmlrvice/0-1-0.html 38 Cloudera, Inc. All rights reserved. Thank you