fine-grained security for spark and hive
TRANSCRIPT
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Fine-Grained Security for Spark and HiveCarter Shanklin - Director PMDon Bosco Durai - Security ArchitectJune 29, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda●Current security options and challenges●Apache Ranger Overview●LLAP Overview●Use Cases and Demo●Apache Atlas Integration
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Options and Challenges
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Options and Challenges
⬢Limited to storage level access control for Spark, Pig and MR
⬢Column Level Access via HiveServer2
⬢Row Level filtering need Hive Views– Multiple Hive Views needs to be created and managed– Explicit permissions need to be given for each view/user– User need to know which view to use
⬢Masking needs custom UDF– Needs to be wrapped using Views
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
• Central audit location for all access requests
• Support multiple destination sources (HDFS, Solr, etc.)
• Real-time visual query interface
AuditingAuthorization
• Store and manage encryption keys
• Support HDFS TDE• Integration with HSM
Ranger KMS
• Centralized platform to define, administer and manage security policies consistently
• Enforce policies within each component
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
Ranger Architecture
HDFS
Ranger Administration Portal
HBase
Hive Server2
Ranger Audit Server
Ranger Plugin
Had
oop
Com
pone
nts
Ent
erpr
ise
Use
rs
Ranger Plugin
Ranger Plugin
Legacy Tools and Data Governance
HDFS
Knox
NifI
Ranger Plugin
Ranger Plugin
RDBMS
SolrRanger Plugin
Ranger Policy Server Integration API
KafkaRanger Plugin
YARNRanger Plugin
Ranger PluginStorm
Ranger Plugin
Atlas
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Audits - Data Access
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Audits - Admin Actions
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Overview
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 and LLAP
⬢ At a High Level:– 2000+ features, improvements and bug
fixes in Hive since HDP 2.4.– 600+ of these from outside of
Hortonworks.
⬢ Major Improvements:– Preview: Hive LLAP: Persistent query
servers with intelligent in-memory caching.
– ACID GA: Hardened and proven at scale.– Expanded SQL Compliance: More capable
integration with BI tools.– Performance: Interactive query, 2x faster
ETL.– Security: Row / Column security
extending to views, Column level security for Spark.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Open Interfaces
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Integration with Hive and LLAP
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive / LLAP Security Capabilities with Ranger
⬢ Ranger Hive plugin provides authorization / access controls.⬢ Column Masking:
– Inject Hive UDFs that mask characters or hash values.– Dynamic, per-user.
⬢ Dynamic Row Filtering:– Query is analyzed and policies applied.– Dynamic, per-user.
⬢ All operations run as ordinary SQL queries:– Masking statements convert to clauses in the SQL select clause.– Filters convert to clauses in the SQL where clause.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Native Hive Masking Capabilities
UDF Purpose Example Start Example Resultmask Convert letters to X/x and
numbers to n. 123 Fake St. nnn Xxxx Xx.
mask_first_n Mask only the first n characters. 433-54-3937 nnn-54-3937
mask_last_n Mask only the last n characters. 433-54-3937 433-54-nnnn
mask_show_first_n Mask, showing only the first n characters. 555-233-1234 555-nnn-nnnn
mask_show_last_n Mask, showing only the last n characters. 433-54-3937 nnn-nn-3937
mask_hash Produce a consistent hash of the field. CA 21f241cccaa5cfa33190f56ff1510
e37
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Delivering Spark Security
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Features: Spark Column Security with LLAP
⬢ Fine-Grained Column Level Access Control for SparkSQL.
⬢ Fully dynamic policies per user. Doesn’t require views.
⬢ Use Standard Ranger policies and tools to control access and masking policies.
Flow:1. SparkSQL gets data locations
known as “splits” from HiveServer and plans query.
2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied.
3. Spark gets a modified query plan based on dynamic security policy.
4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Per-User Row Filtering by Region in SparkSQL
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Setup⬢Customer User and Sales data in ORC (Metadata in MetaStore)
⬢Data can be access via SparkSQL or HiveServer2
⬢Marketing needs access to Sales and Users data for analytics
⬢Fraud Investigation team needs access to data for fraud detection
⬢Billing team needs access to Sales and Users data for billing
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
Sales
customer_id
product_id
promotion_id
cookie_id
tracking_id
Group Users
Fraud frank
Marketing mark
Billing bill
Tables
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 1: Restricting Column Access
This is a simple use case where certain groups or users don’t permission to view the query
⬢Billing group has access to all columns in table Users
⬢Marketing group can’t access credit card column from table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column customer_phone customer_ccn
bill (Billing) 😀 😀
mark (Marketing) 😀 😡
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns - Results
bill from Billing
mark from Marketing
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns - Audit Screen
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 2: Column Masking
In this use case where certain groups or users won't be able to see the real value of certain columns.
⬢Billing group can see the real/raw values for all columns in table Users
⬢Fraud group can only see masked values of PII and PCI fields from table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column customer_email, customer_phone, customer_ccn
bill (Billing) 😀
frank (Fraud) 😎
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Mask Fields
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Column Masking - Results
bill from Billing
frank from Fraud
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Column Masking - Audit Screen
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 3: Row Filtering
In this use case where certain groups or users won't be able to see all the rows from certain tables
⬢Billing group can see all the rows in the table Users
⬢Marketing can only see rows/data from their region in the table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column Rows in Users table
bill (Billing) 😀
Mark (Marketing-CA)
Only CA Users
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Row Filtering
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Row Filtering - Results
bill from Billing
mark from Marketing
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 4: Row Filtering - Cross TableThis an extension of previous use cases, where the context information for filtering the row is in another table.
⬢Billing group can see all the rows in the table Sales
⬢Marketing can only see rows/data from their region in the table Sales, however Sales table doesn’t have the customer geographic information, so it needs to be derived from Users table
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column Rows in Sales table
bill (Billing) 😀
Mark (Marketing-CA)
Only CA Users
Sales
customer_id
product_id
promotion_id
cookie_id
tracking_id
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Row Filtering - Cross Table
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Integration
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cross Product Symbiosis
Apache Atlas
Apache Ranger
LLAP
Classification/ Tagging
Governance
Lineage
Tag Based Policies
Dynamic Custom Policies
Enforcement hooks
HDFS S3
MetaStore
* Column Masking and Row Filtering not yet supported by tag based policy
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger - Tag Based Policies
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Q & A