securing your hadoop cluster with apache ranger, atlas and...
TRANSCRIPT
Securing Your Hadoop Cluster With Apache Ranger, Atlas and KnoxAttila Kanto & Zsombor Gegesy
June 13rd 2017 – Budapest Data Forum
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately never be developed.
Product capabilities are based on information that is publicly available within the Apache Software Foundation websites (“Apache”). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document may contain an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security concepts overview
Apache Knox
Apache Ranger
Apache Atlas
Q&A
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Five pillars of enterprise security
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HDP Security: Comprehensive, Complete, Extensible
Perimeter Level Security
• Network Security (i.e. Firewalls)
• Apache Knox (i.e. Gateways)
Authentication
• LDAP / AD
• Kerberos
Authorization
• Consistent authorization control across all HDP components with Apache Ranger
Data protection
• Encrypt data in motion and data at rest, Apache Ranger KMS
OS Security
• Process isolation
• Namespaces
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Knox
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What is Apache Knox?
REST API and Application Gateway for the Apache Hadoop Ecosystem
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Why Apache Knox?
Extensible reverse proxy framework
Simplifies access Kerberos encapsulation
Single access point for all REST and HTTP interactions
Multi-cluster support
Enhanced Security Eliminate SSH edge node (securely exposes REST APIs and HTTP based services at the perimeter)
Protects the details of the cluster deployment
Provides SSL for non-SSL services
Central auditing
Enterprise Integration LDAP/AD integration
SSO integration
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What the Apache Knox isn’t
Not an alternative to firewalls
Not an alternative to Kerberos
Not a channel for high volume data ingest or export
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Knox OverviewProxying Services
Primary goals of the Apache Knox project is to provide access to Apache Hadoop via proxying of HTTP resources.
Authentication Services
Authentication for REST API access as well as WebSSO flow for UIs. LDAP/AD, Header based PreAuth, Kerberos, SAML, OAuth are all available options.
Client DSL/SDK Services
Client development can be done with scripting through DSL or using the Knox Shell classes directly as SDK.
WebSSO
AuthenticationAnd
Federationproviders
Groovy basedDSL
Client DSL/SDK Services
HTTPProxyingServices
UIs
RESTAPIs
WebSockets
Hive
Ambari
HBase
WebHCatWebHDFS
HadoopUIs
Authentication ServicesProxying Services
KnoxShellSDK
TokenSessions
RESTAPI
Classes
KnoxSSO/Token
YARN
Ranger
Zeppelin
Oozie
Phoenix
Gremlin
SQL/DB
SAML
OAuth
LDAP/AD
SPNEGO
HeaderBased
YARNRM
WebHCat
WebHDFS
HiveYARNRM
HBase
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster access through Edge Node
Hadoop Services
User
SSH/SCP
DMZ
Hadoop CLIsEdge Node
CLI hard to install on desktops
Limited auditing
CLIs must be aware of cluster topology
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster access through Gateway
Hadoop Services
User
REST API
DMZ
All activity audited consistently
Cluster topology is not exposed to the client
User connects trough a REST API
REST API
Gateway
REST API
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Authentication and Identity Propagation
Hadoop Services
Gateway
User
Client is not aware the cluster is secured
with Kerberos
1. REST API Request
2. Authenticationchallenge
user:secret
0. Configure Knoxas trusted proxy
3. Authenticate asuser:secret
4. Authenticate asKnox via SPNEGO
(i.e. Kerberos)
5.REST API RequestdoAs user
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Scalability and Fault Tolerance
Hadoop ServicesGateway
REST API
User
REST API
Load Balancer
REST API
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Multi cluster support / multi tenant support
Hadoop Services
User
Gateway
Hadoop Services
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Extensibility: Providers and Services
Providers Features of the the gateway that can be used by Services
Services Actual Hadoop services like WebHDFS, Hive, RM, etc.
Definitions of endpoints to the gateway to expose a specific service
Includes providing configuration (e.g. rewrite rules)
Topologies Assembly of providers and services
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Why Apache Ranger?
To define and administer security policies consistently
Define security policy once, and apply across the component stack
Deep visibility – detailed audit trail
Database
Table
Column
Queue ( be it Kafka, or YARN)
Any resource
Centralized Platform Fine-Grained Security On
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
All the Hadoop components have some kind of user management
Not integrated, not too sophisticated– HDFS – Posix like User/Group access policy
– Hive/Hbase - db/table level restrictions for Users/Groups
– Etc …
It would be nice if restrictions could be applied:– Driven from LDAP/Active Directory
– Client IP address
– Time of access
– Data masking - ‘support’ only could see the last 4 number of credit card number
– Data filtering – ‘sales’ could only see the data from the same region
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
Solution for authorization with flexible policies – with a plugin architecture
Supports:– HDFS
– Hive
– HBase
– Kafka
– Knox
– Storm
– YARN
– Nifi
– Atlas
Contributed community plugins:– Apache Hawq, Druid, Gaian DB
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
One policy to grant:– Hdfs://home/sales/{USERNAME} – to all the users in the ‘sales’ group
Hive database, table, column level access– row level filter – ‘location = ”HU”’
– row level masking – hashing / hiding / etc
HBase– Table, Column-family and column level filtering
YARN :– Limit processing queue access
KNOX– Limit to topologies / services
Etc …
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
Audit log about every access and decision– Stored in HDFS and/or Solr
– So it is easy to search/filter for audit events
– Could be sent to Kafka, for integrating with other services
Ranger KMS– Secure key management for HDFS (“data-at-rest”)
– Access control policies
– Audit
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Can we have more flexibility?
If table/column/row specific restrictions are not enough– To configure hundreds of columns independently, manually is error prone
Tag based access decisions:– Every column taged as ‘Personal Information’ should be hidden from ‘X’
– Every table tagged with ’visibleBefore=2017-10-01’ should be hidden
But how to get the tags?
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
STRUCTURED
UN
ST
RU
CT
UR
ED
- Metadata Truth in Hadoop
TRADITIONALRDBMS
METADATA
MPP APPLIANCES
Project 1
Project 5
Project 4
Project 3
Metadata
Project 6
DATALAKE
Data Managementalong the entire data lifecycle with integrated provenance and lineage capability
Modeling with MetadataCross- component dataset lineage. Centralized location for all metadata inside Hadoop
Interoperable SolutionsSingle Interface point for Metadata Exchange with platforms outside of Hadoop
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas
Graph about the Metadata – ability to collect and link various information automatically
As a graph, it is highly extensible– Define new nodes and edges between them, and even new node types
Dynamic query language
REST API – for external systems
External connectors with messaging frameworks
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas
For Hive – Which column contains what kind of data
– Who created / who consumes the data – lineage
– Lineage if created by
• Sqoop, Storm, Kafka, Falcon – or if it’s created by Hive SQL
Tags – for marking something as personal info …
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas + Ranger
More fine grained access decisions …
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Atlas and Ranger Integration
Basic Tag policy – Access and entitlements can be based on attributes. – Personally Identifiable Information (PII) is a tag that can be leveraged to protect sensitive personal
data.
Geo-based policy – Access policy based on location. – A user might be able to access data in North America, but may be restricted from access in EMEA
due to privacy compliance.
Time-based policy – Access policy based on time windows. – A user might be able to access data only between 8AM – 5PM (common in SOX regulations.)
Prohibitions – Restrictions on combining two data sets which might be in compliance originally, but not when combined together. – Names and health care records
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Q&A