big data everywhere chicago: the big data imperative -- discovering & protecting sensitive data...
TRANSCRIPT
© 2014 Dataguise Inc. All rights reserved.
Discovering & Protecting Sensitive Data in Hadoop
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Goals For Today
Big Data for banking, healthcare, tech, govt, education, etc. need data security (But few have workable approaches in production today)
Hadoop security approaches (What works and doesn’t work from the past, challenges in the present) Real world case studies (data-centric protection)
Credit card security Healthcare data lake (Data-as-a-Service) Product analytics in the cloud
2
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 3
Market Overview
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Data Growth
• 100% growth and 80% unstructured data by 2015 …finding and classifying sensitive data will get harder
4
Exa
byte
s
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Real-world unstructured data scenarios
5
Voice-to-txt files in Hadoop for customer service optimization;
Patient and doctor medical data in emails, PDFs, doctor’s notes
Web comment fields and customer surveys, CRM data
Log data from wellheads and oil drilling sensors
Web e-Commerce Pay System
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
From%2012%to%2020,%enterprise%Big%Data%will%grow%7500%%in%next%6;8%yrs%%%
IT%headcount%for%Big%Data%will%grow%
1.5x%
The Importance of Automation
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Why Security in Big Data Vertical Refine Explore Enrich
Retail & Web • Log Analysis Site Optimization
• Social Network Analysis
• Dynamic Pricing • Session & Content
Optimization
Retail • Loyalty Program Optimization
• Brand & Sentiment Analysis
• Dynamic Pricing/Targeted Offer
Intelligence • Threat Identification • Person of Interest Discovery
• Cross Jurisdiction Queries
Finance
• Risk Modeling & Fraud Identification
• Trade Performance Analytics
• Surveillance & Fraud Detection
• Customer Risk Analysis
• Real-time upsell, cross sales marketing offers
Energy • Smart Grid: Production Optimization
• Grid Failure Prevention
• Smart Meters • Individual Power Grid
Manufacturing • Supply Chain Optimization
• Customer Churn Analysis
• Dynamic Delivery • Replacement Parts
Healthcare & Payer
• Electronic Medical Records (EMPI) • Clinical Trials Analysis • Insurance Premium
Determination
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Why Security in Big Data Vertical Refine Explore Enrich
Retail & Web • Log Analysis Site Optimization
• Social Network Analysis
• Dynamic Pricing • Session & Content
Optimization
Retail • Loyalty Program Optimization
• Brand & Sentiment Analysis
• Dynamic Pricing/Targeted Offer
Intelligence • Threat Identification • Person of Interest Discovery
• Cross Jurisdiction Queries
Finance
• Risk Modeling & Fraud Identification
• Trade Performance Analytics
• Surveillance & Fraud Detection
• Customer Risk Analysis
• Real-time upsell, cross sales marketing offers
Energy • Smart Grid: Production Optimization
• Grid Failure Prevention
• Smart Meters • Individual Power Grid
Manufacturing • Supply Chain Optimization
• Customer Churn Analysis
• Dynamic Delivery • Replacement Parts
Healthcare & Payer
• Electronic Medical Records (EMPI) • Clinical Trials Analysis • Insurance Premium
Determination
Privacy data
PCI or Financial
Personal Health (PHI)
Personal Health (PHI)
PCI or Financial
Privacy data
Personal Health (PHI)
Privacy data
Privacy data PCI or
Financial
PCI or Financial
Privacy data
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Three Critical Considerations
1. Ensuring Compliance • The Big Ps (PCI, HIPAA, Privacy), data residency,
FERPA,FISMA, FERC , etc. • 1200 laws in 63 countries
2. Reducing Breach Risk 3. Quantifying both
1. How much sensitive data? (“un-announced”)
2. Who is adding? (ad hoc user directories) 3. Who is accessing? (sharing, selling, re-purposing)
9
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Lab Project • Hadoop as
R&D • Strictly data
science • Zero $$$ or
selection of Distribution
• Zero recognition of sensitive data or exposure
Proof Stage • Achieving
value • Data lake cost
savings • Line of
business ownership
• Nodal expansion
• Security elements? (unknown to InfoSec)
ROI Validity • ROI and TCO
validity • Distribution
selection and purchase
• The Security ‘A- Ha’ moment
• Solved with legacy or penalty box Hadoop
On Demand Hadoop • Full scale
production • Ad hoc new
uses • Go Faster:
Spark, Kafka
• Security sanctified
The Evolution of Hadoop Projects
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
On-Demand Hadoop.
• Without adequate sensitive data protection, customers left to “Penalty Boxing” Hadoop
» “Security zones” imposed by InfoSec
» Slows business, costly and cumbersome
• Data-centric protection can set those assets free
11
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 12
Data Protection In Hadoop
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Security in Hadoop In Summary
• Like Cloud, Mobile, Virtualization… Big Data drives fundamental new rules in security
» Ad hoc computing, wide open data sets » Extended users and usages, sharing and selling » 3 Vs moving to 6 Vs (automation, non-blocking)
• Problem #1 is compliance
» Reporting/auditing/monitoring as/more important than data security
• Data-centric protection can help
13
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Hadoop Security Framework
Perimeter'Guarding%access%to%the%
cluster%itself%%%%
Technical'Concepts:'AuthenHcaHon%
Network%isolaHon%%
Data'ProtecHng%data%in%the%
cluster%from%unauthorized%visibility%%
Technical'Concepts:'EncrypHon,%TokenizaHon,%
Data%masking%%
• The 4 approaches to address security within Hadoop (Perimeter, Data, Access, Visibility)
• Dataguise discovers & protects at the data layer and provides visibility for audit reporting and data lineage
Perimeter'Guarding%access%to%the%
cluster%itself%%
Technical'Concepts:'AuthenHcaHon%
Network%isolaHon%%
Access'Defining%what%users%
and%applicaHons%can%do%with%data%
%Technical'Concepts:'
Permissions%AuthorizaHon%
%
Visibility'ReporHng%on%where%data%came%from%and%how%it’s%being%used%
%Technical'Concepts:'
AudiHng%Lineage%
%
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Kerberos on Hadoop • Kerberos (developed at MIT) has been the de-
facto standard for strong authentication/authz » Protection against user and service spoofing
attacks, and allows for enforcement of user HDFS access permissions
• What does Kerberos Do? » Establishes identity for clients, hosts, and services » Prevents impersonation, passwords are never sent over
the wire » Tickets grant cryptographic “permissions” to resources
• Kerberos is core of authentication in native Apache Hadoop from 2010
» Used for access ecosystem services HDFS, JT, Oozie., for server to server traffic auth. etc. BUT complex to manage!
» Lots of steps for example: http://www.cloudera.com/content/cloudera--�content/cloudera--�docs/CDH4/4.3.0/CDH4--�Security--�Guide/cdh4sg_topic_3.html
15
Access'Defining%what%users%
and%applicaHons%can%do%with%data%
%Technical'Concepts:'
Permissions%AuthorizaHon%
%
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
MapR Improvements on Auth/Authz
• Vastly simpler » But no requirements for Kerberos in core » Identity represented using a ticket which is issued by
MapR CLDB servers (Container Location DataBase) » Core services secured by default
• Easier integration » User identity independent of host or operating system » Local to MapR (no external Kerberos required)
• Faster » Leverage Intel accelerated hardware crypto
16
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Elements of Data Centric Protection
• 1. Identify which elements you want to protect via:
» Delimiters (structured data), name-value pairs (semi-structured) or data discovery service (unstructured)
• 2. Automated Protection Options: » Automatically apply protection via: » Format preserving encryption (FPE) » Masking (replace, randomize, intellimask, static) » Redaction (nullify)
• 3. Audit Strategy » Sensitive data protection/access/lineage
17
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Discovery
• Within HDFS » Search for sensitive data per company policy – PII, PCI,… » Handle complex data types such as addresses » Process incrementally (default) to handle only the new content
• In-flight » Processing data on the fly as they are ingested into Hadoop HDFS » Plug-in solution for FTP, Flume,
Sqoop » Search for sensitive data
per policy – PII, PCI, HIPAA…
» NEXT UP: Kafka
18
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
How Discovery Works
19
• MapReduce or Flume/FTP/Sqoop Agent » Root directories and drill downs » Can scan entire dataset or incrementally (watermarking)
• Runs pattern, logic, context, algorithm, and ontology filters • Can utilize white/black lists and reference sets
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Protection Measures
• Protection plan should start with cutting
» What data can we delete/cut? » What data can be redacted? » Masking choices
• Consistency • Realistic looking data • Partial reveal (Intellimask)
Credit Card # 4541 **** **** 3241 • What data needs reversibility
20
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Encryption “vs” Masking
• Encryption: + Reversible + Trusted with security proofs + The first hammer + De-centralized architectures - Complex - Key management - Useless without robust
authentication and authorization - Data value destruction - Needs both encrypt-decrypt
tooling
21
• Masking: + Highest security + Realistic data + Range and value preserving + Once and done +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management - Not as well marketed - Not reversible
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
• Masking: + Highest security + Realistic data + Range and value preserving + Format-preserving and partial reveals +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management - Not as well marketed - Not reversible - Perceived to grow data
Encryption “vs” Masking
• Encryption: + Reversible + Trusted with security proofs + Format-preserving and partial reveals +Scale-out and distributed + The first hammer + De-centralized architectures - Complex - Key management - Useless without robust
authentication and authorization - Data value destruction
22
The fundamental decision between masking and encryption comes down to reversibility: Some elements in analytics must resolve to original: (e.g. 66.249.22.145 or $34,332.12) Some elements ideal for psuedonyms:
Social Security Numbers Credit Card Numbers Names
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Real-World Performance
• Leveraging the power of MapReduce to run distributed encryption or masking
• Data volume: 2.2 TB • Run Time: 23 min • Sensitive Data %: 8/50 Columns in 2.2 Bn rows
• Run on 360 node MapR system • In old-word database technology, this would type
of job would have taken days/week(s)
23
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Audit Strategy • Essential to all goals: Compliance, breach
protection, visibility and metrics • Avoids the “gotcha” moment
» Show all sensitive elements (count, location) » Remediation applied » Dashboard for fast access to critical policies and drill-
downs for file and user action
24
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
How It works: Detection and Protection In-flight or @Rest
RDBMS Xaction
Data warehouse
Site
WEB
FTP
Flum
e A
gent
DgF
lum
e Pl
ug-in
DgFlume Agent
1. Detect sensitive data
2. Protect applying masking/encryption policies
Production C
luster
Had
oop
API
D
isco
ver/
Mas
k/En
cryp
t
DgHDFS Agent
1. Detect sensitive data
2. Protect applying masking/encryption policies
Had
oop
API
DGHive, HDFS bulk decryption/Java app
1. Selective decryption based on user/role and policy
1 Data Discovery and protection while loaded into HDFS
2 Data masked or encrypted in HDFS with Map/Reduce job
3 Users can now access data
DGDiscover-Masker
1. In DB (Oracle, SQL.. SharePoint, Files)
2. Protect applying masking/encryption policies
Sqoo
p
DgScoop Agent
1. Detect sensitive data
2. Protect applying masking/encryption policies
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 26
Case Studies
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Protecting sensitive data in top credit card firm
27
Objectives
! Consolidate existing payment risk analysis inside high-scale, lower cost Hadoop
! Provide tiered access & authorization for multiple business apps (fraud, risk, cross-sell
Solution ! MapR Hadoop for single, reliable, high
performance data analysis platform ! Dataguise consistent masking enables
analysis and unique index key values for de-identified data
! Unique ability to output protected data in adjacent column or appended with delimiter inside existing column to protect data while governing access via authorization rules
• Continuous real-time protection (job runs every 5 mins on ingest)
• Analytics draws on the secure purchasing data of 90 million credit card holders across 127 countries
Results & Benefits
Omniture Files
Credit Card Transactions
Source Data Protection Analysis
Incremental updates to HDFS automatically protected
Selective access to sensitive data based on role and app
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Protecting personal health info (PHI) in aggregate data lake
28
Objectives
! Reduce costly and preventable readmission, decrease mortality rates, and improve the quality of life for patients
! Internal data service model DAaaS (Data Architecture as a Service)
Solution
! Solution needs to protect structured and unstructured source data in database, data warehouse, and flat file structures
! Customer required customization of encryption and key management to fit into their existing corp infrastructure and security policies
! Dataguise dashboard gives admins easy way to identify directories/files containing sensitive data
• Delivered a cost-effective and easy way to determine where sensitive data resides within the cluster, and how it’s been protected
! Seamless access to encrypted data from a variety of data access methods {Hive, Pig, Analytic tools}
Results & Benefits
Health Records
SQL Data
DG FTP Agent
FTP
HDFS Authorization controlled through group membership in Active Directory
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Global Tech Product Analytics
29
Objectives
! Aggregate logging data (product, usage, user configuration) for all smartphones worldwide t
! De-identify personal user info to ensure privacy and compliance with European/US Privacy
Solution
• Customer routes all device logging data into 7 Global AWS clouds
• Uses Dataguise Flume agent to protect all sensitive data being written to Amazon S3
• Runs Dataguise in AWS, also utilizes Dataguise EMR security agents to selectively decrypt for authorized analytics in AWS
• On-demand Hadoop for product analytics, user behavior, supply chain optimization
• High scale-out, high performance and high scale-out paramount
• 100% cloud based security
Results & Benefits
Virtualized DG Secure
Protected Data Amazon S3
Smartphone Device Log Collectors
Apache Flume
DG Flume Agent
AWS Clouds in Korea, Singapore, US (3), UK, and Ireland
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Hadoop Data-Protection Checklist
30
" Discover sensitive data " Automate protective measures " Integrate into Hadoop authorization " With continuous real-time tracking " Dashboards, Reports & Auditing " Automated Risk Assessment/Scoring " Automated inference protection
(roadmap)
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL
Thank You
31
Jeremy Stieglitz VP Products [email protected]