“big data-what’s the big deal?” - issa oc · ibm security 31 data sources hadoop cluster...
TRANSCRIPT
© 2015 IBM Corporation
IBM Security
1© 2015 IBM Corporation
Cindy E. Compert, CIPM
Executive Security Specialist
CTO, Data Privacy
IBM Security
@CCBigData
“Big Data- What’s the Big Deal?”Presentation to ISSA of OC, 6/11/15
© 2015 IBM Corporation
IBM Security
2
Agenda
� Introduction
�What is Big Data?
� Security & Privacy considerations
� Architecture, Technical controls, best practices
� Wrap up
© 2015 IBM Corporation
IBM Security
3
Short History Lesson
1997 2001 2004 2005
“Big Data” Volume, Variety, Velocity
MapReduce Hadoop, HDFS
© 2015 IBM Corporation
IBM Security
4 4
What is “Big Data”? The 5 V’s
Transactional and Application Data
Machine Data Social Data
� Volume
� Structured
� Throughput
� Velocity
� Semi-structured
� Ingestion
� Variety
� Highly unstructured
� Veracity
Enterprise Content
� Variety
� Highly unstructured
� Volume
© 2015 IBM Corporation
IBM Security
5
Big Data Scenarios Span Many Industries
Identify criminals and threats
from disparate video, audio,
and data feeds
Make risk decisions based on
real-time transactional data
Predict weather patterns to plan
optimal wind turbine usage, and
optimize capital expenditure on
asset placement
Detect life-threatening
conditions at hospitals in
time to intervene
Multi-channel customer
sentiment and experience a
analysis
© 2015 IBM Corporation
IBM Security
6 6
Big Data Typical Capabilities
Manage & store huge volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all data sources
Integration, Data Quality, Security, Lifecycle Management, MDM
Understand and navigate federated big data sources
Federated Discovery and Navigation
© 2015 IBM Corporation
IBM Security
7
What is Hadoop? Component Detail
HDFS (Hadoop distributed file system)
• YARN (Map Reduce 2)
• Map Reduce
Flu
me
(log a
ggre
gato
r)S
qoop
(bulk
data
tra
nsfe
r)
Zookeeper
(Coord
ination)
Oozie
(Work
flow
)
Pig
(S
cripting)
Spark
(a
naly
tics
fram
ew
ork
)
Impala
/ T
ez /
Big
SQ
L/H
AW
Q(L
ow
late
ncy
query
eng.)
Hiv
e
(SQ
L-lik
e q
uery
)
HB
ase
(Colu
mnar
sto
re)
Management Consoles
BigInsights Console, Hortonworks Ambari, Cloudera NavigatorH
Mahout
(Machin
e
learn
ing)
HCatalog (metadata and storage abstraction)
Check out also http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined
© 2015 IBM Corporation
IBM Security
8
Sample Big Data Architecture
Data in
Motion
Data at
Rest
Data in
Many Forms
Information
Ingestion and
Operational
Information
Decision
Management
BI and Predictive
Analytics
Navigation
and Discovery
Intelligence
Analysis
Landing Area,
Analytics Zone
and Archive
� Raw Data� Structured Data� Text Analytics� Data Mining� Entity Analytics� Machine Learning
Real-time
Analytics
� Video/Audio� Network/Sensor� Entity Analytics� Predictive
Exploration,
Integrated Warehouse,
and Mart Zones
� Discovery� Deep Reflection� Operational� Predictive
� Stream Processing � Data Integration � Master Data
Streams
Information Governance, Security and Business Continuity
© 2015 IBM Corporation
IBM Security
9
A Data Lake vs. a Data Reservoir
� Data flows in “naturally” and just sits there
� Built to extract value from the data
© 2015 IBM Corporation
IBM Security
10
$$$ QUIZ TIME $$$
1. Name 3 of the 5 V’s of Big Data
2. What year was the term ‘Big Data’ first used?
© 2015 IBM Corporation
IBM Security
11
Agenda
� Introduction
� What is Big Data?
�Security & Privacy considerations
� Architecture, Technical controls, best practices
� Wrap up
© 2015 IBM Corporation
IBM Security
12
Source: “IBM Data Governance”, a commissioned study conducted by Forrester Consulting on behalf of IBM, July, 2013
Q18. Which information governance tool would be the highest investment priority?
Base: 512 Director or VP level professionals with decision making authority for Big Data technologies
Security: A Priority in Big Data Environments
© 2015 IBM Corporation
IBM Security
13
Why is Big Data different?
� More data, exponentially more risk� Immature- less governance and discipline, rapidly
evolving� New types of data, new privacy implications
� Smart meters, health monitors, connected home, connected car
� Linked data– linking public and private data exposes new risks
� New uses of data� mapping to privacy policy
© 2015 IBM Corporation
IBM Security
14
Business Users (With An Idea),Power Users,Data Analysts
Data Scientist /Data Miner,
Advanced Business User,
Application Developer
Traditional IT / Application
Developer
search &
survey
exploratory
analysis
operational
� text search� simple investigations� peek / poke
� from mountain of data into a structured world with apps to provide business value
� iterative in nature, many false starts
� creating/standing up applications, processes, systems with enterprise characteristics
� more formal environment, SLAs, etc
Big Data life cycle – from raw to production
© 2015 IBM Corporation
IBM Security
15
Fit-for-purpose security and privacy
. . . . . . . . 192Initial / exploratory
use casesUsed for business decisions
No difference in data governance requirements
once the data is used for making operational business decisions
Few security or privacy concerns Protect, Secure, Encrypt
Sporadic change managementAudit trail trackingaccess & changes
No data retention requirements Preserve data for N years
Little to no regulation Legislated requirements
No / isolated data quality concerns Data quality imperatives
Sources of information are “interesting” Sources must be trusted
16© 2015 IBM Corporation
Privacy is the ‘Why’ and
‘What’�
Security is the ‘How’
© 2015 IBM Corporation
IBM Security
17
What is ‘sensitive data’?
• Any information that identifies an individual• Examples include name, address, physical characteristics,
location, email, cell phone, political beliefs, etc. • The definition of what constitutes sensitive data varies by
country and even by state. • Examples of sensitive data include PII (personally
identifiable information) and PHI (personal health information)
• What is Nonpersonal information? Remaining data after identifiers are removed. Aka de-identified or anonymized information. Optim Masking may accomplish this goal.
Key Takeaway: An organization’s Legal, Compliance, and
Privacy functions determine how to enforce privacy
regulations, based on risk and value.
IT and InfoSec should NOT be the arbiters.
© 2015 IBM Corporation
IBM Security
18
How unique are you? A Lesson in Uniqueness and Anonymization
• Dr. Latanya Sweeney (Harvard, FTC Chief Technologist)- 1997 study identified uniqueness using US Census predicted 87 percent of U.S. population had unique combinations- just using date of birth, gender, and zip code
• Try it yourself here: http://aboutmyinfo.org• Additional study on personal genome project identified 84-97% of records, also
using demographics plus data mining (http://dataprivacylab.org/projects/pgp/1021-1.pdf)
© 2015 IBM Corporation
IBM Security
19
Agenda
� Introduction
� What is Big Data?
� Security & Privacy considerations
� Architecture, Technical controls, best practices
� Wrap up
© 2015 IBM Corporation
IBM Security
20
Architecture: Security & Privacy for Big Data
© 2015 IBM Corporation
IBM Security
21
Security is Security- Same disciplines apply
� GRC Provide insight into security control efficacy, compliance
� Security Intelligence and Analytics Apply analytics and automation to data and incidents to detect threats, perform forensic analysis and automate compliance
� Identity and Access Management Govern and enforce access across multiple channels, including mobile, social and cloud
� .Data Security & Privacy Prevent data loss and enable data access to support business operations, growth and innovation.
� Application Security Test and verify applications before deployment to reduce risks and costs.
� Infrastructure Protection Achieve in-depth security across networks, servers, virtual servers, mainframes and endpoints.
� Advanced Fraud Protection Detect and prevent attack vectors responsible for the majority of online, mobile and cross-channel fraud.
© 2015 IBM Corporation
IBM Security
22
A Hadoop Security Architecture
http://www.hadoopsphere.com/2013/01/security-architecture-for-apache-hadoop.html
Static Data(at rest)Static Data(at rest)
Dynamic Data (in use)Dynamic Data (in use)
..and masking
© 2015 IBM Corporation
IBM Security
23
Monitoring and auditing challenges
•Many avenues to access
•Security and authentication is evolving
•Complex software stack
with significant log data
from each component
•Security and audit viewed in isolation from
rest of data architecture
© 2015 IBM Corporation
IBM Security
24
Data Security and Privacy Core Disciplines
Assess Vulnerabilities
Harden environments to
reduce risk
Audit and report for compliance
Classify Assets & Quantify risk.
Redact/encrypt/mask sensitive data in all
environments
Monitor and enforce; Review policy
exceptions
Discover sensitive assets & who has
access
Implement Identity & Access Management ,
Activity Monitoring
Define policies and metrics
Understand &
Define
Secure &
Protect
Monitor
& Audit
Security Controls Core Disciplines: The ‘How’
© 2015 IBM Corporation
IBM Security
25
Key Requirements
Utilitize real-time data activity monitoring for privacy, security & compliance
� Implement on premise or cloud
� Non-invasive/disruptive, cross-platform
architecture
� Separation of duties enforcement for Database
Administrator (DBA) access
� Detect or block unauthorized & suspicious activity
� Granular, real-time policies
� Who, what, when, how
�Continuous, policy-based, real-time
monitoring of all data traffic activities,
including actions by privileged users
�Centralize compliance reporting
� Data protection compliance automation
�Real-time alertingMonitoring Appliance
Data Repositories (databases, warehouses, file
shares, Big Data)
� 100% visibility including local admin access
� Minimal performance impact
� Should not rely on resident logs that can easily
be erased by attackers, rogue insiders
� No environment changes
� Integration with broader privacy, security and
compliance tools
© 2015 IBM Corporation
IBM Security
26
What applications are using the data?
Focus your resources on the unknown –
unauthorized MapReduce jobs, Yarn jobs
MonitorMonitor
© 2015 IBM Corporation
IBM Security
27
Encrypt Data at Rest
� Encryption can provide Safe Harbor protection from breach disclosure in many states (consult your compliance team for details)
� Implement Data protection for your database, HADOOP, and file system environments� Look for high performance encryption, access control
and auditing� Data privacy for both online and backup environments� Unified policy and key management for centralized
administration across multiple data servers� Ideally, should be transparent to users, databases,
applications, storage� No coding or changes to existing IT infrastructure� Protect data in any storage environment� User access to data same as before
� Centralized administration and Separation of Duties� Policy and Key management� Audit logs� High Availability
© 2015 IBM Corporation
IBM Security
28
Data Obfuscation Terminology
Masking
� The ability to desensitize sensitive information and make it unreadable from its original form while preserving its format and referential integrity
� it is a one way algorithm – ie. No unmasking data
� SDM – Static Data Masking
� DDM – Dynamic Data Masking
Tokenization
� The process of substituting a “token” which can be mapped to the original value
� Token is a non-sensitive equivalent which has no extrinsic value
� Must maintain a mapping between the tokens and the original values
Redaction
� The process of obscuring part of a text for security purposes.
� The ability to replace real data with substitute characters like (*)
Encryption
� The process of encoding data in such a way that only authorized individuals can read it by decrypting the encoded data with a key
� Format Preserving Encryption (FPE) is special form of encryption
Original Value
4536 6382 9896 5200
Masked Value
4212 5454 6565 7780
Redacted Value
4536 6382 **** ****
Token Value
ABCD GDIC JIJG VXYZ
Encrypted Value
1@#43$%!xy1K2L4P
© 2015 IBM Corporation
IBM Security
29
Diverse requirements for masking
Use Case Business Benefits
1. Masking for
databases
Protect production and non-production environments
Ensure only those with a valid purpose see sensitive data; prevent misuse by third parties
2. Masking for
warehouses
Protect data during the ETL process
Deliver privacy to data warehouse environments
3. Masking for
Hadoop
systems
Protect sensitive data moving into and out of Hadoop systems
Build in privacy from the start
4. Masking for
reports
Protect sensitive data in reports without inhibiting business processes
Distribute reports across teams to facilitate sharing without compromising privacy
5. Semantic
masking for
analytics
Deliver privatized data to analytics platforms
Facilitate analytics and support compliance
© 2015 IBM Corporation
IBM Security
30
Sample Data Masking techniquesDe-identify sensitive information with realistic but fictional data for testing purposes
Comprehensive data masking techniques to transform or de-identify data, including:
� String literal values� Character substrings� Random or sequential
numbers� Arithmetic expressions� Concatenated
expressions� Date aging
� Lookup valueso First and Last Nameso Addresses
� Intelligent Transformerso National IDs#o Credit Card #o Email Address
� Key propagation� AES-256 Format
Preserving Encryption
JASON MICHAELSJASON MICHAELS ROBERT SMITHROBERT SMITH
Benefits
SSN Item # Order Date
10002 80-2382 06/20/2004
10002 86-4538 10/10/2005
Payment Table
Personal Info Table
PersNbr FirstName LastName
08054 Alice Bennett
19101 Carl Davis
27645 Elliot Flynn
Account Table
SSN FirstName LastName
10000 Jeanne Renoir
10001 Claude Monet
10002 Pablo Picasso
Referential integrity is maintained with key propagation
© 2015 IBM Corporation
IBM Security
31
Hadoop clusterData sources
Components Capability
1 Information Catalogue Define privacy policies and share
2 Sensitive Data Discovery Discover and classify sensitive data
3 Masking and Redaction Data masking and document redaction
4 Hadoop Activity Monitoring (HAM)Monitor and audit Big Data (Hadoop) activity
Putting it all together: Sample Solution Architecture
Big Data
Activity MonitoringDiscovery
Output
files
Output
files
FilesFiles
Masked
files
Masked
files
MapReduceMapReduce
Hadoop masking Hadoop masking
Monitor & audit Big Data access (HDFS, Hive, HBase, MapReduce, HUE, etc.)
Masked
files
Masked
filesMaskingMasking
Masked
files
Masked
filesMasking Masking
Information CatalogPolicies and value
Business policies
Redacted
documents
Redacted
documentsRedactionRedactionDocumentsDocuments
Big Data (HDFS) Loader
Big Data (HDFS) Loader
1
2 Sensitive data discovery
3
Data-bases
FilesFiles
Masked
files
Masked
files
Redacted
documents
Redacted
documents
Real-time alerting
and SIEM (Security
Information and Event
Monitoring) integration
4
3
Big Data
Processing
(MapReduce)
Big Data
Processing
(MapReduce)
© 2015 IBM Corporation
IBM Security
32
Best Practices: Build the foundation
First, know your data
1. Understand the data source, its “trust factor”, the data context and meaning, and how it maps to other enterprise data sources.
2. Determine whether to operationalize (and retain) specific data sources, and which zone to land the data, i.e. Hadoop, Data Warehouse, leave in place, etc.
Steps to Assess and Protect:
1. Conduct a Privacy Impact Assessment and a Security Risk Assessment. 2. Inventory and classify sensitive data. 3. Identify and match against legal, contractual, and organizational data
protection requirements with assistance from your security, privacy, and compliance organization.4. Identify protection standards for each classification. For example, all credit card numbers must be encrypted in accordance with PCI DSS.5. Identify the gaps and set up remediation plans.
© 2015 IBM Corporation
IBM Security
33
A recommended approach for Big Data: Activity Monitoring
1. Identify users and classes of users – “privileged” users, data scientistsHWho is allowed to access sensitive data
� Validate with activity monitoring
2. Identify the applications, jobs, ad-hoc analysis
� Validate with activity monitoring
3. When possible identify, encrypt and mask sensitive data before it enters the cluster and identify specific directory location in cluster for that data. Put tighter monitoring controls around that data.
4. Look at exceptions – permission exceptions, other operational errors. Use machine learning to identify patterns of suspicious activity.
© 2015 IBM Corporation
IBM Security
34
Agenda
� Introduction
� What is Big Data?
� Security & Privacy considerations
� Architecture, Technical controls, best practices
�Wrap up
© 2015 IBM Corporation
IBM Security
35
Summary: Keys to Success
1. Manage security and privacy at point of impact
2. Use multiple complementary approaches to secure critical data. 3. Understand that different types of data have different protection
requirements
4. Use a holistic approach to safeguarding information no matter where it is. Include the following items:1. Understand and document where the data exists along with the
exposure risk.2. Secure and continuously monitor access to data.3. Safeguard both structured and unstructured data4. Protect sandboxes and non-production environments5. Demonstrate compliance to pass audits
© 2015 IBM Corporation
IBM Security
36
The Pragmatist’s Best Practices Approach to Big Data Security & Privacy
• Knowing all your sensitive data is a long journey and you can’t afford to wait
• Identify and secure data incrementally so that projects can start
Incremental
• To gain business value start on highest risk projects right away
• Have business analysts work with data analysts
Collaborative• Put a stake in the ground
to progress the project and get results
• Review and make improvements
Iterative
Business AcumenData Analysts
Deliver fastThink big Start small
The Journey The Difference Governance Use Cases Lessons Learned Big Data On-Going Journey
Share One Framework Speak One Language- Share One Framework
© 2015 IBM Corporation
IBM Security
37
www.ibm.com/security
© Copyright IBM Corporation 2014. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.
Statement of Good Security Practices: IT system security involves protecting systems and information through prevention, detection and response to improper access from within and outside your enterprise. Improper access can result in information being altered, destroyed or misappropriated or can result in damage to or misuse of your systems, including to attack others. No IT system or product should be considered completely secure and no single product or security measure can be completely effective in preventing improper access. IBM systems and products are designed to be part of a comprehensive security approach, which will necessarily involve additional operational procedures, and may require other systems, products or services to be most effective. IBM DOES NOT WARRANT THAT SYSTEMS AND PRODUCTS ARE IMMUNE FROM THE MALICIOUS OR ILLEGAL CONDUCT OF ANY PARTY.
© 2015 IBM Corporation
IBM Security
38
Resources
� Webcast “How Vulnerabile is your Critical Data?” 6/23/15 8am PT:
http://securityintelligence.com/events/how-vulnerable-is-your-critical-data
� IBM Security: http://www-03.ibm.com/security/
� IBM Data Security & Privacy: http://www-03.ibm.com/software/products/en/category/SWP23
� Top 10 Data Privacy Tips blog: http://www.securityintelligence.com/find-the-map-locate-the-treasure-and-keep-the-pirates-away-10-data-privacy-best-practices
� Guardium Actvity Monitoring for Hadoop info page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Wf32fc3a2c8cb_4b9c_83e4_09b3c6f60e46/page/Hadoop
� IBM QRadar Security Intelligence: http://www-03.ibm.com/software/products/en/qradar-siem
� IBM Redbook: “Information Governance Principles and Practices for a Big Data Landscape: https://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg248165.html?Open
� Top Tips for Securing Big Data Environments: www.ibm.com/services/forms/signup.do?source=sw-infomgt&S_PKG=500031830&S_CMP=Guardium_big_data_ebook