apache hadoop india summit 2011 talk "informatica and big data" by snajeev kumar
DESCRIPTION
TRANSCRIPT
1
Informatica & Big Data
Sanjeev KumarVP & MD, Informatica India
Apache Hadoop India Summit 2011
2
Agenda
• Big Data
• Big Data in Enterprise
• Informatica & Data
• Informatica & Big Data
3
Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.
.
Why “Big Data” Now? : Exploding Data Volumes
Relational
Complex, Unstructured
• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year
4
Why Now? Exploding Data Volumes
• Explosion in user-generated content• e.g. Blogs, Twitter, Facebook etc.
• Proliferation of web-connected devices• Smartphone interactions with the web
• Increased consumption of digital content• Netflix, HULU, Pandora etc.
• Internet of things• Smart-grid and smart-meters• Machine-generated data via the web
5
Why Now? : New Apps/Use-cases
• Analyze customer/market sentiment• Text analytics on Social Media, blogs
• Achieve Operational Efficiency• e.g. Analyze CDRs to optimize cell tower placements
• Make Recommendations• Data mining on click-stream, purchase history
• Predict the future• e.g. Flightcast predicts flight delays
6
Big Data Challenges
• Storage• Cost-effective Scalability: to multi-terabytes and petabytes
• Non-traditional data models: complex, semi-structured data
• Processing• Data mining, collaborative filtering for structured data• Text Analytics, classification etc. for unstructured data
• Regulatory Compliance• Data Privacy / Masking• Data Archival
7
Addressing Big Data Challenges
• Storage• Parallel Databases
• Greenplum(EMC), Vertica, AsterData
• Distributed Key/Value Stores • Hbase, Google’s BigTable, Amazon’s SimpleDB
• Distributed File Systems• HDFS, GFS, ParAccel
• Analytics• SQL with extensions• Map Reduce• DataFlow Languages : PIG, Sawzall etc
8
Hadoop Technology Stack
HDFS
HBase
Map/Reduce
Pig Hive CascadingZ
oo
Kee
per
9
Hadoop Momentum
Search Volume Index
News Reference Volume
Job Trends from Indeed.com
10
Big Data in the Enterprise – Hadoop Usage
11
Big Data in the EnterpriseCase Studies: Hadoop World 2009
• Yahoo!: Social Graph Analysis
• VISA: Large Scale Transaction Analysis
• China Mobile: Data Mining Platform for Telecom Industry
• JP Morgan Chase: Data Processing for Financial Services
• eHarmony: Matchmaking in the Hadoop Cloud
• Rackspace: Cross Data Center Log Processing
• Visible Technologies: Real-Time Business Intelligence
• Booz Allen Hamilton: Protein Alignment using Hadoop
Slides and Videos at http://www.cloudera.com/hadoop-world-nyc
12
• eBay: Hadoop at eBay
• Twitter: The Hadoop Ecosystem at Twitter
• General Electric: Sentiment Analysis powered by Hadoop
• Yale University: MapReduce and Parallel Database Systems
• AOL: AOL’s Data Layer
• Facebook: Hbase in Production
• Bank of America: The Business of Big Data
• StumbleUpon: Mixing Real-Time and Batch Processing
• Raytheon: SHARD: Storing and Querying Large-Scale Data
More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc/
Big Data in the EnterpriseCase Studies: Hadoop World 2010
13
Agenda
• Big Data
• Big Data in Enterprise
• Informatica & Data
• Informatica & Big Data
14
We enable organizations to gain a competitive advantage
from all their information assets to drive their
top business imperatives
Informatica – Our Singular Mission Enabling The Information Economy
15
Application Partner Data
SWIFT NACHA HIPAA …
Cloud Computing Unstructured
Informatica – What We DoComprehensive, Unified, Open and Economical platform
Database
Data Warehouse
DataMigration
Test DataManagement& Archiving
Master DataManagement
Data Synchronization
B2B DataExchange
DataConsolidation
ComplexEventProcessing
UltraMessaging
16
INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation |
Conversion | De-duping | Exchange | Extraction | Federation |
Hub | Identity | Integration | Life-cycle Management |
Loading | Masking | Mastering | Matching | Migration | On Demand |
Privacy | Profiling | Provisioning | Quality | Quality Assessment |
Registry | Replication | Retirement | Services | Stewardship |
Sub-setting | Synchronization | Test Management | Transformation |
Validation | Virtualization | Warehousing |
]
Informatica & DataVerbs on Data – We do things to data!
17
Informatica & Big Data
• HDFS as a source and a target - Enable universal data connectivity for Hadoop developers
• Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic
• Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool
• Support virtualized access to data split across HDFS and (relational) data-warehouses
18
HDFS
Data Node
HDFSName Node
HDFSJob Tracker
Hadoop Cluster
Weblogs
Enterprise Applications
Databases
Semi-structuredUn-structured
BI
DW/DM
Informatica & Hadoop – Big Picture
MetadataRepository
Graphical IDE for
Hadoop Development
Enterprise
Connectivity for
Hadoop programs Transformation
Engine for custom
data processing
19