info vision sanjeev kumar _ why data is drowning it world
DESCRIPTION
Why Data is Drowning the (IT) World? Sanjeev Kumar VP & MD, Informatica IndiaTRANSCRIPT
1
Why Data is Drowning the (IT) World?
Sanjeev KumarVP & MD, Informatica India
Infovision 2012 SummitOctober 2012
2
Agenda
• Why the Data Deluge?
• Trends Affecting Data Growth
• New Use-cases Enabled by Big Data
3
Agenda
• Why the Data Deluge?
• Trends Affecting Data Growth
• New Use-cases Enabled by Big Data
• Trends Underlying Big Data
• Building-blocks for Managing Big Data
• Q&A
4
Data is the New Plastic
5
Where Are We? Computing Circa 2012!
6
Where Are We? Computing Circa 2012!
• Six decades into the Computer Revolution
7
Where Are We? Computing Circa 2012!
• Six decades into the Computer Revolution
• Four decades since the invention of Microprocessor
8
Where Are We? Computing Circa 2012!
• Six decades into the Computer Revolution
• Four decades since the invention of Microprocessor
• Two decades into the rise of modern Internet
9
Where Are We? Computing Circa 2012!
• Six decades into the Computer Revolution
• Four decades since the invention of Microprocessor
• Two decades into the rise of modern Internet
• Two billion people using the broadband Internet
10
Where Are We? Computing Circa 2012!
• Six decades into the Computer Revolution
• Four decades since the invention of Microprocessor
• Two decades into the rise of modern Internet
• Two billion people using the broadband Internet
Major businesses and industries running on software and delivered as online services*
*”Why software is eating the world” Marc Andreessen, WSJ Aug 2011
11
Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.
.
Trends: Exploding Data Volumes, “Big Data”
Relational
Complex, Unstructured
• 2,500 Exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “Zettabytes” this year
Kilo – Mega – Giga – Terra –Peta – Exa – Zetta - Yotta
12
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity
13
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity• 120+ Twitter accounts relating to Big Data
14
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity• 120+ Twitter accounts relating to Big Data
• 9000 job search results for “data scientists”
15
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity• 120+ Twitter accounts relating to Big Data
• 9000 job search results for “data scientists”• 70,000 Wikipedia “big data” hits per month
16
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity• 120+ Twitter accounts relating to Big Data
• 9000 job search results for “data scientists”• 70,000 Wikipedia “big data” hits per month• 2,000,000 PDFs from search on “big data white paper”
17
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity• 120+ Twitter accounts relating to Big Data
• 9000 job search results for “data scientists”• 70,000 Wikipedia “big data” hits per month• 2,000,000 PDFs from search on “big data white paper”• 112,000,000 Blog posts discussing big data
18
Big Data Buzz!
• 16 Big Data “V”s; Original 3: Volume, Variety & Velocity• 120+ Twitter accounts relating to Big Data
• 9000 job search results for “data scientists”• 70,000 Wikipedia “big data” hits per month• 2,000,000 PDFs from search on “big data white paper”• 112,000,000 Blog posts discussing big data• 1,350,000,000 Google results for “What is big data?”
Source IBM 2012
19
Why Now? Exploding Data Volumes
Internet of things
Increased consumption of digital content
Explosion in user generated content
Proliferation of web connected devices
20
Trends: Changing Data Economics
Low ROB
Return on Byte = value to be extracted from that byte / cost of storing that byte.
High ROB
21
Trends : Data Seen as a Strategic Asset
• Companies leveraging data assets to• Create new and differentiated products
• Product recommendation engines • Increase revenues
• Optimize ad placement to improve click-thru• Improve customer satisfaction / retention
• Analyze CDRs for dropped calls The sexy job in the next ten years will be statisticians. The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill. Hal Varian : Chief Economist, Google.
22
Big Data in the Enterprise
23
Why Now? Big Data Use-cases – User Behavior
• Location & Proximity Tracking• GPS in operational apps, security analysis, navigation & social media• New business opportunities for sales and services in proximity
24
Why Now? Big Data Use-cases – User Behavior
• Location & Proximity Tracking• GPS in operational apps, security analysis, navigation & social media• New business opportunities for sales and services in proximity
• Ad Tracking• Dynamic changes in ad placement, color, size and wording• Improved click-through behavior
25
Why Now? Big Data Use-cases – User Behavior
• Location & Proximity Tracking• GPS in operational apps, security analysis, navigation & social media• New business opportunities for sales and services in proximity
• Ad Tracking• Dynamic changes in ad placement, color, size and wording• Improved click-through behavior
• Social CRM• Text analytics on huge array of unstructured social media• KPI’s: share of voice, audience engagement, conversation reach, …
26
Why Now? Big Data Use-cases – User Behavior
• Location & Proximity Tracking• GPS in operational apps, security analysis, navigation & social media• New business opportunities for sales and services in proximity
• Ad Tracking• Dynamic changes in ad placement, color, size and wording• Improved click-through behavior
• Social CRM• Text analytics on huge array of unstructured social media• KPI’s: share of voice, audience engagement, conversation reach, …
• Causal Factor Discovery in Retail• Deviations based on competition, weather, promos, holidays, events
27
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors• Temperature, humidity, vibration and noise• Energy usage, security violations, failures in a/c, heat, plumbing
28
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors• Temperature, humidity, vibration and noise• Energy usage, security violations, failures in a/c, heat, plumbing
• In-flight Aircraft Sensors• Variables on engines, hydraulics, fuel & electrical systems• Real-time adaptive control, fuel usage, part failure prediction
29
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors• Temperature, humidity, vibration and noise• Energy usage, security violations, failures in a/c, heat, plumbing
• In-flight Aircraft Sensors• Variables on engines, hydraulics, fuel & electrical systems• Real-time adaptive control, fuel usage, part failure prediction
• Smart Utility Meters – Electric Grid• One read-out per second per meter across entire customer base• Dynamic load balancing on grid, failure response, adaptive pricing
30
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors• Temperature, humidity, vibration and noise• Energy usage, security violations, failures in a/c, heat, plumbing
• In-flight Aircraft Sensors• Variables on engines, hydraulics, fuel & electrical systems• Real-time adaptive control, fuel usage, part failure prediction
• Smart Utility Meters – Electric Grid• One read-out per second per meter across entire customer base• Dynamic load balancing on grid, failure response, adaptive pricing
• Mobile Cell Tower Networks• Analyze call-data-records(CDRs) to optimize cell tower placement• Improved user experience and network monetization
31
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing• Analyzing the sequence, identifying genes and gene families• Baseline reference for the larger cotton crop genome
32
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing• Analyzing the sequence, identifying genes and gene families• Baseline reference for the larger cotton crop genome
• Satellite Image Comparison• Overlay of images to create “hot spot” maps to show differences• Construction, destruction, changes due to disasters, encroachment
33
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing• Analyzing the sequence, identifying genes and gene families• Baseline reference for the larger cotton crop genome
• Satellite Image Comparison• Overlay of images to create “hot spot” maps to show differences• Construction, destruction, changes due to disasters, encroachment
• CAT Scan Comparison• Images taken as “slices” of human body• Automatic diagnosis of medical issues and their prevalence
34
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing• Analyzing the sequence, identifying genes and gene families• Baseline reference for the larger cotton crop genome
• Satellite Image Comparison• Overlay of images to create “hot spot” maps to show differences• Construction, destruction, changes due to disasters, encroachment
• CAT Scan Comparison• Images taken as “slices” of human body• Automatic diagnosis of medical issues and their prevalence
• Document Similarity Testing• Latent semantic analysis: “documents that agree with my doc”• Threat discovery, sentiment analysis and opinion polls
35
Agenda
• Why the Data Deluge?
• Trends Affecting Data Growth
• New Use-cases Enabled by Big Data
• Trends Underlying Big Data
• Building-blocks for Managing Big Data
• Q&A
36
Big DataConfluence of Big Transaction, Big Interaction and Big Data Processing
OnlineTransactionProcessing(OLTP)
Online AnalyticalProcessing(OLAP) &DW Appliances
SocialMedia Data
DeviceSensor Data
Scientific, genomic
Machine/Device
BIG TRANSACTION DATA BIG INTERACTION DATA
BIG DATA PROCESSING
Call detail records, image, click stream data
37
OnlineTransactionProcessing
(OLTP)
Online AnalyticalProcessing(OLAP) &
DW Appliances
OracleDB2Britton-LeeIngresInformixSybaseSQLServer
TeradataRedbrickEssBaseSybase IQNetezzaGreenplumDataAllegroAsterdataVerticaParaccelHana
BIG TRANSACTION DATA
Big Transaction DataOLTP and Analytic Databases
38
HRApplication
CRMApplication
MainframeCustomApplication
CustomApplication
CustomApplication
CustomApplication
CustomApplication
Big Transaction DataChanging Economics of Computing From Buy To Rent
39
Big Interaction DataChanging Role Of Computing From Transactions to Interactions
SocialMedia Data
DeviceSensor Data
Clickstream
Image/Text
Scientific• Genomic/Pharma• Medical
Machine/Device• Sensors/Meters/
RFID Tags• CDR/Mobile
BIG INTERACTION DATA
Social Media
Device Sensor Data
40
Big Interaction DataFrom Operational Efficiency To Organizational Effectiveness
Business Management• Business Analysis • Operational Automation
Brand Management• Sentiment Analysis• Proactive Customer
Engagement
RelationalTransactions
1970 - Current
SocialInteractions
2008 - Current
41
Big Interaction DataHow Do You Leverage Device Sensor Data?
• Geo Encoding
• Cell-phone Towers
• Medical Sensors
• RFID Tags
• Edge Networks
42
Big Data ProcessingHighly Scalable Processing Of All Data
OnlineTransactionProcessing(OLTP)
Online AnalyticalProcessing(OLAP) &DW Appliances
SocialMedia Data
DeviceSensor Data
Scientific, genomic
Machine/Device
BIG TRANSACTION DATA BIG INTERACTION DATA
BIG DATA PROCESSING
Call detail records, image, click stream data
43
Big Data Processing What is Hadoop?
PARALLEL
PERSISTENCE
SCRIPTING SQL QUERY
44
Big Data ProcessingWhat does Hadoop do?
• Cost effective scalability• Scale out on commodity hardware
• Support for processing all data types• Structured, Semi-structured and Unstructured data
• Extensibility• Open APIs to implement custom data processing logic
• Hadoop Challenges• Data movement into/out of Hadoop / HDFS• Requires specialized development skills
• Java, Hive, PIG etc.
45
Ingest Data Into HDFSSupport over 100different data sources
Native HDFSSource and
Target Support
Integrated development environment with metadata and preview support
Perform any preprocessing
needed before ingestion
46
Design and Execute Data Integration Logic on Hadoop
Design integration logic for Hadoop in a graphical and metadata driven environment
Configure where the integration logic should run – Hadoop or Native
47
Address Validation
Standardize
Parsing
Matching
Design and Execute Data Quality on HadoopBig Data Cleansing, Dedup, Unstructured Parsing
Address Validation and Geocoding enrichment across
260 countries
Probabilistic or Deterministic Matching
Standardization and Reference Data Management
Parsing of Unstructured Data/Text Fields of all data types of data (customer/
product/ social/ logs)
DQ logic pushed down/run natively ON Hadoop
48
Extract data from HDFS and Hive
48
Persist and write hadoop data into DW, HDFS or any target systems
Extract from HDFS as a native source
Extract from Hive as a native source
Perform any post processing
needed after extraction
49
Processing Big Data : What is missing?• Support for graph/networked data
• How does one visualize complex relationships?
• Data with dynamic schemas• Do the current patterns scale for very large number of columns?
• Are mappings the right paradigm?
• Ability to extract entities from unstructured data
49
50
References
• Why Software is Eating the World• Marc Andreessen, WSJ Aug 2011
• Evolving Role of EDW in Era of Big Data Analytics• Ralph Kimball, Kimball Group 2011
• Data Scientist: Sexiest Job of the 21st Century• Thomas H. Davenport & D.J.Patil, HBR Sept 2012
• Newly Emerging Best Practices for Big Data• Ralph Kimball, Kimball Group Oct 2012
51
Questions
52
INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation | Conversion | De-duping | Exchange | Extraction | Federation | Hub | Identity | Integration | Life-cycle Management | Loading | Masking | Mastering | Matching | Migration | On Demand | Privacy | Profiling | Provisioning | Quality | Quality Assessment | Registry | Replication | Retirement | Services | Stewardship | Sub-setting | Synchronization | Test Management | Transformation | Validation | Virtualization | Warehousing |
]
Informatica & DataVerbs on Data – We do things to data!
53