big data université paris 13
TRANSCRIPT
BIG DATA
Philippe Julio – Big Data Consulting Practice Manager
� Who is KEYRUS ?
� Big Data & Analytics, What is it ?
� Positioning
AGENDA
BIG DATA
2Big Data
� Positioning
� Software & Tools
� Technical Architecture
� Value Proposition
A UNIQUE VALUE PROPOSITION
KEYRUS
€153m2012 Revenues
350 Large accounts* & LME
*including 80 Global Fortune 500
12countries on 4 continents
The infrastructures and processes (quality HR,..) of a large professionnal services
An ability to act on performance management strategy, systems and
Entrepreneurship
Customer proximity
Expertise in deploying international projects
1650Employees
A GROUP STRONG AND AGILE
SPECIALIST IN ORGANIZATIONS PERFORMANCE
OUR VALUES FOR THE BENEFIT OFOUR CUSTOMERS
AN INTERNATIONAL DIMENSION
3800SME customers
© K
eyru
s -
All
right
s re
serv
ed
3Big Data
Industries: 31%
Banking - Insurance: 19%
Telecom : 8%
Services - Distribution: 16%
Public Services: 14%
Utilities: 12%
large professionnal servicesGroup
Simple and formalized governance to maintain agility at all times
A customer-focused decision center
Listed on NYSE-Euronext Paris
strategy, systems and organizations
Different Business Units to serve different types of clients (Large corporations, mid-market, and SMEs)
Functional, Industry and Technology skills
Revenue by Sector
Customer proximity
Building our brand on quality of service
A culture of innovation that defines how we operate and is also part of our value proposition
Diversity as a key component of our HR policy
Nearshore & offshore capacities
Belgium Brazil
Canada China Spain
France Mauritius
Israel Luxembourg Switzerland
TunisiaUSA
BIG DATA & ANALYTICS, WHAT IS IT ?
2 Billion
5 Billion• # of cell phone users
worldwide in 2010
10x• # of Internet users worldwide in
2010• Growth in digital
data every 5 years
35 ZB• By 2020, the Digital
Universe will be 44 times as big as it was in 2009
30 Billion• Pieces of content shared on
Facebook every month
LARGE HADRON COLLIDER OF CERN (SWITZERLAND)
BIG DATA©
Key
rus
-A
ll rig
hts
rese
rved
6Big Data
LARGE HADRON COLLIDER(LHC) of CERN
15PB of Data /Year !!
BIG DATA ?
VelocityOften time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.Batch, Near time, Real time, Streams
VolumeBig data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.TB, Records, Transactions, Tables, Files
• Innovate new business models• Replace/Support human decision
NOT ONLY DATA VOLUME©
Key
rus
-A
ll rig
hts
rese
rved
7Big Data
VarietyBig data extends beyond structured data, including semi-structured and unstructured data of all varieties: text, audio, video, click streams, log files and more.Multi-structured : Unstructured, Semi-Structured, Structured
Value• Replace/Support human decision• Custom actions• Discover needs• Improve performance• Create transparency
WHAT DOES ANALYTICS MEANS ?
• Analysis– Quarterly sales reporting– Sales growth plan
• Simulation & Forecast– Run alternative sales scenarios to
identify the best product mix for next quarter
What will happen?
How can it be done better?F
utur
e
FOR SALES MANAGEMENT BY EXAMPLE©
Key
rus
-A
ll rig
hts
rese
rved
8Big Data
quarter– Run simulations to determine the ideal
number of sales professionals to assign to a particular new territory
• Strategy– Forecast vs. results analysis– Predictable patterns– Decision-making
What happened and when?
How and why it
happened?
Facts Interpretation
Pas
t
ANALYZING DATA
McKinsey: by 2018, the United States alone could face a shortage of:• 140-190,000 people with deep
analytical skills
SKILL OF THE FUTURE©
Key
rus
-A
ll rig
hts
rese
rved
9Big Data
Stacy Collett Computerworld , August 23, 2010
analytical skills • 1.5M managers/analysts with the
know-how to use the analysis of big data to make effective decisions
• www.mckinsey.com/mgi/publications/big_data/
DATA SCIENTIST, STATISTICIAN
Data Scientist• Working on global data
• Modeling complex business problems
• Using Big Data software packages (Mahout, Lucene…)
• Discovering business insights
• Identifying opportunities
Statistician• Working on data sampling
• From data sampling to global data by projection
• Using statistical software packages (SAS, SPSS…)
• Skills for probability, regression and modeling
WHAT IS THE DIFFERENCE©
Key
rus
-A
ll rig
hts
rese
rved
10Big Data
• Identifying opportunities
• Skills for coding, integrating and preparing large, varied, data sets
• Advanced analytics and modeling skills to reveal and understand hidden relationships
• Business knowledge and communication skills to present results
and modeling
• Practical experience on data cleansing, simulation and data visualization
• Skills for data interpretation, analysis, categorization, correlation, explanation
• Communication skills to present results
BIG DATA POSITIONING
HYPE CYCLE 2012
GARTNER ANALYSIS©
Key
rus
-A
ll rig
hts
rese
rved
12Big Data
STRATEGIC TECHNOLOGY FROM GARTNER
� Strategic Big Data
� Big Data is moving from a focus on individual projects to an influence on enterprises’
strategic information architecture
� Actionable Analytics
� Provides simulation, prediction, optimization and other analytics, to empower even
TRENDS 2013©
Key
rus
-A
ll rig
hts
rese
rved
13Big Data
more decision flexibility at the time and place of every business process action
� In Memory Computing
� The execution of certain-types of hours-long batch processes can be squeezed into
minutes or even seconds
� Integrated Ecosystems
� Packaging of software and services to address infrastructure or application workload
BIG DATA STATISTICS
400
500
600
700
800
900
1000966
848
715
619
434364
269
Amount of Stored Data By Sector(in Petabytes, 2009)
Sources:"Big Data: The Next Frontier for Innovation, Competition and
Productivity."US Bureau of Labor Statistics | McKinsley Global Institute Analysis
Pet
abyt
es
REPORT©
Key
rus
-A
ll rig
hts
rese
rved
14Big Data
0
100
200
300269
227Pet
abyt
es
35ZB -> a stack of 50GB Bluray DVDs reaching
from earth to the moon x2
10 ** 21 Bytes
BIG DATA BUSINESS DRIVERS
Telecommunicationsmore reliable network where we can predict and prevent failure –customers attrition
Bank/Insurancerisks management– Bale III –customer qualification, fraud management
Retaila personal experience with products and offers that are just what you need
Life Sciencebetter targeted medicines with fewer complications and side effects
ON MAJOR INDUSTRIES©
Key
rus
-A
ll rig
hts
rese
rved
15Big Data
Mediamore content that is lined up with your personal preferences
Marketinge-reputation - Trends analysis on the web sites
Healthcareprevention system – epidemiological surveillance
Governmentgovernment services that are based on hard data, not just gut
ITsupport optimizationelectric consumption analysis
Gamingdetermining the future direction of the games
BIG DATA DOMAINS
� Digital marketing optimization (e.g., web analytics,
attribution, golden path analysis)
� Data exploration and discovery (e.g., data scientists,
identifying new data-driven products, new markets)
� Fraud detection prevention (e.g. revenue protection,
site integrity, credit card protection, suspect transactions,
A LARGE ACTIVITY©
Key
rus
-A
ll rig
hts
rese
rved
16Big Data
site integrity, credit card protection, suspect transactions,
fight against money laundering)
� Machine-generated data analytics (e.g., remote device
insight, remote sensing, location-based intelligence)
� Social network and relationship analysis (e.g.,
influencer marketing, crowdsourcing, attrition prediction)
� Data retention (e.g. long term conservation of data,
data archiving
Source: Teradata
TRENDS
NEW DATA & MANAGEMENT ECONOMICS
Storage TrendNew Data Structure
(Distributed File Systems, NoSQL , NewSQL…)
Compute TrendNew Analytics
(Massively Parallel Processing,, MapReduce , Algorithms…)
Master/Slave
ElasticData Warehouse
© K
eyru
s -
All
right
s re
serv
ed
17Big Data
Proprietary and dedicated
data warehouse
OLTP is thedata warehouse
General purposedata warehouse
Object Storage
Distributed FS Federated/Sharded
Master/Master
Enterprisedata warehouse
Multi-Structured Data
Master Data ManagementData Quality
BIG DATA SOFTWARE & TOOLS
BIG DATA IS MOSTLY OPEN SOURCE SOFTWARE
OPEN SOURCE NOT ONLY FREE©
Key
rus
-A
ll rig
hts
rese
rved
19Big Data
• Shared source code
• Publicly available and free
• Support suscription not free
• No software vendor lock-in
• For the use and benefit of all without favour
Open Source software
Commercial software
DATA WAREHOUSE
� Data Warehouse appliances
� EMC Greenplum
� Parallel Data Warehouse (Microsoft)
� IBM Netezza
� Oracle Exadata
� SAP HANA
GARTNER ANALYSIS©
Key
rus
-A
ll rig
hts
rese
rved
20Big Data
� ParAccel Analytic Database
� Teradata
� HP Vertica
� Massively Parallel Processing
� Hadoop Connectivity
� Column-Oriented database
� In-Memory databaseSource Gartner – January 2013
DATA MANAGEMENT
Data Integration Data Quality Master Data
Source Gartner October 2012
Source Gartner October 2012
Source Gartner October 2012
GARTNER ANALYSIS©
Key
rus
-A
ll rig
hts
rese
rved
21Big Data
2011 position (in orange) to 2012 position (in red)
• Data acquisition• Consolidation• Data migrations/conversions• Synchronization of data between operational
applications• Interenterprise data sharing• Delivery of data services in an SOA context
• Profiling• Parsing and standardization• Data cleansing• Matching• Monitoring• Enrichment
• Identify, link and synchronize the information across heterogeneous data sources
• Create and manage a central database of record or index
• Support master data and governance requirements through workflow
BUSINESS AND IT IMPACTS
BIG DATA QUALITY
Business consistencyBusiness consistency
Technical consistencyTechnical consistency
ITBusiness
Wrong figures
Visualization not clear for decision-making
Wrong figures
Visualization not clear for decision-making
Incorrect data,doubloons
Incorrect data,doubloons
AccessibilityAccessibility
Governance
External data access
Open data access
Data collect easily
External data access
Open data access
Data collect easily
1
2
3
© K
eyru
s -
All
right
s re
serv
ed
22Big Data
FreshnessFreshness
CompletenessCompleteness
ExplicableExplicable
TraceabilityTraceability
SecuritySecurity
Decision making impact
Data update
Decision making impact
Data update
Data-understandingData-understanding
Data lostData intrusion
Data habilitations
Data lostData intrusion
Data habilitations
All data in the context
Global data
All data in the context
Global data
Data life cycle
From sources to users
Data life cycle
From sources to users
4
5
6
7
8
BUSINESS INTELLIGENCE
� Predictive analysis
� Advanced visualization
� Geospatial analysis
� Cloud analytics platform
GARTNER ANALYSIS©
Key
rus
-A
ll rig
hts
rese
rved
23Big Data
� Cloud analytics platform
� Innovation
� Last years acquisitions
� IBM > Cognos, Algorithmics
� SAP > BusinessObjects
� Oracle > Hyperion, Siebel, Endeca
Source Gartner - February 2012
HADOOP OVERVIEW
Why Hadoop ?
• Searching
What is Hadoop ?
• Top level Apache Foundation project
• Large, active user base, mailing lists, user groups
• Very active community, strong development team
OPEN SOURCE FRAMEWORK©
Key
rus
-A
ll rig
hts
rese
rved
24Big Data
“Open Source software flexible and available architecture for large scale computation and data processing on a network of commodity hardware”
• Log Processing
• Data Analytics
• Video and Image Analysis
• Data Retention
HADOOP PROVIDERS
� Amazon is the most prominent Hadoop cloud service provider
� IBM has the deepest Hadoop platform and application portfolio
� EMC Greenplum is the first mover in Hadoop appliances
� MapR has a strong OEM business for its Hadoop distribution
� Cloudera is the Hadoop pure play with the greatest adoption
� Hortonworks provides professional services to the Hadoop ecosystem
FORESTER ANALYSIS©
Key
rus
-A
ll rig
hts
rese
rved
25Big Data
Hadoop ecosystem
� Pentaho executes Hadoop MapReduce models and Pig scripts for data integration and analytics products
� DataStax embeds Cassandra for real-time Hadoop applications
� Datameer provides a user-friendly Hadoop modeling tool
� Platform Computing brings proven cluster management tools to Hadoop
� Zettaset specializes in Hadoop cluster management tools
� Outerthought focuses on Hadoop search applications
� HStreaming provides complex event processing middleware for Hadoop
Source Forester Research Inc. - February 2012
CLOUDERA
Web Console
Job Workflow
MetadataHUE
APACHE OOZIE
APACHE HIVE MetaStore
Interactive SQL
Data Mining Lib
Impala
APACHE MAHOUT
AP
AC
HE
BIG
TO
P
Data Processing LibDataFu for Pig
• Hadoop is framework based on flexible and available architecture for large scale computation and data processing on a network of commodity hardwar e
• HDFS / MapReduce : Hadoop Distributed File System for storage and Hadoop MapReduce for compute. High availability and scalability. Open source software
• Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Hadoop it provides Tools to enable easy data extract/transform/load , a mechanism to impose structure on a variety of data formats, access to files stored either directly in HDFS or in other data storage systems such as HBase and query execution via MapReduce
Hadoop Framework
HADOOP DISTRIBUTION - CDH©
Key
rus
-A
ll rig
hts
rese
rved
26Big Data
Cloud Deployment Coordination
Data Integration
Fast Read/Write
Access
Batch Processing Languages
APACHE ZOOKEEPER
APACHE
FLUME, APACHE
SQOOP
APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE WHIRR
Bui
ld/T
est
: AP
AC
HE
BIG
TO
P
Cloudera Manager Free Edition (Installation Wizard)
Hadoop Core Kernel
MapReduce, HDFS
ConnectivityODBC/JDBC/FUSE/HTTPS
execution via MapReduce
• Pig is a high-level data-flow language and execution framework for parallel computation. Simple to write MapReduce program. Abstracts you from specific detail. Focus on data processing. Data flow. Data manipulation. for enhancing extract, transform and load data into HDFS or from HDFS into any target systems. Open source software
• Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
• Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
CDH4 – June 2012
MAPREDUCE
MapReduce• MapReduce is the programming paradigm popularized
by Google researchers• Open-source Hadoop implementation of MapReduce
by Yahoo• Open source software framework for distributed
computation• Parallel computation (Map) on each block (Split) of data
in an HDFS file and output a stream of (Key, Value) pairs to the local file system
• JobTracker schedules and manages jobs• TaskTracker executes individual map() and reduce()
tasks on each cluster node
Algorithms• Association Rule Learning Algorithms
• Genetic Algorithms
• Neural Network Algorithms
• Statistical Algorithms (Pandas)
• Machine Learning Algorithms (Mahout, Weka, Scikit Learn)
• Natural Language Processing Algorithms
• Trading Algorithms
• Clinical design Algorithms
• Searching Algorithms (Lucene, Solr, Katta, ElasicSearch, OpenSearchServer…)
Languages• PHP
• Erlang
• Python
• Ruby
• R
• Java
MASSIVELY PARALLEL PROCESSING©
Key
rus
-A
ll rig
hts
rese
rved
27Big Data
MAJOR CATEGORIES
NOSQL DATABASES CATEGORIES
• NoSQL = Not only SQL• Popular name for a subset of structured storage
software that is designed with the intention of delivering increased optimization for high-Key-Value
ColumnBigTable (Google), HBase, Cassandra (DataStax), Hypertable…
Document
2
1 3
© K
eyru
s -
All
right
s re
serv
ed
28Big Data
delivering increased optimization for high-performance operations on large datasets
• Basically, available, scalable, eventually consistent
• Easy to use
• Tolerant of scale by way of horizontal distribution
Key-ValueRedis, Riak (Basho), CouchBase, Voldemort (LinkedIn)MemcacheDB…
DocumentMongoDB (10Gen), CouchDB, Terrastore,SimpleDB (AWS) …
GraphNeo4j (Neo Technology), Jena,InfiniteGraph (Objectivity),FlockDB (Twitter)…
1
4
BIG DATA TECHNICAL ARCHITECTURE
CLOUD FOR BIG DATA
Cloud Computing
Private Public Hybrid
SaaS Applications
App App App App App AppSaaS
Cloud
SalesForce.com,Facebook, Twitter, Li
CloudCloud
Cloud models
CLOUD MODELS©
Key
rus
-A
ll rig
hts
rese
rved
30Big Data
App App App App App App
Platform Tools & Services
Java Ruby Python PHP Erlang R
Operating Systems
Virtualization
Hardware (server, storage, network)
SaaS
PaaS
IaaS
Facebook, Twitter, LinkedIn…
Amazon Web Services, Microsoft Windows Azure, Google…
Amazon Web Services, CloudWatt…
Linux, Windows, Unix…)
INFRASTRUCTURE AS A SERVICE
General Purpose
• Combine server with storage & networking (Hyper-Scale Server)
• Specialized software enables general purpose systems designs to provide high performance data services
Data services move to the infrastructure
IAAS MODEL©
Key
rus
-A
ll rig
hts
rese
rved
31Big Data
Data services move to the infrastructure
Application
Data Services
Metadata Mgnt
Storage
LegacyApplication
Data Services
Metadata Mgnt
Storage
EmergingApplication
Data Services
Metadata Mgnt
Storage
Future
Application
Infrastructure
BI ARCHITECTURE VS. BIG DATA ARCHITECTURE
BI & DWH Architecture - Traditional• SQL based• Commercial software• SAP BO, IBM Cognos, Oracle Hyperion…• High availability• Enterprise database• Right design for structured data• Current storage hardware (SAN, NAS, DAS)
Analytics Architecture – New Generation• Not only SQL based• Hadoop, Cassandra…• High scalability, availability and flexibility• Compute and storage in the same box for
reducing the network latency• Right design for semi-structured and
unstructured data
AppServers
ALIGNING ARCHITECTURE ON BUSINESS©
Key
rus
-A
ll rig
hts
rese
rved
32Big Data
DataNodes
Network Switches
EdgeNodes
DatabaseServers
NetworkSwitches
SANSwitch
Storage Array
HADOOP ARCHITECTURE
Network Switches
OVEVIEW©
Key
rus
-A
ll rig
hts
rese
rved
33Big Data
2 x EdgeNode• 2 CPU 6 core• 96GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE Ports
3 to n DataNode• 2 CPU 6 core• 48GB RAM• 12 x HDD 3TB 7.5K• 2 x 10GbE Ports
2 x NameNode/BackupNode• 2 CPU 6 core• 96GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE Ports
Edge Nodes Control Nodes Worker Nodes
360° INSIGHT
ENTERPRISE DATA ARCHITECTURE
Dev./Int.Dev./Int.BI /
AnalyticsBI /
AnalyticsEnterprise ReportingEnterprise Reporting
ClouderaManagerClouderaManager
SYSTEM OPERATORS
ENGINEERS ANALYSTS BUSINESS USERS
Web/Mobile ApplicationsWeb/Mobile Applications
CUSTOMERS
Modeling Tools
Modeling Tools
DATA SCIENTISTS
DATA ADMINISTRATOR
Meta Data/ETL ToolsMeta Data/ETL Tools
© K
eyru
s -
All
right
s re
serv
ed
34Big Data
LogsLogs FilesFiles Web DataWeb Data RDBMSRDBMS
EnterpriseData Warehouse
OnlineServing Systems
BIG DATA VALUE PROPOSITION
BIG DATA - TCO / ROI APPROACH
� Evaluate the investment opportunity� What can we expect from the investment ?
� Is it worth investing in-house ?
� How long to payback on investment ?
� What is the competitive advantage value ?
� What is the risk if we don’t start the project ?
� Costs� Hardware & software products costs
KEY QUESTIONS©
Key
rus
-A
ll rig
hts
rese
rved
36Big Data
� Services & Support costs
� Training & communication costs
� Energy & professional costs
� Benefits� Increase productivity
� Increase margins and revenues
� Reduce time to access to relevant information
� Reduce time to decision making
� Enhance quality of information
� Enhance users satisfaction
• TCO = Costs• ROI = (Benefits – TCO) / TCO
� Keyrus, leader in Business Intelligence (Consulting & Delivery)
� Works closely with the “big data” leaders
� Works with high level profiles: Statistician, Architect, BIDW Specialist, Consultant, Manager…
� Develops partnerships
� Develops innovation
� Uses open source software� No software vendors lock-in
BIG DATA VALUE PROPOSITION
37Big Data
� No software vendors lock-in
� Low TCO
� Apache Hadoop framework� HDFS, MapReduce, Hive…
� Big data integration software� Informatica, Talend…
� Big data analytics & visualization software� SAS, SAP, QlikTeck, Tableau Software…
� DWH appliances and big data connectivity� Vertica, Exadata, Greenplum, Netezza, Teradata, SAP HANA, MS
Parallel Data Warehouse
QUESTIONS & ANSWERS
&
WHO, WHAT, WHEN, WHERE…©
Key
rus
-A
ll rig
hts
rese
rved
38Big Data
&
THANK YOU
FOR YOUR ATTENTION©
Key
rus
-A
ll rig
hts
rese
rved
39Big Data