big data université paris 13

39
BIG DATA Philippe Julio – Big Data Consulting Practice Manager

Upload: rita-sassou

Post on 10-Nov-2014

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Université Paris 13

BIG DATA

Philippe Julio – Big Data Consulting Practice Manager

Page 2: Big Data Université Paris 13

� Who is KEYRUS ?

� Big Data & Analytics, What is it ?

� Positioning

AGENDA

BIG DATA

2Big Data

� Positioning

� Software & Tools

� Technical Architecture

� Value Proposition

Page 3: Big Data Université Paris 13

A UNIQUE VALUE PROPOSITION

KEYRUS

€153m2012 Revenues

350 Large accounts* & LME

*including 80 Global Fortune 500

12countries on 4 continents

The infrastructures and processes (quality HR,..) of a large professionnal services

An ability to act on performance management strategy, systems and

Entrepreneurship

Customer proximity

Expertise in deploying international projects

1650Employees

A GROUP STRONG AND AGILE

SPECIALIST IN ORGANIZATIONS PERFORMANCE

OUR VALUES FOR THE BENEFIT OFOUR CUSTOMERS

AN INTERNATIONAL DIMENSION

3800SME customers

© K

eyru

s -

All

right

s re

serv

ed

3Big Data

Industries: 31%

Banking - Insurance: 19%

Telecom : 8%

Services - Distribution: 16%

Public Services: 14%

Utilities: 12%

large professionnal servicesGroup

Simple and formalized governance to maintain agility at all times

A customer-focused decision center

Listed on NYSE-Euronext Paris

strategy, systems and organizations

Different Business Units to serve different types of clients (Large corporations, mid-market, and SMEs)

Functional, Industry and Technology skills

Revenue by Sector

Customer proximity

Building our brand on quality of service

A culture of innovation that defines how we operate and is also part of our value proposition

Diversity as a key component of our HR policy

Nearshore & offshore capacities

Belgium Brazil

Canada China Spain

France Mauritius

Israel Luxembourg Switzerland

TunisiaUSA

Page 4: Big Data Université Paris 13

BIG DATA & ANALYTICS, WHAT IS IT ?

Page 5: Big Data Université Paris 13

2 Billion

5 Billion• # of cell phone users

worldwide in 2010

10x• # of Internet users worldwide in

2010• Growth in digital

data every 5 years

35 ZB• By 2020, the Digital

Universe will be 44 times as big as it was in 2009

30 Billion• Pieces of content shared on

Facebook every month

Page 6: Big Data Université Paris 13

LARGE HADRON COLLIDER OF CERN (SWITZERLAND)

BIG DATA©

Key

rus

-A

ll rig

hts

rese

rved

6Big Data

LARGE HADRON COLLIDER(LHC) of CERN

15PB of Data /Year !!

Page 7: Big Data Université Paris 13

BIG DATA ?

VelocityOften time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.Batch, Near time, Real time, Streams

VolumeBig data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.TB, Records, Transactions, Tables, Files

• Innovate new business models• Replace/Support human decision

NOT ONLY DATA VOLUME©

Key

rus

-A

ll rig

hts

rese

rved

7Big Data

VarietyBig data extends beyond structured data, including semi-structured and unstructured data of all varieties: text, audio, video, click streams, log files and more.Multi-structured : Unstructured, Semi-Structured, Structured

Value• Replace/Support human decision• Custom actions• Discover needs• Improve performance• Create transparency

Page 8: Big Data Université Paris 13

WHAT DOES ANALYTICS MEANS ?

• Analysis– Quarterly sales reporting– Sales growth plan

• Simulation & Forecast– Run alternative sales scenarios to

identify the best product mix for next quarter

What will happen?

How can it be done better?F

utur

e

FOR SALES MANAGEMENT BY EXAMPLE©

Key

rus

-A

ll rig

hts

rese

rved

8Big Data

quarter– Run simulations to determine the ideal

number of sales professionals to assign to a particular new territory

• Strategy– Forecast vs. results analysis– Predictable patterns– Decision-making

What happened and when?

How and why it

happened?

Facts Interpretation

Pas

t

Page 9: Big Data Université Paris 13

ANALYZING DATA

McKinsey: by 2018, the United States alone could face a shortage of:• 140-190,000 people with deep

analytical skills

SKILL OF THE FUTURE©

Key

rus

-A

ll rig

hts

rese

rved

9Big Data

Stacy Collett Computerworld , August 23, 2010

analytical skills • 1.5M managers/analysts with the

know-how to use the analysis of big data to make effective decisions

• www.mckinsey.com/mgi/publications/big_data/

Page 10: Big Data Université Paris 13

DATA SCIENTIST, STATISTICIAN

Data Scientist• Working on global data

• Modeling complex business problems

• Using Big Data software packages (Mahout, Lucene…)

• Discovering business insights

• Identifying opportunities

Statistician• Working on data sampling

• From data sampling to global data by projection

• Using statistical software packages (SAS, SPSS…)

• Skills for probability, regression and modeling

WHAT IS THE DIFFERENCE©

Key

rus

-A

ll rig

hts

rese

rved

10Big Data

• Identifying opportunities

• Skills for coding, integrating and preparing large, varied, data sets

• Advanced analytics and modeling skills to reveal and understand hidden relationships

• Business knowledge and communication skills to present results

and modeling

• Practical experience on data cleansing, simulation and data visualization

• Skills for data interpretation, analysis, categorization, correlation, explanation

• Communication skills to present results

Page 11: Big Data Université Paris 13

BIG DATA POSITIONING

Page 12: Big Data Université Paris 13

HYPE CYCLE 2012

GARTNER ANALYSIS©

Key

rus

-A

ll rig

hts

rese

rved

12Big Data

Page 13: Big Data Université Paris 13

STRATEGIC TECHNOLOGY FROM GARTNER

� Strategic Big Data

� Big Data is moving from a focus on individual projects to an influence on enterprises’

strategic information architecture

� Actionable Analytics

� Provides simulation, prediction, optimization and other analytics, to empower even

TRENDS 2013©

Key

rus

-A

ll rig

hts

rese

rved

13Big Data

more decision flexibility at the time and place of every business process action

� In Memory Computing

� The execution of certain-types of hours-long batch processes can be squeezed into

minutes or even seconds

� Integrated Ecosystems

� Packaging of software and services to address infrastructure or application workload

Page 14: Big Data Université Paris 13

BIG DATA STATISTICS

400

500

600

700

800

900

1000966

848

715

619

434364

269

Amount of Stored Data By Sector(in Petabytes, 2009)

Sources:"Big Data: The Next Frontier for Innovation, Competition and

Productivity."US Bureau of Labor Statistics | McKinsley Global Institute Analysis

Pet

abyt

es

REPORT©

Key

rus

-A

ll rig

hts

rese

rved

14Big Data

0

100

200

300269

227Pet

abyt

es

35ZB -> a stack of 50GB Bluray DVDs reaching

from earth to the moon x2

10 ** 21 Bytes

Page 15: Big Data Université Paris 13

BIG DATA BUSINESS DRIVERS

Telecommunicationsmore reliable network where we can predict and prevent failure –customers attrition

Bank/Insurancerisks management– Bale III –customer qualification, fraud management

Retaila personal experience with products and offers that are just what you need

Life Sciencebetter targeted medicines with fewer complications and side effects

ON MAJOR INDUSTRIES©

Key

rus

-A

ll rig

hts

rese

rved

15Big Data

Mediamore content that is lined up with your personal preferences

Marketinge-reputation - Trends analysis on the web sites

Healthcareprevention system – epidemiological surveillance

Governmentgovernment services that are based on hard data, not just gut

ITsupport optimizationelectric consumption analysis

Gamingdetermining the future direction of the games

Page 16: Big Data Université Paris 13

BIG DATA DOMAINS

� Digital marketing optimization (e.g., web analytics,

attribution, golden path analysis)

� Data exploration and discovery (e.g., data scientists,

identifying new data-driven products, new markets)

� Fraud detection prevention (e.g. revenue protection,

site integrity, credit card protection, suspect transactions,

A LARGE ACTIVITY©

Key

rus

-A

ll rig

hts

rese

rved

16Big Data

site integrity, credit card protection, suspect transactions,

fight against money laundering)

� Machine-generated data analytics (e.g., remote device

insight, remote sensing, location-based intelligence)

� Social network and relationship analysis (e.g.,

influencer marketing, crowdsourcing, attrition prediction)

� Data retention (e.g. long term conservation of data,

data archiving

Source: Teradata

Page 17: Big Data Université Paris 13

TRENDS

NEW DATA & MANAGEMENT ECONOMICS

Storage TrendNew Data Structure

(Distributed File Systems, NoSQL , NewSQL…)

Compute TrendNew Analytics

(Massively Parallel Processing,, MapReduce , Algorithms…)

Master/Slave

ElasticData Warehouse

© K

eyru

s -

All

right

s re

serv

ed

17Big Data

Proprietary and dedicated

data warehouse

OLTP is thedata warehouse

General purposedata warehouse

Object Storage

Distributed FS Federated/Sharded

Master/Master

Enterprisedata warehouse

Multi-Structured Data

Master Data ManagementData Quality

Page 18: Big Data Université Paris 13

BIG DATA SOFTWARE & TOOLS

Page 19: Big Data Université Paris 13

BIG DATA IS MOSTLY OPEN SOURCE SOFTWARE

OPEN SOURCE NOT ONLY FREE©

Key

rus

-A

ll rig

hts

rese

rved

19Big Data

• Shared source code

• Publicly available and free

• Support suscription not free

• No software vendor lock-in

• For the use and benefit of all without favour

Open Source software

Commercial software

Page 20: Big Data Université Paris 13

DATA WAREHOUSE

� Data Warehouse appliances

� EMC Greenplum

� Parallel Data Warehouse (Microsoft)

� IBM Netezza

� Oracle Exadata

� SAP HANA

GARTNER ANALYSIS©

Key

rus

-A

ll rig

hts

rese

rved

20Big Data

� ParAccel Analytic Database

� Teradata

� HP Vertica

� Massively Parallel Processing

� Hadoop Connectivity

� Column-Oriented database

� In-Memory databaseSource Gartner – January 2013

Page 21: Big Data Université Paris 13

DATA MANAGEMENT

Data Integration Data Quality Master Data

Source Gartner October 2012

Source Gartner October 2012

Source Gartner October 2012

GARTNER ANALYSIS©

Key

rus

-A

ll rig

hts

rese

rved

21Big Data

2011 position (in orange) to 2012 position (in red)

• Data acquisition• Consolidation• Data migrations/conversions• Synchronization of data between operational

applications• Interenterprise data sharing• Delivery of data services in an SOA context

• Profiling• Parsing and standardization• Data cleansing• Matching• Monitoring• Enrichment

• Identify, link and synchronize the information across heterogeneous data sources

• Create and manage a central database of record or index

• Support master data and governance requirements through workflow

Page 22: Big Data Université Paris 13

BUSINESS AND IT IMPACTS

BIG DATA QUALITY

Business consistencyBusiness consistency

Technical consistencyTechnical consistency

ITBusiness

Wrong figures

Visualization not clear for decision-making

Wrong figures

Visualization not clear for decision-making

Incorrect data,doubloons

Incorrect data,doubloons

AccessibilityAccessibility

Governance

External data access

Open data access

Data collect easily

External data access

Open data access

Data collect easily

1

2

3

© K

eyru

s -

All

right

s re

serv

ed

22Big Data

FreshnessFreshness

CompletenessCompleteness

ExplicableExplicable

TraceabilityTraceability

SecuritySecurity

Decision making impact

Data update

Decision making impact

Data update

Data-understandingData-understanding

Data lostData intrusion

Data habilitations

Data lostData intrusion

Data habilitations

All data in the context

Global data

All data in the context

Global data

Data life cycle

From sources to users

Data life cycle

From sources to users

4

5

6

7

8

Page 23: Big Data Université Paris 13

BUSINESS INTELLIGENCE

� Predictive analysis

� Advanced visualization

� Geospatial analysis

� Cloud analytics platform

GARTNER ANALYSIS©

Key

rus

-A

ll rig

hts

rese

rved

23Big Data

� Cloud analytics platform

� Innovation

� Last years acquisitions

� IBM > Cognos, Algorithmics

� SAP > BusinessObjects

� Oracle > Hyperion, Siebel, Endeca

Source Gartner - February 2012

Page 24: Big Data Université Paris 13

HADOOP OVERVIEW

Why Hadoop ?

• Searching

What is Hadoop ?

• Top level Apache Foundation project

• Large, active user base, mailing lists, user groups

• Very active community, strong development team

OPEN SOURCE FRAMEWORK©

Key

rus

-A

ll rig

hts

rese

rved

24Big Data

“Open Source software flexible and available architecture for large scale computation and data processing on a network of commodity hardware”

• Log Processing

• Data Analytics

• Video and Image Analysis

• Data Retention

Page 25: Big Data Université Paris 13

HADOOP PROVIDERS

� Amazon is the most prominent Hadoop cloud service provider

� IBM has the deepest Hadoop platform and application portfolio

� EMC Greenplum is the first mover in Hadoop appliances

� MapR has a strong OEM business for its Hadoop distribution

� Cloudera is the Hadoop pure play with the greatest adoption

� Hortonworks provides professional services to the Hadoop ecosystem

FORESTER ANALYSIS©

Key

rus

-A

ll rig

hts

rese

rved

25Big Data

Hadoop ecosystem

� Pentaho executes Hadoop MapReduce models and Pig scripts for data integration and analytics products

� DataStax embeds Cassandra for real-time Hadoop applications

� Datameer provides a user-friendly Hadoop modeling tool

� Platform Computing brings proven cluster management tools to Hadoop

� Zettaset specializes in Hadoop cluster management tools

� Outerthought focuses on Hadoop search applications

� HStreaming provides complex event processing middleware for Hadoop

Source Forester Research Inc. - February 2012

Page 26: Big Data Université Paris 13

CLOUDERA

Web Console

Job Workflow

MetadataHUE

APACHE OOZIE

APACHE HIVE MetaStore

Interactive SQL

Data Mining Lib

Impala

APACHE MAHOUT

AP

AC

HE

BIG

TO

P

Data Processing LibDataFu for Pig

• Hadoop is framework based on flexible and available architecture for large scale computation and data processing on a network of commodity hardwar e

• HDFS / MapReduce : Hadoop Distributed File System for storage and Hadoop MapReduce for compute. High availability and scalability. Open source software

• Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Hadoop it provides Tools to enable easy data extract/transform/load , a mechanism to impose structure on a variety of data formats, access to files stored either directly in HDFS or in other data storage systems such as HBase and query execution via MapReduce

Hadoop Framework

HADOOP DISTRIBUTION - CDH©

Key

rus

-A

ll rig

hts

rese

rved

26Big Data

Cloud Deployment Coordination

Data Integration

Fast Read/Write

Access

Batch Processing Languages

APACHE ZOOKEEPER

APACHE

FLUME, APACHE

SQOOP

APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE WHIRR

Bui

ld/T

est

: AP

AC

HE

BIG

TO

P

Cloudera Manager Free Edition (Installation Wizard)

Hadoop Core Kernel

MapReduce, HDFS

ConnectivityODBC/JDBC/FUSE/HTTPS

execution via MapReduce

• Pig is a high-level data-flow language and execution framework for parallel computation. Simple to write MapReduce program. Abstracts you from specific detail. Focus on data processing. Data flow. Data manipulation. for enhancing extract, transform and load data into HDFS or from HDFS into any target systems. Open source software

• Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

• Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

CDH4 – June 2012

Page 27: Big Data Université Paris 13

MAPREDUCE

MapReduce• MapReduce is the programming paradigm popularized

by Google researchers• Open-source Hadoop implementation of MapReduce

by Yahoo• Open source software framework for distributed

computation• Parallel computation (Map) on each block (Split) of data

in an HDFS file and output a stream of (Key, Value) pairs to the local file system

• JobTracker schedules and manages jobs• TaskTracker executes individual map() and reduce()

tasks on each cluster node

Algorithms• Association Rule Learning Algorithms

• Genetic Algorithms

• Neural Network Algorithms

• Statistical Algorithms (Pandas)

• Machine Learning Algorithms (Mahout, Weka, Scikit Learn)

• Natural Language Processing Algorithms

• Trading Algorithms

• Clinical design Algorithms

• Searching Algorithms (Lucene, Solr, Katta, ElasicSearch, OpenSearchServer…)

Languages• PHP

• Erlang

• Python

• Ruby

• R

• Java

MASSIVELY PARALLEL PROCESSING©

Key

rus

-A

ll rig

hts

rese

rved

27Big Data

Page 28: Big Data Université Paris 13

MAJOR CATEGORIES

NOSQL DATABASES CATEGORIES

• NoSQL = Not only SQL• Popular name for a subset of structured storage

software that is designed with the intention of delivering increased optimization for high-Key-Value

ColumnBigTable (Google), HBase, Cassandra (DataStax), Hypertable…

Document

2

1 3

© K

eyru

s -

All

right

s re

serv

ed

28Big Data

delivering increased optimization for high-performance operations on large datasets

• Basically, available, scalable, eventually consistent

• Easy to use

• Tolerant of scale by way of horizontal distribution

Key-ValueRedis, Riak (Basho), CouchBase, Voldemort (LinkedIn)MemcacheDB…

DocumentMongoDB (10Gen), CouchDB, Terrastore,SimpleDB (AWS) …

GraphNeo4j (Neo Technology), Jena,InfiniteGraph (Objectivity),FlockDB (Twitter)…

1

4

Page 29: Big Data Université Paris 13

BIG DATA TECHNICAL ARCHITECTURE

Page 30: Big Data Université Paris 13

CLOUD FOR BIG DATA

Cloud Computing

Private Public Hybrid

SaaS Applications

App App App App App AppSaaS

Cloud

SalesForce.com,Facebook, Twitter, Li

CloudCloud

Cloud models

CLOUD MODELS©

Key

rus

-A

ll rig

hts

rese

rved

30Big Data

App App App App App App

Platform Tools & Services

Java Ruby Python PHP Erlang R

Operating Systems

Virtualization

Hardware (server, storage, network)

SaaS

PaaS

IaaS

Facebook, Twitter, LinkedIn…

Amazon Web Services, Microsoft Windows Azure, Google…

Amazon Web Services, CloudWatt…

Linux, Windows, Unix…)

Page 31: Big Data Université Paris 13

INFRASTRUCTURE AS A SERVICE

General Purpose

• Combine server with storage & networking (Hyper-Scale Server)

• Specialized software enables general purpose systems designs to provide high performance data services

Data services move to the infrastructure

IAAS MODEL©

Key

rus

-A

ll rig

hts

rese

rved

31Big Data

Data services move to the infrastructure

Application

Data Services

Metadata Mgnt

Storage

LegacyApplication

Data Services

Metadata Mgnt

Storage

EmergingApplication

Data Services

Metadata Mgnt

Storage

Future

Application

Infrastructure

Page 32: Big Data Université Paris 13

BI ARCHITECTURE VS. BIG DATA ARCHITECTURE

BI & DWH Architecture - Traditional• SQL based• Commercial software• SAP BO, IBM Cognos, Oracle Hyperion…• High availability• Enterprise database• Right design for structured data• Current storage hardware (SAN, NAS, DAS)

Analytics Architecture – New Generation• Not only SQL based• Hadoop, Cassandra…• High scalability, availability and flexibility• Compute and storage in the same box for

reducing the network latency• Right design for semi-structured and

unstructured data

AppServers

ALIGNING ARCHITECTURE ON BUSINESS©

Key

rus

-A

ll rig

hts

rese

rved

32Big Data

DataNodes

Network Switches

EdgeNodes

DatabaseServers

NetworkSwitches

SANSwitch

Storage Array

Page 33: Big Data Université Paris 13

HADOOP ARCHITECTURE

Network Switches

OVEVIEW©

Key

rus

-A

ll rig

hts

rese

rved

33Big Data

2 x EdgeNode• 2 CPU 6 core• 96GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE Ports

3 to n DataNode• 2 CPU 6 core• 48GB RAM• 12 x HDD 3TB 7.5K• 2 x 10GbE Ports

2 x NameNode/BackupNode• 2 CPU 6 core• 96GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE Ports

Edge Nodes Control Nodes Worker Nodes

Page 34: Big Data Université Paris 13

360° INSIGHT

ENTERPRISE DATA ARCHITECTURE

Dev./Int.Dev./Int.BI /

AnalyticsBI /

AnalyticsEnterprise ReportingEnterprise Reporting

ClouderaManagerClouderaManager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile ApplicationsWeb/Mobile Applications

CUSTOMERS

Modeling Tools

Modeling Tools

DATA SCIENTISTS

DATA ADMINISTRATOR

Meta Data/ETL ToolsMeta Data/ETL Tools

© K

eyru

s -

All

right

s re

serv

ed

34Big Data

LogsLogs FilesFiles Web DataWeb Data RDBMSRDBMS

EnterpriseData Warehouse

OnlineServing Systems

Page 35: Big Data Université Paris 13

BIG DATA VALUE PROPOSITION

Page 36: Big Data Université Paris 13

BIG DATA - TCO / ROI APPROACH

� Evaluate the investment opportunity� What can we expect from the investment ?

� Is it worth investing in-house ?

� How long to payback on investment ?

� What is the competitive advantage value ?

� What is the risk if we don’t start the project ?

� Costs� Hardware & software products costs

KEY QUESTIONS©

Key

rus

-A

ll rig

hts

rese

rved

36Big Data

� Services & Support costs

� Training & communication costs

� Energy & professional costs

� Benefits� Increase productivity

� Increase margins and revenues

� Reduce time to access to relevant information

� Reduce time to decision making

� Enhance quality of information

� Enhance users satisfaction

• TCO = Costs• ROI = (Benefits – TCO) / TCO

Page 37: Big Data Université Paris 13

� Keyrus, leader in Business Intelligence (Consulting & Delivery)

� Works closely with the “big data” leaders

� Works with high level profiles: Statistician, Architect, BIDW Specialist, Consultant, Manager…

� Develops partnerships

� Develops innovation

� Uses open source software� No software vendors lock-in

BIG DATA VALUE PROPOSITION

37Big Data

� No software vendors lock-in

� Low TCO

� Apache Hadoop framework� HDFS, MapReduce, Hive…

� Big data integration software� Informatica, Talend…

� Big data analytics & visualization software� SAS, SAP, QlikTeck, Tableau Software…

� DWH appliances and big data connectivity� Vertica, Exadata, Greenplum, Netezza, Teradata, SAP HANA, MS

Parallel Data Warehouse

Page 38: Big Data Université Paris 13

QUESTIONS & ANSWERS

&

WHO, WHAT, WHEN, WHERE…©

Key

rus

-A

ll rig

hts

rese

rved

38Big Data

&

Page 39: Big Data Université Paris 13

THANK YOU

FOR YOUR ATTENTION©

Key

rus

-A

ll rig

hts

rese

rved

39Big Data