the technology of the business data lake · the business data lake – technical perspective...
TRANSCRIPT
The Technology of the Business Data Lake
Appendix
2
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Pivotal data products
Term Description
Greenplum
Database
A massively parallel platform for large-scale data analytics to manage and analyze petabytes of
data – also available with Hadoop HDFS storage tier integration (HAWQ an add on for PHD).
HAWQ brings mature MPP technology for SQL on Hadoop. MADlib, in-database parallel
implementation of common analytics functions, will also work with HAWQ soon.
GemFire A real-time distributed data store with linear scalability and continuous uptime capabilities – now
available with storage tier integrated on Hadoop HDFS (GemFire XD).
Pivotal HD Commercially supported Apache Hadoop. HAWQ brings mature enterprise class SQL
capabilities to Hadoop and GemFire XD brings real-time data access to Hadoop.
Spring XD Spring XD simplifies the process of creating real world big data solutions. Simplifies high
throughput data ingestion and export along with ability to create cross platform workflows.
Pivotal Data
Dispatch
On-demand big data access across and beyond the enterprise. PDD provides data workers
security controlled self service access to data. IT manages data modeling, access, compliance,
and data lifecycle policies for all data provided through Pivotal DD.
Pivotal
Analytics
Provides the business community with visualizations and insights from big data. It provides the
ability to join data from different sources to quickly create visualizations and dashboards. Pivotal
Analytics can infer schemas from data sources and automatically create insights as it ingests
data from various sources, freeing up business analysts to focus on analyzing data and
generating insights rather than manipulating data.
3
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Appendix 2:
Terminology
4
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Terminology
Term Description
Synchronous
path
Processing that happens while the user is waiting for the results from an action (usually a click).
The results are usually returned from information stored in the real-time systems.
Asynchronous
path
Processing that happens in the background and no user is waiting for the results of the analysis.
The results of the processing influence the synchronous processing by refreshing the information
synchronous path processing relies on.
Streaming Processing (collection, scoring, aggregation, deposition) of a single event as it happens.
Streaming is usually associated with synchronous path processing.
Micro batch Processing of group of events as they come frequently in a compact package. Usually every few
seconds or minutes.
Batch Processing of a large group of events coming in a package – usually every hour or daily or
monthly.
Mega batch Infrequent processing of all or most (very large amount of) data. Although repeatable, usually
done once a quarter or even less frequently.
Frequency Rate at which the events are generated aka. “event rate”.
Latency Time delay between the event generation (resulting from a business activity) and receiving it.
SLA Agreed service level agreement with the data consumer on the latency, quality and
completeness of the data along with up time guarantees.
5
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Terminology (cont’d.)
Term Description
Real-time
response time
Very low latency between the event occurrence and insight generation. Usually within seconds of
the event occurrence.
Interactive
response time
Time a user has to wait for the results – if within minutes it is considered interactive. If the user
needs to take a coffee break, it is batch.
Near real-time
response time Slightly higher latency than real time. Usually within few minutes of the event occurrence.
Analytics Algorithms that run on the data. Vast scale from simple pre-computed aggregation to complex
algorithms looking for patterns in data.
Insights Results from the analytical algorithms made available to applications or business users.
Actions The activities that a business or an application performs in response to the information from the
insights.
6
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Appendix 3:
Components of
Business Data Lake
7
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Business Data Lake Criteria
How is Business Data Lake different?
EDW
Common data
model
Base class = standard data
Derived classes = local data
Single class = single view across the
enterprise
Data quality Full spectrum 1 0
0 1 0 1 0
0 1
0 1
1 1 0
Data integration
Multiple
interfaces SQL, SAS, R, MapReduce, NoSQL
SQL access integration with SAS, R
and other analytical interfaces
Mixed workload
with varying
QoS
Support low latency, interactive and
batch Limited QoS separation required
8
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Generic Business Data Lake architecture
Ingestion
tier
Insights
tier
Unified operations tier
System monitoring System management
Unified data management tier
Data mgmt.
services
MDM
RDM
Audit and
policy mgmt.
Processing tier
Workflow management
Distillation tier
HDFS storage Unstructured and structured data
In-memory
MPP database
Real
time
Micro
batch
Mega
batch
SQL
NoSQL
SQL
MapReduce
Query interfaces
SQL
Sources Action tier
Real-time
ingestion
Micro batch
ingestion
Batch
ingestion
Real-time
insights
Interactive
insights
Batch
insights
9
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Components of Business Data Lake
Term Description
Storage Ability to store ALL (structured, unstructured) data cost efficiently in the Business Data Lake.
Ingestion Ability to bring data from multiple data sources across all timelines with varying QoS.
Distillation Ability to take the data stored in the storage tier and converting it to structured data for easier
analysis by downstream applications.
Processing Ability to run analytical algorithms and user queries with varying QoS (real time, interactive,
batch) to generate structured data for easier analysis by downstream applications.
Insights Ability to analyze all the data with varying QoS (real time, interactive and batch) to generate
insights for business decisioning.
Action Ability to integrate the insights with the business decisioning systems.
Unified data
management
Ability to manage the data lifecycle, access policy definition, and master data management and
reference data management services.
Unified
operations
Ability to monitor, configure and manage the whole Data Lake from a single operations
environment.
10
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Pivotal components for the tiers
Term Description
Storage Pivotal HD.
Ingestion GemFire XD, HAWQ, Pivotal HD and Spring XD.
Distillation Pivotal Data Dispatch.
Processing Pivotal HD, HAWQ and GemFire XD queries optionally managed via Spring XD workflows.
Insights Pivotal HD, HAWQ and GemFire XD queries from user applications.
Action Big data applications aka business decisioning systems.
Unified data
management Pivotal Data Dispatch, master data management and reference data management services.
Unified
operations
Pivotal Command Center (component of Pivotal HD to manage HAWQ and GemFire XD*),
Spring XD monitoring and Pivotal Data Dispatch monitoring.
11
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Data Lake interfaces
Ingestion Streaming Micro batch Batch Mega batch
Data Loader Yes Yes Yes
GemFire XD Yes
PDD
Spring XD Yes Yes Yes Yes
Sqoop Yes Yes
Distcp Yes Yes
Flume Yes Yes Yes
HDFS put Yes Yes
Talend Yes Yes
Informatica Yes Yes
Interface Real time Interactive Batch
GemFire XD (SQL) Yes Yes
HAWQ (SQL) Yes Yes Yes
Hive (HiveQL) Yes
HBase (NoSQL) Yes Yes
MapReduce Yes
Pig Yes
Impala (SQL) Yes Yes
BI Tools GemFire XD HAWQ Hive
MicroStrategy Yes Yes
BusinessObjects Yes Yes
Spotfire Yes Yes
Tableau Yes Yes
Microsoft Excel Yes Yes
Datameer Yes Yes
Karmasphere Yes Yes
Pivotal Data Dispatch
Legend:
Pivotal
Apache
Partner
Competition
Monitoring
data
management
Configuration
install
Pivotal command
center
Pivotal command
center
Data access Ingestion Analytics +
Analytics
12
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Data ingestion
Event processing
Ev
en
t co
llecti
on
Files Events
Eve
nts
F
ile
s
Streaming
Mega batch
GemFire XD
Spring XD
Micro batch
N/A
Data loader
Spring XD
Hig
h t
hro
ug
hp
ut
Lo
w t
hro
ug
hp
ut
Batch Real time
GemFire XD
Data loader
Spring XD
Out of the box support for HTTP, Tail, Mail, Twitter, GemFire, TCP, JMS, RabbitMQ, Time,
MQTT, …
Move massive amounts of data at wire speed with throttling capabilities.
SQL Insert data into a GemFire XD and API to send data to GemFire XD. GemFire XD
Spring XD
Data loader
13
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
SQL Query for interactive data access. Connectivity with industry standard BI tools.
HiveQL and MapReduce for batch data access. HBase for real-time looking and simple data
queries.
SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS
immediately available for interactive queries.
Data access
GemFire XD
HAWQ
Hive HBase
MapReduce
An
aly
tic
s
Lo
ok
up
Batch Real time Interactive
Qu
ery
HAWQ
Hive
MapReduce
GemFire XD
HBase MapReduce
Pig
Data distillation
MapReduce
Pig
Use connectors,
programs, models to
convert to structured data
Event access methods
Ev
en
t sto
rag
e
Unstructured Structured interfaces
Un
str
uc
ture
d
Str
uc
ture
d
SQL
HiveQL
Hbase APIs
MapReduce
Pig
14
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Data distillation
MapReduce
Pig
Connectors from
Hadoop
Greenplum database
GemFire/SQL Fire
Processing platform
Data
sto
rag
e
Native Hadoop
Nati
ve
H
DF
S
HAWQ
GemFire XD PXF connectors
SQL Query for interactive data access. Connectivity with industry standard BI tools.
HiveQL and MapReduce for batch data access. HBase for real-time looking and simple data
queries.
SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS
immediately available for interactive queries. GemFire XD
HAWQ
Hive HBase
MapReduce
An
aly
tic
s
Lo
ok
up
Batch Real time Interactive
Qu
ery
HAWQ
Hive
MapReduce
GemFire XD
HBase
15
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Unified data management: Pivotal Data Dispatch
All data stored on HDFS:
• Pivotal: GemFire XD/HAWQ
• Hadoop data: Hive/HBase
• Raw ingested data
IT managed:
• Data registered in PDD
• Data source connected and automated
• Target support for sandbox creation
• Auditable data access policy definition
Data work:
• Self serve ability to access data on demand on a target sandbox from various sources
while conforming to the data access policies.
16
BIM
Copyright © 2013 Capgemini. All rights reserved.
The Business Data Lake – Technical perspective
Action tier: Decision maker expectations
Informational
Ability to get information in a dashboard
Integration with business intelligence tools
Tableau, MicroStrategy, BusinessObjects, Pentaho.
Alerting
Ability to alert the decision maker
Integration with the alert systems
Dashboard, alarms, emails, pagers, phones etc.
Automation
Ability to integrate with business decisioning systems
Integration with the applications to take automated actions
MessageMQ, Rabbit, Spring, & other technologies.
The information contained in this presentation is proprietary.
Copyright © 2013 Capgemini. All rights reserved.
Rightshore® is a trademark belonging to Capgemini.
www.capgemini.com/bim
About Capgemini
With more than 130,000 people in 44 countries, Capgemini is one
of the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2012 global revenues
of EUR 10.3 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business Experience™, and draws on Rightshore®,
its worldwide delivery model.