the technology of the business data lake · the business data lake – technical perspective...

The Technology of the Business Data Lake

Appendix

2

BIM

Copyright © 2013 Capgemini. All rights reserved.

The Business Data Lake – Technical perspective

Pivotal data products

Term Description

Greenplum

Database

A massively parallel platform for large-scale data analytics to manage and analyze petabytes of

data – also available with Hadoop HDFS storage tier integration (HAWQ an add on for PHD).

HAWQ brings mature MPP technology for SQL on Hadoop. MADlib, in-database parallel

implementation of common analytics functions, will also work with HAWQ soon.

GemFire A real-time distributed data store with linear scalability and continuous uptime capabilities – now

available with storage tier integrated on Hadoop HDFS (GemFire XD).

Pivotal HD Commercially supported Apache Hadoop. HAWQ brings mature enterprise class SQL

capabilities to Hadoop and GemFire XD brings real-time data access to Hadoop.

Spring XD Spring XD simplifies the process of creating real world big data solutions. Simplifies high

throughput data ingestion and export along with ability to create cross platform workflows.

Pivotal Data

Dispatch

On-demand big data access across and beyond the enterprise. PDD provides data workers

security controlled self service access to data. IT manages data modeling, access, compliance,

and data lifecycle policies for all data provided through Pivotal DD.

Pivotal

Analytics

Provides the business community with visualizations and insights from big data. It provides the

ability to join data from different sources to quickly create visualizations and dashboards. Pivotal

Analytics can infer schemas from data sources and automatically create insights as it ingests

data from various sources, freeing up business analysts to focus on analyzing data and

generating insights rather than manipulating data.

3

BIM



Appendix 2:

Terminology

4

BIM



Terminology

Term Description

Synchronous

path

Processing that happens while the user is waiting for the results from an action (usually a click).

The results are usually returned from information stored in the real-time systems.

Asynchronous

path

Processing that happens in the background and no user is waiting for the results of the analysis.

The results of the processing influence the synchronous processing by refreshing the information

synchronous path processing relies on.

Streaming Processing (collection, scoring, aggregation, deposition) of a single event as it happens.

Streaming is usually associated with synchronous path processing.

Micro batch Processing of group of events as they come frequently in a compact package. Usually every few

seconds or minutes.

Batch Processing of a large group of events coming in a package – usually every hour or daily or

monthly.

Mega batch Infrequent processing of all or most (very large amount of) data. Although repeatable, usually

done once a quarter or even less frequently.

Frequency Rate at which the events are generated aka. “event rate”.

Latency Time delay between the event generation (resulting from a business activity) and receiving it.

SLA Agreed service level agreement with the data consumer on the latency, quality and

completeness of the data along with up time guarantees.

5

BIM



Terminology (cont’d.)

Term Description

Real-time

response time

Very low latency between the event occurrence and insight generation. Usually within seconds of

the event occurrence.

Interactive

response time

Time a user has to wait for the results – if within minutes it is considered interactive. If the user

needs to take a coffee break, it is batch.

Near real-time

response time Slightly higher latency than real time. Usually within few minutes of the event occurrence.

Analytics Algorithms that run on the data. Vast scale from simple pre-computed aggregation to complex

algorithms looking for patterns in data.

Insights Results from the analytical algorithms made available to applications or business users.

Actions The activities that a business or an application performs in response to the information from the

insights.

6

BIM



Appendix 3:

Components of

Business Data Lake

7

BIM



Business Data Lake Criteria

How is Business Data Lake different?

EDW

Common data

model

Base class = standard data

Derived classes = local data

Single class = single view across the

enterprise

Data quality Full spectrum 1 0

0 1 0 1 0

0 1

0 1

1 1 0

Data integration

Multiple

interfaces SQL, SAS, R, MapReduce, NoSQL

SQL access integration with SAS, R

and other analytical interfaces

Mixed workload

with varying

QoS

Support low latency, interactive and

batch Limited QoS separation required

8

BIM



Generic Business Data Lake architecture

Ingestion

tier

Insights

tier

Unified operations tier

System monitoring System management

Unified data management tier

Data mgmt.

services

MDM

RDM

Audit and

policy mgmt.

Processing tier

Workflow management

Distillation tier

HDFS storage Unstructured and structured data

In-memory

MPP database

Real

time

Micro

batch

Mega

batch

SQL

NoSQL

SQL

MapReduce

Query interfaces

SQL

Sources Action tier

Real-time

ingestion

Micro batch

ingestion

Batch

ingestion

Real-time

insights

Interactive

insights

Batch

insights

9

BIM



Components of Business Data Lake

Term Description

Storage Ability to store ALL (structured, unstructured) data cost efficiently in the Business Data Lake.

Ingestion Ability to bring data from multiple data sources across all timelines with varying QoS.

Distillation Ability to take the data stored in the storage tier and converting it to structured data for easier

analysis by downstream applications.

Processing Ability to run analytical algorithms and user queries with varying QoS (real time, interactive,

batch) to generate structured data for easier analysis by downstream applications.

Insights Ability to analyze all the data with varying QoS (real time, interactive and batch) to generate

insights for business decisioning.

Action Ability to integrate the insights with the business decisioning systems.

Unified data

management

Ability to manage the data lifecycle, access policy definition, and master data management and

reference data management services.

Unified

operations

Ability to monitor, configure and manage the whole Data Lake from a single operations

environment.

10

BIM



Pivotal components for the tiers

Term Description

Storage Pivotal HD.

Ingestion GemFire XD, HAWQ, Pivotal HD and Spring XD.

Distillation Pivotal Data Dispatch.

Processing Pivotal HD, HAWQ and GemFire XD queries optionally managed via Spring XD workflows.

Insights Pivotal HD, HAWQ and GemFire XD queries from user applications.

Action Big data applications aka business decisioning systems.

Unified data

management Pivotal Data Dispatch, master data management and reference data management services.

Unified

operations

Pivotal Command Center (component of Pivotal HD to manage HAWQ and GemFire XD*),

Spring XD monitoring and Pivotal Data Dispatch monitoring.

11

BIM



Data Lake interfaces

Ingestion Streaming Micro batch Batch Mega batch

Data Loader Yes Yes Yes

GemFire XD Yes

PDD

Spring XD Yes Yes Yes Yes

Sqoop Yes Yes

Distcp Yes Yes

Flume Yes Yes Yes

HDFS put Yes Yes

Talend Yes Yes

Informatica Yes Yes

Interface Real time Interactive Batch

GemFire XD (SQL) Yes Yes

HAWQ (SQL) Yes Yes Yes

Hive (HiveQL) Yes

HBase (NoSQL) Yes Yes

MapReduce Yes

Pig Yes

Impala (SQL) Yes Yes

BI Tools GemFire XD HAWQ Hive

MicroStrategy Yes Yes

BusinessObjects Yes Yes

Spotfire Yes Yes

Tableau Yes Yes

Microsoft Excel Yes Yes

Datameer Yes Yes

Karmasphere Yes Yes

Pivotal Data Dispatch

Legend:

Pivotal

Apache

Partner

Competition

Monitoring

data

management

Configuration

install

Pivotal command

center

Pivotal command

center

Data access Ingestion Analytics +

Analytics

12

BIM



Data ingestion

Event processing

Ev

en

t co

llecti

on

Files Events

Eve

nts

F

ile

s

Streaming

Mega batch

GemFire XD

Spring XD

Micro batch

N/A

Data loader

Spring XD

Hig

h t

hro

ug

hp

ut

Lo

w t

hro

ug

hp

ut

Batch Real time

GemFire XD

Data loader

Spring XD

Out of the box support for HTTP, Tail, Mail, Twitter, GemFire, TCP, JMS, RabbitMQ, Time,

MQTT, …

Move massive amounts of data at wire speed with throttling capabilities.

SQL Insert data into a GemFire XD and API to send data to GemFire XD. GemFire XD

Spring XD

Data loader

13

BIM



SQL Query for interactive data access. Connectivity with industry standard BI tools.

HiveQL and MapReduce for batch data access. HBase for real-time looking and simple data

queries.

SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS

immediately available for interactive queries.

Data access

GemFire XD

HAWQ

Hive HBase

MapReduce

An

aly

tic

s

Lo

ok

up

Batch Real time Interactive

Qu

ery

HAWQ

Hive

MapReduce

GemFire XD

HBase MapReduce

Pig

Data distillation

MapReduce

Pig

Use connectors,

programs, models to

convert to structured data

Event access methods

Ev

en

t sto

rag

e

Unstructured Structured interfaces

Un

str

uc

ture

d

Str

uc

ture

d

SQL

HiveQL

Hbase APIs

MapReduce

Pig

14

BIM



Data distillation

MapReduce

Pig

Connectors from

Hadoop

Greenplum database

GemFire/SQL Fire

Processing platform

Data

sto

rag

e

Native Hadoop

Nati

ve

H

DF

S

HAWQ

GemFire XD PXF connectors

SQL Query for interactive data access. Connectivity with industry standard BI tools.

HiveQL and MapReduce for batch data access. HBase for real-time looking and simple data

queries.

SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS

immediately available for interactive queries. GemFire XD

HAWQ

Hive HBase

MapReduce

An

aly

tic

s

Lo

ok

up

Batch Real time Interactive

Qu

ery

HAWQ

Hive

MapReduce

GemFire XD

HBase

15

BIM



Unified data management: Pivotal Data Dispatch

All data stored on HDFS:

• Pivotal: GemFire XD/HAWQ

• Hadoop data: Hive/HBase

• Raw ingested data

IT managed:

• Data registered in PDD

• Data source connected and automated

• Target support for sandbox creation

• Auditable data access policy definition

Data work:

• Self serve ability to access data on demand on a target sandbox from various sources

while conforming to the data access policies.

16

BIM



Action tier: Decision maker expectations

Informational

Ability to get information in a dashboard

Integration with business intelligence tools

Tableau, MicroStrategy, BusinessObjects, Pentaho.

Alerting

Ability to alert the decision maker

Integration with the alert systems

Dashboard, alarms, emails, pagers, phones etc.

Automation

Ability to integrate with business decisioning systems

Integration with the applications to take automated actions

MessageMQ, Rabbit, Spring, & other technologies.

The information contained in this presentation is proprietary.


Rightshore® is a trademark belonging to Capgemini.

www.capgemini.com/bim

About Capgemini

With more than 130,000 people in 44 countries, Capgemini is one

of the world's foremost providers of consulting, technology and

outsourcing services. The Group reported 2012 global revenues

of EUR 10.3 billion.

Together with its clients, Capgemini creates and delivers

business and technology solutions that fit their needs and drive

the results they want. A deeply multicultural organization,

Capgemini has developed its own way of working, the

Collaborative Business Experience™, and draws on Rightshore®,

its worldwide delivery model.

http://www.capgemini.com/bim

http://www.facebook.com/Capgemini

http://www.linkedin.com/company/capgemini

http://www.twitter.com/capgemini

http://www.youtube.com/capgemini

http://www.slideshare.net/capgemini

the technology of the business data lake · the business data lake – technical perspective...

Documents