the future of analytics, data integration and bi on big data platforms

55
Mark Rittman, Oracle ACE Director THE FUTURE OF ANALYTICS, DATA INTEGRATION AND BI ON BIG DATA PLATFORMS HADOOP USER GROUP IRELAND (HUG IRL) Dublin, September 2016

Upload: mark-rittman

Post on 16-Apr-2017

1.151 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: The Future of Analytics, Data Integration and BI on Big Data Platforms

Mark Rittman, Oracle ACE Director

THE FUTURE OF ANALYTICS, DATA INTEGRATION

AND BI ON BIG DATA PLATFORMS

HADOOP USER GROUP IRELAND (HUG IRL)

Dublin, September 2016

Page 2: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Mark Rittman, Co-Founder of Rittman Mead

•Oracle ACE Director, specialising in Oracle BI&DW

•14 Years Experience with Oracle Technology

•Regular columnist for Oracle Magazine

•Author of two Oracle Press Oracle BI books

•Oracle Business Intelligence Developers Guide

•Oracle Exalytics Revealed

•Writer for Rittman Mead Blog : http://www.rittmanmead.com/blog

•Email : [email protected]

•Twitter : @markrittman

About the Speaker

2

Page 3: The Future of Analytics, Data Integration and BI on Big Data Platforms

OR AS I SAY AT PARTIES…

3

Page 4: The Future of Analytics, Data Integration and BI on Big Data Platforms

4

Page 5: The Future of Analytics, Data Integration and BI on Big Data Platforms

BUT SERIOUSLY…

5

Page 6: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Started back in 1996 on a bank Oracle DW project

•Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts

•Went on to use Oracle Developer/2000 and Designer/2000

•Our initial users queried the DW using SQL*Plus

•And later on, we rolled-out Discoverer/2000 to everyone else

•And life was fun…

20 Years in Old-school BI & Data Warehousing

6

Page 7: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Data warehouses provided a unified view of the business

•Single place to store key data and metrics

•Joined-up view of the business

•Aggregates and conformed dimensions

•ETL routines to load, cleanse and conform data

•BI tools for simple, guided access to information

•Tabular data access using SQL-generating tools

•Drill paths, hierarchies, facts, attributes

•Fast access to pre-computed aggregates

•Packaged BI for fast-start ERP analytics

Data Warehouses and Enterprise BI Tools

7

Oracle

MongoDB

Oracle

Sybase

IBMDB/2

MSSQL

MSSQLServer

CoreERPPlatform

Retail

Banking

CallCenter

E-Commerce

CRM

Business

IntelligenceTools

DataWarehouse

Access&Performance

Layer

ODS/Foundation

Layer

7

Page 8: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Examples were Crystal Reports, Oracle Reports, Cognos Impromptu, Business Objects

•Report written against carefully-curated BI dataset, or directly connecting to ERP/CRM

•Adding data from external sources, or other RDBMSs, was difficult and involved IT resources

•Report-writing was a skilled job

•High ongoing cost for maintenance and changes

•Little scope for analysis, predictive modeling

•Often user frustration and pace of delivery

Reporting Back Then…

8 8

Page 9: The Future of Analytics, Data Integration and BI on Big Data Platforms

•For example Oracle OBIEE, SAP Business Objects, IBM Cognos

•Full-featured, IT-orientated enterprise BI platforms

•Metadata layers, integrated security, web delivery

•Pre-build ERP metadata layers, dashboards + reports

•Federated queries across multiple sources

•Single version of the truth across the enterprise

•Mobile, web dashboards, alerts, published reports

•Integration with SOA and web services

Then Came Enterprise BI Tools

10 10

Page 10: The Future of Analytics, Data Integration and BI on Big Data Platforms

THEN CAME … BIG DATA

11

Page 11: The Future of Analytics, Data Integration and BI on Big Data Platforms

AND HADOOP

13

Page 12: The Future of Analytics, Data Integration and BI on Big Data Platforms

BIG, FAST AND FAULT-TOLERANT

14

Page 13: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Data from new-world applications is not like historic data

•Typically comes in non-tabular form

•JSON, log files, key/value pairs

•Users often want it speculatively

•Haven’t thought it through

•Schema can evolve

•Or maybe there isn’t one

•But the end-users want it now

•Not when you’re ready

But Why Hadoop? Reason #1 - Flexible Storage

16

BigDataManagementPlatform

Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopmentenvironment

Datasetsandsamples

Models andprograms

SingleCustomerView EnrichedCustomerProfile

Correlating

Modeling

MachineLearning

Scoring

Schema-onReadAnalysis

Page 14: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Enterprise High-End RDBMSs such as Oracle can scale

•Clustering for single-instance DBs can scale to >PB

•Exadata scales further by offloading queries to storage

•Sharded databases (e.g. Netezza) can scale further

•But cost (and complexity) become limiting factors

•Typically $1m/node is not uncommon

But Why Hadoop? Reason #2 - Massive Scalability

17

Page 15: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Hadoop started by being synonymous with MapReduce, and Java coding

•But YARN (Yet another Resource Negotiator) broke this dependency

•Modern Hadoop platforms provide overall cluster resource management, but support multiple processing frameworks

•General-purpose (e.g. MapReduce)

•Graph processing

•Machine Learning

•Real-Time Processing (Spark Streaming, Storm)

•Even the Hadoop resource management framework can be swapped out

•Apache Mesos

Reason #3 - Processing Frameworks

18

BigDataPlatform-AllRunningNativelyUnderHadoop

YARN(ClusterResourceManagement)

Batch(MapReduce)

HDFS(ClusterFilesystemholdingrawdata)

Interactive (Impala,Drill, Tez,Presto)

Streaming+ In-Memory

(Spark,Storm)

Graph+Search(Solr,Giraph)

EnrichedCustomerProfile

Modeling

Scoring

Page 16: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage

•Flexible data storage platform with cheap storage, flexible schema support + compute

•Data lands in the data lake or reservoir in raw form, then minimally processed

•Data then accessed directly by “data scientists”, or processed further into DW

Meet the New Data Warehouse : The “Data Lake”

19

DataTransfer DataAccess

DataFactory DataReservoir

BusinessIntelligenceTools

HadoopPlatform

FileBasedIntegration

StreamBased

Integration

Datastreams

Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopment

environment

Datasetsandsamples

Models andprograms

Marketing/SalesApplications

Models

MachineLearning

Segments

OperationalData

Transactions

CustomerMasterata

UnstructuredData

Voice+ChatTranscripts

ETLBasedIntegration

RawCustomerData

Datastoredintheoriginal

format(usuallyfiles)suchasSS7,ASN.1,JSONetc.

MappedCustomerData

Datasetsproducedbymappingandtransformingrawdata

Page 17: The Future of Analytics, Data Integration and BI on Big Data Platforms

NEW STARTUPS ENABLING A HYBRID “OLD WORLD/NEW WORLD” APPROACH

20

Page 18: The Future of Analytics, Data Integration and BI on Big Data Platforms

AND PERFECT FOR ANALYTICS

22

Page 19: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering

•Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs

•Hadoop was designed from outside for massive horizontal scalability - using cheap hardware

•Anticipates hardware failure and makes multiple copies of data as protection

•More nodes you add, more stable it becomes

•And at a fraction of the cost of traditional RDBMS platforms

Hadoop : The Default Platform Today for Analytics

23

Page 20: The Future of Analytics, Data Integration and BI on Big Data Platforms

BI INNOVATION IS HAPPENING AROUND HADOOP

24

Page 21: The Future of Analytics, Data Integration and BI on Big Data Platforms

“WE’RE WINNING!”

27

Page 22: The Future of Analytics, Data Integration and BI on Big Data Platforms
Page 23: The Future of Analytics, Data Integration and BI on Big Data Platforms

BUT…

29

Page 24: The Future of Analytics, Data Integration and BI on Big Data Platforms

isn’t Hadoop Slow?

Page 25: The Future of Analytics, Data Integration and BI on Big Data Platforms

too slowfor ad-hoc querying?

Page 26: The Future of Analytics, Data Integration and BI on Big Data Platforms

WELCOME TO 2016

32

Page 27: The Future of Analytics, Data Integration and BI on Big Data Platforms
Page 28: The Future of Analytics, Data Integration and BI on Big Data Platforms
Page 29: The Future of Analytics, Data Integration and BI on Big Data Platforms

(HADOOP 2.0)

35

Page 30: The Future of Analytics, Data Integration and BI on Big Data Platforms
Page 31: The Future of Analytics, Data Integration and BI on Big Data Platforms

HADOOP IS NOW FAST

37

Page 32: The Future of Analytics, Data Integration and BI on Big Data Platforms

Hadoop 2.0 Processing Frameworks + Tools

38

Page 33: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Cloudera’s answer to Hive query response time issues

•MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access

•Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata

•Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc

Cloudera Impala - Fast, MPP-style Access to Hadoop Data

39

Page 34: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations

•Apache AVRO often used for general-purpose processing

•Splitability, schema evolution, in-built metadata, support for block compression

•Parquet now commonly used with Impala due to column-orientated storage

•Mirrors work in RDBMS world around column-store

•Only return (project) the columns you require across a wide table

Parquet - Column-Orientated Storage for Analytics

40

Page 35: The Future of Analytics, Data Integration and BI on Big Data Platforms

•But Parquet (and HDFS) have significant limitation for real-time analytics applications

•Append-only orientation, focus on column-store makes streaming ingestion harder

•Cloudera Kudu aims to combine best of HDFS + HBase

•Real-time analytics-optimised

•Supports updates to data

•Fast ingestion of data

•Accessed using SQL-style tables and get/put/update/delete API

Cloudera Kudu - Best of HBase and Column-Store

41

Page 36: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Kudu storage used with Impala - create tables using Kudu storage handler

•Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA

Example Impala DDL + DML Commands with Kudu

42

CREATE TABLE `my_first_table` (`id` BIGINT,`name` STRING)TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id');

INSERT INTO my_first_table VALUES (99, "sarah");INSERT IGNORE INTO my_first_table VALUES (99, "sarah");

UPDATE my_first_table SET name="bob" where id = 3;

DELETE FROM my_first_table WHERE id < 3;

DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;

Page 37: The Future of Analytics, Data Integration and BI on Big Data Platforms

AND IT’S NOW IN-MEMORY

43

Page 38: The Future of Analytics, Data Integration and BI on Big Data Platforms
Page 39: The Future of Analytics, Data Integration and BI on Big Data Platforms

Accompanied by Innovations in Underlying Platform

45

Cluster Resource Management tosupport mulJ-tenant distributed services

In-Memory Distributed Storage,to accompany In-Memory Distributed Processing

Page 40: The Future of Analytics, Data Integration and BI on Big Data Platforms

DATAFLOW PIPELINES ARE THE NEW ETL

46

Page 41: The Future of Analytics, Data Integration and BI on Big Data Platforms

New ways to do BI

Page 42: The Future of Analytics, Data Integration and BI on Big Data Platforms

New ways to do BI

Page 43: The Future of Analytics, Data Integration and BI on Big Data Platforms

HADOOP IS THE NEW ETL ENGINE

49

Page 44: The Future of Analytics, Data Integration and BI on Big Data Platforms

50Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Proprietary ETL engines die circa

2015 – folded into big data

Oracle Open World 2015 21

Proprietary ETL is Dead. Apache-based ETL is What’s Next

ScriptedSQL

StoredProcs

ODI forColumnar

ODI forIn-Mem

ODI forExadata

ODI forHive

ODI forPig & Oozie

1990’s

Eon of Scripts and PL-SQL Era of SQL E-LT/Pushdown Big Data ETL in Batch Streaming ETL

Period of Proprietary Batch ETL Engines

Informatica

Ascential/IBM

Ab InitioActa/SAPSyncSort

1994

Oracle Data Integrator

ODI forSpark

ODI forSpark Streaming

WarehouseBuilder

Page 45: The Future of Analytics, Data Integration and BI on Big Data Platforms

MACHINE LEARNING & SEARCH FOR “AUTOMAGIC” SCHEMA DISCOVERY

51

Page 46: The Future of Analytics, Data Integration and BI on Big Data Platforms

New ways to do BI

Page 47: The Future of Analytics, Data Integration and BI on Big Data Platforms

•By definition there's lots of data in a big data system ... so how do you find the data you want?

•Google's own internal solution - GOODS ("Google Dataset Search")

•Uses crawler to discover new datasets

•ML classification routines to infer domain

•Data provenance and lineage

•Indexes and catalogs 26bn datasets

•Other users, vendors also have solutions

•Oracle Big Data Discovery

•Datameer

•Platfora

•Cloudera Navigator

Google GOODS - Catalog + Search At Google-Scale

53

Page 48: The Future of Analytics, Data Integration and BI on Big Data Platforms

A NEW TAKE ON BI

54

Page 49: The Future of Analytics, Data Integration and BI on Big Data Platforms

•Came out if the data science movement, as a way to "show workings"

•A set of reproducible steps that tell a story about the data

•as well as being a better command-line environment for data analysis

•One example is Jupyter, evolution of iPython notebook

•supports pySpark, Pandas etc

•See also Apache Zepplin

Web-Based Data Analysis Notebooks

55

Page 50: The Future of Analytics, Data Integration and BI on Big Data Platforms

AND EMERGING OPEN-SOURCE BI TOOLS AND PLATFORMS

57

Page 51: The Future of Analytics, Data Integration and BI on Big Data Platforms

And Emerging Open-Source BI Tools and Platforms

http://larrr.com/wp-content/uploads/2016/05/paper.pdf

Page 52: The Future of Analytics, Data Integration and BI on Big Data Platforms
Page 53: The Future of Analytics, Data Integration and BI on Big Data Platforms

And Emerging Open-Source BI Tools and Platforms

Page 54: The Future of Analytics, Data Integration and BI on Big Data Platforms

WELCOME TO THE FUTURE

62

Page 55: The Future of Analytics, Data Integration and BI on Big Data Platforms

Mark Rittman, Oracle ACE Director

THE FUTURE OF ANALYTICS, DATA INTEGRATION

AND BI ON BIG DATA PLATFORMS

HADOOP USER GROUP IRELAND (HUG IRL)

Dublin, September 2016