reducing time to value - dm and analytical tools available ...files.meetup.com/10751222/reducing...

34
28/04/2016 Copyright © Intelligent Business Strategies 1992-2016 – All Rights Reserved 1 Reducing Time To Value - Data Management And Analytical Tools On Spark and Hadoop Mike Ferguson Managing Director Intelligent Business Strategies HUG Manchester Meetup April 2016 2 Copyright © Intelligent Business Strategies 1992-2016 About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in BI/Analytics, data management and big data. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz [email protected] Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700

Upload: others

Post on 27-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 1

Reducing Time To Value - Data Management And Analytical Tools On Spark and Hadoop

Mike Ferguson Managing Director Intelligent Business Strategies HUG Manchester Meetup April 2016

2 Copyright © Intelligent Business Strategies 1992-2016!

About Mike Ferguson

Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in BI/Analytics, data management and big data. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates.

www.intelligentbusiness.biz [email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700

Page 2: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 2

3 Copyright © Intelligent Business Strategies 1992-2016!

Topics

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

§  Preparing data for analysis without programming using data wrangling tools

§ Model development tools that exploit Spark and in-Hadoop analytics

§  Building workflow based analytical applications without programming

§  Building streaming analytic applications without programming

§  Text analytics and the power of search

§  Interactive data discovery and data visualization tools

4 Copyright © Intelligent Business Strategies 1992-2016!

Today Both Structured And Multi-Structured Data Are Needed For Deeper Insight

Multi-structured

data Click stream web log data Customer interaction data

Social interaction data Sensor data

Rich media data (video, audio) External content

Documents Internal web content

Seismic data (oil & gas)

Structured data

OLTP system data Data warehouse data

Personal data stores e.g. Excel, Access

Often un-modelled and may not be well understood

Often a schema is defined and data is well understood

Page 3: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 3

5 Copyright © Intelligent Business Strategies 1992-2016!

Different Platforms Optimised For Different Analytical Workloads Are Now Needed

Big Data analytical workloads have resulted in multiple platforms now being used for analytical processing

Data Warehouse RDBMS

EDW

DW & marts

mart DW

Appliance

Advanced Analytics (structured data)

Analytical RDBMS

Streaming data

Streaming analytics

Real-time streaming analytics &

decision m’gmt

NoSQL DBMS

Hadoop data store

NoSQL DB e.g. graph DB

Advanced Analytic (multi-structured data)

Investigative / Exploratory

analysis Graph

analysis

C

R

U

D

Prod

Asset

Cust

MDM

Self-service BI and

Analytical Tools

IT developed queries, reports &

dashboards

Data mining, model

development

Data mining, model

development

6 Copyright © Intelligent Business Strategies 1992-2016!

Data Scientists Are Doing Exploratory Analysis, Developing Analytical Models And Applications Across The Ecosystem

Data Warehouse RDBMS

EDW

DW & marts

mart DW

Appliance

Advanced Analytics (structured data)

Analytical RDBMS

Streaming data

Streaming analytics

Real-time streaming analytics &

decision m’gmt

NoSQL DBMS

Hadoop data store

NoSQL DB e.g. graph DB

Advanced Analytic (multi-structured data)

Exploratory analysis

Graph analysis

C

R

U

D

Prod

Asset

Cust

MDM

Data mining, model

development

Data mining, model

development

Data Scientist

Text analysis

Page 4: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 4

7 Copyright © Intelligent Business Strategies 1992-2016!

Problems With Big Data Analytic Application Development §  Reliance on very highly skilled data scientists has become

a barrier to adoption •  Limited availability of skilled employees

§  Very high bar set to find people with skills in: •  Data engineering •  Mathematics and statistics •  Java, Python or Scala programming •  R programming •  Data visualisation •  Communication with business

§  Slow pace of building analytic applications •  Writing code is time consuming and expensive •  Too dependent on developers who may be a bottleneck •  High maintenance costs, no metadata, staff turnover…

8 Copyright © Intelligent Business Strategies 1992-2016!

Speeding up Data Science Requires Automation, Simplification and Provisioning of Insight

§  Need more automation and simplification •  E.g. Raise level of abstraction where programming skills are no

longer needed to prepare and integrate data

§  Lower the bar on skillsets •  Enable the ‘Citizen Data Scientists’ – business analyst •  Need a greater reliance on business analysts and data architects

in big data environments in future

§  Introduce automation to increase agility and reduce time to value •  Automated data discovery and profiling of new data sources •  Generate code to exploit new technology more rapidly and reduce

reliance on programming

§  Deliver actionable insight to the point of need by integrating into processes and applications

Page 5: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 5

9 Copyright © Intelligent Business Strategies 1992-2016!

New Tools Are Needed To Allow Business Analysts To Become “Citizen Data Scientists”

Exploratory analysis Predictive / statistical model producer

Business Analyst

Business Manager / Operations worker /

Customer Data Scientist

Model consumer Data blending Data visualisation Information Producer

• Build reports • Build and publish dashboards

Insight Interpreter Storyteller and Collaborator Business Communicator

Information consumer Decision maker Collaborator Action taker

+ Citizen Data Scientist

New tools

10 Copyright © Intelligent Business Strategies 1992-2016!

Key Requirements For Tools To Improve Productivity And Reduce Time To Value - 1

§  Be able to develop batch Spark and stream processing analytical applications without the need for programming

§  Develop Spark and streaming analytic applications using pipelines (workflows) so ETL developers can retain their skills and business analysts can participate

§  Be able to filter data from streaming data sources for storage in Hadoop or an analytic RDBMS

§  Deploy analytics in-database, in-stream and in-Hadoop for scalability

§  Align analytics with business strategy by tagging them

Page 6: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 6

11 Copyright © Intelligent Business Strategies 1992-2016!

Key Requirements For Tools To Improve Productivity And Reduce Time To Value - 2 §  Publish data integration workflows as trusted data services

for consumption (use) by. •  People building analytics e.g. Developers, data scientists,

business analysts

§  Publish analytic workflows as services so they can be •  Consumed in other tools and apps to build powerful data

driven analytic applications •  Nested in other workflows

§  Create a catalog of available trusted data, data services and analytic services

§  Enrich customer master data, data warehouses and data marts with new data and insights

12 Copyright © Intelligent Business Strategies 1992-2016!

Acquire

Reducing Time To Value - The Objective Is To Accelerate The Creation of Analytical Process

Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

How do you accelerate this process?

Do you have to code everything?

What tools are available to the ‘Citizen Data Scientist’ to help accelerate elements of this process or even the whole process?

What other factors are critical to success?

Page 7: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 7

13 Copyright © Intelligent Business Strategies 1992-2016!

Technology Frequently Used In Data Science

§  Self-service data preparation tools

§  Data mining tools •  Workflow based data mining tools •  Statistical analysis •  Built-in data preparation •  Machine learning algorithms

§  Streaming analytics platform workbenches

§  Analytical application development •  Typically on Spark or Hadoop MapReduce •  Programming in Python, Java, Scala and R •  Often using interactive workbench technologies

14 Copyright © Intelligent Business Strategies 1992-2016!

Topics – Where Are We?

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

Ø Preparing data for analysis without programming using data wrangling tools

§ Model development tools that exploit Spark and in-Hadoop analytics

§  Building workflow based analytical applications without programming

§  Building streaming analytic applications without programming

§  Text analytics and the power of search

§  Interactive data discovery and data visualization tools

Page 8: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 8

15 Copyright © Intelligent Business Strategies 1992-2016!

Evolution of Big Data Integration Has Followed The Same Cycle as it Did in Data Warehousing

Hand coded ETL programs

Hadoop Hand coded

programs

ELT processing

Generated Spark or MR ELT processing

Evolution of Big Data Integration

16 Copyright © Intelligent Business Strategies 1992-2016!

Data Cleansing and Integration Tool

Scaling ETL Transformations for In-Hadoop ELT Processing

Extract Parse Clean Transform Analyse Load Insights

Option 1 ETL tool generates HQL or convert generated SQL to

HQL

Option 2 ETL tool generates Pig

(compiler converts every transform to a map

reduce job) or JAQL

Option 3 ETL tool generates

3GL MR or Spark code

Option 4 – Other Native massively parallel

transformation and integration bypassing any Hadoop execution

engine

Allows ETL developers to use their skills to prepare and integrate data at scale without the need fro programming

Page 9: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 9

17 Copyright © Intelligent Business Strategies 1992-2016!

Generation of Spark MPP In-Memory Data Integration AND Analysis Jobs – E.g. Talend

Source: Talend

Acquire Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

18 Copyright © Intelligent Business Strategies 1992-2016!

IBM BigIntegrate Supports Data Pipelining With Auto Data Repartitioning for Maximum Throughput

Source: IBM

customer last name

customer postcode

credit card number

U-Z

N-T

G-M

A-F

Source Target

repartitioning repartitioning

Runs on •  BigInsights, •  ODP with Apache Hadoop •  Hortonworks •  Cloudera CDH

Page 10: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 10

19 Copyright © Intelligent Business Strategies 1992-2016!

Informatica Is Also Running Native On Hadoop - Blaze Engine And Cluster Aware Layer (CAL)

§  Distributed engine runs directly on Hadoop YARN

§  Leverages all the compute nodes on a Hadoop cluster

§  Automatic intelligent data pipelining, job partitioning, scaling for large concurrent workloads

§  Cluster Aware Layer (CAL) hides cluster specific interactions

•  Resource Management •  Distributed File System •  Cluster Management

§  Choice of execution on Map-Reduce, BLAZE or INFA engines outside of Hadoop

HADOOP Cluster

HDFS

Map-Reduce

Hive Runtime

INFA DIS (Data Integration Server)

INFA Hive Executor

Data Engine Compiler

Blaze Executor

Blaze Runtime

DIS CAL

Hive Driver

Hive MetaStore

YARN

Blaze Runtime

Hadoop CAL

Source: Informatica

20 Copyright © Intelligent Business Strategies 1992-2016!

Self-Service Data Integration Tool Vendors

§  Actian Dataflow

§  Alteryx

§  Clear Story Data

§  Datameer

§  IBM DataWorks

§  Informatica Rev

§  Paxata

§  SAS Data Loader

for Hadoop

§  Tamr

§  Trifacta

Acquire Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

Acquire Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

Data preparation, integration, analysis & visualisation

Data preparation and integration

Page 11: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 11

21 Copyright © Intelligent Business Strategies 1992-2016!

Business User Data Wrangling Product Example - Paxata (Aimed At Data Scientists)

Paxata auto data profiling

Paxata in-line transformations

22 Copyright © Intelligent Business Strategies 1992-2016!

Business User Data Wrangling Product Example - Paxata Cluster and Edit To Help De-Duplicate Data

Source: Paxata

Paxata applies Kmeans clustering to each column to group together similar values

Page 12: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 12

23 Copyright © Intelligent Business Strategies 1992-2016!

Paxata Technical Architecture

©(Paxata,(Inc.(

Scheduling(and(Resource(Management(Distributed(file(system(

Paxata(technical(architecture(

19(

HDFS'

Distributed(Processing(Engine(

(Pax(Cache(Manager(((remote(+(inWline)(

Pax(Data(Access((Pax(Data(Library)(Pax(Compiler(

Pax(Requests((view,(histograms,((

cluster,(rela:onships)(

Pax(RDDs(((projec:on,(filter,((aggrega:on,(pivot()(

Pax(AnswerSet((Manager(

ODBC,(JDBC,(Web(services,(etc…(

19(

Parallel(InWMemory(Pipelined(Data(Prep(Engine(powered(by(Intellifusion™(

Data(Manager( Script(Manager( Seman:c(Rela:onships(

Projects( Users( Tenants(

Data(Prepara:on(Applica:on(Web(Services(

Connec:vity(And(API(Toolkit(

Mul:Wuser(aware,(HTML5(&(mul:Wdevice(ready,(Data(Driven(Design,(etc…(

UI(Layer(

YARN'

Source: Paxata

24 Copyright © Intelligent Business Strategies 1992-2016!

Trifacta Predictive Interaction User Interface – Text Extraction Example

Figure 1: Predictive Interaction for text pattern specification. The left image shows the interface after the user has highlighted thestring mobile in line 34. The right shows the interface after one more gesture: highlighting the string dynamic in line 31. Notethat the top-ranked suggested transform changes after the second highlight, and hence so do the Source and Preview contents.

Figure 2: A ranked list of regular expressions.

a visual rendering of their data in a familiar tabular grid. They canguide the system by highlighting substrings in the table, which areadded to an example set. Based on this set, an inference algorithmproduces a ranked list of suggested text patterns that model the setwell. For the top-ranked pattern, the table renderer highlights anymatches found, and shows how those matches will be used.

Figure 1 shows the states of the interface after the user makes eachof two guiding interactions: first, highlighting the string mobilein row 34, and then highlighting the additional string dynamic inrow 31. The user interface shows the highlighted patterns in thesource (blue), and the outcome of a text extraction transform in apreview column (tan). The user can choose to view the outputs ofother suggested transforms by clicking on them in the top panel;they can also edit the patterns directly in a Transform Editor. Whenthe user decides on the best pattern, they can click the “plus” (+) tothe right of the transform to add it to a DSL script.

In our initial prototype the suggested transforms looked differentthan what is shown in Figure 1. Originally, users would see aranked list of REs in a traditional syntax, as shown in Figure 2(corresponding to the ranked list of suggested transforms on theright of Figure 1). In user studies we found that even experiencedprogrammers had difficulty deciding quickly and accurately amongalternative REs. It seems that RE syntax is better suited to writingpatterns than to reading them. Hence we changed our DSL to a newpattern language (compilable to REs) that is better suited to rapiddisambiguation among options.

In essence, we evolved our DSL design to simplify the way thatusers can interact with automated predictions. Although simple, thisexample illustrates some of the subtleties involved in co-designingPredictive Interaction across the three streams of traditional researchmentioned above. The visualization has to be informative and theaffordances for user guidance clear; the predictive model has toreceive information-rich guidance from the interactions, and do agood job of surfacing probable but diverse choices; the DSL hasto be expressive yet sufficiently small for tractable inference andsimple user interaction.

In the remainder of the paper, we provide a general framework forPredictive Interaction, putting it in context with previous approachesto visual languages for managing data, and highlighting research

X Y

Z

f

h g compilation

DSL

(a) (b)

Data Results

interactionData Vis Visual Results

visualization

Figure 3: Lifts. A traditional lift (a): given a map f : X !Y , and a map g : Z ! Y , the lifting problem is to find amap h : X ! Z such that g � h = f . Lifting in the contextof visual specifications (b): rather than write expressions in atextual DSL, we define a lift to a domain of data visualizationand interactions, such that the interactions in that domain leadto final outputs: compilation � interaction � visualization = DSLprogramming.

Figure 1 1 Qualified retrieval

EMP NAME SAL MGR DEPT

Figure 12 Partially underlined qualified retrieval

328

Qualijied retrieval. Print the names of the employees who work in the toy department and earn more than $10000. This is shown in Figure 11. Note the specification of the condition “more than $lQl&)O.” One has the option of using any of the following in- equality operators: #, >, >=, <, <=. If no inequality operator is used’ as a prefix, equality is implied. The symbol # can be re- placed by 1 or I=.

Partially underlined qualijied retrieval. Print the green items that start with the letter I . This is found in Figure 12. The I in IKE is not underlined, and it is a constant. Therefore, the system prints all the green items that start with the letter I . The user can par- tially underline at the beginning, middle or end of a word, a sen- tence, or a paragraph, as in the example, XPAY, which means find a word, a sentence or a paragraph such that somewhere in that sentence or paragraph there exist the letters PA. Since an example element can be blank, then it word, a sentence, or a paragraph that starts or ends with the letters PA also qualifies.

The partial underline feature is useful if an entry is a sentence or text and the user wishes to search to find all examples that con- tain a special word or root. If, for example, the query is to find entries with the word Texas, the formulation’ of this query is P. x TEXAS Y.

- -

Qualijied retrieval using links. Print all the green items sold by the toy department. This is shown in Figure 13. In this case, the user displays both the TYPE table and the SALES table by gener- 3ting two blank skeletons on the screen and filling them in with beadings and with required entries. The significance of the ex- ample element is best illustrated in this query. Here, the same example element must be used in both tables, indicating that if an example item such as N U T is green, that same item is also sold by the toy department. Only if these conditions are met simultaneously does the item qualify as a solution. The manual equivalent is to scan the TYPE table to find a green item and then scan the SALES table to check whether that same item is also sold by the toy department. Since there is no specification of how the query is to be processed or where the scan is to start, the formulation of this query is neutral and symmetric.

Figure 13 Qualified retrieval using links ‘“7-1 P . E T GREEN -

Once the concept of a linking example element is understood, the user can link any number of tables and any number of rows within a single table, as in the following examples.

ZLOOF IBM SYST J

Figure 4: Query By Example: qualified retrieval usinglinks [32].

challenges and opportunities for the community.

2. LIFTING TO VISUAL LANGUAGESTo set the stage for our discussion, we re-examine the more

traditional integration of two of our three themes: visualizationand data-centric languages. There are a number of influential priorefforts along these lines, including Query-By-Example (QBE) [32],Microsoft Access, and Tableau. These interfaces take a textual datamanipulation language (e.g., relational calculus) and “lift” it intoan isomorphic higher-level visual language intended to be morenatural for users. Given a visual specification of a query, a systemcan translate (“ground”) to the domain of the textual language forprocessing. Lifting is a basic idea from category theory, sometimesused in the design of functional programming languages (Figure 3).

Lifting to a visual domain has proven to be useful for the specifi-cation of standard select-project-join-aggregate queries. As illustra-tion, we review two influential systems: QBE and Tableau.

Example 1: QBE. The main idea in QBE is to lift the database

1.  User highlights text

2.  Trifacta predictive models generate ranked suggested transforms

2. Outcome of the suggested text pattern transform in Preview column

3.  User adds the selected transform to the script

Source: Trifacta

Page 13: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 13

25 Copyright © Intelligent Business Strategies 1992-2016!

Datawatch Monarch Provides Automated Extraction of Structured Data From Documents

26 Copyright © Intelligent Business Strategies 1992-2016!

IT Professionals Are Very Concerned About Data Governance As Departments Buy Different Tools

Stand-alone Data Wrangling

tools

Data & Metadata

Relationship Discovery

Services

Data Quality

Profiling & Monitoring Services

Data Modeling Services

Data Cleansing & Matching

Services Data

Integration Services

Business Glossary

/ Info Catalog Services

Data Governance/Management Console

Data Privacy & Lifecycle

Management

Services

Data Audit &

Protection Services

EIM Tool Suite

IT Data Architect Data Scientist

Business Analyst

PowerQuery

Self-Service DI embedded in Self-

Service BI tools

Dell Boomi IBM DataWorks Informatica Rev Microsoft Data Factory SnapLogic

Cloud DI “What about Data

Governance?” Lineage?

Page 14: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 14

27 Copyright © Intelligent Business Strategies 1992-2016!

Interoperability Is Needed Across Tools To Re-Use Data Preparation Jobs Developed By Different Users

Stand-alone Data Wrangling

tools

Data & Metadata

Relationship Discovery

Services

Data Quality

Profiling & Monitoring Services

Data Modeling Services

Data Cleansing & Matching

Services Data

Integration Services

Business Glossary

/ Info Catalog Services

Data Governance/Management Console

Data Privacy & Lifecycle

Management

Services

Data Audit &

Protection Serbices

EIM Tool Suite

IT Data Architect Data Scientist

Business Analyst

PowerQuery

Self-Service DI embedded in Self-

Service BI tools

Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev

Cloud DI

Interoperability

metadata metadata

metadata metadata No Stan

dard

API

s, sti

ll

Incom

plete

– Wor

k In P

rogr

ess

28 Copyright © Intelligent Business Strategies 1992-2016!

What Happens If You Have An EIM Tool Suite, MDM AND Best-of-Breed Self-Service Data Integration Tools?

IT Business Users

Self-Service DI

Data & Metadata Relationship Discovery Services

Data Quality Profiling & Monitoring Services

Data Modeling Services

Data Cleansing & Matching Services

Data Integration Services

Business Glossary / Info Catalog Services

Data Governance/Management Console

Data Privacy & Lifecycle Management Services

Data Audit & Protection Serbices

EIM Tool Suite

MDM System

C

R

U

D

Prod

LSP

Cust

Answer is they HAVE TO Integrate to solve the data governance problem

Self-Service DI

Data & Metadata Relationship Discovery Services

Data Quality Profiling & Monitoring Services

Data Modeling Services

Data Cleansing & Matching Services

Data Integration Services

Business Glossary / Info Catalog Services

Data Governance/Management Console

Data Privacy & Lifecycle Management Services

Data Audit & Protection Serbices

EIM Tool Suite

MDM System

C

R

U

D

Prod

LSP

Cust Invoke SSDI services from EIM workflows

Invoke EIM & MDM services from SSDI tools

RESTful APIs

e.g. Paxata RESTful API

?

Page 15: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 15

29 Copyright © Intelligent Business Strategies 1992-2016!

Informatica Catalog & Live

Data Map

Business And IT User Data Refinery Tools e.g. Informatica

Analyst tool Data & Metadata

Relationship Discovery

Services

Data Quality

Profiling & Monitoring Services

Data Modeling Services

Data Cleansing & Matching

Services Data

Integration Services

Business Glossary

/ Info Catalog Services

Data Governance/Management Console

Data Privacy & Lifecycle

Management

Services

Data Audit &

Protection Serbices

EIM Tool Suite

IT Data Architect Data Scientist

Business Analyst

Informatica Rev

Self-service Cloud DI

metadata

metadata Analyst tool

30 Copyright © Intelligent Business Strategies 1992-2016!

Metadata Management In A Data Reservoir – Importing 3rd Party Metadata Into An EIM Platforms Using Apache Atlas

Stand-alone Data Wrangling

tools

Services

Data Governance/Management Console

EIM Tool Suite

IT Data Architect Data Scientist

Business Analyst

PowerQuery

Self-Service DI embedded in Self-

Service BI tools

Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev

Cloud DI metadata

metadata

metadata

metadata

atlas

Graph store

atlas atlas

Information Catalog

Page 16: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 16

31 Copyright © Intelligent Business Strategies 1992-2016!

Topics – Where Are We?

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

§  Preparing data for analysis without programming using data wrangling tools

Ø Model development tools that exploit Spark and in-Hadoop analytics

§  Building workflow based analytical applications without programming

§  Building streaming analytic applications without programming

§  Text analytics and the power of search

§  Interactive data discovery and data visualization tools

32 Copyright © Intelligent Business Strategies 1992-2016!

Requirement Is Now To Deploy Analytics In Analytical DBMSs, In-Hadoop and In-Stream For Scalability & Reuse

Sandboxes (DW Appliance)

Analytics execution

EDW streaming

data

Analytics Platform Develop analytics

Deploy analytics

PMML

In-database analytics

PMML

In-stream analytics

PMML

In-Hadoop analytics

§  Customer

§  Operations

§  Risk

§  Finance

§  Sustainability

Business Strategy

align

Page 17: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 17

33 Copyright © Intelligent Business Strategies 1992-2016!

Advanced Analytics Tool Product Example - Knime

34 Copyright © Intelligent Business Strategies 1992-2016!

KNIME Integration With Spark Is Much More Than Using Mllib – It Can Exploit Spark Transformations

Source: Knime

Page 18: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 18

35 Copyright © Intelligent Business Strategies 1992-2016!

Knime Integration With Spark MLlib

§  Spark RDDs as input/output format

§  Native MLlib model learning and prediction

§  Data stays within your Spark cluster

§  No unnecessary data movements

§  Several input/output nodes e.g. Hive, HDFS files, …

Native MLlib model

Source: Knime

36 Copyright © Intelligent Business Strategies 1992-2016!

Model Development - RapidMiner Can Exploit Spark MLlib Algorithms on Hadoop Data To Build Scalable Models

Spark MLlib decision tree

algorithm

Develop and train the model on Spark Deploy and execute it on Spark / Hadoop

Access data in HDFS data set

Source: RapidMiner

Push down analytics closer to the data

Page 19: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 19

37 Copyright © Intelligent Business Strategies 1992-2016!

These Nodes Can Be Shared With Non-Programmer Data Scientists To Democratize Access To Spark Capabilities

Utilise Spark nodes in SPSS models Spark MLlib becomes usable for non-programmers with code abstracted behind a SPSS Modeler GUI

Create new Spark MLlib based IBM SPSS Modeler nodes

E.g. Spark based collaborative filtering in SPSS

IBM SPSS Modeler v17.1

38 Copyright © Intelligent Business Strategies 1992-2016!

Model Delevopment - Dell Statistica Has Support For Hadoop HDFS And In-Hadoop Analytical Algorithms

Page 20: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 20

39 Copyright © Intelligent Business Strategies 1992-2016!

Topics – Where Are We?

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

§  Preparing data for analysis without programming using data wrangling tools

§ Model development tools that exploit Spark and in-Hadoop analytics

Ø Building workflow based analytical applications without programming

§  Building streaming analytic applications without programming

§  Text analytics and the power of search

§  Interactive data discovery and data visualization tools

40 Copyright © Intelligent Business Strategies 1992-2016!

Building Analytical Workflows That Leverage Spark For Data Blending - E.g. Alteryx

Page 21: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 21

41 Copyright © Intelligent Business Strategies 1992-2016!

Expediting The Data Refinery Process On Hadoop With Automated Analysis – From ETL to Analytical Workflows

Parse & Prepare Data in Hadoop

Transform & Cleanse Data in Hadoop

Discover data in Hadoop

ELT work -flow

other data

Raw data

Load data into Hadoop

Data Refinery

EDW Graph DBMS

DW appliance

Automated Invocation of Custom Built & Pre-built Analytics on Hadoop

contains clean, high value data

New high value Insights

(pub/sub)

42 Copyright © Intelligent Business Strategies 1992-2016!

Building Analytic Applications (No Programming) - E.g. Actian DataFlow (Uses A Knime UI)

Works with flat files, relational databases, NoSQL databases and Hadoop file system (HDFS) >> This kind of tool significantly reduces time to value

Dataflows execute on a proprietary DataFlow cluster that can run on YARN

Page 22: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 22

43 Copyright © Intelligent Business Strategies 1992-2016!

Topics – Where Are We?

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

§  Preparing data for analysis without programming using data wrangling tools

§ Model development tools that exploit Spark and in-Hadoop analytics

§  Building workflow based analytical applications without programming

Ø Building streaming analytic applications without programming

§  Text analytics and the power of search

§  Interactive data discovery and data visualization tools

44 Copyright © Intelligent Business Strategies 1992-2016!

Source: Impetus

Kafka spout bolt bolt

bolt

Building Storm And Spark Streaming Applications With No Programming – E.g. Impetus StreamAnalyix

Drag and drop workflow based Spark Streaming or Storm applications Generates the code for Spark Streaming or Storm (uses Trident) Includes Kafka support

Page 23: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 23

45 Copyright © Intelligent Business Strategies 1992-2016!

Topics – Where Are We?

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

§  Preparing data for analysis without programming using data wrangling tools

§ Model development tools that exploit Spark and in-Hadoop analytics

§  Building workflow based analytical applications without programming

§  Building streaming analytic applications without programming

Ø Text analytics and the power of search

§  Interactive data discovery and data visualization tools

46 Copyright © Intelligent Business Strategies 1992-2016!

Several Search Based Products Have Support for Big Data

§  Attivio §  Cloudera Search

§  Connexica

§  HP Autonomy IDOL – integrates with Vertica and Hadoop

§  Information Builders webFOCUS Magnify

§  IBM BigIndex and Watson Explorer

§  LucidWorks Big Data

§ Maana

§ MapR with LucidWorks Search

§ Oracle Endeca and Oracle Big Data Appliance § Quid

§  Splunk

Page 24: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 24

47 Copyright © Intelligent Business Strategies 1992-2016!

Exploratory Analysis Of Multi-Structured Data In Hadoop Via Search, e.g. Lucene Or IBM BigIndex

CMS

Image server

Collab tools

File servers

Web feeds

email

Web sites

LOAD

BI Tools, Applications,

Mashups

Use massively parallel Map Reduce to build a partitioned search index

index index Index

partition

index partitions

Useful for analysing un-modelled semi-structured content that is not well understood

48 Copyright © Intelligent Business Strategies 1992-2016!

Hadoop Search Based Analytics - Product Example Splunk Hunk (Splunk on Hadoop)

Page 25: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 25

49 Copyright © Intelligent Business Strategies 1992-2016!

Hadoop Search Based Analytics Splunk (Hunk) Is Very Popular For Analysing Machine Data

50 Copyright © Intelligent Business Strategies 1992-2016!

Enterprise Search With A Search AND SQL API - Attivio Active Intelligence Engine (Supports Hadoop)

Source: Attivio

Page 26: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 26

51 Copyright © Intelligent Business Strategies 1992-2016!

Tibco Spotfire Dashboard Created From Accessing Multi-Structured Data (including Email) Via Attivio

52 Copyright © Intelligent Business Strategies 1992-2016!

Text Analysis On Hadoop With No Programming - E.g. Datameer (Generated Code For You)

Data Cleansing And Preparation

Entity extraction

Part of speech tagging

Page 27: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 27

53 Copyright © Intelligent Business Strategies 1992-2016!

Datameer Sentiment Analysis

54 Copyright © Intelligent Business Strategies 1992-2016!

Text Analytics Product Example - Microsoft Azure ML Text Analytics Service

Page 28: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 28

55 Copyright © Intelligent Business Strategies 1992-2016!

Topics – Where Are We?

§  Speeding up Data Science - why no programming is a valid option

§  Key requirements for tools if they are to improve productivity

§  Preparing data for analysis without programming using data wrangling tools

§ Model development tools that exploit Spark and in-Hadoop analytics

§  Building workflow based analytical applications without programming

§  Building streaming analytic applications without programming

§  Text analytics and the power of search

Ø Interactive data discovery and data visualization tools

56 Copyright © Intelligent Business Strategies 1992-2016!

Historically BI Platforms Were Suites Of Separate Tools For Different Types Of Analysis

Data Warehouse RDBMS

EDW

DW & marts

mart

Business Analyst

Production Pixel Perfect

Reporting

Ad hoc query and Reporting

Office Integration OLAP Dashboard

Builder Mobile

BI Visual

Discovery

Information Consumer

BI Platform

Page 29: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 29

57 Copyright © Intelligent Business Strategies 1992-2016!

BI Platforms And Advanced Analytics Are Merging

Modern Analytics Platform BI Platform

Business Analyst

Information Consumer

Data Scientist

Advanced Analytics

EDW

streaming data

DW Appliance

mart office data

cloud data

Logsmachine

data social data

BI Vendors missing advanced analytics will add this capability and vice-versa

58 Copyright © Intelligent Business Strategies 1992-2016!

BI/Analytics Tools Are Connecting To Structured, Semi Structured And Unstructured Data Sources – E.g. Zoomdata

Page 30: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 30

59 Copyright © Intelligent Business Strategies 1992-2016!

The Modern BI/Analytics Platform Is Becoming Service Oriented, Role Based, Embeddable And Extendable

EDW

streaming data

DW Appliance

Visualisations (device agnostic)

Collaboration & Story telling

Data Management services

Decision engine

Advanced analytics

Sec

urity

Connectors

Orchestration

Customisable Role-based User Interface API (embedded analytics)

Information & Artifacts Catalog

Ext

ensi

bilit

y A

PIs

Analytics Engine And Optimizer

Query & Reporting

mart

Dashboard development

Model management Graph Text Predictive

Aggregation & OLAP

sandbox

Bus. Analyst

Information consumer

Data Scientist

Action services (e.g. alerts,

recommendations)

Applic-ations

office data

cloud data

Logsmachine

data social data

processes

In-memorycolumnardatastore

websites

Copyright © Intelligent Business Strategies 1992-2015!

Ext

ensi

bilit

y A

PIs

API (embedded analytics) Customisable Role-based User Interface

60 Copyright © Intelligent Business Strategies 1992-2016!

Analytics Consumption – Need To Utilise In-Database And In-Hadoop Predictive Analytics In Self-Service BI Tools

E.g. SAS Visual Analytics

Tibco Spotfire (Mobile)

In-Hadoop Analytics

R Analytics

Scientific Analytics

Data Prep

Data Mining

Predictive

Analytics

Spatial

Tibco Spotfire (Mobile)

In-Database Analytics

R Analytics

Scientific Analytics

Data Prep

Data Mining

Predictive

Analytics

Spatial

Analytical RDBMS Can the analytics run in parallel?

E.g. Tableau Forecasting

Analytics Platform

Page 31: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 31

61 Copyright © Intelligent Business Strategies 1992-2016!

EDW

streaming data

DW Appliance

mart office data

cloud data

Logsmachine

data social data

The Modern BI/Analytics Platform – Spark Is Claiming Market Share In Scalable Analytics

Visualisations (device agnostic)

Collaboration & Story telling

Data Management services

Decision engine

Orchestration

Customisable Role-based User Interface API (embedded analytics)

Information & Artifacts Catalog

Sec

urity

Ext

ensi

bilit

y A

PIs

Query & Reporting

Dashboard development

Model management

Aggregation & OLAP

sandbox

Bus. Analyst

Information consumer

Data Scientist

Action services (e.g. alerts,

recommendations)

Applic-ations

processes websites

Advanced analytics

Connectors

Analytics Engine And Optimizer

Graph Text Predictive

In-memorycolumnardatastore

Sec

urity

Ext

ensi

bilit

y A

PIs

62 Copyright © Intelligent Business Strategies 1992-2016!

Spark And MapReduce Based Self-Service Analytical Tool Example - Datameer

Predictive Analytics – E.g. Decision Trees

Spreadsheet style user interface

Datameer offers end-to-end processing from ETL to analytics to data visualisation It generates Spark & MR code to run on Hadoop

Page 32: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 32

63 Copyright © Intelligent Business Strategies 1992-2016!

Datameer – Data Visualisation Social Network Relationship Analysis Twitter Analysis Dashboard

Custom visualisations

64 Copyright © Intelligent Business Strategies 1992-2016!

Data Discovery & Visualisation, Dashboard or Analytical

workflow server

Business Analyst or Data Scientist

personal & office data

Predictive models

community

Publish / Share Consume / Enhance / Re-publish

Transaction systems

DW

SQL Access to Hadoop Is Needed To Allow Hadoop Data To Be Accessed By Users With Self-Service BI Tools

collaborate

HDFS / Hbase/ Hive

e.g. Hive interface

SQL on Hadoop

Page 33: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 33

65 Copyright © Intelligent Business Strategies 1992-2016!

Also Is It Necessary To Build The Entire Analytical Workflow Every Time?

Analytics producers

marketing

finance

operations

Acquire Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

Acquire Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

Acquire Data Preparation (clean, transform, filter)

Analyse (e.g.Score) Visualise

Decide Act

Data Integration data

Embed

66 Copyright © Intelligent Business Strategies 1992-2016!

Reducing Time To Value Using Publish And Subscribe And Pipeline Components

Acquire Acquire

Acquire Data Preparation (clean, transform, filter) data

source

Data Integration

publish Info catalog

trusted data as a service

publish Info catalog

trusted, integrated data ad a service

subscribe Analyse

(e.g.score) consume

publish Analytics catalog

New predictive analytic pipelines

(as a service)

consume subscribe

Visualise

Decide Act

other e.g. embed analytic applications

consume subscribe

publish

Solutions catalog New prescriptive

analytic pipelines

publish New analytic applications

use

Page 34: Reducing Time To Value - DM And Analytical Tools Available ...files.meetup.com/10751222/Reducing Time To Value - DM And An… · 28/04/2016 Copyright © Intelligent Business Strategies

28/04/2016

Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 34

67 Copyright © Intelligent Business Strategies 1992-2016!

Conclusion

§ We are at the point where ‘citizen data sciemtists’ no longer need to know how to write code to be productive on Hadoop

§  Tools exist to accelerate the analytical process •  Data preparation and integration •  Model development and deployment on Spark and Hadoop •  Text extraction and analysis •  Machine learning •  End-to-end analytical application development •  Visual data discovery

§  It is important to ensure that tools are integrated

§  Technology alone is not enough •  Companies need to organise for success so that IT, data

scientists and business analysts work together as a team

68 Copyright © Intelligent Business Strategies 1992-2016!

www.intelligentbusiness.biz [email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700

Thank You! Please join me for my

Big Data and Analytics Master Class – London, May 12-13, 2016 Book at http://www.q4k.com/content/big-data-analytics-strategy-

implementation-2