reducing time to value - dm and analytical tools available ...files.meetup.com/10751222/reducing...
TRANSCRIPT
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 1
Reducing Time To Value - Data Management And Analytical Tools On Spark and Hadoop
Mike Ferguson Managing Director Intelligent Business Strategies HUG Manchester Meetup April 2016
2 Copyright © Intelligent Business Strategies 1992-2016!
About Mike Ferguson
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in BI/Analytics, data management and big data. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates.
www.intelligentbusiness.biz [email protected]
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 2
3 Copyright © Intelligent Business Strategies 1992-2016!
Topics
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
§ Preparing data for analysis without programming using data wrangling tools
§ Model development tools that exploit Spark and in-Hadoop analytics
§ Building workflow based analytical applications without programming
§ Building streaming analytic applications without programming
§ Text analytics and the power of search
§ Interactive data discovery and data visualization tools
4 Copyright © Intelligent Business Strategies 1992-2016!
Today Both Structured And Multi-Structured Data Are Needed For Deeper Insight
Multi-structured
data Click stream web log data Customer interaction data
Social interaction data Sensor data
Rich media data (video, audio) External content
Documents Internal web content
Seismic data (oil & gas)
Structured data
OLTP system data Data warehouse data
Personal data stores e.g. Excel, Access
Often un-modelled and may not be well understood
Often a schema is defined and data is well understood
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 3
5 Copyright © Intelligent Business Strategies 1992-2016!
Different Platforms Optimised For Different Analytical Workloads Are Now Needed
Big Data analytical workloads have resulted in multiple platforms now being used for analytical processing
Data Warehouse RDBMS
EDW
DW & marts
mart DW
Appliance
Advanced Analytics (structured data)
Analytical RDBMS
Streaming data
Streaming analytics
Real-time streaming analytics &
decision m’gmt
NoSQL DBMS
Hadoop data store
NoSQL DB e.g. graph DB
Advanced Analytic (multi-structured data)
Investigative / Exploratory
analysis Graph
analysis
C
R
U
D
Prod
Asset
Cust
MDM
Self-service BI and
Analytical Tools
IT developed queries, reports &
dashboards
Data mining, model
development
Data mining, model
development
6 Copyright © Intelligent Business Strategies 1992-2016!
Data Scientists Are Doing Exploratory Analysis, Developing Analytical Models And Applications Across The Ecosystem
Data Warehouse RDBMS
EDW
DW & marts
mart DW
Appliance
Advanced Analytics (structured data)
Analytical RDBMS
Streaming data
Streaming analytics
Real-time streaming analytics &
decision m’gmt
NoSQL DBMS
Hadoop data store
NoSQL DB e.g. graph DB
Advanced Analytic (multi-structured data)
Exploratory analysis
Graph analysis
C
R
U
D
Prod
Asset
Cust
MDM
Data mining, model
development
Data mining, model
development
Data Scientist
Text analysis
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 4
7 Copyright © Intelligent Business Strategies 1992-2016!
Problems With Big Data Analytic Application Development § Reliance on very highly skilled data scientists has become
a barrier to adoption • Limited availability of skilled employees
§ Very high bar set to find people with skills in: • Data engineering • Mathematics and statistics • Java, Python or Scala programming • R programming • Data visualisation • Communication with business
§ Slow pace of building analytic applications • Writing code is time consuming and expensive • Too dependent on developers who may be a bottleneck • High maintenance costs, no metadata, staff turnover…
8 Copyright © Intelligent Business Strategies 1992-2016!
Speeding up Data Science Requires Automation, Simplification and Provisioning of Insight
§ Need more automation and simplification • E.g. Raise level of abstraction where programming skills are no
longer needed to prepare and integrate data
§ Lower the bar on skillsets • Enable the ‘Citizen Data Scientists’ – business analyst • Need a greater reliance on business analysts and data architects
in big data environments in future
§ Introduce automation to increase agility and reduce time to value • Automated data discovery and profiling of new data sources • Generate code to exploit new technology more rapidly and reduce
reliance on programming
§ Deliver actionable insight to the point of need by integrating into processes and applications
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 5
9 Copyright © Intelligent Business Strategies 1992-2016!
New Tools Are Needed To Allow Business Analysts To Become “Citizen Data Scientists”
Exploratory analysis Predictive / statistical model producer
Business Analyst
Business Manager / Operations worker /
Customer Data Scientist
Model consumer Data blending Data visualisation Information Producer
• Build reports • Build and publish dashboards
Insight Interpreter Storyteller and Collaborator Business Communicator
Information consumer Decision maker Collaborator Action taker
+ Citizen Data Scientist
New tools
10 Copyright © Intelligent Business Strategies 1992-2016!
Key Requirements For Tools To Improve Productivity And Reduce Time To Value - 1
§ Be able to develop batch Spark and stream processing analytical applications without the need for programming
§ Develop Spark and streaming analytic applications using pipelines (workflows) so ETL developers can retain their skills and business analysts can participate
§ Be able to filter data from streaming data sources for storage in Hadoop or an analytic RDBMS
§ Deploy analytics in-database, in-stream and in-Hadoop for scalability
§ Align analytics with business strategy by tagging them
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 6
11 Copyright © Intelligent Business Strategies 1992-2016!
Key Requirements For Tools To Improve Productivity And Reduce Time To Value - 2 § Publish data integration workflows as trusted data services
for consumption (use) by. • People building analytics e.g. Developers, data scientists,
business analysts
§ Publish analytic workflows as services so they can be • Consumed in other tools and apps to build powerful data
driven analytic applications • Nested in other workflows
§ Create a catalog of available trusted data, data services and analytic services
§ Enrich customer master data, data warehouses and data marts with new data and insights
12 Copyright © Intelligent Business Strategies 1992-2016!
Acquire
Reducing Time To Value - The Objective Is To Accelerate The Creation of Analytical Process
Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
How do you accelerate this process?
Do you have to code everything?
What tools are available to the ‘Citizen Data Scientist’ to help accelerate elements of this process or even the whole process?
What other factors are critical to success?
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 7
13 Copyright © Intelligent Business Strategies 1992-2016!
Technology Frequently Used In Data Science
§ Self-service data preparation tools
§ Data mining tools • Workflow based data mining tools • Statistical analysis • Built-in data preparation • Machine learning algorithms
§ Streaming analytics platform workbenches
§ Analytical application development • Typically on Spark or Hadoop MapReduce • Programming in Python, Java, Scala and R • Often using interactive workbench technologies
14 Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
Ø Preparing data for analysis without programming using data wrangling tools
§ Model development tools that exploit Spark and in-Hadoop analytics
§ Building workflow based analytical applications without programming
§ Building streaming analytic applications without programming
§ Text analytics and the power of search
§ Interactive data discovery and data visualization tools
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 8
15 Copyright © Intelligent Business Strategies 1992-2016!
Evolution of Big Data Integration Has Followed The Same Cycle as it Did in Data Warehousing
Hand coded ETL programs
Hadoop Hand coded
programs
ELT processing
Generated Spark or MR ELT processing
Evolution of Big Data Integration
16 Copyright © Intelligent Business Strategies 1992-2016!
Data Cleansing and Integration Tool
Scaling ETL Transformations for In-Hadoop ELT Processing
Extract Parse Clean Transform Analyse Load Insights
Option 1 ETL tool generates HQL or convert generated SQL to
HQL
Option 2 ETL tool generates Pig
(compiler converts every transform to a map
reduce job) or JAQL
Option 3 ETL tool generates
3GL MR or Spark code
Option 4 – Other Native massively parallel
transformation and integration bypassing any Hadoop execution
engine
Allows ETL developers to use their skills to prepare and integrate data at scale without the need fro programming
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 9
17 Copyright © Intelligent Business Strategies 1992-2016!
Generation of Spark MPP In-Memory Data Integration AND Analysis Jobs – E.g. Talend
Source: Talend
Acquire Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
18 Copyright © Intelligent Business Strategies 1992-2016!
IBM BigIntegrate Supports Data Pipelining With Auto Data Repartitioning for Maximum Throughput
Source: IBM
customer last name
customer postcode
credit card number
U-Z
N-T
G-M
A-F
Source Target
repartitioning repartitioning
Runs on • BigInsights, • ODP with Apache Hadoop • Hortonworks • Cloudera CDH
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 10
19 Copyright © Intelligent Business Strategies 1992-2016!
Informatica Is Also Running Native On Hadoop - Blaze Engine And Cluster Aware Layer (CAL)
§ Distributed engine runs directly on Hadoop YARN
§ Leverages all the compute nodes on a Hadoop cluster
§ Automatic intelligent data pipelining, job partitioning, scaling for large concurrent workloads
§ Cluster Aware Layer (CAL) hides cluster specific interactions
• Resource Management • Distributed File System • Cluster Management
§ Choice of execution on Map-Reduce, BLAZE or INFA engines outside of Hadoop
HADOOP Cluster
HDFS
Map-Reduce
Hive Runtime
INFA DIS (Data Integration Server)
INFA Hive Executor
Data Engine Compiler
Blaze Executor
Blaze Runtime
DIS CAL
Hive Driver
Hive MetaStore
YARN
Blaze Runtime
Hadoop CAL
Source: Informatica
20 Copyright © Intelligent Business Strategies 1992-2016!
Self-Service Data Integration Tool Vendors
§ Actian Dataflow
§ Alteryx
§ Clear Story Data
§ Datameer
§ IBM DataWorks
§ Informatica Rev
§ Paxata
§ SAS Data Loader
for Hadoop
§ Tamr
§ Trifacta
Acquire Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
Acquire Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
Data preparation, integration, analysis & visualisation
Data preparation and integration
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 11
21 Copyright © Intelligent Business Strategies 1992-2016!
Business User Data Wrangling Product Example - Paxata (Aimed At Data Scientists)
Paxata auto data profiling
Paxata in-line transformations
22 Copyright © Intelligent Business Strategies 1992-2016!
Business User Data Wrangling Product Example - Paxata Cluster and Edit To Help De-Duplicate Data
Source: Paxata
Paxata applies Kmeans clustering to each column to group together similar values
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 12
23 Copyright © Intelligent Business Strategies 1992-2016!
Paxata Technical Architecture
©(Paxata,(Inc.(
Scheduling(and(Resource(Management(Distributed(file(system(
Paxata(technical(architecture(
19(
HDFS'
Distributed(Processing(Engine(
(Pax(Cache(Manager(((remote(+(inWline)(
Pax(Data(Access((Pax(Data(Library)(Pax(Compiler(
Pax(Requests((view,(histograms,((
cluster,(rela:onships)(
Pax(RDDs(((projec:on,(filter,((aggrega:on,(pivot()(
Pax(AnswerSet((Manager(
ODBC,(JDBC,(Web(services,(etc…(
19(
Parallel(InWMemory(Pipelined(Data(Prep(Engine(powered(by(Intellifusion™(
Data(Manager( Script(Manager( Seman:c(Rela:onships(
Projects( Users( Tenants(
Data(Prepara:on(Applica:on(Web(Services(
Connec:vity(And(API(Toolkit(
Mul:Wuser(aware,(HTML5(&(mul:Wdevice(ready,(Data(Driven(Design,(etc…(
UI(Layer(
YARN'
Source: Paxata
24 Copyright © Intelligent Business Strategies 1992-2016!
Trifacta Predictive Interaction User Interface – Text Extraction Example
Figure 1: Predictive Interaction for text pattern specification. The left image shows the interface after the user has highlighted thestring mobile in line 34. The right shows the interface after one more gesture: highlighting the string dynamic in line 31. Notethat the top-ranked suggested transform changes after the second highlight, and hence so do the Source and Preview contents.
Figure 2: A ranked list of regular expressions.
a visual rendering of their data in a familiar tabular grid. They canguide the system by highlighting substrings in the table, which areadded to an example set. Based on this set, an inference algorithmproduces a ranked list of suggested text patterns that model the setwell. For the top-ranked pattern, the table renderer highlights anymatches found, and shows how those matches will be used.
Figure 1 shows the states of the interface after the user makes eachof two guiding interactions: first, highlighting the string mobilein row 34, and then highlighting the additional string dynamic inrow 31. The user interface shows the highlighted patterns in thesource (blue), and the outcome of a text extraction transform in apreview column (tan). The user can choose to view the outputs ofother suggested transforms by clicking on them in the top panel;they can also edit the patterns directly in a Transform Editor. Whenthe user decides on the best pattern, they can click the “plus” (+) tothe right of the transform to add it to a DSL script.
In our initial prototype the suggested transforms looked differentthan what is shown in Figure 1. Originally, users would see aranked list of REs in a traditional syntax, as shown in Figure 2(corresponding to the ranked list of suggested transforms on theright of Figure 1). In user studies we found that even experiencedprogrammers had difficulty deciding quickly and accurately amongalternative REs. It seems that RE syntax is better suited to writingpatterns than to reading them. Hence we changed our DSL to a newpattern language (compilable to REs) that is better suited to rapiddisambiguation among options.
In essence, we evolved our DSL design to simplify the way thatusers can interact with automated predictions. Although simple, thisexample illustrates some of the subtleties involved in co-designingPredictive Interaction across the three streams of traditional researchmentioned above. The visualization has to be informative and theaffordances for user guidance clear; the predictive model has toreceive information-rich guidance from the interactions, and do agood job of surfacing probable but diverse choices; the DSL hasto be expressive yet sufficiently small for tractable inference andsimple user interaction.
In the remainder of the paper, we provide a general framework forPredictive Interaction, putting it in context with previous approachesto visual languages for managing data, and highlighting research
X Y
Z
f
h g compilation
DSL
(a) (b)
Data Results
interactionData Vis Visual Results
visualization
Figure 3: Lifts. A traditional lift (a): given a map f : X !Y , and a map g : Z ! Y , the lifting problem is to find amap h : X ! Z such that g � h = f . Lifting in the contextof visual specifications (b): rather than write expressions in atextual DSL, we define a lift to a domain of data visualizationand interactions, such that the interactions in that domain leadto final outputs: compilation � interaction � visualization = DSLprogramming.
Figure 1 1 Qualified retrieval
EMP NAME SAL MGR DEPT
Figure 12 Partially underlined qualified retrieval
328
Qualijied retrieval. Print the names of the employees who work in the toy department and earn more than $10000. This is shown in Figure 11. Note the specification of the condition “more than $lQl&)O.” One has the option of using any of the following in- equality operators: #, >, >=, <, <=. If no inequality operator is used’ as a prefix, equality is implied. The symbol # can be re- placed by 1 or I=.
Partially underlined qualijied retrieval. Print the green items that start with the letter I . This is found in Figure 12. The I in IKE is not underlined, and it is a constant. Therefore, the system prints all the green items that start with the letter I . The user can par- tially underline at the beginning, middle or end of a word, a sen- tence, or a paragraph, as in the example, XPAY, which means find a word, a sentence or a paragraph such that somewhere in that sentence or paragraph there exist the letters PA. Since an example element can be blank, then it word, a sentence, or a paragraph that starts or ends with the letters PA also qualifies.
The partial underline feature is useful if an entry is a sentence or text and the user wishes to search to find all examples that con- tain a special word or root. If, for example, the query is to find entries with the word Texas, the formulation’ of this query is P. x TEXAS Y.
- -
Qualijied retrieval using links. Print all the green items sold by the toy department. This is shown in Figure 13. In this case, the user displays both the TYPE table and the SALES table by gener- 3ting two blank skeletons on the screen and filling them in with beadings and with required entries. The significance of the ex- ample element is best illustrated in this query. Here, the same example element must be used in both tables, indicating that if an example item such as N U T is green, that same item is also sold by the toy department. Only if these conditions are met simultaneously does the item qualify as a solution. The manual equivalent is to scan the TYPE table to find a green item and then scan the SALES table to check whether that same item is also sold by the toy department. Since there is no specification of how the query is to be processed or where the scan is to start, the formulation of this query is neutral and symmetric.
Figure 13 Qualified retrieval using links ‘“7-1 P . E T GREEN -
Once the concept of a linking example element is understood, the user can link any number of tables and any number of rows within a single table, as in the following examples.
ZLOOF IBM SYST J
Figure 4: Query By Example: qualified retrieval usinglinks [32].
challenges and opportunities for the community.
2. LIFTING TO VISUAL LANGUAGESTo set the stage for our discussion, we re-examine the more
traditional integration of two of our three themes: visualizationand data-centric languages. There are a number of influential priorefforts along these lines, including Query-By-Example (QBE) [32],Microsoft Access, and Tableau. These interfaces take a textual datamanipulation language (e.g., relational calculus) and “lift” it intoan isomorphic higher-level visual language intended to be morenatural for users. Given a visual specification of a query, a systemcan translate (“ground”) to the domain of the textual language forprocessing. Lifting is a basic idea from category theory, sometimesused in the design of functional programming languages (Figure 3).
Lifting to a visual domain has proven to be useful for the specifi-cation of standard select-project-join-aggregate queries. As illustra-tion, we review two influential systems: QBE and Tableau.
Example 1: QBE. The main idea in QBE is to lift the database
1. User highlights text
2. Trifacta predictive models generate ranked suggested transforms
2. Outcome of the suggested text pattern transform in Preview column
3. User adds the selected transform to the script
Source: Trifacta
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 13
25 Copyright © Intelligent Business Strategies 1992-2016!
Datawatch Monarch Provides Automated Extraction of Structured Data From Documents
26 Copyright © Intelligent Business Strategies 1992-2016!
IT Professionals Are Very Concerned About Data Governance As Departments Buy Different Tools
Stand-alone Data Wrangling
tools
Data & Metadata
Relationship Discovery
Services
Data Quality
Profiling & Monitoring Services
Data Modeling Services
Data Cleansing & Matching
Services Data
Integration Services
Business Glossary
/ Info Catalog Services
Data Governance/Management Console
Data Privacy & Lifecycle
Management
Services
Data Audit &
Protection Services
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI embedded in Self-
Service BI tools
Dell Boomi IBM DataWorks Informatica Rev Microsoft Data Factory SnapLogic
Cloud DI “What about Data
Governance?” Lineage?
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 14
27 Copyright © Intelligent Business Strategies 1992-2016!
Interoperability Is Needed Across Tools To Re-Use Data Preparation Jobs Developed By Different Users
Stand-alone Data Wrangling
tools
Data & Metadata
Relationship Discovery
Services
Data Quality
Profiling & Monitoring Services
Data Modeling Services
Data Cleansing & Matching
Services Data
Integration Services
Business Glossary
/ Info Catalog Services
Data Governance/Management Console
Data Privacy & Lifecycle
Management
Services
Data Audit &
Protection Serbices
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI embedded in Self-
Service BI tools
Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev
Cloud DI
Interoperability
metadata metadata
metadata metadata No Stan
dard
API
s, sti
ll
Incom
plete
– Wor
k In P
rogr
ess
28 Copyright © Intelligent Business Strategies 1992-2016!
What Happens If You Have An EIM Tool Suite, MDM AND Best-of-Breed Self-Service Data Integration Tools?
IT Business Users
Self-Service DI
Data & Metadata Relationship Discovery Services
Data Quality Profiling & Monitoring Services
Data Modeling Services
Data Cleansing & Matching Services
Data Integration Services
Business Glossary / Info Catalog Services
Data Governance/Management Console
Data Privacy & Lifecycle Management Services
Data Audit & Protection Serbices
EIM Tool Suite
MDM System
C
R
U
D
Prod
LSP
Cust
Answer is they HAVE TO Integrate to solve the data governance problem
Self-Service DI
Data & Metadata Relationship Discovery Services
Data Quality Profiling & Monitoring Services
Data Modeling Services
Data Cleansing & Matching Services
Data Integration Services
Business Glossary / Info Catalog Services
Data Governance/Management Console
Data Privacy & Lifecycle Management Services
Data Audit & Protection Serbices
EIM Tool Suite
MDM System
C
R
U
D
Prod
LSP
Cust Invoke SSDI services from EIM workflows
Invoke EIM & MDM services from SSDI tools
RESTful APIs
e.g. Paxata RESTful API
?
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 15
29 Copyright © Intelligent Business Strategies 1992-2016!
Informatica Catalog & Live
Data Map
Business And IT User Data Refinery Tools e.g. Informatica
Analyst tool Data & Metadata
Relationship Discovery
Services
Data Quality
Profiling & Monitoring Services
Data Modeling Services
Data Cleansing & Matching
Services Data
Integration Services
Business Glossary
/ Info Catalog Services
Data Governance/Management Console
Data Privacy & Lifecycle
Management
Services
Data Audit &
Protection Serbices
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
Informatica Rev
Self-service Cloud DI
metadata
metadata Analyst tool
30 Copyright © Intelligent Business Strategies 1992-2016!
Metadata Management In A Data Reservoir – Importing 3rd Party Metadata Into An EIM Platforms Using Apache Atlas
Stand-alone Data Wrangling
tools
Services
Data Governance/Management Console
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI embedded in Self-
Service BI tools
Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev
Cloud DI metadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information Catalog
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 16
31 Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
§ Preparing data for analysis without programming using data wrangling tools
Ø Model development tools that exploit Spark and in-Hadoop analytics
§ Building workflow based analytical applications without programming
§ Building streaming analytic applications without programming
§ Text analytics and the power of search
§ Interactive data discovery and data visualization tools
32 Copyright © Intelligent Business Strategies 1992-2016!
Requirement Is Now To Deploy Analytics In Analytical DBMSs, In-Hadoop and In-Stream For Scalability & Reuse
Sandboxes (DW Appliance)
Analytics execution
EDW streaming
data
Analytics Platform Develop analytics
Deploy analytics
PMML
In-database analytics
PMML
In-stream analytics
PMML
In-Hadoop analytics
§ Customer
§ Operations
§ Risk
§ Finance
§ Sustainability
Business Strategy
align
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 17
33 Copyright © Intelligent Business Strategies 1992-2016!
Advanced Analytics Tool Product Example - Knime
34 Copyright © Intelligent Business Strategies 1992-2016!
KNIME Integration With Spark Is Much More Than Using Mllib – It Can Exploit Spark Transformations
Source: Knime
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 18
35 Copyright © Intelligent Business Strategies 1992-2016!
Knime Integration With Spark MLlib
§ Spark RDDs as input/output format
§ Native MLlib model learning and prediction
§ Data stays within your Spark cluster
§ No unnecessary data movements
§ Several input/output nodes e.g. Hive, HDFS files, …
Native MLlib model
Source: Knime
36 Copyright © Intelligent Business Strategies 1992-2016!
Model Development - RapidMiner Can Exploit Spark MLlib Algorithms on Hadoop Data To Build Scalable Models
Spark MLlib decision tree
algorithm
Develop and train the model on Spark Deploy and execute it on Spark / Hadoop
Access data in HDFS data set
Source: RapidMiner
Push down analytics closer to the data
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 19
37 Copyright © Intelligent Business Strategies 1992-2016!
These Nodes Can Be Shared With Non-Programmer Data Scientists To Democratize Access To Spark Capabilities
Utilise Spark nodes in SPSS models Spark MLlib becomes usable for non-programmers with code abstracted behind a SPSS Modeler GUI
Create new Spark MLlib based IBM SPSS Modeler nodes
E.g. Spark based collaborative filtering in SPSS
IBM SPSS Modeler v17.1
38 Copyright © Intelligent Business Strategies 1992-2016!
Model Delevopment - Dell Statistica Has Support For Hadoop HDFS And In-Hadoop Analytical Algorithms
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 20
39 Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
§ Preparing data for analysis without programming using data wrangling tools
§ Model development tools that exploit Spark and in-Hadoop analytics
Ø Building workflow based analytical applications without programming
§ Building streaming analytic applications without programming
§ Text analytics and the power of search
§ Interactive data discovery and data visualization tools
40 Copyright © Intelligent Business Strategies 1992-2016!
Building Analytical Workflows That Leverage Spark For Data Blending - E.g. Alteryx
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 21
41 Copyright © Intelligent Business Strategies 1992-2016!
Expediting The Data Refinery Process On Hadoop With Automated Analysis – From ETL to Analytical Workflows
Parse & Prepare Data in Hadoop
Transform & Cleanse Data in Hadoop
Discover data in Hadoop
ELT work -flow
other data
Raw data
Load data into Hadoop
Data Refinery
EDW Graph DBMS
DW appliance
Automated Invocation of Custom Built & Pre-built Analytics on Hadoop
contains clean, high value data
New high value Insights
(pub/sub)
42 Copyright © Intelligent Business Strategies 1992-2016!
Building Analytic Applications (No Programming) - E.g. Actian DataFlow (Uses A Knime UI)
Works with flat files, relational databases, NoSQL databases and Hadoop file system (HDFS) >> This kind of tool significantly reduces time to value
Dataflows execute on a proprietary DataFlow cluster that can run on YARN
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 22
43 Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
§ Preparing data for analysis without programming using data wrangling tools
§ Model development tools that exploit Spark and in-Hadoop analytics
§ Building workflow based analytical applications without programming
Ø Building streaming analytic applications without programming
§ Text analytics and the power of search
§ Interactive data discovery and data visualization tools
44 Copyright © Intelligent Business Strategies 1992-2016!
Source: Impetus
Kafka spout bolt bolt
bolt
Building Storm And Spark Streaming Applications With No Programming – E.g. Impetus StreamAnalyix
Drag and drop workflow based Spark Streaming or Storm applications Generates the code for Spark Streaming or Storm (uses Trident) Includes Kafka support
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 23
45 Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
§ Preparing data for analysis without programming using data wrangling tools
§ Model development tools that exploit Spark and in-Hadoop analytics
§ Building workflow based analytical applications without programming
§ Building streaming analytic applications without programming
Ø Text analytics and the power of search
§ Interactive data discovery and data visualization tools
46 Copyright © Intelligent Business Strategies 1992-2016!
Several Search Based Products Have Support for Big Data
§ Attivio § Cloudera Search
§ Connexica
§ HP Autonomy IDOL – integrates with Vertica and Hadoop
§ Information Builders webFOCUS Magnify
§ IBM BigIndex and Watson Explorer
§ LucidWorks Big Data
§ Maana
§ MapR with LucidWorks Search
§ Oracle Endeca and Oracle Big Data Appliance § Quid
§ Splunk
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 24
47 Copyright © Intelligent Business Strategies 1992-2016!
Exploratory Analysis Of Multi-Structured Data In Hadoop Via Search, e.g. Lucene Or IBM BigIndex
CMS
Image server
Collab tools
File servers
Web feeds
Web sites
LOAD
BI Tools, Applications,
Mashups
Use massively parallel Map Reduce to build a partitioned search index
index index Index
partition
index partitions
Useful for analysing un-modelled semi-structured content that is not well understood
48 Copyright © Intelligent Business Strategies 1992-2016!
Hadoop Search Based Analytics - Product Example Splunk Hunk (Splunk on Hadoop)
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 25
49 Copyright © Intelligent Business Strategies 1992-2016!
Hadoop Search Based Analytics Splunk (Hunk) Is Very Popular For Analysing Machine Data
50 Copyright © Intelligent Business Strategies 1992-2016!
Enterprise Search With A Search AND SQL API - Attivio Active Intelligence Engine (Supports Hadoop)
Source: Attivio
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 26
51 Copyright © Intelligent Business Strategies 1992-2016!
Tibco Spotfire Dashboard Created From Accessing Multi-Structured Data (including Email) Via Attivio
52 Copyright © Intelligent Business Strategies 1992-2016!
Text Analysis On Hadoop With No Programming - E.g. Datameer (Generated Code For You)
Data Cleansing And Preparation
Entity extraction
Part of speech tagging
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 27
53 Copyright © Intelligent Business Strategies 1992-2016!
Datameer Sentiment Analysis
54 Copyright © Intelligent Business Strategies 1992-2016!
Text Analytics Product Example - Microsoft Azure ML Text Analytics Service
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 28
55 Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
§ Speeding up Data Science - why no programming is a valid option
§ Key requirements for tools if they are to improve productivity
§ Preparing data for analysis without programming using data wrangling tools
§ Model development tools that exploit Spark and in-Hadoop analytics
§ Building workflow based analytical applications without programming
§ Building streaming analytic applications without programming
§ Text analytics and the power of search
Ø Interactive data discovery and data visualization tools
56 Copyright © Intelligent Business Strategies 1992-2016!
Historically BI Platforms Were Suites Of Separate Tools For Different Types Of Analysis
Data Warehouse RDBMS
EDW
DW & marts
mart
Business Analyst
Production Pixel Perfect
Reporting
Ad hoc query and Reporting
Office Integration OLAP Dashboard
Builder Mobile
BI Visual
Discovery
Information Consumer
BI Platform
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 29
57 Copyright © Intelligent Business Strategies 1992-2016!
BI Platforms And Advanced Analytics Are Merging
Modern Analytics Platform BI Platform
Business Analyst
Information Consumer
Data Scientist
Advanced Analytics
EDW
streaming data
DW Appliance
mart office data
cloud data
Logsmachine
data social data
BI Vendors missing advanced analytics will add this capability and vice-versa
58 Copyright © Intelligent Business Strategies 1992-2016!
BI/Analytics Tools Are Connecting To Structured, Semi Structured And Unstructured Data Sources – E.g. Zoomdata
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 30
59 Copyright © Intelligent Business Strategies 1992-2016!
The Modern BI/Analytics Platform Is Becoming Service Oriented, Role Based, Embeddable And Extendable
EDW
streaming data
DW Appliance
Visualisations (device agnostic)
Collaboration & Story telling
Data Management services
Decision engine
Advanced analytics
Sec
urity
Connectors
Orchestration
Customisable Role-based User Interface API (embedded analytics)
Information & Artifacts Catalog
Ext
ensi
bilit
y A
PIs
Analytics Engine And Optimizer
Query & Reporting
mart
Dashboard development
Model management Graph Text Predictive
Aggregation & OLAP
sandbox
Bus. Analyst
Information consumer
Data Scientist
Action services (e.g. alerts,
recommendations)
Applic-ations
office data
cloud data
Logsmachine
data social data
processes
In-memorycolumnardatastore
websites
Copyright © Intelligent Business Strategies 1992-2015!
Ext
ensi
bilit
y A
PIs
API (embedded analytics) Customisable Role-based User Interface
60 Copyright © Intelligent Business Strategies 1992-2016!
Analytics Consumption – Need To Utilise In-Database And In-Hadoop Predictive Analytics In Self-Service BI Tools
E.g. SAS Visual Analytics
Tibco Spotfire (Mobile)
In-Hadoop Analytics
R Analytics
Scientific Analytics
Data Prep
Data Mining
Predictive
Analytics
Spatial
Tibco Spotfire (Mobile)
In-Database Analytics
R Analytics
Scientific Analytics
Data Prep
Data Mining
Predictive
Analytics
Spatial
Analytical RDBMS Can the analytics run in parallel?
E.g. Tableau Forecasting
Analytics Platform
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 31
61 Copyright © Intelligent Business Strategies 1992-2016!
EDW
streaming data
DW Appliance
mart office data
cloud data
Logsmachine
data social data
The Modern BI/Analytics Platform – Spark Is Claiming Market Share In Scalable Analytics
Visualisations (device agnostic)
Collaboration & Story telling
Data Management services
Decision engine
Orchestration
Customisable Role-based User Interface API (embedded analytics)
Information & Artifacts Catalog
Sec
urity
Ext
ensi
bilit
y A
PIs
Query & Reporting
Dashboard development
Model management
Aggregation & OLAP
sandbox
Bus. Analyst
Information consumer
Data Scientist
Action services (e.g. alerts,
recommendations)
Applic-ations
processes websites
Advanced analytics
Connectors
Analytics Engine And Optimizer
Graph Text Predictive
In-memorycolumnardatastore
Sec
urity
Ext
ensi
bilit
y A
PIs
62 Copyright © Intelligent Business Strategies 1992-2016!
Spark And MapReduce Based Self-Service Analytical Tool Example - Datameer
Predictive Analytics – E.g. Decision Trees
Spreadsheet style user interface
Datameer offers end-to-end processing from ETL to analytics to data visualisation It generates Spark & MR code to run on Hadoop
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 32
63 Copyright © Intelligent Business Strategies 1992-2016!
Datameer – Data Visualisation Social Network Relationship Analysis Twitter Analysis Dashboard
Custom visualisations
64 Copyright © Intelligent Business Strategies 1992-2016!
Data Discovery & Visualisation, Dashboard or Analytical
workflow server
Business Analyst or Data Scientist
personal & office data
Predictive models
community
Publish / Share Consume / Enhance / Re-publish
Transaction systems
DW
SQL Access to Hadoop Is Needed To Allow Hadoop Data To Be Accessed By Users With Self-Service BI Tools
collaborate
HDFS / Hbase/ Hive
e.g. Hive interface
SQL on Hadoop
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 33
65 Copyright © Intelligent Business Strategies 1992-2016!
Also Is It Necessary To Build The Entire Analytical Workflow Every Time?
Analytics producers
marketing
finance
operations
Acquire Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
Acquire Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
Acquire Data Preparation (clean, transform, filter)
Analyse (e.g.Score) Visualise
Decide Act
Data Integration data
Embed
66 Copyright © Intelligent Business Strategies 1992-2016!
Reducing Time To Value Using Publish And Subscribe And Pipeline Components
Acquire Acquire
Acquire Data Preparation (clean, transform, filter) data
source
Data Integration
publish Info catalog
trusted data as a service
publish Info catalog
trusted, integrated data ad a service
subscribe Analyse
(e.g.score) consume
publish Analytics catalog
New predictive analytic pipelines
(as a service)
consume subscribe
Visualise
Decide Act
other e.g. embed analytic applications
consume subscribe
publish
Solutions catalog New prescriptive
analytic pipelines
publish New analytic applications
use
28/04/2016
Copyright©IntelligentBusinessStrategies1992-2016–AllRightsReserved 34
67 Copyright © Intelligent Business Strategies 1992-2016!
Conclusion
§ We are at the point where ‘citizen data sciemtists’ no longer need to know how to write code to be productive on Hadoop
§ Tools exist to accelerate the analytical process • Data preparation and integration • Model development and deployment on Spark and Hadoop • Text extraction and analysis • Machine learning • End-to-end analytical application development • Visual data discovery
§ It is important to ensure that tools are integrated
§ Technology alone is not enough • Companies need to organise for success so that IT, data
scientists and business analysts work together as a team
68 Copyright © Intelligent Business Strategies 1992-2016!
www.intelligentbusiness.biz [email protected]
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You! Please join me for my
Big Data and Analytics Master Class – London, May 12-13, 2016 Book at http://www.q4k.com/content/big-data-analytics-strategy-
implementation-2