three signs your architecture is too small for big data. camp it december 2014
Post on 05-Aug-2015
132 Views
Preview:
TRANSCRIPT
© 2014 Craig Jordan
3 SIGNS YOUR BUSINESS INTELLIGENCE ARCHITECTURE IS too small for BIG DATACapabilities, Experiments, and Architecture Patterns
CampIT, December 4, 2014December 4, 2014
1
© 2014 Craig Jordan
Agenda
• The three V-s and their impact on a classic BI architecture
• Three capabilities Big Data requires• Near real time data processing• Machine-learning • Text processing
• Extending the classic BI architecture
• Q&A
2
© 2014 Craig Jordan
Defining Big Data
Untrusted Uncleansed Master data Transactions
Speed of generation Analysis latency Decision latency Time to action
Unstructured Semi-structured Structured
Transactions & master data Click stream Sensor Log Event [Scanned] document Speech, audio Social media
Volume Variety
VeracityVelocity
3
Big Data
© 2014 Craig Jordan
Classic BI Architecture
4
Create Acquire Integrate Present Use
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract ProcessExtract Process EDW stageEDW stage
Transform & Load
Transform & Load
EDWEDW
data martdata mart
data martdata mart
data martdata mart
data martdata mart
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
© 2014 Craig Jordan
Classic BI Architecture Capabilities
• Standard reports and dashboards with reliable performance
• Ad hoc analysis through defined dimensions• Measure-centric analysis• Flexible representation of change over time• Manual insight
discovery• Quantifiable
quality• Analytic workload
isolated from operations
5
© 2014 Craig Jordan
Big Data 3 V-s Impact
6
• Volume • Increases the size of every “cylinder”• Increases the throughput required for every arrow & process
• Variety • detail is discarded during extract & transform
• Velocity • challenges every process to be available and responsive
• Veracity • decisions must be encoded by the processing
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract ProcessExtract Process EDW stageEDW stage
Transform & Load
Transform & Load
EDWEDW
data martdata mart
data martdata mart
data martdata mart
data martdata mart
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
© 2014 Craig Jordan
Warning?
7
Your data warehouse is
running out of disk space.
You don’t store tweets in your data
warehouse.
Your ETL processes still execute in
batch
© 2014 Craig Jordan
Why the 3 V-s are Insufficient
Volume• Database sizes can be increased• How large is large enough?
Variety• Which data types are required, nice to have, unnecessary?• Should an architecture provide techniques for every data type?
Velocity• ETL processes can be enhanced to execute immediately• However, doing so for all is not feasible• Which process mush be immediate?• How do you handle data that is involved both in immediate and
batch processing requirements?
Veracity• Like beauty, [at least some] truth is in the eye of the beholder.
8
© 2014 Craig Jordan
Tomorrow may be too late
9
Your business intelligence architecture is too small for big data if…
Some of your key business processes complete with undesirable outcomes
due to unnecessary data delay.
© 2014 Craig Jordan
What processes complete today?
Finding architecture requirements for increased velocity
•Not all processes have the same duration• Find the ones that can start and finish in a single day
•What information sources do they use• Find the ones that change during the process and impact its
outcome
•What consumers need the information during the process
• Find the ones that could take different action to affect the business result
10
© 2014 Craig Jordan
Business ProcessBusiness Process
Reducing the time to insight
11
Nightly ETLProcess
Nightly ETLProcess
VolatileVolatile datadata
Op SysOp SysSlowlychangingSlowlychanging datadata Data martData mart targettarget
Business ProcessBusiness Process
Investigate near-real-time data movement aligned with your source technology•XML, Avro, JSON•JMS•SOA
Minimize the change to the target by including as few as possible volatile sources
Nightly ETLProcess
Nightly ETLProcess
VolatileVolatile datadata
Op SysOp SysSlowlychangingSlowlychanging datadata
targettarget
On Demand “ETL”
Process
On Demand “ETL”
Process
Data martData mart
Data martData mart
© 2014 Craig Jordan
Ingest & respond, Accumulate & consolidate
12
Serving LayerServing Layer
IncomingData
IncomingData
Batch LayerBatch Layer
All DataAll Data
Speed LayerSpeed Layer
Process StreamProcess Stream
PrecomputedInformation
PrecomputedInformation
Incremental Information
Incremental Information Real Time
ViewReal Time
View
Real TimeView
Real TimeView
Batch Data View
Batch Data ViewBatch Data
ViewBatch Data
View
QueryQuery
• Type II Slowly Changing Dimensions
• Accumulating Snapshots• Batch machine learning
calculations
• Type II Slowly Changing Dimensions
• Accumulating Snapshots• Batch machine learning
calculations
• Micro transformations to make data accessible
• Real-time machine learning calculations
• Micro transformations to make data accessible
• Real-time machine learning calculations
The Lambda Architecture
© 2014 Craig Jordan
Action Steps: Reducing time to insight
• Find the processes that complete in a single day
• Determine the volatile sources that impact the process outcome
• Select near-real-time data acquisition techniques aligned to source
• Minimize impact on the target
• Prepare for a big data architecture by• Selecting data formats suitable for big data
(Avro & CSV rather than XML and JSON)• Understanding the concepts of the lambda architecture
• Experiment with prototype implementations
• Select the architecture that most readily reduces arbitrary delay
• Architect your analytic targets for “accumulation” and “consolidation”• Lambda architecture (speed layer, serving layer, consolidation layer)• Type-II dimensions & summaries• Accumulating snapshots
13
© 2014 Craig Jordan
Thresholds and indicators are not enough
14
Your business intelligence architecture is too small for big data if…
Finding insights from your data depends upon the day-to-day work of
your analysts.
© 2014 Craig Jordan
Predicting the expected, Finding the unexpected
• Statistical analysts use historical observations to • Predict future performance• Characterize normal capability• Identify deviations from the
norm
15
Upper Control Limit (UCL)
Lower Control Limit (LCL)
Time
From: http://www.bottomlineanalytics.com/
© 2014 Craig Jordan
Experiment first
• Begin with machine learning by enabling qualified analysts through appropriate tools• R• Python• SAS/EM
17
Enterprise
Data Asset
Enterprise
Data Asset
Analytic Workbench
Analytic Workbench
MM
MM
MM
MM
python, R, SAS, …
python, R, SAS, …
Data scientist
Interactive Advanced Analytic Platform
cachecache
© 2014 Craig Jordan
Extend to the operational case
• Confirm through data analyst involvement
18
Enterprise
Data Asset
Enterprise
Data Asset
Analytic Workbench
Analytic Workbench
MM
MM
MM
MM
BI ToolBI Tool QQ
Query
Access model throughSQL extensions& R integration
python, R, SAS, …
python, R, SAS, …
Data analyst
Data scientist
Interactive Advanced Analytic Platform
cachecache
Query, Report, Etc.QQ
Analytic ModelMM
© 2014 Craig Jordan
Extend to the operational case, cont
• Fully operationalize by materializing model output for online access; or by enabling near-real-time execution
19
BI ToolBI ToolQQcachecache
Private
Data
Private
Data Data analyst
Business Leader
On demand
EnterpriseData Asset
EnterpriseData Asset
Extract Interactive query
Isolated Exploration Environment
DesktopBI Tool
DesktopBI ToolQQ
MobileBI ToolMobileBI Tool
BI ToolBI Tool QQ BI ToolQQ
Data analyst
Developer
Live Interactive Query
Analytic Workbench
Analytic Workbench
MM
MM
MM
MM
python, R, SAS, …
python, R, SAS, …
Data analyst
Data scientist
Interactive Advanced Analytic Platform
cachecache
© 2014 Craig Jordan
Action Steps: Enhancing automatic insight• Find statistical experts who already
• Classify normal and special cause variation• Cluster customers, sales, products, etc.• Forecast future performance
• Experiment with machine learning algorithms• Confirm machine learning models through extended
advanced analytic environment
• Operationalize machine learning
20
© 2014 Craig Jordan
Text doesn’t fit in your architecture
21
Your business intelligence architecture is too small for big data if…
Your analysts can’t find textual information that matters.
© 2014 Craig Jordan
Finding text to analyze
• Select text directly related to your key processes• Email• Notes• Customer surveys• Phone call transcripts• Web chat transcripts
22
© 2014 Craig Jordan
Text processing techniques• Investigate text processing techniques and the outputs they can
create• Entity extraction (Creating relationships)• Polarity (aka Sentiment)• Search-based sets
• Keyword identification / facets• Relevancy scores
• Example Technologies• Python
• Standard Library Text Processing Services• Natural Language Toolkit (NLTK)• OpenNLP
• R• qdap (Oct 2014)• tm (June 2014)
• DB• Gptext (SOLR and MADlib)• MarkLogic (free-text search)
• Expose as UDFs through SQL
24
© 2014 Craig Jordan
Extend your architecture to include text
25
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract/Transform/Load
Processes
Extract/Transform/Load
Processes
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
Replication ProcessesReplication Processes
NoSQL DBNoSQL DB
HadoopCluster
HadoopCluster
EDWEDW
data martdata mart
data martdata mart
© 2014 Craig Jordan
Action Steps: Deriving insights from text sources
• Find the text sources that relate to your core processes
• Experiment with text processing• Entity extraction• Polarity• Faceting• Relevancy
• Experiment with navigation• KPI-based
• Sentiment• Relevance
• Search-based• Facet-based
27
© 2014 Craig Jordan
Classic BI Architecture
28
Create Acquire Integrate Present Use
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract ProcessExtract Process EDW stageEDW stage
Transform & Load
Transform & Load
EDWEDW
data martdata mart
data martdata mart
data martdata mart
data martdata mart
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
© 2014 Craig Jordan
Experiment with speed layer techniques
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract ProcessExtract Process EDW stageEDW stage
Transform & Load
Transform & Load
EDWEDW
data martdata mart
data martdata mart
data martdata mart
data martdata mart
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
Extended for near real-time processing
29
Near real-time ingestion
technologies
© 2014 Craig Jordan
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract ProcessExtract Process EDW stageEDW stage
Transform & Load
Transform & Load
EDWEDW
data martdata mart
data martdata mart
data martdata mart
data martdata mart
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
Add ML algorithms
Extended for machine learning
30
Add ML algorithms
Add analytic workbench for
experimentation
© 2014 Craig Jordan
Op SysOp SysOp sysOp sys datadata
Third party dataThird party data
Op SysOp SysSaasSaas datadata
Extract ProcessExtract Process EDW stageEDW stage
Transform & Load
Transform & Load
EDWEDW
data martdata mart
data martdata mart
data martdata mart
data martdata mart
Metadata ManagementMetadata Management
Data Quality Management & Data GovernanceData Quality Management & Data Governance
Extended for text processing
31
Add analytic workbench for
experimentation
Add natural language processing & faceted
search
NoSQL DBNoSQL DB
HadoopCluster
HadoopCluster
Add flexible storage engines for
unstructured data
Add flexible storage engines for
unstructured data
top related