high performance is no longer a “nice to have” in ...files.meetup.com/14317202/twingo event...
TRANSCRIPT
MicroStrategy and Big Data
Presented by: Javier Valladares,
November 2015
Agenda
MicroStrategy Overview
Leveraging Hadoop for BI
Customer’s cases
MicroStrategy Enterprise Analytics
A powerful Business Intelligence solution that meets the needs of Business and IT in a single platform.
Best for ITBest for business
Data Discovery
and Visualization
• Exceptional Ease of
Use
• Schema-free
• Data Preparation +
Blending
• Rapid prototyping
• Agile and visual
analysis
Geo-SpatialAppsGraphsDashboards Banded
Reports
OLAP
Reports
Data
DiscoveryPredictive
Analytics
• Reusable Object Model
• Single Security Architecture
• Single Metadata
• Optimized Multi-Source Data Access
• Design Once Deploy Everywhere
• Enterprise Reporting on any device
• Highest User Scale
• High Data Scale
• Fastest Query Performance
• Secure, Personalized Analytics for
10,000s
MicroStrategy Analytics PlatformMicroStrategy Desktop
Alerts &
Distribution
Rapid | Intuitive Powerful | Scalable | Extensible | Governed | Highly Performant | Secure
Visualizations
What is Hadoop?
• Hadoop is an open source software framework to support distributive storage and
processing for large datasets on commodity hardware.
• It is a software platform designed to store and process quantities of data that are too
large for just one particular device or server.
• It has two main components:
• HDFS: Hadoop Distributed File System. It is the “secret sauce” that enables
Hadoop to store huge files. It’s a scalable file system that distributes and
stores data across all machines in a Hadoop cluster.
• Map Reduce: MapReduce is the system used to efficiently process the large
amount of data Hadoop stores in HDFS. Originally created by Google, its
strength lies in the ability to divide a single large data processing job into
smaller tasks.
Extreme Scalability and Reliability
These sources provide scalable and reliable data
storage that is designed to span large clusters of
commodity servers
Affordable Data Storage
Looking to store large volumes and variety of data in
a relational source, is not possible anymore. It is
expensive and Hadoop offers much cheaper data
storage
Highly Flexible
Hadoop bypasses the need to specify a
schema/structure the data. Allows to dump the data
and ask questions later.
What is the reason behind Hadoop’s value proposition?
Challenges with Big Data Analytics
Performance: Organizations seeking to implement advance analytics on Hadoop,
struggle for high performance.
MapReduce persists intermediate results to disk after each pass through the data;
as a result, iterative algorithms implemented in MapReduce run significantly slower
than they do on distributed in-memory platform
Data Federation: Many real world applications require integration across projects,
which is challenging due to the multiple analytic point solutions introduced by
Hadoop.
Time to Market: Enterprises are always keen to shorter their time to market and it is
challenging when dealing with several types of sources for varied types of data and
then different technologies to query them.
Data Cleansing: Enterprises find it challenging to cleanse varied forms of data to
make it ready for analytics.
Se
co
nd
s E
xe
cu
tio
n T
ime
s M
inu
tes
GBs Data Volume PBs
In-m
em
ory
RD
BM
S
Hadoop
In-memory
RDBMS
Hadoop
Query Execution Times in an environment with Hadoop
Support for More Big Data Sources
Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single Database
Data Warehouse
Appliances
MapReduce &
NOSQL Databases
Relational
Databases
Multidimensional
Databases
Columnar
Databases
SaaS-Based App
Data
HANA
BigInsights
Parallel Data Warehouse
Elastic Map
Reduce
Analysis Services
Redshift
Brin
g A
ll R
ele
va
nt D
ata
to
Decis
ion
Ma
ke
rs
Distribution
Clipboard MicroStrategy
Dataset
Analytics Zendesk
HDFS
Generic Web
ServicesSOAP REST
Generic Web
Services with
OAuth..many more..
User / Departmental
Data
Usage Patterns for MicroStrategy with Hadoop as a Data Source
Maturity of Data Access
RDBMS
1.Visually explore
subject–matter
extract in-memory
through a one-time
query to Hadoop
2.Self-service
parameterized
queries directly
to Hadoop
Multi-dimensional
Business Model
ETL
3.Model-driven
access to Hadoop.
4.Query multi-source
schema model and
drill down among
Intelligent Cubes,
EDW, Hive
Ways to Connect and Query Hadoop
#1 SQL on Hadoop/HDFS
MicroStrategy
Analytics Platform
Hadoop
HDFS
Hive ODBC
Connector
Hadoop
Distribution Hive
• This is the most popular way of querying Hadoop, via Hive/Impala
• Hive allows users who aren’t familiar with programming to access and analyze big data in
a less technical way, using a SQL-like syntax called Hive Query Language (HiveQL).
Hive is used for complex, long-running tasks and analyses on large sets of data, e.g.
analyzing the performance of every store within a particular region for a chain retailer.
• Impala: Like Hive, Impala also uses SQL syntax to query Hadoop. Impala is used for
analyses that you want to run and return quickly on a small subset of your data, e.g.
analyzing company finances for a daily or weekly report. Not ideal for complex data
manipulation, data preparation etc.
• Hive is a screwdriver and Impala is a drill bit.
Apache
Shark/Spark
Apache Pig
Apache
Hive
SQL on Hadoop
How does MicroStrategy integrates with Hadoop?
• MicroStrategy certifies
Cloudera Impala, Google Big
Query and Pivotal HAWQ as a
data source.
• MicroStrategy optimizes and
certifies Hadoop/Hive as a
data source.
• MicroStrategy certifies
Spark/Shark on HDFS.
• MicroStrategy also provides a
connector to execute Freeform
Pig-Latin reports
MicroStrategy
Analytics Platform
Hadoop
HDFSBig Data
Engine/Hadoop
Gateway
NEW
#2 Tap into Hadoop Natively
Ways to Connect and Query Hadoop
• We launched this connectivity with v10. Big Data Engine (BDE), is a native YARN based
application that enables direct access to HDFS.
• YARN (Yet Another Resource Negotiator) is the prerequisite for Enterprise Hadoop,
providing resource management and a central platform to deliver consistent operations,
security, and data governance tools across Hadoop clusters.
• Use Case: Fulfills faster data loading of data from HDFS and leverage our in-memory
layer for analytics.
How Big Data Engine works?
Hadoop Cluster
Data Node
Big Data Execution Engine
Name NodeData Node
Big Data Execution Engine
Datapartition
Datapartition
Big Data Query Engine
….
In-memory Cubes (PRIME)
BDE Streamer
• Big Data Engine has two components:
• Big Data Query Engine (BDQE)
• Big Data Execution Engine (BDEE)
• I-server sends the query to BDQE. BDQE will
further assign sub task to related BDEE which
runs on each data node of Hadoop
• BDEE will work in parallel, perform the
needed aggregation and wrangling and push
the data to the I-server
• BDE Streamer will merge the data from each
BDEE and pass the final result to Analytical
Engine to either publish the cube or render it
directly on VI
I-Server
Three Steps for Self Service Access to Hadoop with Native Connectivity
Web logs, survey/feedback forms,
machine generated data…
Import Data from
HDFS directly
Cleanse, Refine with
Data Wrangler
Analyze with
Visual Insight
• Cleanse, refine and transform
data from HDFS, make it
ready for analysis.
• Designed for business users
Get full insights from
Hadoop/HDFS data using Visual
Insight
22
Some experience
RetailKey BI Characteristics:
Business Use and Benefits
INDUSTRY: Retail (Online Commerce)
BI COMPONENTS: Reports, Dashboards, VI
USERS: ~200
DATABASE: Hadoop, Oracle
HADOOP DISTRIBUTION: Apache
VOLUME OF DATA: Petabytes
TYPE OF DATA: Web Logs, Online behavior
APPLICATIONS: Sales Analysis
• Analyzing web logs/online behavior stored in Hadoop. Dashboards and VI
analysis run against in-memory cubes, while ad-hoc reports run live against
the Hadoop data using a combination of Hive/Shark
• Match customer transactions in Oracle DWH against clickstream data in
Hadoop to gather a holistic view of the online customer
• End users do not need to code with MapReduce
• Developers are more productive delivering self service BI through a tool
instead of coding custom user interfaces
Entertainment
INDUSTRY: Entertainment
BI COMPONENTS: Traditional Reports, VI
USERS: ~200
DATABASE: Hadoop, Teradata
HADOOP DISTRIBUTION: Amazon EMR
VOLUME OF DATA: Petabytes
TYPE OF DATA: Log and Events data from
Streaming Service
APPLICATIONS: Sales Analysis
Key BI Characteristics:
• Sales Analysis generally with a new launch in new region, quick report analysis to understand the new accounts,
number of hours of viewing etc.
• Directly querying and reporting from MicroStrategy on logs via Hive
• Able to make better Sales decisions
• Short-lived analytics on the use of streaming service
• Easy access for analysts to Hadoop data without using MapReduce
• Shortcut the ETL to warehouse cycle that would otherwise take weeks
• Extend business model to create own content:
https://en.wikipedia.org/wiki/List_of_original_programs_distributed_by_Netflix
Business Use and Benefits
Digital MediaKey BI Characteristics:
Business Use and Benefits
INDUSTRY: Digital Media
BI COMPONENTS: 1 Application; Reports, VI, Dashboards
DATABASE: Hadoop, Hive, Impala
HADOOP DISTRIBUTION: Cloudera
VOLUME OF DATA: Over 1 Billion traffic attribute
combinations
APPLICATIONS: Traffic Attribute Multiplier
• The Traffic Attribute Multiplier application is helping Adconion to:
o Target their digital ads better
o Shorten the time to prepare and tune models
o Provide better ad delivery ROI for their customers
• Leveraging MicroStrategy’s integration to Impala and the rich visualizations
library, making it easy to be consumed and scalable for business users
• Data blending and data clustering for better business insights
• Achieved 2.4% improvement in ad budgets spending efficiency
• Evaluated MSTR against Tableau, Pentaho, and Jaspersoft and chose us for
our completeness