in the mix: how native integration improves mapreduce and hadoop

36
Tuesday, May 22, 12

Upload: inside-analysis

Post on 20-Aug-2015

921 views

Category:

Technology


1 download

TRANSCRIPT

Tuesday, May 22, 12

[email protected]

Twitter Tag: #briefrTuesday, May 22, 12

Reveal the essential characteristics of enterprise software, good and bad

Provide a forum for detailed analysis of today’s innovative technologies

Give vendors a chance to explain their product to savvy analysts

Allow audience members to pose serious questions... and get answers!

Twitter Tag: #briefr

Tuesday, May 22, 12

May: Analytics

June: Intelligence

July: Governance

August: Analytics

September: Integration

October: Database

Twitter Tag: #briefr

Tuesday, May 22, 12

Twitter Tag: #briefr

Ultimately analytics is about businesses making optimal decisions, although the range of technologies that inhabit this area is wide: statistical analysis, data mining, process mining, predictive analytics, predictive modeling, business process modeling and complex event processing.

With the advent of big data, analytics has become “big analytics” with organizations diving into large heaps of data that previously was not available or usable.

A major challenge with this market trend is to be able to provide adequate performance for all BI and analytics workloads on the volumes of data that are now being assembled and which are continuously growing.

Tuesday, May 22, 12

Twitter Tag: #briefr

Robin Bloor is Chief Analyst at The Bloor Group.

[email protected]

Tuesday, May 22, 12

Twitter Tag: #briefr

SAP Sybase has a history of database innovation and application from the corporate RDBMS through to the mobile and embedded market.

Sybase IQ has been deployed in many areas of application and is used in many complex predictive analytics deployments, where speed data capacity and versatility are critical.

Recently it has been upgraded to be used in a symbiotic manner with Hadoop in order to provide a comprehensive capability as a BI and analytics engine for Big Data applications

Tuesday, May 22, 12

Twitter Tag: #briefr

David Jonker works in the area of Data Management & Analytics for SAP and is Product Marketing Director for Sybase IQ. In the last 5 years David has led product marketing teams for Sybase’s Data Management & Analytics product lines, including Sybase IQ, Sybase ASE, SQL Anywhere, and Advantage Database Server. His career includes over 10 years in software engineering and product management. Before joining Sybase, David had consulting, product management and software development roles.

Courtney Claussen is a product manager at Sybase, Inc., focusing on Sybase's data warehousing and analytics products. She has enjoyed a 30 year career in software development,

technical support and product marketing in the areas of computer aided design, computer aided software engineering,

database management systems, middleware, and analytics.

Tuesday, May 22, 12

Sybase IQ 15.4 Overview —Big data analytics & Hadoop

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 10

Sybase IQWidespread success

Stands out as the leading enterprise data warehouse among the largest banks, insurance agencies, and telecom operators worldwide

Manage and analyze statistical measures for the entire nation of Canada

Analyze complex models in more than 200 financial institutions worldwide

Analyze ALL Federal tax returns in the US

Store and analyze massive amounts of industry segment data in 30 of the largest information providers in the world, including Transunion, Nielsen and Axiom

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 11

Volume

Managing and harnessing

terabytes of data

SkillsLack of adequate

skills for non-standard platforms

and APIs

Variety

Harmonizing silos of structured and

unstructured data

Velocity

Keeping up with unpredictable data

and query flows

Costs

Too expensive to acquire, operate,

and expand

Dealing with volume, variety, velocity, costs, skills BIG DATA ANALYTICS ISSUES

BIG DATA

ANALYTICS

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 12

Big data analytics

Sybase IQ 15A powerful big data analytics platform in the making

v15.0

2009

VLDB Platform FoundationVolume

v15.4

2011

MapReduce APISkills

v15.3

2011

PlexQ™ MPP FoundationCosts

v15.2

2010

Text Search, Web 2.0 APIVariety

v15.1

2009

In-Database Analytics APIVelocity

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 13

Sybase IQ 15.4A comprehensive platform for big data analytics

SybaseCONTROL CENTER

SybasePOWERDESIGNER

CERTIFITEDISV TOOLS

Web 2.0 Java C/C++ SQL

DMBS

AppServices

Eco-SystemUnstructured Data(Hadoop, Content Mgmt)

Structured Data(DBMS)

Ingest + PersistFederation

Tuesday, May 22, 12

Details: In-Database Analytics & Hadoop

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 15

No compromise for complex analyticsBasic to advanced analytical functions available to SQL directly from Sybase IQ engineData never leaves the database until results are materializedAnalytics code / models must be shareable yet must allow AD-HOC analysisAnalytics code / models must be applicable to the latest data setStandards based access, concept extensibility is compulsoryPerformance and scalability is a givenAverage developer must be able to build In-database analytical models

Sybase  IQ  Process

Built-­‐In  func6ons External  DLL  “A”

External  DLL  “A”

Database  =  Logic/Filtering

Applied  in  database

   Analy7cs  simplified:  Logic  To  Data    =  Fast  +  Efficient

 

In-database analytics in Sybase IQ

Tuesday, May 22, 12

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 17

In-database analytics in Sybase IQCustom functions APIs

Several different forms of C++ and JAVA UDF APIs for building custom In-database analytics, each valid at different locations within queries

1.{Scalar} to {Scalar functions} e.g. sin, cosine, …

2.{Scalar set} to {Scalar functions} e.g. max, min, …

3.{Scalar set} to {Scalar set} e.g. OLAP windows, …

4.{Scalar set} to {Tables} e.g. join result sets, …

5.{Scalar set, Tables} to {Tables} e.g. MapReduce, …

All variants are parallelizable, but (5) is also distributable across the PlexQ™ grid

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 18

• Ideal for ISV or custom Data Mining libraries for Healthcare, eCommerce, Public SectorApps include: – ISV partner Zementis built a plug-in

for PMML (Predictive Modeling Markup Language) models– Validates PMML from SAS, R,..– Translates PMML to JAVA UDFs– JAVA UDFs called from SQL

In-database analytics in Sybase IQ Java custom functions

Big Data Use Cases

•External algorithms written as JAVA fns, plugged into Sybase IQ

•JAVA fns via SQL: runs In-Database, much faster than client side

•JAVA fns run protected/fault tolerant (in separate process)

•Supports scalar and table outputs•Supports all data types

CharacteristicsJAVA User

Defined Function offers a new in-

database analytics API

Feature3

Sybase IQ

Plug-In

JAVA UDF

PMML Zementis

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 19

• Ideal for bringing together Big Data Analytics pre-computations from different domains

• Example — In Telecommunication: Sybase IQ with aggregated customer loyalty data & Hadoop with aggregated network utilization data; Quest Toad for Cloud can bring data from both sources, linking customer loyalty to network utilization or network faults (e.g. dropped calls)

App services — integrating Sybase IQ + Hadoop: at client side

SYBASE IQ 15.4 DECONSTRUCTED

Big Data Use Cases

•Client tool capable of querying Sybase IQ and Hadoop

•Currently certified client tool is Quest Toad for Cloud

•Better performance when results from sources are pre-computed/pre-aggregated

CharacteristicsClient side

federation: Join data from

Sybase IQ AND Hadoop at a client application level

Feature6a

Sybase IQ

$

Toad for Cloud Databases

Hadoop Hive

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 20

• Ideal for combining subsets of HDFS unstructured data or summary of HDFS data into Sybase IQ for mid to long term usage in business reports

• Example — In eCommerce: clickstream data from weblogs stored in HDFS and outputs of MR jobs on that data (to study browsing behavior) ETL’d into Sybase IQ. The transactional sales data in Sybase IQ joined with clickstream data to understand and predict customer browsing to buying behavior

Big Data Use Cases• Extract & load subsets of HDFS data into Sybase IQ column store–Raw data from HDFS–Results of Hadoop MR jobs

• HDFS data stored in Sybase IQ is treated like other Sybase IQ data–Gets ACID properties of a DBMS–Can be indexed, joined, parallelized–Can be queried in an ad-hoc way

• Visible to BI and other client tools via Sybase IQ ANSI SQL API only

• Currently, the Apache bulk data transfer utility SQOOP (built by Cloudera) is certified to provide this ETL capability

CharacteristicsLoad Hadoop

data into Sybase IQ column store: Extract, transform,

load data from HDFS (Hadoop Distributed File System) into Sybase IQ schemas

Feature

App services — integrating Sybase IQ + Hadoop: using ETL

SYBASE IQ 15.4 DECONSTRUCTED

6b

ETL

SQOOP

Clickstream Data

Sales Data

Sybase IQHDFS

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 21

• Ideal for combining subsets of HDFS data with Sybase IQ data for operational (transient) business reports

• Example — In Retail: Point Of Sale (POS) detailed data stored in HDFS. Sybase IQ EDW fetches POS data at fixed intervals from HDFS of specific hot selling SKUs, combines with inventory data in Sybase IQ to predict and prevent inventory “stockouts”

Big Data Use Cases• Scan and fetch specified data subsets from HDFS via table UDF–Can read and fetch HDFS data

subsets–Called as part of Sybase IQ SQL

query–Output joinable with Sybase IQ data

• HDFS data not stored in Sybase IQ–Fetched into Sybase IQ In-memory

tables–ACID properties not applicable

• Visible to BI/other client tools via Sybase IQ ANSI SQL API

CharacteristicsJoin HDFS data with Sybase IQ data on the fly: Fetch and join

subsets of HDFS data on-demand

using SQL queries from Sybase IQ

(Data Federation technique)

Feature

App services — integrating Sybase IQ + Hadoop: using Data Federation

SYBASE IQ 15.4 DECONSTRUCTED

6c

POS Data Inventory Data

Sybase IQUDF BridgeHDFS

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 22

App services — integrating Sybase IQ + Hadoop: using Query Federation

SYBASE IQ 15.4 DECONSTRUCTED

• Ideal for combining results of Hadoop MR job results with Sybase IQ data for operational (transient) business reports

• Example – In Utilities: Smart meter and smart grid data can be combined for load monitoring and demand forecast. Smart grid transmission quality data (multi-attribute time series data) stored in HDFS can be computed via Hadoop MR jobs triggered from Sybase IQ and combined with Smart meter data stored in Sybase IQ to analyze demand and workload.

Big Data Use CasesCharacteristics• Trigger and fetch Hadoop MR job results via table UDF

– Can trigger Hadoop MR jobs

– Called as part of Sybase IQ SQL query

– Output joinable with Sybase IQ data

• HDFS data not stored in Sybase IQ

– Fetched into Sybase IQ In-memory tables

– ACID properties not applicable

• Repeated use: put fetched data in tables

• Visible to BI and other client tools via Sybase IQ ANSI SQL API

CharacteristicsCombine results of

Hadoop MR jobs with Sybase IQ data on the fly: Initiate and

Join results of Hadoop MR jobs on-demand using SQL queries

from Sybase IQ data (Query Federation

technique)

Feature6d

Smart Grid Transmission Data

Smart Meter Consumption Data

UDF Bridge Sybase IQHDFS

Tuesday, May 22, 12

© 2012 SAP AG. All rights reserved. 23

Data  Discovery  (Data  Scien7sts)

Applica6on  Modeling  (Business  Analysts)

Reports/Dashboards  (BI  Programmers)

Business  Decisions  (Business  End  Users)

Infrastructure  Management  

(DBAs)

• Dynamic, elastic PlexQ™ MPP grid–Grow, shrink, provision on-demand–Heavy parallelization

• Load, prepare, mine, report in a workflow–Privacy through isolation of resources–Collaboration through sharing of results/data via sharing of resources

Unique, user community focused platform for big data analyticsSYBASE IQ 15.4

Full  Mesh  High  Speed  Interconnect

SAN Fabric

                     

                         

                     

                         

                     

                         

                     

                         

Tuesday, May 22, 12

Thank you

Courtney ClaussenProduct Manager, Sybase [email protected]

David JonkerProduct Marketing Director, Sybase [email protected]

Tuesday, May 22, 12

Twitter Tag: #briefr

Tuesday, May 22, 12

Tuesday, May 22, 12

Most of the Big Data opportunity is, in the end, a Big Analytics opportunity.

There are two challenges in this:

Managing the data and the data flow

Providing acceptable performance for analytics applications

Hadoop and its associated technologies can be both a blessing and a curse.

Twitter Tag: #briefr

Tuesday, May 22, 12

• Hadoop = Key-value store & Parallel processing framework• Some NoSQL databases are DHT-based, some are specialized DBMS• Column-store DBMS vary, but in general they are MPP RDBMS and NewSQL DBMS

Twitter Tag: #briefr

Tuesday, May 22, 12

Twitter Tag: #briefr

Data volumes (includes complexity of data structure)

Concurrency (includes also workload variability)

Computation (is application dependent)

Data flow architecture is a factor

Tuesday, May 22, 12

Twitter Tag: #briefr

In many ways this is similar to the Data Warehouse data flow challenge; writ larger

Latency is about application service levels

This is probably still a three stage process

This is, by the way, a simplification

Tuesday, May 22, 12

Big Analytics is here to stay

In some analytical application areas speed is desirable, in others speed is critical.

Warning: Workloads can be mixed

Analytic speed depends upon the database engine, but also data flow architecture

Business effectiveness depends upon integration with the business process

Twitter Tag: #briefr

Tuesday, May 22, 12

Twitter Tag: #briefr

The prebuilt functions clearly make sense (for speed of processing). Are they intended to make some analytic tools unnecessary or simply to be called directly by such tools?

What does SAP see as the appropriate role(s) for Hadoop in most businesses?

As I understand it, Sybase IQ can fully replace Hadoop in some contexts. What are the situations where you think Hadoop AND Sybase IQ is appropriate?

I’m intrigued by the idea of JOINing data between Hadoop results and Sybase IQ, but I’m not sure of the role of such a capability. How is this different from using MR for data ingest?

As you can link up to Hadoop/Sybase IQ at the front or at the back-end, which would you tend to use when?

Tuesday, May 22, 12

Twitter Tag: #briefr

You speak of broad and comprehensive capability, in combination with Hadoop.

So which areas do you think are sweet spots?

And which kinds of application and/or data collections do you think require different approaches?

Who have been the early adopters of this Hadoop/Sybase IQ capability and what kind of business problems are they trying to solve?

What do you see as SAP HANA’s role in this? Are the same analytical capabilities being added to SAP HANA?

Tuesday, May 22, 12

Tuesday, May 22, 12

Twitter Tag: #briefr

May: Analytics

June: Intelligence

July: Governance

August: Analytics

September: Integration

October: Database

Tuesday, May 22, 12

Tuesday, May 22, 12