oracle big data sql - dataznalosti.cz · •performance of bi tools like cognos, oracle biee, sas,...

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data SQL Konference Data a znalosti 2015

Jakub ILLNER Information Management Architect XLOB Enterprise Cloud Architects 23 July 2015, version 2


Agenda

Is SQL Dead?

Introducing Oracle Big Data SQL

How Oracle Big Data SQL Works

Demonstration

Questions and Answers

1

2

3

4

2

5


Is SQL Dead? With all the Big Data and NoSQL technologies, why to bother with SQL

3

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | 4

Is SQL Dead?


//Part 5 // Keeping context

var lastUniqueId = "foobar"

var lastRecord: (DataKey, Int) = null

var lastLastRecord: (DataKey, Int) = null

var position = 0

it.foreach( r => {

position = position + 1

if (!lastUniqueId.equals(r._1.uniqueId)) {

lastRecord = null

lastLastRecord = null

}

//Part 6 : Finding those peaks and valleys

if (lastRecord != null && lastLastRecord != null) {

if (lastRecord._2 < r._2 && lastRecord._2 < lastLastRecord._2) {

results.+=(new PivotPoint(r._1.uniqueId,

position,

lastRecord._1.eventTime,

lastRecord._2,

false))

} else if (lastRecord._2 > r._2 && lastRecord._2 > lastLastRecord._2) {

results.+=(new PivotPoint(r._1.uniqueId,

position,

lastRecord._1.eventTime,

lastRecord._2,

true))

}

}

lastUniqueId = r._1.uniqueId

lastLastRecord = lastRecord

lastRecord = r

})

results.iterator

})

//Part 7 : pretty everything up

pivotPointRDD.map(r => {

val pivotType = if (r.isPeak) "peak" else "valley"

r.uniqueId + "," +

r.position + "," +

r.eventTime + "," +

r.eventValue + "," +

pivotType

} ).saveAsTextFile(outputPath)

}

class DataKey(val uniqueId:String, val eventTime:Long)

extends Serializable with Comparable[DataKey] {

override def compareTo(other:DataKey): Int = {

val compare1 = uniqueId.compareTo(other.uniqueId)

if (compare1 == 0) {

eventTime.compareTo(other.eventTime)

} else {

compare1

}

}

}

class PivotPoint(val uniqueId: String,

val position:Int,

val eventTime:Long,

val eventValue:Int,

val isPeak:Boolean) extends Serializable {}

}

package com.hadooparchitecturebook.spark.peaksandvalleys

import org.apache.hadoop.io.{Text, LongWritable}

import org.apache.hadoop.mapred.TextInputFormat

import org.apache.spark.rdd.ShuffledRDD

import org.apache.spark.{Partitioner, SparkContext, SparkConf}

import scala.collection.mutable

/**

* Created by ted.malaska on 12/7/14.

*/

object SparkPeaksAndValleysExecution {

def main(args: Array[String]): Unit = {

if (args.length == 0) {

println("{inputPath} {outputPath} {numberOfPartitions}")

5

Peaks and Valleys in Spark vs. SQL

SELECT PRIMARY KEY

, POSITION

, EVENT_VALUE

, CASE

WHEN LEAD_EVENT_VALUE is null or LAG_EVENT_VALUE is null

then 'EDGE'

WHEN EVENT_VALUE < LEAD_EVENT_VALUE AND EVENT_VALUE < LAG_EVENT_VALUE

then 'VALLEY'

WHEN EVENT_VALUE > LEAD_EVENT_VALUE AND EVENT_VALUE > LAG_EVENT_VALUE

then 'PEAK'

ELSE 'SLOPE'

AND AS POINT_TYPE

FROM (

SELECT PRIMARY_KEY

, POSITION

, EVENT_VALUE

, LEAD(EVENT_VALUE,1,null)

OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LEAD_EVENT_VALUE

, LAG(EVENT_VALUE,1,null)

OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LAG_EVENT_VALUE

FROM PEAK_AND_VALLEY_TABLE

)

132 Lines of Scala/Spark 17 Lines of SQL

7x less code

Finding Peaks and Valleys in Stock Market Data

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 5

10:00 10:05 10:10 10:15 10:20 10:25

Ticker


Example taken from Hadoop Application Architectures by Mark Grover, Ted Malaska, Jonathan Seidman & Gwen Shapira (O’Reilly, July 2015)


• Declarative

• Abstracted (from storage)

• Concise

• Powerful

• Simple to learn

• Rich analytical functions

• Fast

• Secure

• Standardized

• Widely used & known

6

Power of SQL


• Hive

– First SQL engine on Hadoop

– Uses MapReduce for execution

– Contains metastore (HCatalog)

– New Hive-on-Spark project

• SparkSQL

– Spark module for accessing structured data

– Fast, in-memory execution

– Compatible with Hive

7

• Impala

– Developed by Cloudera

– Fast, in-memory execution

– Introduced Parquet format


• Presto

– Developed by Facebook

– Fast, low latency execution


– Connectivity to other sources

SQL on Hadoop


What if you need to query Hadoop & RDBMS data?

• Pure Hadoop-on-SQL engines can access data in Hadoop only (HDFS, Hive, Parquet, ORC, HBase etc.)

• Performance of BI tools like Cognos, Oracle BIEE, SAS, Tableau for large and complex federated queries is limited

• Possible solution is to use SQL interface & Hadoop integration available with the DBMS platform of choice

8


Which SQL-on-Hadoop approach will you use most?

From 9% in 2014 to 19% in 2015

http://blogs.gartner.com/merv-adrian/2015/02/22/which-sql-on-hadoop-poll-still-says-whatever-but-dbms-providers-gain/


Introducing Oracle Big Data SQL One SQL to query ALL the data

10


What if you could …

• Make all data easily available to all your Oracle Database applications

• While supporting the full breadth of Oracle SQL query language

• With all the security of Oracle Database 12c

• Without moving data between your Hadoop cluster and the RDBMS

• And deliver fast query performance

• While leveraging your existing skills

• And still utilize the latest Hadoop innovations

11


One SQL to query ALL the data

NoSQL

CQL N1QL UnQL HiveQL SQL


• Ranking functions

– Rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

– Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

– Direct inter-row reference using offsets

• Reporting Aggregate functions

– Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

– Correlation, linear regression family, covariance

• Linear regression

– Fitting of an ordinary-least-squares regression line to a set of number pairs.

– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions

• Descriptive Statistics

– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

Rich Analytical Functions with Oracle SQL

13


next = lineNext.getQuantity();

}

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) {

state = "S";

return state;

}

if (gt(q, prev) && gt(q, next)) {

state = "T";

return state;

}

if (lt(q, prev) && lt(q, next)) {

state = "B";

return state;

}

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) {

state = "E";

return state;

}

if (q.isEmpty() || eq(q, prev)) {

state = "F";

return state;

}

return state;

}

private boolean eq(String a, String b) {

if (a.isEmpty() || b.isEmpty()) {

return false;

}

return a.equals(b);

}

private boolean gt(String a, String b) {


return false;

}

return Double.parseDouble(a) > Double.parseDouble(b);

}

private boolean lt(String a, String b) {


return false;

}

return Double.parseDouble(a) < Double.parseDouble(b);

}

public String getState() {

return this.state;

}

}

BagFactory bagFactory = BagFactory.getInstance();

@Override

public Tuple exec(Tuple input) throws IOException {

long c = 0;

String line = "";

String pbkey = "";

V0Line nextLine;

V0Line thisLine;

V0Line processLine;

V0Line evalLine = null;

V0Line prevLine;

boolean noMoreValues = false;

String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>();

boolean finished = false;

DataBag output = bagFactory.newDefaultBag();

if (input == null) {

return null;

}

if (input.size() == 0) {

return null;

}

Object o = input.get(0);

if (o == null) {

return null;

}

//Object o = input.get(0);

if (!(o instanceof DataBag)) {

int errCode = 2114;

String msg = "Expected input to be DataBag, but"

14

Pattern Matching With Oracle SQL

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_z

ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),

W AS (price < PREV(price)),

Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF 12 Lines of Oracle SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14

10:00 10:05 10:10 10:15 10:20 10:25

Ticker



Security – Virtual Private Database with Oracle SQL

• Oracle Virtual Private Database (VPD) enables you to create security policies to control database access at the row and column level.

• Oracle VPD adds a dynamic WHERE clause to a SQL statement that is issued against the table, view, or synonym to which an Oracle Virtual Private Database security policy was applied.

• Because you attach security policies directly to the database objects (tables, views), and the policies are automatically applied whenever a user accesses data, there is no way to bypass security.

• With Big Data SQL the Oracle Virtual Private Database is available for Hadoop data

B B B

SELECT * FROM my_bigdata_table WHERE SALES_REP_ID = SYS_CONTEXT('USERENV','SESSION_USER');

Filter on SESSION_USER

Oracle Database 12c Big Data SQL on Hadoop Cluster

15


Security – Data Redaction with Oracle SQL • Oracle Data Redaction enables you to create

security policies to control what data is visible for sensitive columns with personal or security information.

• Oracle Data Redaction dynamically applies redaction function to columns. The function transforms, obfuscates or hides the sensitive information for unauthorized users.

• Since the policy is applied automatically by Oracle Database there is no way to bypass security and get the un-redacted data.

• With Big Data SQL the Oracle Virtual Private Database is available for Hadoop data B B B

DBMS_REDACT.ADD_POLICY( object_schema => 'MCLICK', object_name => 'TWEET_V', column_name => 'USERNAME', policy_name => 'tweet_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25', expression => '1=1' );

***


16


How Oracle Big Data SQL Works Marriage of Hadoop and Oracle Database Query Processing

17


Hadoop data accessible through Oracle External Tables

• New set of properties

– ORACLE_HIVE and ORACLE_HDFS access drivers

– Identify a Hadoop cluster, data source, column mapping, error handling, overflow handling, logging

• New table metadata passed from Oracle DDL to Hadoop readers at query execution

• Architected for extensibility – StorageHandler capability enables future support

for many other data sources

– Examples: MongoDB, HBase, Oracle NoSQL DB

18

CREATE TABLE movielog (click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY Dir1 ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster) ) REJECT LIMIT UNLIMITED


B B B

How Oracle executes a query with Big Data SQL

HDFS Data Node BDSQL

HDFS Data Node BDSQL

Query compilation determines: • Data locations • Data structure • Parallelism

1

Fast reads using Big Data SQL Server • Schema-for-read using Hadoop classes • Smart Scan selects only relevant data

2

Process filtered result • Move relevant data to database • Join with database tables • Apply database security policies

3 Hive Metastore

HDFS NameNode 1

2 3

19



Storage Layer

20

Big Data SQL: A New Hadoop Processing Engine

Filesystem (HDFS) NoSQL Databases

(Oracle NoSQL DB, HBase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduce and Hive

Spark Impala Search Big Data

SQL


Big Data SQL Uses Hive Metastore, not MapReduce

Hive Metastore

Hive Impala SparkSQL Oracle Big Data SQL …

Common semantic repository (schemas, Java classes) for most of SQL-on-Hadoop tools

Metastore maps DDL to Java access classes


How Data is Stored in Hadoop

22

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB JSON File

Block B1

Block B2

Block B3

• 1 block = 256 MB • Example File = 4096 blocks • InputSplits = 4096 Potential scan parallelism


How MapReduce and Hive Read Data

23

Data Node

disk

Consumer

SCAN

Create ROWS

& COLUMNS

• Scan and row creation needs to be able to work on “any” data format

• Data definitions and column deserializations are needed to provide a table

RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes H

ive

Sto

rage

H

and

ler


Big Data SQL Server Dataflow

Disks

Data Node

Big Data SQL

External Table Services

Smart Scan

Read data from HDFS Data Node • Direct-path reads • C-based readers when possible • Use native Hadoop classes otherwise

1

Translate bytes to Oracle 2

Apply Smart Scan to Oracle bytes • Apply filters • Project Columns • Parse JSON/XML • Score models

3 RecordReader

SerDe 1

01

10

01

01

00

11

1

10

11

00

10

10

01

11

1

01

10

01

0 1

2

3

24


Operations Pushed Down to Hadoop

JSON


Pushed down to Big Data SQL Cell − Hadoop scans (InputFormat, SerDe) − JSON parsing − WHERE clause evaluation − Storage index evaluation − Column projection − Bloom filters for faster joins − Score Data Mining models

Handled by Oracle Database − Query Compilation & Optimization − Joins − Aggregations − Ordering of results − PL/SQL evaluation − Table functions − Security policies

Request

Oracle Data Stream

Smart Scan – Only

relevant data are emitted


Oracle Big Data SQL Storage Index

• Storage index provides query speed-up through transparent IO elimination of HDFS Blocks

• Columns in SQL are mapped to fields in the HDFS file via External Table Definitions

• Min / max value is recorded for each HDFS Block in a in-memory storage index

26

HDFS

Field 1, Field 2, 1001 1010 1045 1109 1043 1001 1045 1609 1043 11455 1909 12430 13010 10450 1909 2043

Field 3, … , Field n

HDFS Block1

(256MB)

HDFS Block2

(256MB)

Index

B1 – Movie_ID Min: 1001 Max: 1609

B2 – Movie_ID Min: 1909 Max: 13010

Example: Find all ratings from movies with a MOVIE_ID of 1109

✓

✗


Oracle Parallel Query and Hadoop

B B B

Hive Metastore

HDFS NameNode

Determine Hadoop Parallelism • Determine schema-for-read • Determine InputSplits • Arrange splits for best performance

1

Map to Oracle Parallelism • Map splits to granules • Assign batches of granules to PX Servers

2

PX Servers Route Work • Send granule requests async to cells • Reap results

3

1

2

PX

InputSplits

27



Oracle and Hadoop Parallelism

SELECT /*+PARALLEL(EVE,8)*/

CUST.NAME

, CUST.MSISDN

, EVE.MONTH

, EVE.EVENT_TYPE

, COUNT(*) AS EVENT_COUNT

, SUM(EVE.DURATION) AS DURATION

FROM D_CUSTOMERS CUST

, F_NETWORK_EVENTS EVE

WHERE CUST.MSISDN = EVE.MSISDN

GROUP BY CUST.NAME

, CUST.MSISDN

, EVE.MONTH

, EVE.EVENT_TYPE

ORDER BY 1,2,3,4

PX Server #1

PX Server #2

PX Server #3

PX Server #4

PX Server #5

PX Server #6

PX Server #7

PX Server #8

Parallelism defined by Degree of Parallelism (DOP) – dynamic, statement, table level

DOP can be throttled by database if maximum DOP exceeded or table too small

Oracle Database 12c Granule Request

Granule Request

Parallelism defined by Hadoop InputSplits (fan-out from PX to Hadoop)

Utilized as many cores as provided by cgroups (first-come-first-serve)

Big Data SQL on Hadoop Cluster

Async

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Ho

st 1

H

ost

4

Ho

st 3

H

ost

2


Big Data SQL Prerequisites

• Oracle 12c on Linux

• Oracle Exadata

• Oracle Big Data Appliance

• Infiniband interconnection between Oracle Exadata and Oracle Big Data Appliance

29

B B B

Oracle Big Data Appliance

with CDH

Oracle 12c

on Oracle Exadata

Infiniband


Demonstration Big Data SQL in Action

30


Questions and Answers Provided we still have some time …

36


Knowledge Check True or False Hive is leveraged by Big Data SQL as a query execution engine - allowing BDS queries to automatically execute faster as the Hive execution engine improves (e.g. Spark replaces MapReduce)

False Big Data SQL leverages only the Hive Metastore (HCatalog) and the corresponding classes (InputFormat, RecordReader, SerDe) but it does not use Hive for execution

37


Knowledge Check True or False Oracle Big Data SQL sends all the data from Hadoop to Oracle Database where the query is processed. Oracle Database does column selection, it applies WHERE, GROUP BY and ORDER BY clauses etc.

False Big Data SQL Smart Scan performs low level processing of query. Smart Scan does the column projection, it applies WHERE condition & Bloom filters, it processes JSON etc.

38


Knowledge Check True or False Oracle’s ambition with Big Data SQL is to supersede all the Hadoop-on-SQL engines like Hive, SparkSQL, Impala, Drill or Presto.

False Big Data SQL is for companies with significant Oracle assets (e.g. Oracle Data Warehouse) who wants to access and process both Hadoop and Oracle data from single SQL environment

39


Thank You!


Safe Harbor Statement

The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

41

oracle big data sql - dataznalosti.cz · •performance of bi tools like cognos, oracle biee, sas,...

Documents