seamless access from oracle database to your big...

Seamless Access from Oracle Database to Your Big Data

Brian MacdonaldBig Data and Analytics SpecialistOracle Enterprise Architect

September 24, 2015

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Agenda• Hadoop and SQL access methods• What is Oracle Big Data SQL• Big Data SQL Architecture• Big Data SQL Configuration• Roadmap• Customer Story• Q&A

9/23/2015 2


First Lets Define Big Data

Structured & Unstructured Data

&


SQL on Hadoop is ObviousAlthough Implementations Vary

• Hive• Impala, HAWQ, IBM Big SQL• Oracle SQL Connector for Hadoop (OSCH)• Oracle Big Data SQL • A million more (Tez, Presto, Hadapt, Stinger, Polybase, Drill, Lots of

start ups)

Stinger


SQL Analytics ChallengeSeparate silos of information to analyze

5


SQL Analytics ChallengeNo comprehensive SQL interface

6


Oracle Big Data SQLHadoop + NoSQL + Relational…

7


Oracle Big Data SQL – A New Architecture

• Powerful, high-performance SQL on Hadoop– Full Oracle SQL capabilities on Hadoop– SQL query processing local to Hadoop nodes

• Simple data integration of Hadoop and Oracle Database– Single SQL point-of-entry to access all data– Scalable joins between Hadoop and RDBMS data

• Oracle Security – Govern all Data through a Single Set of Security Policies– Redaction, VPD, etc.– Tool Access

8


Use Rich Oracle SQL Dialect Over All DataSnapshot of Oracle SQL Analytic Functions

• Ranking functions

– rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

– Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

– Direct inter-row reference using offsets

• Reporting Aggregate functions

– Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

– Correlation, linear regression family, covariance

• Linear regression

– Fitting of an ordinary-least-squares regression line to a set of number pairs.

– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions

• Descriptive Statistics

– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential


Oracle Big Data SQL Architecture• Two components of Oracle Big Data SQL

– External Table extension– Big Data SQL Server Software on Big Data Appliance

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12

A Smarter Oracle External Table

Oracle Table

HDFS Data

You define:

You get:

• Table name

• Automatic discovery of Hive table metadata

• Oracle types

• Automatic translation from Hadoop types• Automatic conversion from any InputFormat

• Any Degree of Parallelism

• Fan-out Parallelism across the Hadoop cluster


Unify Metadata: Publish Hive Metadata to Oracle Catalog

13

CREATE TABLE movieapp_log_json(click VARCHAR2(4000))ORGANIZATION EXTERNAL(TYPE ORACLE_HIVEDEFAULT DIRECTORY DEFAULT_DIR)

REJECT LIMIT UNLIMITED;

Big Data Appliance+

Hadoop/NoSQL

Exadata+

Oracle Database

Oracle Catalog

External Table

Hive metadata

External Table

Hive Metastore


Accessible through Oracle Data DictionaryImmediately – So the DBA doesn’t need to go to Hadoop

• ALL_HIVE_DATABASES

• ALL_HIVE_TABLES

• ALL_HIVE_COLUMNS

• DBA_HIVE_DATABASES

• DBA_HIVE_TABLES

• DBA_HIVE_COLUMNS

• USER_HIVE_DATABASES

• USER_HIVE_TABLES

• USER_HIVE_COLUMNS


Extend Oracle External Tables

CREATE TABLE movielog (

click VARCHAR2(4000))

ORGANIZATION EXTERNAL (

TYPE ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

))

REJECT LIMIT UNLIMITED;

• New types of external tables– ORACLE_HIVE (inherit metadata)– ORACLE_HDFS (specify metadata)

• Access parameters for Big Data– Hadoop cluster– Remote Hive database/table

• DBMS_HADOOP Package for automatic import

• SQLDeveloper Integration (Create Table)


SQLDeveloper Integration


How Data is Stored in HadoopAs files. Pretty Simple

17

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB File

CREATE TABLE ORDER (custid VARCHAR2(10), recommended VARCHAR2(20), activity (NUMBER 8,2))ORGANIZATION EXTERNAL (TYPE oracle_hdfs)LOCATION ("hdfs:/usr/cust/summary/*");

• Assumes Default Values• Table Options

– Fields– Column Maps– Delimiters– Fileformats

• json, textfile, sequencefile,…

– Serdes• i.e regex

More (See Docs)


Creating an External Table against Hive

DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE (cluster_id IN VARCHAR2,db_name IN VARCHAR2 := NULL,hive_table_name IN VARCHAR2,hive_partition IN BOOLEAN,table_name IN VARCHAR2 := NULL,perform_ddl IN BOOLEAN DEFAULT FALSE,text_of_ddl OUT VARCHAR2);

set serveroutput onDECLAREDDLout VARCHAR2(4000);BEGINdbms_hadoop.create_extddl_for_hive(CLUSTER_ID=>'bigdatalite',DB_NAME=>'brian',HIVE_TABLE_NAME=>'movie',HIVE_PARTITION=>FALSE,TABLE_NAME=>'movie',PERFORM_DDL=>FALSE,TEXT_OF_DDL=>DDLout);dbms_output.put_line(DDLout);END;


Oracle External Tables – Flexibility for Varied File Structures

• Transparent schema-for-read– Use fast C-based readers when possible– Use native Hadoop classes otherwise

• Engineered to understand parallelism– Map external units of parallelism to Oracle

• Architected for extensibility– StorageHandler capability enables support for other

data sources– Examples: MongoDB, HBase, Oracle NoSQL DB

CREATE TABLE ORDER (cust_num VARCHAR2(10),

order_num VARCHAR2(20), order_total NUMBER(8,2))ORGANIZATION EXTERNAL (

TYPE ORACLE_HIVEDEFAULT DIRECTORY DEFAULT_DIR

)PARALLEL 20REJECT LIMIT UNLIMITED;


StorageHandlers: Extensibility Beyond HDFS

Hive Metastore

Oracle Big Data SQL

StorageHandlers are a metadata bridge.


Oracle Big Data SQL Architecture• Two components of Oracle Big Data SQL

– External Table extension– Big Data SQL Server Software on Big Data Appliance


What gives Exadata extreme performance?

22

Oracle Database 12c

SQL

Offload Query toExadata Storage Servers

Small data subsetquickly returned


Introducing Oracle Big Data SQL

23

Massively Parallel SQL Query across Oracle, Hadoop and NoSQL

Oracle Database 12cHadoop & NoSQL


Big Data Appliance X5-2

Sun Oracle X5-2L Servers with per server:• 2 * 18 Core Intel Xeon E5 Processors• 128 GB Memory• 96TB Disk space

Integrated Software (4.2):• Oracle Linux 6.6• Oracle Big Data SQL 1.1*• Cloudera Distribution of Apache Hadoop 5.4 – EDH Edition• Cloudera Manager 5.4• Oracle R Distribution• Oracle NoSQL Database CE

24

* Oracle Big Data SQL is separately licensed


Introducing Oracle Big Data SQL

25

Massively Parallel SQL Query across Oracle, Hadoop and NoSQL

Oracle Database 12c

Offload Query toExadata Storage Servers

Small data subsetquickly returned

Hadoop & NoSQL

Offload Query toData Nodes

SQL

data subset

SQL


Storage Layer

26

Big Data SQL Server: A New Hadoop Processing Engine

Filesystem (HDFS) NoSQL Databases(Oracle NoSQL DB, Hbase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduceand Hive Spark Impala Search Big Data

SQL


How do we query Hadoop?Big Data SQL Query Execution

HDFSData NodeBDSQL

HDFS Data NodeBDSQL

Query compilation determines:• Data locations • Data structure• Parallelism

1

Fast reads using Big Data SQL Server• Schema-for-read using Hadoop classes• Smart Scan selects only relevant data

2

Process filtered result• Move relevant data to database• Join with database tables• Apply database security policies

3Hive Metastore

HDFSNameNode 1

2 3


Apply Advanced Security on Hadoop & NoSQL

28

Same security policies across all data

Oracle Database 12cHadoop

Redacteddata

subset

SQLRaw JSON

data inHadoop

JSON

Customer data

in Oracle

DBMS_REDACT.ADD_POLICY(object_schema => 'sales',object_name => 'customer_detail',column_name => 'last_name',policy_name => 'customer_privacy',function_type => DBMS_REDACT.FULL,expression => '1=1');

• Redaction• Virtual Private Database• Fine-grain Access Control


Configuration• Install Oracle Big Data SQL on the BDA using Mammoth• Run the Big Data SQL-Exadata installation script on each Oracle Exadata

database node– Sets up connectivity from Exadata to the Big Data SQL Servers on the BDA.– Installs a Hadoop client– Configure directories and files– Big Data SQL Agent– Oracle directory objects– Others


Directories• Two Types of directories are created• Common Directory – must be on cluster wide shared files system

– Subdirectories for jar files– bigdata.properties (paths,etc.)

• Cluster Directory(s)– Configuration details for each BDA Cluster– Sub directory of Common directory

• Oracle Directories that point to these Dirs– ORACLE_BIGDATA_CONFIG – Common Directory– ORACLE_BIGDATA_CL_XXXX – One for each Cluster directory (case sensitive)


Big Data SQL Agents • Created by Install Script• This multi-threaded agent bridges the metadata between Oracle Database

and Hadoop.• It launches a single JVM - instead of one for every process (which can be

quite slow). • create public database link BDSQL$_XXXX using

'extproc_connection_data'; (XXXX is the name of each BDA cluster from Cluster Directories

• create public database link BDSQL$_DEFAULT_CLUSTER using 'extproc_connection_data';


If Kerberos is used on BDA

• Must create ticket (kinit) for BDS user• BDS runs as Oracle User• Need to renew tickets

– cron– Other automation to be released soon


Requirements - For Now• Exadata• Oracle 12.1.0.2.1+• Storage Servers 2.1.1.1 or 12.1.1.0• Exadata configured on the same InfiniBand subnet as BDA• Exadata and BDA connected by InfiniBand


RoadmapSubsequent content subject to change!


Today

• Hadoop DoP linked to RDBMS DoP– Lead to many idle PQ processes– Required explicit declaration

Next

• Unlink Hadoop and RDBMS DoP• Automatic max Hadoop parallelism

– Even on serial tables

• An average of 40% faster– Even at equivalent DoP

Enhanced Parallelism


Today

• All blocks in a query must be read from disk– Large (256MB) disk I/O for each block

Next

• Automatically create Storage Indexes in Big Data SQL Agents

• Check index before reading blocks– Skip unnecessary I/Os

• An average of 65% faster– Up to 100x faster for highly selective

queries

Storage Indexing


Customer Examples

37


Building Customer Loyalty

Deliver personalized multi-channel content to every customer (example: Kroger’s ‘MyMagazine’)

Expand to a wide variety of interaction data to build customer profiles

Challenges

Customer loyalty marketing and programs for major retailers and consumer brands

Company Overview

2x improvements in campaign performance Large-scale concurrent processing of complex SQL

70% of analysis is done in SQL, uses R as well

Benefits

Solution Overview Oracle Exadata X3-8 Oracle Database with Advanced Analytics Oracle ZFS Backup Appliance Big Data Appliance Next: Big Data SQL

SQL Analysis R-based Analysis Machine Learning

BDAZFS X3-8X3-8

Source Systems (at Client)


Thank You&

Q&A

seamless access from oracle database to your big...

Documents