in3020/4020 –database systems spring 2020, week 16.2, part ...€¦ · in3020/4020 –database...

Post on 29-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IN3020/4020 – Database Systems Spring 2020, Week 16.2, Part 1

Data Science

Dr. M. Naci Akkøk, Chief Architect, Oracle NordicsCo-founder & CEO, In-Virtualis

Also based upon slides from Renée Wikestad, Senior Solution Engineer, Oracle

Major driver of DBMS development/usageo If you haven´t noticed yet, one of the major users of (and hence

drivers in research for) databases and database management systems is data-intensive or data-driven applications:

o Analyticso Data Science (or applied statistics)o Artificial Intelligence (AI) and Machine Learning (ML)o Also including pattern recognition, text and speech

analysis/synthesis etc.

Artificial Intelligence

Machine Learning

Deep Learning

So what is … ?

Data Science

Artificial Intelligence

Machine Learning

Deep Learning

Data Scientist; Skills

Math and Statistics

Business/DomainsKnowledge

Computer Science

Machine LearningDiagnosticAnalysis

Soft

war

e De

velo

pmen

t

What is Machine Learning?

ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

What is Machine Learning?

ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

What is Machine Learning?

ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

What is Machine Learning?

ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

What is Machine Learning?

ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

There are hybrid models, but such applications arealways data driven, data intensive.

The old way

Human written rules guided by experience

Problem: Difficult to determine the right rules, rules may change, impossible to evaluate the fit for every individual

?With Machine Learning

Machine finds patterns in historical data (model)

Uses this knowledge to predict best audience

Example: Address the right audience for my product

? pneumonia

healthy

The old way

Human work by highly skilled doctors.

With Machine Learning

Machine finds patterns in labelled historical images.

Uses this knowledge to diagnose new images.

Example: Medical diagnosis

Example: Fraud Detection

The old way

Rule based systems.

High amount of false positives.

With Machine Learning

Combination of techniques, including Anomaly Detection.

Higher accuracy.

SalePrice

Size (m2)

400K

200100 300

300K

200K

100K

500K

A model is a function

f(x)=Target

x = Input Feature

f(x) = 150 + 1.25 x

How to find a model with Linear regression?

Hypothesis:

Parameters:

Cost Function:

Target:

• We use a training set of historical data to find parameters that provide the function that best fits the training data.

• We then use a test data set of historical data to checkhow well the model fits this test data.

• If the model is good enough, we can use it to predictfuture values.

Machine Learning Types

Supervised

CLASSIFICATIONREGRESSION

Machine Learning Types

Supervised

CLASSIFICATIONREGRESSION CLUSTERING

Unsupervised

Machine Learning Types

CLASSIFICATIONREGRESSION CLUSTERING

Supervised Unsupervised Reinforcement

Agent

Environment

state

reward

action

o Supervised Learning: Uses historical data to find a model (mathematical function) or classification / pattern that best fits the data in order to predict valuesfor new data.

o Unsupervised Learning:Finds trends and patterns in data, without previous training on historicaldata. Here you try to find patterns/classifications of the data with littleor no idea what the result should look like.

Reinforcement Learning: Uses observations to learn. Learning is based on the fact that all actionshave an effect on the model and that feedback is given from the actionsthat guide and improve the learning.

Machine Learning Algorithms can be divided into categories according to their purpose.Machine Learning Models

S

M

Top languages for Data Scientists

Python and SQL are leading, with R in a good third place.

Access to Data

• On-prem/cloud/hybrid data management

• Finding data

• Combining data

Challenges of Data Science

Complex lifecycle to build models

• Multiple roles involved

• Reuse of past work

• Access to tools and infrastructure

• Access to flexible compute power

Operationalizing ML• Packaging models for

consumption

• Monitoring models

• Updating models

Problem Definition

Dataset Discovery and

Access

Data Exploration + Preparation

Model BuildingModel

Evaluation + Selection

Model Deployment

Model Evaluation

The Data Science Process

The Machine Learning Process

Business Understanding

Data Understanding Data Preparation Modeling Evaluation Deployment

Capture DataDescribe DataExplore Data

Select DataClean DataConstruct DataCombine Data

Select Modeling TechniqueBuild ModelAsses Model

Evaluate resultDecide Next Step

ObjectivesSuccess Criteria

Analytics Development

Reports & Dashboards

Data Information Predictions & Insights Appls with ML

Analytical Maturity

Valu

e of

Ana

lytic

s

Diagnostic Analysis & Reports

Predictive / Machine Learning

“ML Enabled” Applications

What Happened?

Why it Happened?

What WILL happen?

Automated ML Appls

Machine Learning Differentiators

Business Understanding

• Predict employees that voluntarily leave• Predict customers that are likely to churn• Target “best” customers • Find items that will help me sell more most profitable items• What is a specific customer most likely to purchase next?• How can I combat fraud?• ..and so on

Start with a Well-Defined Business Problem Statement

“If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”

― Albert Einstein

Business UnderstandingStart with a Well-Defined Business Problem Statement

Move the Algorithms, Not the Data!

Machine Learning Algorithms Require Data

X1

X2

A1 A2 A3 A4 A5 A6 A7

Algorithms automatically sift through large amounts of data to discover hidden patterns, new insights and make predictions

In-database ML

X1

X2

A1 A2 A3 A4 A5 A6 A7

Identify most important factor (Attribute Importance)

Predict customer behavior (Classification)Find profiles of targeted people or items (Classification)

Predict or estimate a value (Regression)

Segment a population (Clustering)

Find fraudulent or “rare events” (Anomaly Detection)

Determine co-occurring items in a “basket” (Associations)

Built-in ml library. Try these in always free Oracle Autonomous Data Warehouse or Transaction Processing databases: https://www.oracle.com/cloud/free/?source=:ow:o:p:nav:0916BCButton&intcmp=:ow:o:p:nav:0916BCButton

Classification• Naïve Bayes• Logistic Regression (GLM)• Decision Tree• Random Forest• Neural Network• Support Vector Machine• Explicit Semantic Analysis

Clustering• Hierarchical K-Means• Hierarchical O-Cluster• Expectation Maximization (EM)

Anomaly Detection• One-Class SVM

Time Series• Forecasting –

Exponential Smoothing• Includes popular models

e.g. Holt-Winters with trends, seasonality, irregularity, missing data

Regression• Linear Model• Generalized Linear Model• Support Vector Machine (SVM)• Stepwise Linear regression• Neural Network• LASSO

Attribute Importance• Minimum Description Length• Principal Comp Analysis (PCA)• Unsupervised Pair-wise KL Div• CUR decomposition for row & AI

Association Rules• A priori/ market basket

Predictive Queries• Predict, cluster, detect, features

SQL Analytics• SQL Windows• SQL Patterns• SQL Aggregates

Feature Extraction• Principal Comp Analysis (PCA)• Non-negative Matrix Factorization• Singular Value Decomposition (SVD)• Explicit Semantic Analysis (ESA)

Text Mining Support• Algorithms support text• Tokenization and theme extraction• Explicit Semantic Analysis (ESA) for

document similarity

Statistical Functions• Basic statistics:

min, max, median, stdev, t-test, F-test, Pearson’s, Chi-Sq, ANOVA, etc.

Model Deployment• SQL—1st Class Objects • Oracle RESTful API (ORDS)• OML Microservices (for Apps)

X1

X2

A1 A2 A3 A4 A5 A6 A7

o STATISTICAL FUNCTIONS

o Descriptive statistics (e.g. median, stdev, mode, sum, etc.)

o Hypothesis testing (t-test, F-test, Kolmogorov-Smirnov test, Mann Whitney test, Wilcoxon Signed Ranks test

o Correlations analysis(parametric and nonparametric e.g.Pearson’s test for correlation, Spearman's rho coefficient, Kendall's tau-b correlation coefficient)

o Ranking functions

o Cross Tabulations with Chi-square statistics

o Linear regression

o ANOVA (Analysis of variance)

o Test Distribution fit (e.g., Normal distribution test,Binomial test, Weibull test,Uniform test, Exponential test,Poisson test)

o Statistical Aggregates(min, max, mean, median, stdev, mode, quantiles, plus x sigma, minus x sigma, top n outliers, bottom n outliers)

o ANALYTICAL SQL

o SQL Windows

o SQL Aggregate functions

o LAG/LEAD functions

o SQL for Pattern Matching

o Additional approximate query processing: APPROX_COUNT, APPROX_SUM, APPROX_RANK

o Regular Expressions

Statistical Functions and Analytical SQL

In-database ML: Solution to one of the biggest problems

One of the major challenges with ALL data-drive, data-intensive applications is getting the data into the database!

Time consumingSecurity / quality riskOptimization & performance challenges … and others

The way Oracle natively includes of ML algorithmsPerform parallel ML directly in Data Warehouse or OLTP for fast model building and real time scoring of new dataKeep data secure by avoiding copy contagion – copies of data in analytics systems are a common source of data breaches

Obstacles to adoption of ML in database have been removedAbility to call out to latest R and Python algorithmsFor example accessible via SQL, REST API, Oracle R Enterprise and Python

Traditional approaches to building models are cumbersome

Modelbuilding

Scoring

Data extraction

Data exploration and transformation

Model evaluation

Modeldeployment

o There is a better way

§ Optimized algorithms§ Models produced faster and with less effort§ Better data protection and security§ Faster access to data

In-database machine learning is better

Modelbuilding

Modeldeployment

Data extraction

Data exploration and transformation

Model evaluation

Scoring

In-database machine learning

• Parallel, distributed, in-DB execution• Improved memory utilization for

greater scalability• Wide range of algorithms• Flexible model building

SQL notebook R Data Miner (no coding)Python

• Data stays secured in database• Deploy as REST API available

Machine Learning on Data Platforms

OCI Data Science Service Bring Your Own Analytics

Approaches to building ML models

Build ML on databaseManage projects and collaboratively build, deploy,

manage models

Architect your own platform with

open source

Business user friendly approach to ML

• Large amounts of data in database?

• Governance / security?

• Skillset?

Build models in a notebook

Oracle Machine Learning notebooks

o Included in Autonomous Data Warehouse

o Familiar notebook for data scientists and collaborators

o Easy sharing of notebooks and templates with permissions, versioning, and execution scheduling

o In-database machine learning algorithms and analytics functions via SQL, and later Python and R

o Secure access to data in the database

Machine learning is better in the cloudThis is what Oracle Cloud offers:

High performance compute

§ Flexible shapes from VMs to Bare Metal

§ 40% better performance than other clouds

§ 30% cheaper GPUs

Scalable, independent storage

§ NVMe SSD, Object, Block and Archive Storage

§ 50% faster HDFS performance§ 90% lower Data Lake costs with

Data Tiering

Guaranteed network performance

§ Non-over subscribed, flat 25 Gbps bandwidth between nodes

§ Only cloud with Network performance SLA

§ No egress charges

The Evolution of the Database Developer/DBA Role

o Data Engineero Architecture, o “data wrangler”

Machine Learning Solving data-driven problemsDiscovering insights Making predictions

Data SecurityData classification,Data life-cycle mgmt

Application TuningSQL tuning, connection mgmt

o Data remains in Database

• Model building and scoring occur in-database• Leverage investment in Oracle IT

o Eliminate data duplication• Eliminate separate analytical servers

o Deliver enterprise-wide “predictive” applications

Savings with in-database Machine Learning

Don’t Move the Data!Traditional ML

Hours, Days or Weeks

Data Extraction

Data Prep & Transformation

Model Building

Model “Scoring”

Data Prep. &Transformation

Data Import

avings

Model “Scoring”Embedded Data Prep

Data Preparation

Model Building

Oracle Machine Learning

Seconds, Minutes or Hours

.

SELECT SUBSTR(cust_income_level, 1, 22) income_level,AVG(DECODE(cust_gender, 'M', amount_sold, null)) sold_to_men,AVG(DECODE(cust_gender, 'F', amount_sold, null)) sold_to_women,STATS_T_TEST_INDEPU(cust_gender, amount_sold, 'STATISTIC', 'F') t_observed,STATS_T_TEST_INDEPU(cust_gender, amount_sold) two_sided_p_value

FROM customers c, sales sWHERE c.cust_id = s.cust_idGROUP BY ROLLUP(cust_income_level)ORDER BY income_level, sold_to_men, sold_to_women, t_observed;

SQL Statistical Functions

STATS_T_TEST_INDEPU (SQL) Example; P_Values < 0.05 show statistically significantly differences in the amounts purchased by men vs. women

Compare AVG Purchase Amounts Men vs. Women Grouped_By INCOME_LEVELSimple SQL Syntax—Statistical Comparisons (t-tests)

o BEGINo DBMS_DATA_MINING.CREATE_MODEL(o model_name => 'BUY_INSUR1',o mining_function => dbms_data_mining.classification,o data_table_name => 'CUST_INSUR_LTV',o case_id_column_name => 'CUST_ID',o target_column_name => 'BUY_INSURANCE',o settings_table_name => 'CUST_INSUR_LTV_SET');o END;o /

Simple SQL Syntax—Classification ModelOML for SQL Model Build & Apply

Select prediction_probability(BUY_INSUR1, 'Yes'USING 3500 as bank_funds, 825 as checking_amount, 400 as credit_balance, 22 as age, 'Married' as marital_status, 93 as

MONEY_MONTLY_OVERDRAWN, 1 as house_ownership)from dual;

ML Model Build (PL/SQL)

Model Apply (SQL query)

.

o BEGINo DBMS_DATA_MINING.CREATE_MODEL(o model_name => 'BUY_INSURANCE_AI',o mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,o data_table_name => 'CUST_INSUR_LTV',o case_id_column_name => 'cust_id',o target_column_name => 'BUY_INSURANCE',o settings_table_name => 'Att_Import_Mode_Settings');o END;o /

Simple SQL Syntax—Attribute ImportanceOML for SQL Model Build

SELECT attribute_name, explanatory_value, rankFROM BUY_INSURANCE_AIORDER BY rank, attribute_name;

ML Model Build (PL/SQL)

Model Results (SQL query)

ATTRIBUTE_NAME RANK ATTRIBUTE_VALUEBANK_FUNDS 1 0.2161MONEY_MONTLY_OVERDRAWN 2 0.1489N_TRANS_ATM 3 0.1463N_TRANS_TELLER 4 0.1156T_AMOUNT_AUTOM_PAYMENTS 5 0.1095

A1 A2 A3 A4 A5 A6 A7

o Collaborative UI o Based on Apache Zeppelino Supports data scientists, data analysts,

application developers, DBAso Easy sharing of notebooks and templates o Permissions, versioning, and execution scheduling

o Included with Autonomous Databaseo Automatically provisioned, managed, backed upo In-database SQL algorithms and analytics functions o Supports deployment of ML analytics via ORDS

(Oracle REST Data Services)o Soon to be augmented with Python and R

Autonomous Database + Machine Learning PlatformOracle Machine Learning Notebooks

Demo ML in Databaseo Using Oracle Autonomous Database

top related