dba to data scientist -satyendra

62
© Copyright 2013. Apps Associates LLC. 1 DBA to Data Scientist with Oracle Big Data August 23 rd , 2014

Upload: pasalapudi123

Post on 11-Nov-2014

256 views

Category:

Technology


4 download

DESCRIPTION

AIOUG Techday August 23rd Hyderabad - 12c Theme

TRANSCRIPT

Page 1: Dba to data scientist -Satyendra

© Copyright 2013. Apps Associates LLC. 1

DBA to Data Scientist with Oracle Big Data

August 23rd , 2014

Page 2: Dba to data scientist -Satyendra

© Copyright 2013. Apps Associates LLC. 2

Satyendra Kumar Pasalapudi

Associate Practice Director – IMS @ Apps Associates

Co Founder & Vice President of AIOUG

@pasalapudi

Page 3: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 4

Agenda

• What is Big Data

• Big Data Growth

• 4 Phases of Big Data

• NoSQL Databases

• Hadoop Basics

• Big Data Appliance

• Skills Required for DBA Scientist

Page 5: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 6

Cost effectively manage

and analyze

all available data in its

native form

unstructured,

structured, streaming

ERP CRM

RFID

Website

Network Switches

Social Media

Billing

Big data Challenge

Page 6: Dba to data scientist -Satyendra

90%

Of the world’s data has been created in the last two years

Source: IBM

Page 7: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 8

3 Macro Trends Driving Disruption

Page 8: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 9

Gen X Stats

Page 9: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 10

Big Data – High Data Varity & Velocity

Page 10: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 11

How Did Big Data Evolve?

• More people interacting with data

• Smartphones

• Internet

• Greater volumes of data being generated (machine-to-machine generation)

• Sensors

• General Packet Radio Services (GPRS)

Page 11: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 12

What Is Big Data?

Big data is defined as voluminous unstructured data from many different sources, such as:

• Social networks

• Banking and financial services

• E-commerce services

• Web-centric services

• Internet search indexes

• Scientific searches

• Document searches

• Medical records

• Weblogs

Page 12: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 13

Big Data

• Extremely large datasets that are hard to deal with using Relational Databases

– Storage/Cost

– Search/Performance

– Analytics and Visualization

• Need for parallel processing on hundreds of machines

– ETL cannot complete within a reasonable time

– Beyond 24hrs – never catch up

Page 13: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 14

Characteristics of Big Data

Volume Variety Velocity Value

Social Networks

RSS Feeds

Micro Blogs

Page 14: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 15

Characteristics of Big Data

Page 15: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 16

Big Data Eco System

Page 16: Dba to data scientist -Satyendra

Financial services Discover fraud patterns based on multi-years worth of credit card transactions and in a time scale that does not allow new patterns to accumulate significant losses. Measure transaction processing latency across many business processes by processing and correlating system log data.

Internet retailer Discover fraud patterns in Internet retailing by mining Web click logs. Assess risk by product type and session/Internet Protocol (IP) address activity.

Retailers Perform sentiment analysis by analyzing social media data.

Drug discovery Perform large-scale text analytics on publicly available information sources.

Healthcare Analyze medical insurance claims data for financial analysis, fraud detection, and preferred patient treatment plans. Analyze patient electronic health records for evaluation of patient care regimes and drug safety.

Mobile telecom Discover mobile phone churn patterns based on analysis of CDRs and correlation with activity in subscribers’ networks of callers.

IT technical support Perform large-scale text analytics on help desk support data and publicly available support forums to correlate system failures with known problems.

Scientific research Analyze scientific data to extract features (e.g., identify celestial objects from telescope imagery).

Internet travel Improve product ranking (e.g., of hotels) by analysis of multi-years worth of Web click logs.

Big Data /Hadoop Use Cases

Page 17: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 18

The Four Phases of Data Conversion

Acquire Organize Analyze Decide

1 4 3 2

Page 18: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 19

Database Market Disruption

$30B Database Market Being Disrupted

Page 19: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 20

Operational vs. Analytical Databases

Page 20: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 21

Growth is the New Reality

Instagram gained nearly 1 million users overnight when they expanded to Android

Page 21: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 22

Draw Something Viral Growth

Page 22: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 23

How Do You Take This Growth?

Page 23: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 24

Scaling Out RDBMS

Page 24: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 25

RDBMS are Not Enough?

Page 25: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 26

NoSQL Technology Scales Out

Page 26: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 27

A New Technology

Page 27: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 28

Use Cases

Page 28: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 29

Relational vs. Documental Data Model

JSON or JavaScript Object Notation, is a text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects. Despite its relationship to JavaScript, it is language-independent, with parsers available for many languages

Page 29: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 30

Brewer's CAP Theorem

Page 30: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 31

Brewer's CAP Theorem

Page 31: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 32

NoSQL Technology Spectrum

Page 32: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 33

Operational vs. Analytical Databases

Page 33: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 34

Hadoop Design Principles

• System shall manage and heal itself

– Automatically and transparently route around failure

– Speculatively execute redundant tasks if certain nodes are detected to be slow

• Performance shall scale linearly

– Proportional change in capacity with resource change

• Compute should move to data

– Lower latency, lower bandwidth

• Simple core, modular and extensible

Page 34: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 35

Hadoop Intro

• At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.

• GFS is not open source.

• Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).

• The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.

• Projects Nutch and Lucene were started with “search” as the application in mind;

Page 35: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 36

Hadoop Intro

• Hadoop Distributed file system and mapreduce were found to have applications beyond search.

• HDFS and MapReduce were moved out of Nutch as a sub-project of Lucene and later promoted into a apache project Hadoop

Page 36: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 37

Hadoop History

• Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster

• May 2009 – Hadoop sorts Petabyte in 17 hours

Page 37: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 38

What & Where is Hadoop Used For?

Search

• Yahoo, Amazon, Zvents

Log Processing

• Facebook, Yahoo, ContextWeb. Joost, Last.fm

Recommendation Systems

• Facebook

Data Warehouse

• Facebook, AOL

Video and Image Analysis

• New York Times, Eyealike

Page 38: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 39

What & Where is Hadoop Used For?

Amazon.com, Ancestry.com, Akamai, American Airlines, AOL, Apple, AVG , eBay, Electronic Arts, Hortonworks, Federal Reserve Board of Governors, Foursquare, Fox Interactive Media, Google, HewlettPackard, IBM, ImageShack, ISI, InMobi, Intuit, Joost, Last.fm, LinkedIn, Microsoft, NetApp, Netflix, Ooyala, Riot Games, Spotify, Qualtrics, The New York Times, SAP AG, SAS Institute, StumbleUpon, Twitter, Yodlee

Page 39: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 40

Hadoop Ecosystem

HDFS (Hadoop Distributed File System)

HBase (key-value store)

MapReduce (Job Scheduling/Execution System)

Data Access

Sqoop Flume

Client Access

Hue Hive(Sql)

Pig(Pl/Sql)

Zoo

Kee

pe

r (C

oo

rdin

atio

n)

(Streaming/Pipes APIs)

Ch

ukw

a (M

on

ito

rin

g)

Data Mining

Mahout

OS – Redhat, Suse, Ubuntu,Windows

Commodity Hardware

Java Virtual Machine

Networking

Orchestration

Oozie

Page 40: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 43

PIG

• Compiled into a series of MapReduce jobs

– Easier to program

– Optimization opportunities

• grunt> A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

• grunt> B = FOREACH A GENERATE name;

Page 41: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 44

Hive

Managing and querying structured data • MapReduce for execution • SQL like syntax • Extensible with types,

functions, scripts • Metadata stored in a RDBMS

(MySQL) • Joins, Group By, Nesting • Optimizer for number of

MapReduce required

hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>‘;

Page 42: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 45

Sqoop

• It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import

• Imports can also be used to populate tables in Hive or HBase

• Exports can be used to put data from Hadoop into a relational database

Page 43: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 47

HDFS Architecture

Page 44: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 50

Architecture Overview

Page 45: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 51

HDFS Distributions

Page 47: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 53

Oracle Big Data Appliance: Introduction

Oracle Big Data Appliance: Introduction

• Oracle Big Data Appliance is an engineered system containing both hardware and software components. Oracle Big Data Appliance delivers:

‒ A complete and optimized solution for big data

‒ Single-vendor support for both hardware and software

‒ An easy-to-deploy solution

‒ Tight integration with Oracle Database

Page 48: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 54

Oracle Big Data Appliance: Where It Stands?

Big Data Appliance

Data Variety

Unstructured

Schema-less

Schema

Information Acquire Organize Analyze

Page 49: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 55

Oracle Big Data: Software Components

Oracle Big Data Appliance

Oracle NoSQL Database

Oracle Big Data Connectors

Open Source R Distribution

Cloudera Manager & Cloudera’s Distribution Including Apache Hadoop

Oracle Linux 5.6 and Java Hotspot VM

Page 50: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 56

Oracle Big Data with Oracle Exadata

Page 51: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 57

Oracle Big Data Solution

Oracle BI Foundation Suite

Oracle Real-Time Decisions

Endeca Information Discovery

Decide

Oracle Advanced Analytics

Oracle Database

Oracle Spatial & Graph

Acquire – Organize – Analyze

Oracle Big Data Connectors

Oracle Data Integrator

Stream

Oracle Event Processing

Apache Flume

Oracle GoldenGate

Oracle NoSQL Database

Cloudera Hadoop

Oracle R Distribution

Page 52: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 58

Mapping the Phases with Software

Acquire Phase

– Hadoop Distributed File System

– Oracle NoSQL Database

Organize Phase

– Hadoop Software Framework

– Oracle Data Integrator

Analyze Phase

– R Statistical Programming Environment

– Oracle Data Warehouse

Page 53: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 61

Analyze Phase

Analyze Database

+ Oracle R Enterprise

Statistical Functions

Data Mining Algorithms

Query Capabilities

Page 54: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 63

Oracle Big Data: Software Components

Oracle Big Data Appliance

Oracle NoSQL Database

Oracle Big Data Connectors

Open Source R Distribution

Cloudera Manager & Cloudera’s Distribution Including Apache Hadoop

Oracle Linux 5.6 and Java Hotspot VM

Page 55: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 64

Oracle Big Data Sql

• New Technology Bridges Oracle, Hadoop, and NoSQL Data Stores

Using Oracle Big Data SQL, organizations can:

• Combine data from Oracle Database, Hadoop and NoSQL in a single SQL query

• Query and analyze data in Hadoop and NoSQL

• Integrate big data analysis into existing applications and architectures

• Extend security and access policies from Oracle Database to data in Hadoop and NoSQL

• Maximize query performance on all data using Smart Scan

• One Fast SQL Query for All Your Data

Page 56: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 65

Data Science

Source :http://wikibon.org/blog/role-of-the-data-scientist/

Page 57: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 66

Data Scientist

Source :http://wikibon.org/blog/role-of-the-data-scientist/

Page 58: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 67

DBA to Data Scientist

Hadoop HDFS Map Reduce NoSQL Database Hive Pig OR All the above with Big Data Appliance

Page 59: Dba to data scientist -Satyendra

© Copyright 2014. Apps Associates LLC. 68

Big Data Eco System

Page 60: Dba to data scientist -Satyendra

© Copyright 2013. Apps Associates LLC. 69

On Premise Hadoop as RDBMS “active archive”

SALES 2013

Oracle Database

Structured Data Analytics from Apps

SALES 2012

SALES 2011

SALES 2010

SALES 2011

SALES 2010

“Hive” provides an SQL-like query layer over Hadoop and MapReduce

Unstructured + Structured Data Analytics from Apps

Hadoop for Structured Archive and Unstructured data

Page 61: Dba to data scientist -Satyendra

© Copyright 2013. Apps Associates LLC. 70

AWS EMR as RDBMS “active archive”

SALES 2013

Oracle Database

Structured Data Analytics from Apps

SALES 2012

SALES 2011

SALES 2010

SALES 2011

SALES 2010

“Hive” provides an SQL-like query layer over Amazon EMR

Unstructured + Structured Data Analytics from Apps

AWS EMR for Structured Archive and Unstructured data

Amazon Elastic MapReduce (Amazon EMR)

Page 62: Dba to data scientist -Satyendra

Thank You! @pasalapudi