investigates and discovers data that positively or negatively...

23
Investigates and Discovers Data That Positively or Negatively Impacts Your Business Performance 1 Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Upload: others

Post on 04-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Investigates and Discovers Data That Positively or Negatively Impacts Your Business Performance

1

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 2: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

IoT – Collecting & Analyzing real-time data such as machine health check logs

Exploding real-time data from social media and IoT sources, arriving in high data volume

Data is highly fragmented, in databases, emails, PDF’s, spreadsheets and more.

Growing concerns about release of personal information, security & privacy of all data

Top Challenges for Managing Big Data

U.S. and European banks paid nearly $65 billion in penalties and fines, about 40% greater than 2013, the previous high, according to the Boston Consulting Group.

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 3: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Recent World Events Supporting BDR

Business Value

3

EU DPA Regulation

The regulation returns control over citizens’ personal data to citizens

F.C.C. Fines AT&T $25 Million for Privacy BreachNY Times, April 2015

It is estimated that poor data quality costs US companies $600 billion per year

TechRepublic, December 21, 2015

Good news! Big banks only have $65 billion in legal fines left to pay.

Yahoo Finance, August 26, 2015

In the United States, it is reported that by 2018 there will be more than 490,000 data science positions available, but only 200,000

qualified people to fill the roles. Datanami, January 22, 2016

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 4: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Recent World Events Supporting BDR

Business Value

4

HIPAA Violation

Lahey Hospital and Medical Center in Burlington, Mass., agreed to pay $850,000 to settle potential HIPAA violations

On September 2, 2015 The HHS Office of Civil Rights (OCR) issued a press release

A 2011 report from McKinsey Global Institute predicted by 2018, the U.S. could face a shortage of 140,000 to 190,000 qualified data analysts, as well as 1.5 million managers who

know how to use big data to make decisions.Worchester Business Journal, January 4, 2016

HIPAA Violation

University of Washington Medicine in Seattle agreed to pay $750,000 to settle violation allegations.

On September 2, 2015 The HHS Office of Civil Rights (OCR) issued a press release

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 5: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

BDR Support for the Process and Analyze

Phase of the Big Data Life Cycle

5

Analyze• Outlier

Discovery

• Correlation

Coefficients

• Provide

Relevant Data

to BI and

Analytics

Process• Classification

• Cataloging

• Meta Data

Roadmap

• Accuracy

• Streaming

Data

• Pre-Built

Database

Connectors

Ingest

Store

Process

Analyze

Visualize

Action

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 6: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Investigate and Discover Data

Fortune 1000 companies

need solutions that plug

into existing business

processes

Discover outliers that

impact business

performance and

Compliance Violations

Data Scientists are expected

to find correlation

coefficients, outliers and

other data anomalies"According to a report from

Experian, the average company

estimates 27 percent of its revenue

is wasted due to inaccurate or

incomplete data”6

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 7: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Unique Value

Big Data Revealed Runs On The Hadoop Native Framework

Leverages Existing Investment In Technology

A Complete Solution - Repeatable, Collaborative, And Extensible

Data Discovery, Compliance Validation, Anomalies, Outlier

Detection/Alerts, and User Definable Discovery

Read/Source Data Directly: HDFS, Teradata, Oracle, DB2,

MySQL, PDF, DOCX, HTML, Excel and more

Process Static HDFS, RDBMS Data As Well As Live Streaming

Data Feeds With Real-time Discovery

Run With BDR GUI Or Use Callable Modules Within Current

Production Processes And ETL/BI Processes

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 8: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Certifications

8

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 9: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Cloudera Support

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 10: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Kerberos Support

Kerberos /ˈkərbərəs/ is a computer network authentication protocol which works on the basis of 'tickets' to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 11: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance
Page 12: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Architecture Overview

Installs on any cluster client

*Database jobs require distributing drivers to the cluster

Web front end runs on JBOSS 7.1.1

Defaults to port 8282

Requires MySQL for application support tables

Utilizes Map Reduce and Spark via Yarn

Most jobs store in Hive/HBase

Leverages Cloudera Impala for performant queries

Leverages Cloudera Navigator for enhanced metadata

12

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 13: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Architecture Legend

Dark Blue Components are part of BigDataRevealed

Light Blue Components are part of the Cloudera ecosystem

Black/Grey Components are part of the larger open source ecosystem

13

BigDataRevealed REST API

Hive

Cloudera ImpalaCloudera

Navigator

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 14: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Map Reduce Job

BigDataRevealed REST API

HDFS

BigDataRevealed

MapReduce Processes

•Data Discovery

•Batch Outlier

•Compliance

•Quick Class

Parameters:•Source

•Custom REGEXs

Source Data

Hive

Cloudera Impala

Results

Intermediate

data

Cloudera

Navigator

Results

Results

Metadata

Libraries:

• Tika

• OpenNLP

Kerberos Authentication

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 15: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Map Reduce Job

UI utilizes Cloudera Navigator and webHDFS to gather file system information

REST API initiates the map reduce job with user selected parameters

If the files are binary, they are processed through Tika with intermediate results stored back to HDFS

Batch Process executes and stores results in Hive

Results are checked against user watch conditions and notifications sent

UI leverages Cloudera Impala to provide a performant view into result data

15

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 16: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Database Job

16

BigDataRevealed REST API

RMDB:

Oracle

DB2

Teradata

MySQL

BigDataRevealed

MapReduce Processes

•Data Discovery

•Batch Outlier

•Compliance

•Quick Class

Parameters:•Source

•Custom REGEXs

Source Data

Hive

Cloudera Impala

Results

JDBC

Results

Results

Metadata

Libraries:

• OpenNLP

Kerberos Authentication

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 17: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Database Job

UI utilizes JDBC to gather database metadata

REST API initiates the map reduce job with user selected parameters

Batch Process executes and stores results in Hive

Results are checked against user watch conditions and notifications sent

UI leverages Cloudera Impala to provide a performant view into result data

17

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 18: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Spark Job

18

BigDataRevealed REST API

HDFS

BigDataRevealed

Spark Processes

•Spark Outlier

•Correlation

Parameters:•Source

•Custom REGEXs

Source Data

HDFS

Apace Drill

Results

Intermediate

data

Cloudera

Navigator

Results

Results

Metadata

Libraries:

• MLLIB

Kerberos Authentication

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 19: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Spark Job

UI utilizes Cloudera Navigator and webHDFS to gather file system information

REST API initiates the spark job with user selected parameters

Spark Process executes and stores results in HDFS

Results are checked against user watch conditions and notifications sent

UI leverages Apache Drill to view HDFS file contents

19

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 20: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Streaming Job

BigDataRevealed REST API

BigDataRevealed

Streaming Spark

Processes

•Twitter

•Future Expansion

Parameters:•Source

•Custom REGEXs

Source Data

HBase

Cloudera Impala

Results

Results

Results

Stream

Libraries:

• MLLIB

Kerberos Authentication

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 21: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Anatomy of a BDR Streaming Job

User selects from a list of Admin entered data streams

REST API initiates the spark job with user selected parameters

Spark executes and stores results in HBase

UI leverages Cloudera Impala to provide a view into result data

21

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 22: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

Investigates and Discovers Data That Positively or Negatively Impacts Your Business Performance

22

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016

Page 23: Investigates and Discovers Data That Positively or Negatively …storage.googleapis.com/wzukusers/user-14546381/documents... · 2016-02-16 · Negatively Impacts Your Business Performance

BigDataRevealed Contacts

23

Steven Meister

BigDataRevealed

[email protected]

847-791-7838

Steven Meister (847) 791-7838 [email protected] Confidential Information © 2016