unlocking operational intelligence from the data lake

46
Unlocking Operational Intelligence from the Data Lake Mat Keep Director, Product & Market Analysis [email protected] @matkeep

Upload: mat-keep

Post on 07-Apr-2017

438 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Unlocking Operational Intelligence from the Data Lake

Unlocking Operational Intelligence from the Data Lake

Mat KeepDirector, Product & Market [email protected]@matkeep

Page 2: Unlocking Operational Intelligence from the Data Lake

2

The World is ChangingDigital Natives & Digital Transformation

VolumeVelocityVariety

IterativeAgile

Short Cycles

Always OnSecureGlobal

Open-SourceCloud

Commodity

Data Time

Risk Cost

Page 3: Unlocking Operational Intelligence from the Data Lake

3

Creating the “Insight Economy”

Page 4: Unlocking Operational Intelligence from the Data Lake

4

Data Warehouse Challenges

Page 5: Unlocking Operational Intelligence from the Data Lake

5

The Rise of the Data Lake

Page 6: Unlocking Operational Intelligence from the Data Lake

6

• 24% CAGR: Hadoop, Spark & Streaming

• 18% CAGR: Databases• Databases are key

components within the big data landscape

“Big Data” is More than Just Hadoop

Page 7: Unlocking Operational Intelligence from the Data Lake

7

Apache Hadoop Data Lake

• Risk modeling• Retrospective & predictive analytics• Machine learning & pattern

matching• Customer segmentation & churn

analysis• ETL pipelines• Active archives

NoSQLDatabase

Page 8: Unlocking Operational Intelligence from the Data Lake

8

http://www.infoworld.com/article/2980316/big-data/why-your-big-data-strategy-is-a-bust.html

“Thru 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.”Nick Heudecker, Research Director, Data Management & Integration

Page 9: Unlocking Operational Intelligence from the Data Lake

9

How to Avoid Being in the 70%?1. Unify data lake analytics with

the operational applications

2. Create smart, contextually

aware, data-driven apps &

insights

3. Integrate a database layer with

the data lake

Page 10: Unlocking Operational Intelligence from the Data Lake

10

MongoDB & Hadoop: What’s CommonDistributed Processing & Analytics

Common Attributes• Schema-on-read• Multiple replicas• Horizontal scale• High throughput• Low TCO

Page 11: Unlocking Operational Intelligence from the Data Lake

11

MongoDB & Hadoop: What’s Different Distributed Processing & Analytics

• Data stored as large files (64MB-128MB blocks). No indexes

• Write-once-read-many, append-only• Designed for high throughput scans

across TB/PB of data. • Multi-minute latency

Common Attributes• Schema-on-read• Multiple replicas• Horizontal scale• High throughput• Low TCO

Page 12: Unlocking Operational Intelligence from the Data Lake

12

MongoDB & Hadoop: What’s Different Distributed Processing & Analytics

• Random access to subsets of data• Millisecond latency• Expressive querying, rich

aggregations & flexible indexing• Update fast changing data, avoid re-

write / re-compute entire data set

• Data stored as large files (64MB-128MB blocks). No indexes

• Write-once-read-many, append-only• Designed for high throughput scans

across TB/PB of data. • Multi-minute latency

Common Attributes• Schema-on-read• Multiple replicas• Horizontal scale• High throughput• Low TCO

Page 13: Unlocking Operational Intelligence from the Data Lake

13

Bringing it Together

Online Servicespowered by

Back-end machine learningpowered by

• User account & personalization• Product catalog• Session management & shopping cart• Recommendations

• Customer classification & clustering• Basket analysis• Brand sentiment• Price optimization

MongoDB Connector for

Hadoop

Page 14: Unlocking Operational Intelligence from the Data Lake

Mes

sage

Que

ue

Customer Data Mgmt Mobile App IoT App Live Dashboards

Raw Data

Processed Events

Distributed Processing

Frameworks

Millisecond latency. Expressive querying & flexible indexing against subsets of data. Updates-in place. In-database aggregations & transformations

Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB blocks. Write-once-read-many & append-only storage model

Sensors

User Data

Clickstreams

Logs

Churn Analysis

Enriched Customer Profiles

Risk Modeling

Predictive Analytics

Real-Time Access

Batch Processing, Batch Views

Design Pattern: Operationalized Data Lake

Page 15: Unlocking Operational Intelligence from the Data Lake

Mes

sage

Que

ue

Customer Data Mgmt Mobile App IoT App Live Dashboards

Raw Data

Processed Events

Distributed Processing

Frameworks

Millisecond latency. Expressive querying & flexible indexing against subsets of data. Updates-in place. In-database aggregations & transformations

Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB blocks. Write-once-read-many & append-only storage model

Sensors

User Data

Clickstreams

Logs

Churn Analysis

Enriched Customer Profiles

Risk Modeling

Predictive Analytics

Real-Time Access

Batch Processing, Batch Views

Design Pattern: Operationalized Data LakeConfigure where to land incoming data

Page 16: Unlocking Operational Intelligence from the Data Lake

Mes

sage

Que

ue

Customer Data Mgmt Mobile App IoT App Live Dashboards

Raw Data

Processed Events

Distributed Processing

Frameworks

Millisecond latency. Expressive querying & flexible indexing against subsets of data. Updates-in place. In-database aggregations & transformations

Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB blocks. Write-once-read-many & append-only storage model

Sensors

User Data

Clickstreams

Logs

Churn Analysis

Enriched Customer Profiles

Risk Modeling

Predictive Analytics

Real-Time Access

Batch Processing, Batch Views

Design Pattern: Operationalized Data Lake

Raw data processed to generate analytics models

Page 17: Unlocking Operational Intelligence from the Data Lake

Mes

sage

Que

ue

Customer Data Mgmt Mobile App IoT App Live Dashboards

Raw Data

Processed Events

Distributed Processing

Frameworks

Millisecond latency. Expressive querying & flexible indexing against subsets of data. Updates-in place. In-database aggregations & transformations

Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB blocks. Write-once-read-many & append-only storage model

Sensors

User Data

Clickstreams

Logs

Churn Analysis

Enriched Customer Profiles

Risk Modeling

Predictive Analytics

Real-Time Access

Batch Processing, Batch Views

Design Pattern: Operationalized Data LakeMongoDB exposes analytics models to operational apps. Handles real time

updates

Page 18: Unlocking Operational Intelligence from the Data Lake

Mes

sage

Que

ue

Customer Data Mgmt Mobile App IoT App Live Dashboards

Raw Data

Processed Events

Distributed Processing

Frameworks

Millisecond latency. Expressive querying & flexible indexing against subsets of data. Updates-in place. In-database aggregations & transformations

Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB blocks. Write-once-read-many & append-only storage model

Sensors

User Data

Clickstreams

Logs

Churn Analysis

Enriched Customer Profiles

Risk Modeling

Predictive Analytics

Real-Time Access

Batch Processing, Batch Views

Design Pattern: Operationalized Data Lake

Compute new models against

MongoDB & HDFS

Page 19: Unlocking Operational Intelligence from the Data Lake

19

Operational Database Requirements

1 “Smart” integration with the data lake

2 Powerful real-time analytics

3 Flexible, governed data model

4 Scale with the data lake

5 Sophisticated management & security

Page 20: Unlocking Operational Intelligence from the Data Lake

20

Evaluating your Options

Page 21: Unlocking Operational Intelligence from the Data Lake

21

Query and Data ModelMongoDB Relational Column Family

(i.e. HBase)Rich query language & secondary indexes

Yes Yes Requires integration with separate Spark /

Hadoop clusterIn-Database aggregations & search Yes Yes Requires integration

with separate Spark / Hadoop cluster

Dynamic schema Yes No Partial

Data validation Yes Yes App-side code

• Why it matters– Query & Aggregations: Rich, real time analytics against operational data– Dynamic Schema: Manage multi-structured data– Data Validation: Enforce data governance between data lake & operational apps

Page 22: Unlocking Operational Intelligence from the Data Lake

22

Data Lake IntegrationMongoDB Relational Column Family

(i.e. HBase)Hadoop + secondary indexes Yes Yes: Expensive No secondary

indexesSpark + secondary indexes Yes Yes: Expensive No secondary

indexesNative BI connectivity Yes Yes 3rd-party connectors

Workload isolation Yes Yes: Expensive Load data to separate

Spark/Hadoop cluster

• Why it matters– Hadoop + Spark: Efficient data movement between data lake, processing layer & database– Native BI Connectivity: Visualizing operational data– Workload isolation: separation between operational and analytical workloads

Page 23: Unlocking Operational Intelligence from the Data Lake

23

Operationalizing for Scale & SecurityMongoDB Relational Column Family

(i.e. HBase)Robust security controls Yes Yes YesScale-out on commodity hardware Yes No YesSophisticated management platform Yes Yes Monitoring only

• Why it matters– Security: Data protection for regulatory compliance– Scale-Out: Grow with the data lake– Management: Reduce TCO with platform automation, monitoring, disaster recovery

Page 24: Unlocking Operational Intelligence from the Data Lake

24

MongoDB Nexus Architecture

Page 25: Unlocking Operational Intelligence from the Data Lake

Adoption & Skills Availability

Page 26: Unlocking Operational Intelligence from the Data Lake

Operational Data Lake in Action

Page 27: Unlocking Operational Intelligence from the Data Lake

27

Problem Why MongoDB ResultsProblem Solution Results

Existing EDW with nightly batch loads

No real-time analytics to personalize user experience

Application changes broke ETL pipeline

Unable to scale as services expanded

Microservices architecture running on AWS

All application events written to Kafka queue, routed to MongoDB and Hadoop

Events that personalize real-time experience (ie triggering email send, additional questions, offers) written to MongoDB

All event data aggregated with other data sources and analyzed in Hadoop, updated customer profiles written back to MongoDB

2x faster delivery of new services after migrating to new architecture

Enabled continuous delivery: pushing new features every day

Personalized user experience, plus higher uptime and scalability

UK’s Leading Price Comparison SiteOut-pacing Internet search giants with continuous delivery pipeline powered by microservices & Docker running MongoDB, Kafka and Hadoop in the cloud

Page 28: Unlocking Operational Intelligence from the Data Lake

28

Problem Why MongoDB ResultsProblem Solution Results

Customer data scattered across 100+ different systems

Poor customer experience: no personalization, no consistent experience across brands or devices

No way to analyze customer behavior to deliver targeted offers

Selected MongoDB over HBase for schema flexibility and rich query support

MongoDB stores all customer profiles, served to web, mobile & call-center apps

Distributed across multiple regions for DR and data locality

All customer interactions stored in MongoDB, loaded into Hadoop for customer segmentation

Unified processing pipeline with Spark running across MongoDB and Hadoop

Single profile created for each customer, personalizing experience in real time

Revenue optimization by calculating best ticket prices

Reduce competitive pressures by identifying gaps in product offerings

Customer Data ManagementSingle view and real-time analytics with MongoDB, Spark, & Hadoop

Leading Global Airline

Page 29: Unlocking Operational Intelligence from the Data Lake

29

Problem Why MongoDB ResultsProblem Solution Results

Commercialize a national security platform

Massive volumes of multi-structured data: news, RSS & social feeds, geospatial, geological, health & crime stats

Requires complex analysis, delivered in real time, always on

Apache NiFI for data ingestion, routing & metadata management

Hadoop for text analytics

HANA for geospatial analytics

MongoDB correlates analytics with user profiles & location data to deliver real-time alerts to corporate security teams & individual travelers

Enables Prescient to uniquely blend big data technology with its security IP developed in government

Dynamic data model supports indexing 38k data sources, growing at 200 per day

24x7 continuous availability

Scalability to PBs of data

World’s Most Sophisticated Traveler Safety PlatformAnalyzing PBs of Data with MongoDB, Hadoop, Apache NiFi & SAP HANA

Page 30: Unlocking Operational Intelligence from the Data Lake

30

Problem Why MongoDB ResultsProblem Solution Results

Requirement to analyze data over many different dimensions to detect real time threat profiles

HBase unable to query data beyond primary key lookups

Lucene search unable to scale with growth in data

MongoDB + Hadoop to collect and analyze data from internet sensors in real time

MongoDB dynamic schema enables sensor data to be enriched with geospatial tags

Auto-sharding to scale as data volumes grow

Run complex, real-time analytics on live data

Improved query performance by over 3x

Scale to support doubling of data volume every 24 months

Deploy across global data centers for low latency user experience

Engineering teams have more time to develop new features

Powering Global Threat IntelligenceCloud-based real-time analytics with MongoDB & Hadoop

Page 31: Unlocking Operational Intelligence from the Data Lake

Wrapping Up

Page 32: Unlocking Operational Intelligence from the Data Lake

Conclusion

1 Data lakes enabling enterprises to affordably capture & analyze more data

2 Operational and analytical workloads are converging

3 MongoDB is the key technology to operationalize the data lake

Page 33: Unlocking Operational Intelligence from the Data Lake

33

MongoDB Compass MongoDB Connector for BI

MongoDB Enterprise Server

MongoDB Enterprise Advanced24

x 7

Sup

port

(1 h

our S

LA)

Com

mercial License

(No A

GP

L Copyleft R

estrictions)

Platform Certifications

MongoDB Ops Manager

Monitoring & Alerting

Query Optimization

Backup & Recovery

Automation & Configuration

Schema Visualization

Data Exploration

Ad-Hoc Queries

Visualization

Analysis

Reporting

Authorization Auditing Encryption(In Flight & at Rest)Authentication

REST APIEmergency Patches

Customer Success Program

On-Demand Online Training

Warranty

Limitation of Liability

Indemnification

Page 34: Unlocking Operational Intelligence from the Data Lake

500+ employeesAbout

MongoDB, Inc.

2,000+ customers

13 offices worldwide

$311M in funding

Page 36: Unlocking Operational Intelligence from the Data Lake
Page 37: Unlocking Operational Intelligence from the Data Lake

37

For More Information

Resource Location

Case Studies mongodb.com/customers

Presentations mongodb.com/presentations

Free Online Training education.mongodb.com

Webinars and Events mongodb.com/events

Documentation docs.mongodb.org

MongoDB Downloads mongodb.com/download

Additional Info [email protected]

Page 38: Unlocking Operational Intelligence from the Data Lake

38

Problem Why MongoDB ResultsProblem Solution Results

System failures in online banking systems creating customer sat issues

No personalization experience across channels

No enrichment of user data with social media chatter

Apache Flume to ingest log data & social media streams, Apache Spark to process log events

MongoDB to persist log data and KPIs, immediately rebuild user sessions when a service fails

Integration with MongoDB query language and secondary indexes to selectively filter and query data in real time

Improved user experience, with more customers using online, self-service channels

Improved services following deeper understanding of how users interact with systems

Greater user insight by adding social media insights

One of World’s Largest BanksCreating new customer insights with MongoDB & Spark

Page 39: Unlocking Operational Intelligence from the Data Lake

39

LEGACY FUTURE STATE

APPS On-Premise, Monoliths SaaS, Microservices

DATABASE Relational (Oracle) Non-Relational (MongoDB)

EDW Teradata, Oracle, etc. Hadoop

COMPUTE Scale-Up Server Containers / Commodity Server / Cloud

STORAGE SAN Local Storage & Data Lakes

NETWORK Routers and Switches Software-Defined Networks

The New Enterprise Stack

Page 40: Unlocking Operational Intelligence from the Data Lake

Operational ApplicationAnalytics Application

MongoDB Primary

MongoDB Secondary MongoDB Secondary

Real Time analytics to inform operational

application

Querying operational data

Workload Isolation for Real-Time Analytics

Page 41: Unlocking Operational Intelligence from the Data Lake

41

Handling Multi-Structured Data from the Data LakeFlexible, Governed Data Model

{ first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: [45.123,47.232], Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

Fields can contain an array of sub-documents

Typed field values

Fields can contain arrays

String

Number

Geo-Location

Page 42: Unlocking Operational Intelligence from the Data Lake

42

Expressive Query Language, Rich Secondary Indexes

Rich Queries • Find Paul’s cars• Find everybody in London with a car between 1970

and 1980

Geospatial • Find all of the car owners within 5km of Trafalgar Sq.

Text Search • Find all the cars described as having leather seats

Aggregation • Calculate the average value of Paul’s car collection

Map Reduce • What is the ownership pattern of colors by geography over time (is purple trending in China?)

Page 43: Unlocking Operational Intelligence from the Data Lake

43

Visualizing Operational DataMongoDB Connector for BI

Visualize and explore multi-structured data using SQL-based BI platforms.

Your BI Platform

BI ConnectorProvides Schema

Translates QueriesTranslates Response

Page 44: Unlocking Operational Intelligence from the Data Lake

44

Enterprise-Grade Security

*Included with MongoDB Enterprise Advanced

BUSINESS NEEDS SECURITY FEATURES

Authentication SCRAM, LDAP*, Kerberos*, x.509 Certificates

Authorization Built-in Roles, User-Defined Roles, Field-Level Redaction

Auditing* Admin, DML, DDL, Role-based

Encryption Network: SSL (with FIPS 140-2), Disk: Encrypted Storage Engine* or Partner Solutions

Page 45: Unlocking Operational Intelligence from the Data Lake

45

Scale-Out Across Commodity Hardware & Regions

Page 46: Unlocking Operational Intelligence from the Data Lake

46

Management Tooling: MongoDB Ops Manager

• Monitoring & alerting• Integration to APM platforms• Prescriptive management with

query profiling• Automated cluster

provisioning, scaling and upgrades

• Continuous, point in time backup