unlocking operational intelligence from the data lake

Unlocking Operational Intelligence from the Data Lake

Mat KeepDirector, Product & Market [email protected]@matkeep

mailto:[email protected]

2

The World is ChangingDigital Natives & Digital Transformation

VolumeVelocityVariety

IterativeAgile

Short Cycles

Always OnSecureGlobal

Open-SourceCloud

Commodity

Data Time

Risk Cost

3

Creating the “Insight Economy”

4

Data Warehouse Challenges

5

The Rise of the Data Lake

6

• 24% CAGR: Hadoop, Spark & Streaming

• 18% CAGR: Databases• Databases are key

components within the big data landscape

“Big Data” is More than Just Hadoop

7

Apache Hadoop Data Lake

• Risk modeling• Retrospective & predictive analytics• Machine learning & pattern

matching• Customer segmentation & churn

analysis• ETL pipelines• Active archives

NoSQLDatabase

8

http://www.infoworld.com/article/2980316/big-data/why-your-big-data-strategy-is-a-bust.html

“Thru 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.”Nick Heudecker, Research Director, Data Management & Integration



9

How to Avoid Being in the 70%?1. Unify data lake analytics with

the operational applications

2. Create smart, contextually

aware, data-driven apps &

insights

3. Integrate a database layer with

the data lake

10

MongoDB & Hadoop: What’s CommonDistributed Processing & Analytics

Common Attributes• Schema-on-read• Multiple replicas• Horizontal scale• High throughput• Low TCO

11

MongoDB & Hadoop: What’s Different Distributed Processing & Analytics

• Data stored as large files (64MB-128MB blocks). No indexes

• Write-once-read-many, append-only• Designed for high throughput scans

across TB/PB of data. • Multi-minute latency


12

MongoDB & Hadoop: What’s Different Distributed Processing & Analytics

• Random access to subsets of data• Millisecond latency• Expressive querying, rich

aggregations & flexible indexing• Update fast changing data, avoid re-

write / re-compute entire data set

• Data stored as large files (64MB-128MB blocks). No indexes

• Write-once-read-many, append-only• Designed for high throughput scans

across TB/PB of data. • Multi-minute latency


13

Bringing it Together

Online Servicespowered by

Back-end machine learningpowered by

• User account & personalization• Product catalog• Session management & shopping cart• Recommendations

• Customer classification & clustering• Basket analysis• Brand sentiment• Price optimization

MongoDB Connector for

Hadoop

Mes

sage

Que

ue

Customer Data Mgmt Mobile App IoT App Live Dashboards

Raw Data

Processed Events

Distributed Processing

Frameworks

Millisecond latency. Expressive querying & flexible indexing against subsets of data. Updates-in place. In-database aggregations & transformations

Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB blocks. Write-once-read-many & append-only storage model

Sensors

User Data

Clickstreams

Logs

Churn Analysis

Enriched Customer Profiles

Risk Modeling

Predictive Analytics

Real-Time Access

Batch Processing, Batch Views

Design Pattern: Operationalized Data Lake

Mes

sage

Que

ue


Raw Data

Processed Events


Frameworks



Sensors

User Data

Clickstreams

Logs

Churn Analysis


Risk Modeling


Real-Time Access


Design Pattern: Operationalized Data LakeConfigure where to land incoming data

Mes

sage

Que

ue


Raw Data

Processed Events


Frameworks



Sensors

User Data

Clickstreams

Logs

Churn Analysis


Risk Modeling


Real-Time Access



Raw data processed to generate analytics models

Mes

sage

Que

ue


Raw Data

Processed Events


Frameworks



Sensors

User Data

Clickstreams

Logs

Churn Analysis


Risk Modeling


Real-Time Access


Design Pattern: Operationalized Data LakeMongoDB exposes analytics models to operational apps. Handles real time

updates

Mes

sage

Que

ue


Raw Data

Processed Events


Frameworks



Sensors

User Data

Clickstreams

Logs

Churn Analysis


Risk Modeling


Real-Time Access



Compute new models against

MongoDB & HDFS

19

Operational Database Requirements

1 “Smart” integration with the data lake

2 Powerful real-time analytics

3 Flexible, governed data model

4 Scale with the data lake

5 Sophisticated management & security

20

Evaluating your Options

21

Query and Data ModelMongoDB Relational Column Family

(i.e. HBase)Rich query language & secondary indexes

Yes Yes Requires integration with separate Spark /

Hadoop clusterIn-Database aggregations & search Yes Yes Requires integration

with separate Spark / Hadoop cluster

Dynamic schema Yes No Partial

Data validation Yes Yes App-side code

• Why it matters– Query & Aggregations: Rich, real time analytics against operational data– Dynamic Schema: Manage multi-structured data– Data Validation: Enforce data governance between data lake & operational apps

22

Data Lake IntegrationMongoDB Relational Column Family

(i.e. HBase)Hadoop + secondary indexes Yes Yes: Expensive No secondary

indexesSpark + secondary indexes Yes Yes: Expensive No secondary

indexesNative BI connectivity Yes Yes 3rd-party connectors

Workload isolation Yes Yes: Expensive Load data to separate

Spark/Hadoop cluster

• Why it matters– Hadoop + Spark: Efficient data movement between data lake, processing layer & database– Native BI Connectivity: Visualizing operational data– Workload isolation: separation between operational and analytical workloads

23

Operationalizing for Scale & SecurityMongoDB Relational Column Family

(i.e. HBase)Robust security controls Yes Yes YesScale-out on commodity hardware Yes No YesSophisticated management platform Yes Yes Monitoring only

• Why it matters– Security: Data protection for regulatory compliance– Scale-Out: Grow with the data lake– Management: Reduce TCO with platform automation, monitoring, disaster recovery

24

MongoDB Nexus Architecture

Adoption & Skills Availability

Operational Data Lake in Action

27

Problem Why MongoDB ResultsProblem Solution Results

Existing EDW with nightly batch loads

No real-time analytics to personalize user experience

Application changes broke ETL pipeline

Unable to scale as services expanded

Microservices architecture running on AWS

All application events written to Kafka queue, routed to MongoDB and Hadoop

Events that personalize real-time experience (ie triggering email send, additional questions, offers) written to MongoDB

All event data aggregated with other data sources and analyzed in Hadoop, updated customer profiles written back to MongoDB

2x faster delivery of new services after migrating to new architecture

Enabled continuous delivery: pushing new features every day

Personalized user experience, plus higher uptime and scalability

UK’s Leading Price Comparison SiteOut-pacing Internet search giants with continuous delivery pipeline powered by microservices & Docker running MongoDB, Kafka and Hadoop in the cloud

28


Customer data scattered across 100+ different systems

Poor customer experience: no personalization, no consistent experience across brands or devices

No way to analyze customer behavior to deliver targeted offers

Selected MongoDB over HBase for schema flexibility and rich query support

MongoDB stores all customer profiles, served to web, mobile & call-center apps

Distributed across multiple regions for DR and data locality

All customer interactions stored in MongoDB, loaded into Hadoop for customer segmentation

Unified processing pipeline with Spark running across MongoDB and Hadoop

Single profile created for each customer, personalizing experience in real time

Revenue optimization by calculating best ticket prices

Reduce competitive pressures by identifying gaps in product offerings

Customer Data ManagementSingle view and real-time analytics with MongoDB, Spark, & Hadoop

Leading Global Airline

29


Commercialize a national security platform

Massive volumes of multi-structured data: news, RSS & social feeds, geospatial, geological, health & crime stats

Requires complex analysis, delivered in real time, always on

Apache NiFI for data ingestion, routing & metadata management

Hadoop for text analytics

HANA for geospatial analytics

MongoDB correlates analytics with user profiles & location data to deliver real-time alerts to corporate security teams & individual travelers

Enables Prescient to uniquely blend big data technology with its security IP developed in government

Dynamic data model supports indexing 38k data sources, growing at 200 per day

24x7 continuous availability

Scalability to PBs of data

World’s Most Sophisticated Traveler Safety PlatformAnalyzing PBs of Data with MongoDB, Hadoop, Apache NiFi & SAP HANA

30


Requirement to analyze data over many different dimensions to detect real time threat profiles

HBase unable to query data beyond primary key lookups

Lucene search unable to scale with growth in data

MongoDB + Hadoop to collect and analyze data from internet sensors in real time

MongoDB dynamic schema enables sensor data to be enriched with geospatial tags

Auto-sharding to scale as data volumes grow

Run complex, real-time analytics on live data

Improved query performance by over 3x

Scale to support doubling of data volume every 24 months

Deploy across global data centers for low latency user experience

Engineering teams have more time to develop new features

Powering Global Threat IntelligenceCloud-based real-time analytics with MongoDB & Hadoop

Wrapping Up

Conclusion

1 Data lakes enabling enterprises to affordably capture & analyze more data

2 Operational and analytical workloads are converging

3 MongoDB is the key technology to operationalize the data lake

33

MongoDB Compass MongoDB Connector for BI

MongoDB Enterprise Server

MongoDB Enterprise Advanced24

x 7

Sup

port

(1 h

our S

LA)

Com

mercial License

(No A

GP

L Copyleft R

estrictions)

Platform Certifications

MongoDB Ops Manager

Monitoring & Alerting

Query Optimization

Backup & Recovery

Automation & Configuration

Schema Visualization

Data Exploration

Ad-Hoc Queries

Visualization

Analysis

Reporting

Authorization Auditing Encryption(In Flight & at Rest)Authentication

REST APIEmergency Patches

Customer Success Program

On-Demand Online Training

Warranty

Limitation of Liability

Indemnification

500+ employeesAbout

MongoDB, Inc.

2,000+ customers

13 offices worldwide

$311M in funding

35

Resources to Learn More

• Guide: Operational Data Lake

• Whitepaper: Real-Time Analytics with Apache Spark & MongoDB

https://www.mongodb.com/collateral/unlocking-operational-intelligence-from-the-data-lake

https://www.mongodb.com/collateral/apache-spark-and-mongodb-turning-analytics-into-real-time-action

https://www.mongodb.com/collateral/apache-spark-and-mongodb-turning-analytics-into-real-time-action

37

For More Information

Resource Location

Case Studies mongodb.com/customers

Presentations mongodb.com/presentations

Free Online Training education.mongodb.com

Webinars and Events mongodb.com/events

Documentation docs.mongodb.org

MongoDB Downloads mongodb.com/download

Additional Info [email protected]

38


System failures in online banking systems creating customer sat issues

No personalization experience across channels

No enrichment of user data with social media chatter

Apache Flume to ingest log data & social media streams, Apache Spark to process log events

MongoDB to persist log data and KPIs, immediately rebuild user sessions when a service fails

Integration with MongoDB query language and secondary indexes to selectively filter and query data in real time

Improved user experience, with more customers using online, self-service channels

Improved services following deeper understanding of how users interact with systems

Greater user insight by adding social media insights

One of World’s Largest BanksCreating new customer insights with MongoDB & Spark

39

LEGACY FUTURE STATE

APPS On-Premise, Monoliths SaaS, Microservices

DATABASE Relational (Oracle) Non-Relational (MongoDB)

EDW Teradata, Oracle, etc. Hadoop

COMPUTE Scale-Up Server Containers / Commodity Server / Cloud

STORAGE SAN Local Storage & Data Lakes

NETWORK Routers and Switches Software-Defined Networks

The New Enterprise Stack

Operational ApplicationAnalytics Application

MongoDB Primary

MongoDB Secondary MongoDB Secondary

Real Time analytics to inform operational

application

Querying operational data

Workload Isolation for Real-Time Analytics

41

Handling Multi-Structured Data from the Data LakeFlexible, Governed Data Model

{ first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: [45.123,47.232], Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

Fields can contain an array of sub-documents

Typed field values

Fields can contain arrays

String

Number

Geo-Location

42

Expressive Query Language, Rich Secondary Indexes

Rich Queries • Find Paul’s cars• Find everybody in London with a car between 1970

and 1980

Geospatial • Find all of the car owners within 5km of Trafalgar Sq.

Text Search • Find all the cars described as having leather seats

Aggregation • Calculate the average value of Paul’s car collection

Map Reduce • What is the ownership pattern of colors by geography over time (is purple trending in China?)

43

Visualizing Operational DataMongoDB Connector for BI

Visualize and explore multi-structured data using SQL-based BI platforms.

Your BI Platform

BI ConnectorProvides Schema

Translates QueriesTranslates Response

44

Enterprise-Grade Security

*Included with MongoDB Enterprise Advanced

BUSINESS NEEDS SECURITY FEATURES

Authentication SCRAM, LDAP*, Kerberos*, x.509 Certificates

Authorization Built-in Roles, User-Defined Roles, Field-Level Redaction

Auditing* Admin, DML, DDL, Role-based

Encryption Network: SSL (with FIPS 140-2), Disk: Encrypted Storage Engine* or Partner Solutions

45

Scale-Out Across Commodity Hardware & Regions

46

Management Tooling: MongoDB Ops Manager

• Monitoring & alerting• Integration to APM platforms• Prescriptive management with

query profiling• Automated cluster

provisioning, scaling and upgrades

• Continuous, point in time backup