making enterprise data available in real time with...

25
MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence

Upload: others

Post on 04-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

MAKING ENTERPRISE DATA

AVAILABLE IN REAL TIME

WITH ELASTICSEARCH

Yann Cluchey

CTO @ Cogenta

CTO @ GfK Online Pricing Intelligence

Page 2: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

What is Enterprise Data?

Page 3: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

What is Enterprise Data?

Page 4: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Online Pricing Intelligence

1. Gather data from 500+ of eCommerce sites

2. Organise into high quality market view

3. Competitive intelligence tools

Page 5: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Price,

Stock,

Meta

Price,

Stock,

Meta

Price,

Stock,

Meta Price,

Stock,

Meta

HTML

Custom Crawler

Parse web content

Discover product data

Tracking 20m products

Daily+

HTML

HTML

HTML

Page 6: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Database

Processing, Storage

Enrichment

Persistent Storage

Product Catalogue

+ time series data

Processing

Page 7: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Database

Thing #1 - Detection

Identify distinct products

Automated information retrieval

Lucene + custom index builder

Continuous process

(Humans for QA)

Lucene

Index

Index Builder

GUI

Matcher

Page 8: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Thing #2 - BI Tools

Web Applications

Also based on Lucene

Batch index build process

Per-customer indexes

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 2

Customer

Index 3

Page 9: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Thing #1 - Pain

Continuously indexing

Track changes, read back out to index

Drain on performance

Latency, coping with peaks

Full rebuild for index schema change

or inconsistencies

Full rebuild doesn’t scale well…

Unnecessary work..? Lucene

Index

Index Builder

GUI

Database

Page 10: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Customer

Index 2

Thing #2 - Pain

Twice daily batch rebuild, per customer

Very slow

Moar customers?

Moar data?

Moar often?

Data set too complex,

keeps changing

Index shipping

Moar web servers?

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 3

Indexing

Database

Batch Sync

Web Server 1 Web Server 2

Page 11: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Pain Points

As data, customers scale,

processes slow down

Adapting to change

Easy to layer on,

hard to make fundamental changes

Read vs write concerns

Database Maintenance

Index

Index Builder

Database

Page 12: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Goals

Eliminate latencies

Improve scalability

Improve availability

Something achievable

Your mileage will vary

Page 13: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

elasticsearch

Open source, distributed search engine

Based on Lucene, fully featured API

Querying, filtering, aggregation

Text processing / IR

Schema-free

Yummy

(real-time, sharding, highly available)

Silver bullets not included

Page 14: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Indexing Database

Indexing Database

Our Pipeline

Database

Crawlers Crawlers

Processors Processors

Processors Processors

Processors Processors Indexers

Indexes Indexes

Indexes

Web Servers Web Servers

Web Servers

Page 15: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Our New Pipeline

Database

Crawlers Crawlers

Processors Processors

Processors Processors

Processors Processors Indexers

Indexes Indexes

Indexes Web Servers

Web Servers Web Servers

Page 16: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Event Hooks

Messages fired OnCreate.. and OnUpdate

Payload contains everything needed for indexing

The data

Keys (still mastered in SQL)

Versioning

Sender has all the information already

Use RabbitMQ to control event message flow

Messages are durable

Page 17: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Indexing Strategy

RESTful API (HTTP, Thrift, Memcache)

Use bulk methods

They support percolation

Rivers (pull)

RabbitMQ River

JDBC River

Mongo/Couch/etc. River

Logstash

Index Q Indexer

Event Q

Page 18: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Model Your Data

What’s in your documents?

Database = Index

Table = Type ...?

Start backwards

What do your applications need?

How will they need to query the data?

Prototyping! Fail quickly!

elasticsearch supports Nested objects, parent/child docs

Page 19: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Joins

Events relate to line-items

Amazon decreased price

Pixmania is running a promotion

Need to group by Product

Use key/value store

Get full Product document

Modify it, write it back

Enqueue indexing instruction

Indexer Event Q 3 3 5

1 4

1

2

1 3

4 Index Q

Key/value

store

5

Page 20: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Where to join?

elasticsearch

Consider performance

Depends how data is structured/indexed (e.g. parent/child)

Compression, collisions

In-memory cache (e.g. Memcache)

Persistent storage (e.g. Cassandra or Mongo)

Two awesome benefits

Quickly re-index if needed

Updates have access to the full Product data

Serialisation is costly

Page 21: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Synchronisation & Concurrency

Fault tolerance

Code to expect missing data

Out of sequence events

Concurrency Control

Apply Optimistic Concurrency Control at Mongo

Optimise for collisions

Page 22: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Synchronisation & Concurrency

Synchronisation

Out of sequence index instructions

elasticsearch external versioning

Can rebuild from scratch if need to

Consistency

Which version is right?

Dates

Revision numbers from SQL

Independent updates

Page 23: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Figures

Ingestion

20m data points/day (continuously)

~ 200GB

3K msgs/second at peak

Hardware

SQL: 2 x 12-core, 64GB, 72-spindle SAN

Indexing: 4 x 4-core, 8GB

Mongo: 1 x 4-core, 16GB, 1xSSD

Elastic: 5 x 4-core, 16GB, 1xSSD

Custom-Built

Lucene

elasticsearch

Latency 3 hours < 1 second

Bottleneck Disk (SQL) CPU

Page 24: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Managing Change

Key/value

store

Index_A

Client

Indexer Event Q

Alias

Index_B

Index

Index_B Index_A

Page 25: MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCHgotocon.com/dl/goto-aar-2014/slides/YannCluchey_Cogenta... · 2014-09-30 · WITH ELASTICSEARCH Yann Cluchey CTO @

Thanks

@YannCluchey

Concurrency Patterns with MongoDB http://slidesha.re/YFOehF

Consistency without Consensus Peter Bourgon, SoundCloud http://bit.ly/1DUAO1R

Eventually Consistent Data Structures Sean Cribbs, Basho https://vimeo.com/43903960