dev nexus 2017

Big Data processing with Elasticsearch and Apache Spark

DevNexus, 2017

Who Am I•Roy Russo•VP Engineering, Predikto•Co-Author - Elasticsearch in

Action•ElasticHQ.org

2

http://ElasticHQ.org/

Why Am I Here?•What is Search•Elasticsearch 101 in 10 minutes•Elasticsearch and Big Data processing-The Predikto process -Elasticsearch at scale

3

Search is about filtering

information and

determining relevance.

4

How does a Search Engine Work?

5

Select * FROM make WHERE name LIKE ‘%Tesla%’

Search Engines use Magic

6

Where Magic == Inverted Index

It’s FM!

Inverted Index1.Take some documents2.Tokenize them3.Find unique tokens4.Map tokens to documents

7

apple oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6

Inverted Index

8

apples oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6

Search for “apple peach”

What is Elasticsearch?

9

Elasticsearch is…•Search and Analytics engine-Yes, analytics too.

•Document Store-Every field is indexed/searchable

•Distributed- Fault-tolerant

10

What Elasticsearch is not•Key-Value Store-Redis, Riak

•Column Family Store-C*, HBase

•Graph Database-Neo4J

•Relational Database

11

ElasticSearch in a Nutshell•Based on Apache Lucene•Distributed•Document-Oriented•Schema "free"•HTTP + JSON•(Near) Real-time search•Ecosystem-Hosting, Monitoring Apps, Clients (SDK)

12

Where can I get it?•Free and Open Source•https://www.elastic.co/•https://github.com/elastic/elasticsearch•Backed by a Company, Elastic-Training-Support-Auth/AuthZ-Kibana for UI

13

https://www.elastic.co/

https://github.com/elastic/elasticsearch

How do I run it?•Download it-https://www.elastic.co/downloads

•bin/elasticsearch•http://localhost:9200

14

{status: 200,name: "Tesla",cluster_name: "elasticsearch_royrusso",version: {

number: "1.4.2",build_hash: "927caff6f05403e936c20bf4529f144f0c89fd8c",build_timestamp: "2014-12-16T14:11:12Z",build_snapshot: false,lucene_version: "4.10.2"

},tagline: "You Know, for Search"

}

https://www.elastic.co/downloads

http://localhost:9200/

Elasticsearch requires Java. *

15

* You have 5 seconds to whine about it and then shutup.

ElasticSearch for Centralized Logs

•Logstash + ElasticSearch + Kibana (ELK)

16

Based on Apache Lucene•Free and Open Source•Started in 1999•What’s it do?

17

Elasticsearch is a Document Store

18

Document Store•Like MongoDB and CouchDB… but more

like Solr•Document DB

19

Modeled in JSON

20

{ "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983}

{ "_index": "imdb", "_type": "movie", "_id": "u17o8zy9RcKg6SjQZqQ4Ow", "_version": 1, "_source": { "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983 }}

Schema-Free•Dynamic Mapping-Elasticsearch "guesses" the data-types (string,

int, float…)

21

"imdb": { "movie": { "properties": { "country": { "type": "string“, “store”:true, “index”:false }, "genre": { "type": "string“, "null_value" : "na“, “store”:false, “index:true }, "year": { "type": "long" } } }}

Terminology

•Cluster: 1..N Nodes w/ same Cluster Name•Node: One ElasticSearch instance (1 java

proc)•Shard = One Lucene instance-0 or more replicas 2

2

MySQL ElasticsearchDatabase Index

Table TypeRow Document

Column FieldSchema Mapping

Index (Everything is indexed)SQL Query DSL

REST•HTTP Verbs: GET, POST, PUT, DELETE•JSON •_cat API

23

% curl '192.168.56.10:9200/_cat/health?v&ts=0'cluster status nodeTotal nodeData shards pri relo init unassignfoo green 3 3 3 3 0 0 0

Random notes about Elasticsearch•I recommend public trainings for

beginners; developers and devops… •The QDSL is NOT intuitive.•Official support is expensive. •Release roadmap on github is hit-or-miss

in bizarro-land.•Beware client SDKs. Some of the

maintainers can be finicky.•The technology is great, but the space is

noisy.

24

Elasticsearch is Distributed

25

High Availability•No need for load balancer•Different Node Types•Indices are partitioned in to Shards•Replica shards on different Nodes•Automatic Master election & failover

26

Cluster Topology

27

A1 A2B2 B2 B1

B3

B1 A1 A2

B3

4 Node Cluster

Index A: 2 Shards & 1 ReplicaIndex B: 3 Shards & 1 Replica

Discovery•Nodes discover each other using multicast.-Unicast is an option

•Each cluster has an elected master node-Beware of split-brain

28

discovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3"]

Nodes• Master node handles cluster-wide (Meta-API) events- Node participation- New indices create/delete- Re-Allocation of shards

• Data Nodes- Indexing / Searching operations

• Tribe Nodes- Cross-cluster search

• Client Nodes- REST calls- Light-weight load balancers

• Ingest Nodes- Pipelines/Processors: Date, Convert, Rename, Split, Sort, UC, LC,

Foreach29

The Basics - Shards•Primary Shard:- First time Indexing- Index has 1..N primary shards (default: 5)-# Not changeable once index created

•Replica Shard:-Copy of the primary shard-# Can be changed later-Each primary has 0..N replicas- HA:• Promoted to primary if primary fails• Get/Search handled by primary||replica3

0

Shard Auto-Allocation

•Shard Phases:-Unassigned- Initializing-Started-Relocating

31

Node 1

0P

1R

Node 2

1P

0R

Node 3

0R

Add a Node: Shards Relocate

New in 2.x and 5.x•2.x•Pipeline Aggregations•Say goodbye to Filters

•5.x• Ingest Node•Painless scripting•Data structures:• Not_analyzed -> text

Elasticsearch in the Real World*

33

* Real World == My World

From Big Data to Predictions

What we do1. Ingest Data : Sensors, maintenance,

weather, faults2. Generate Features3. Train Models4. Issue Predictions5. Visualize

How it Works

How it Works

36

MAX

Raw datain TB

Auto dynamic engineered

features

Auto dynamic feature

selection

Method and algorithmselection

Dynamic self learning

How it REALLY Works

37

•Customer hands us X years of data•It’s a horrible mess•ETL to Elasticsearch via Spark•Feature generation via Spark•Train Models•Predictions Issued

85%

10%

5%

Onboarding

ETL Feature Generation Predictive Models

What is feature generation?

38

•We want to predict engine overheating:•Feature 1: Max temperature of this

engine per week•Feature 2: Difference in temperature of

one reading compared to the mean of all readings across the fleet.

•Feature 3: For the month of July, how does mean engine temperature compare to all other engines for all July’s on record.

•…•Feature N

The problem with 10 ms

39

•10ms response times are fast, unless:•10ms X 100 million queries:•16,667 minutes•277.8 hours•11.57 days

•More queries, slower responses•Downward spiral and cascade issues at

requester

Inconvenient Truths in Data Science

40

•Everyone is a data scientist or Uber driver today; or both.

•Space is noisy. Lots of consulting-ware.•Customer data is a disaster.•Watson can’t even predict IBM layoffs.

Elasticsearch at Predikto

41

•Write From Spark to Elasticsearch•Query from Spark to Elasticsearch•Visualize

How we use Apache Spark

42

•It’s all DAGs•No new code-per-customer

•ETL:•DAG-driven•Configuration triggers functions.

•Feature Generation:•DAG-driven•Some magic

DAG-driven ETL

43

Architecture Considerations•Fast read•Mostly time-series data

•Fast write•Dynamic querying•Apache Spark integration•Easily deployed on AWS •Commodity hardware•HA•Schema free•GeoJSON

Contenders•Cassandra •Datastax Enterprise•Relational (PSQL, MySQL)•Mongo•Solr•Hadoop

Why Elasticsearch?•Fast read•Fast write (?)•Dynamic querying•Apache Spark integration via Hadoop

connector•Easy EC2 setup/configuration•HA•Schema free

Heavy use in Machine Learning

47

•ETL process; FAST WRITES•Data is denormalized•Fields typed / standardized•Parallelized via Spark

•Feature generation; FAST READS•Billions of queries during process•Parallelized via Spark

•User Interface / API; FAST READS / DYNAMIC QUERYING•Query anything, any way

Current Infrastructure•NGINX as reverse proxy•Why?

•Elasticsearch nodes are EC2 r4.2xl w/ 60GB RAM•EBS mount• SSD. Never use spinning media!• ~500GB per volume

•VPC•Cluster-per-customer

Node Configuration•A base yml config:•bootstrap.mlockall: true•discovery.zen.ping.multicast.enabled:

false• index.search.slowlog.threshold.query.de

bug: 1s•plugin.mandatory: cloud-aws•discovery.type: ec2

•ES_HEAP_SIZE=30g•MAX_OPEN_FILES=65535•MAX_LOCKED_MEMORY=unlimited

Deploying on AWS•Highly recommend the Cloud-AWS plugin

for discovery. https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-ec2.html

1. Nodes should be configured with the same cluster name AND security group.

2. All EC2 instances must have the same IAM Role at creation!

3. Security group should be open on 9200-9300.

4. Discovery will happen on boot.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-ec2.html



More about Elasticsearch on AWS

•Volumes:•Use EBS for smaller clusters (3-5)ish

nodes.• Instance store for larger clusters• ephemeral

•Amazon Linux AMIs•EC2 performance enhancements may

help here.•Small instance types get network

throttled.•Tribe nodes for cross-region synch•I don’t recommend the AWS Elasticsearch

service

Tending to Elasticsearch•Curator: https://github.com/elastic/curator•Actions:•Create INDEX•Alias•Snapshot

•Backups on AWS:•For 5.x, install the Snapshot plugin.•Snapshots to S3• Requires key/secret

https://github.com/elastic/curator

https://github.com/elastic/curator

Handling Fast Writes•Consider collectors for throttling writes•Always use the bulk API•Disable replicas•Index.refresh_interval : -1•Tuning for concurrent writing is an art.• Indexing buffer sizes•Translog flush size

Handling Fast Reads•Fast hardware•Avoid joins (nested & parent-child)•Avoid scripting•Pre 5.x: Use filters•Cache

•indices.fielddata.cache.size•Atomic queries

Maintaining Big Data•Big data grows•Archive/close unused data•Partition data:• Index by time, Index by time AND

device• Aliases• Templates / Curator

•Get your mapping right.•Re-indexing big data is not trivial• Elasticsearch 5.x reindexer

Questions?

56

dev nexus 2017

Software