dev nexus 2017
TRANSCRIPT
![Page 1: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/1.jpg)
Big Data processing with Elasticsearch and Apache Spark
DevNexus, 2017
![Page 2: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/2.jpg)
Who Am I•Roy Russo•VP Engineering, Predikto•Co-Author - Elasticsearch in
Action•ElasticHQ.org
2
![Page 3: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/3.jpg)
Why Am I Here?•What is Search•Elasticsearch 101 in 10 minutes•Elasticsearch and Big Data processing-The Predikto process -Elasticsearch at scale
3
![Page 4: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/4.jpg)
Search is about filtering
information and
determining relevance.
4
![Page 5: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/5.jpg)
How does a Search Engine Work?
5
Select * FROM make WHERE name LIKE ‘%Tesla%’
![Page 6: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/6.jpg)
Search Engines use Magic
6
Where Magic == Inverted Index
It’s FM!
![Page 7: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/7.jpg)
Inverted Index1.Take some documents2.Tokenize them3.Find unique tokens4.Map tokens to documents
7
apple oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6
![Page 8: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/8.jpg)
Inverted Index
8
apples oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6
Search for “apple peach”
![Page 9: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/9.jpg)
What is Elasticsearch?
9
![Page 10: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/10.jpg)
Elasticsearch is…•Search and Analytics engine-Yes, analytics too.
•Document Store-Every field is indexed/searchable
•Distributed- Fault-tolerant
10
![Page 11: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/11.jpg)
What Elasticsearch is not•Key-Value Store-Redis, Riak
•Column Family Store-C*, HBase
•Graph Database-Neo4J
•Relational Database
11
![Page 12: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/12.jpg)
ElasticSearch in a Nutshell•Based on Apache Lucene•Distributed•Document-Oriented•Schema "free"•HTTP + JSON•(Near) Real-time search•Ecosystem-Hosting, Monitoring Apps, Clients (SDK)
12
![Page 13: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/13.jpg)
Where can I get it?•Free and Open Source•https://www.elastic.co/•https://github.com/elastic/elasticsearch•Backed by a Company, Elastic-Training-Support-Auth/AuthZ-Kibana for UI
13
![Page 14: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/14.jpg)
How do I run it?•Download it-https://www.elastic.co/downloads
•bin/elasticsearch•http://localhost:9200
14
{status: 200,name: "Tesla",cluster_name: "elasticsearch_royrusso",version: {
number: "1.4.2",build_hash: "927caff6f05403e936c20bf4529f144f0c89fd8c",build_timestamp: "2014-12-16T14:11:12Z",build_snapshot: false,lucene_version: "4.10.2"
},tagline: "You Know, for Search"
}
![Page 15: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/15.jpg)
Elasticsearch requires Java. *
15
* You have 5 seconds to whine about it and then shutup.
![Page 16: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/16.jpg)
ElasticSearch for Centralized Logs
•Logstash + ElasticSearch + Kibana (ELK)
16
![Page 17: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/17.jpg)
Based on Apache Lucene•Free and Open Source•Started in 1999•What’s it do?
17
![Page 18: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/18.jpg)
Elasticsearch is a Document Store
18
![Page 19: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/19.jpg)
Document Store•Like MongoDB and CouchDB… but more
like Solr•Document DB
19
![Page 20: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/20.jpg)
Modeled in JSON
20
{ "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983}
{ "_index": "imdb", "_type": "movie", "_id": "u17o8zy9RcKg6SjQZqQ4Ow", "_version": 1, "_source": { "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983 }}
![Page 21: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/21.jpg)
Schema-Free•Dynamic Mapping-Elasticsearch "guesses" the data-types (string,
int, float…)
21
"imdb": { "movie": { "properties": { "country": { "type": "string“, “store”:true, “index”:false }, "genre": { "type": "string“, "null_value" : "na“, “store”:false, “index:true }, "year": { "type": "long" } } }}
![Page 22: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/22.jpg)
Terminology
•Cluster: 1..N Nodes w/ same Cluster Name•Node: One ElasticSearch instance (1 java
proc)•Shard = One Lucene instance-0 or more replicas 2
2
MySQL ElasticsearchDatabase Index
Table TypeRow Document
Column FieldSchema Mapping
Index (Everything is indexed)SQL Query DSL
![Page 23: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/23.jpg)
REST•HTTP Verbs: GET, POST, PUT, DELETE•JSON •_cat API
23
% curl '192.168.56.10:9200/_cat/health?v&ts=0'cluster status nodeTotal nodeData shards pri relo init unassignfoo green 3 3 3 3 0 0 0
![Page 24: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/24.jpg)
Random notes about Elasticsearch•I recommend public trainings for
beginners; developers and devops… •The QDSL is NOT intuitive.•Official support is expensive. •Release roadmap on github is hit-or-miss
in bizarro-land.•Beware client SDKs. Some of the
maintainers can be finicky.•The technology is great, but the space is
noisy.
24
![Page 25: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/25.jpg)
Elasticsearch is Distributed
25
![Page 26: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/26.jpg)
High Availability•No need for load balancer•Different Node Types•Indices are partitioned in to Shards•Replica shards on different Nodes•Automatic Master election & failover
26
![Page 27: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/27.jpg)
Cluster Topology
27
A1 A2B2 B2 B1
B3
B1 A1 A2
B3
4 Node Cluster
Index A: 2 Shards & 1 ReplicaIndex B: 3 Shards & 1 Replica
![Page 28: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/28.jpg)
Discovery•Nodes discover each other using multicast.-Unicast is an option
•Each cluster has an elected master node-Beware of split-brain
28
discovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3"]
![Page 29: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/29.jpg)
Nodes• Master node handles cluster-wide (Meta-API) events- Node participation- New indices create/delete- Re-Allocation of shards
• Data Nodes- Indexing / Searching operations
• Tribe Nodes- Cross-cluster search
• Client Nodes- REST calls- Light-weight load balancers
• Ingest Nodes- Pipelines/Processors: Date, Convert, Rename, Split, Sort, UC, LC,
Foreach29
![Page 30: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/30.jpg)
The Basics - Shards•Primary Shard:- First time Indexing- Index has 1..N primary shards (default: 5)-# Not changeable once index created
•Replica Shard:-Copy of the primary shard-# Can be changed later-Each primary has 0..N replicas- HA:• Promoted to primary if primary fails• Get/Search handled by primary||replica3
0
![Page 31: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/31.jpg)
Shard Auto-Allocation
•Shard Phases:-Unassigned- Initializing-Started-Relocating
31
Node 1
0P
1R
Node 2
1P
0R
Node 3
0R
Add a Node: Shards Relocate
![Page 32: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/32.jpg)
New in 2.x and 5.x•2.x•Pipeline Aggregations•Say goodbye to Filters
•5.x• Ingest Node•Painless scripting•Data structures:• Not_analyzed -> text
![Page 33: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/33.jpg)
Elasticsearch in the Real World*
33
* Real World == My World
From Big Data to Predictions
![Page 34: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/34.jpg)
What we do1. Ingest Data : Sensors, maintenance,
weather, faults2. Generate Features3. Train Models4. Issue Predictions5. Visualize
![Page 35: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/35.jpg)
How it Works
![Page 36: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/36.jpg)
How it Works
36
MAX
Raw datain TB
Auto dynamic engineered
features
Auto dynamic feature
selection
Method and algorithmselection
Dynamic self learning
![Page 37: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/37.jpg)
How it REALLY Works
37
•Customer hands us X years of data•It’s a horrible mess•ETL to Elasticsearch via Spark•Feature generation via Spark•Train Models•Predictions Issued
85%
10%
5%
Onboarding
ETL Feature Generation Predictive Models
![Page 38: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/38.jpg)
What is feature generation?
38
•We want to predict engine overheating:•Feature 1: Max temperature of this
engine per week•Feature 2: Difference in temperature of
one reading compared to the mean of all readings across the fleet.
•Feature 3: For the month of July, how does mean engine temperature compare to all other engines for all July’s on record.
•…•Feature N
![Page 39: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/39.jpg)
The problem with 10 ms
39
•10ms response times are fast, unless:•10ms X 100 million queries:•16,667 minutes•277.8 hours•11.57 days
•More queries, slower responses•Downward spiral and cascade issues at
requester
![Page 40: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/40.jpg)
Inconvenient Truths in Data Science
40
•Everyone is a data scientist or Uber driver today; or both.
•Space is noisy. Lots of consulting-ware.•Customer data is a disaster.•Watson can’t even predict IBM layoffs.
![Page 41: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/41.jpg)
Elasticsearch at Predikto
41
•Write From Spark to Elasticsearch•Query from Spark to Elasticsearch•Visualize
![Page 42: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/42.jpg)
How we use Apache Spark
42
•It’s all DAGs•No new code-per-customer
•ETL:•DAG-driven•Configuration triggers functions.
•Feature Generation:•DAG-driven•Some magic
![Page 43: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/43.jpg)
DAG-driven ETL
43
![Page 44: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/44.jpg)
Architecture Considerations•Fast read•Mostly time-series data
•Fast write•Dynamic querying•Apache Spark integration•Easily deployed on AWS •Commodity hardware•HA•Schema free•GeoJSON
![Page 45: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/45.jpg)
Contenders•Cassandra •Datastax Enterprise•Relational (PSQL, MySQL)•Mongo•Solr•Hadoop
![Page 46: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/46.jpg)
Why Elasticsearch?•Fast read•Fast write (?)•Dynamic querying•Apache Spark integration via Hadoop
connector•Easy EC2 setup/configuration•HA•Schema free
![Page 47: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/47.jpg)
Heavy use in Machine Learning
47
•ETL process; FAST WRITES•Data is denormalized•Fields typed / standardized•Parallelized via Spark
•Feature generation; FAST READS•Billions of queries during process•Parallelized via Spark
•User Interface / API; FAST READS / DYNAMIC QUERYING•Query anything, any way
![Page 48: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/48.jpg)
Current Infrastructure•NGINX as reverse proxy•Why?
•Elasticsearch nodes are EC2 r4.2xl w/ 60GB RAM•EBS mount• SSD. Never use spinning media!• ~500GB per volume
•VPC•Cluster-per-customer
![Page 49: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/49.jpg)
Node Configuration•A base yml config:•bootstrap.mlockall: true•discovery.zen.ping.multicast.enabled:
false• index.search.slowlog.threshold.query.de
bug: 1s•plugin.mandatory: cloud-aws•discovery.type: ec2
•ES_HEAP_SIZE=30g•MAX_OPEN_FILES=65535•MAX_LOCKED_MEMORY=unlimited
![Page 50: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/50.jpg)
Deploying on AWS•Highly recommend the Cloud-AWS plugin
for discovery. https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-ec2.html
1. Nodes should be configured with the same cluster name AND security group.
2. All EC2 instances must have the same IAM Role at creation!
3. Security group should be open on 9200-9300.
4. Discovery will happen on boot.
![Page 51: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/51.jpg)
More about Elasticsearch on AWS
•Volumes:•Use EBS for smaller clusters (3-5)ish
nodes.• Instance store for larger clusters• ephemeral
•Amazon Linux AMIs•EC2 performance enhancements may
help here.•Small instance types get network
throttled.•Tribe nodes for cross-region synch•I don’t recommend the AWS Elasticsearch
service
![Page 52: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/52.jpg)
Tending to Elasticsearch•Curator: https://github.com/elastic/curator•Actions:•Create INDEX•Alias•Snapshot
•Backups on AWS:•For 5.x, install the Snapshot plugin.•Snapshots to S3• Requires key/secret
![Page 53: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/53.jpg)
Handling Fast Writes•Consider collectors for throttling writes•Always use the bulk API•Disable replicas•Index.refresh_interval : -1•Tuning for concurrent writing is an art.• Indexing buffer sizes•Translog flush size
![Page 54: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/54.jpg)
Handling Fast Reads•Fast hardware•Avoid joins (nested & parent-child)•Avoid scripting•Pre 5.x: Use filters•Cache
•indices.fielddata.cache.size•Atomic queries
![Page 55: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/55.jpg)
Maintaining Big Data•Big data grows•Archive/close unused data•Partition data:• Index by time, Index by time AND
device• Aliases• Templates / Curator
•Get your mapping right.•Re-indexing big data is not trivial• Elasticsearch 5.x reindexer
![Page 56: Dev nexus 2017](https://reader035.vdocuments.us/reader035/viewer/2022070600/58e498a81a28aba3458b461f/html5/thumbnails/56.jpg)
Questions?
56