Download - Dev nexus 2017
Big Data processing with Elasticsearch and Apache Spark
DevNexus, 2017
Who Am I•Roy Russo•VP Engineering, Predikto•Co-Author - Elasticsearch in
Action•ElasticHQ.org
2
Why Am I Here?•What is Search•Elasticsearch 101 in 10 minutes•Elasticsearch and Big Data processing-The Predikto process -Elasticsearch at scale
3
Search is about filtering
information and
determining relevance.
4
How does a Search Engine Work?
5
Select * FROM make WHERE name LIKE ‘%Tesla%’
Search Engines use Magic
6
Where Magic == Inverted Index
It’s FM!
Inverted Index1.Take some documents2.Tokenize them3.Find unique tokens4.Map tokens to documents
7
apple oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6
Inverted Index
8
apples oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6
Search for “apple peach”
What is Elasticsearch?
9
Elasticsearch is…•Search and Analytics engine-Yes, analytics too.
•Document Store-Every field is indexed/searchable
•Distributed- Fault-tolerant
10
What Elasticsearch is not•Key-Value Store-Redis, Riak
•Column Family Store-C*, HBase
•Graph Database-Neo4J
•Relational Database
11
ElasticSearch in a Nutshell•Based on Apache Lucene•Distributed•Document-Oriented•Schema "free"•HTTP + JSON•(Near) Real-time search•Ecosystem-Hosting, Monitoring Apps, Clients (SDK)
12
Where can I get it?•Free and Open Source•https://www.elastic.co/•https://github.com/elastic/elasticsearch•Backed by a Company, Elastic-Training-Support-Auth/AuthZ-Kibana for UI
13
How do I run it?•Download it-https://www.elastic.co/downloads
•bin/elasticsearch•http://localhost:9200
14
{status: 200,name: "Tesla",cluster_name: "elasticsearch_royrusso",version: {
number: "1.4.2",build_hash: "927caff6f05403e936c20bf4529f144f0c89fd8c",build_timestamp: "2014-12-16T14:11:12Z",build_snapshot: false,lucene_version: "4.10.2"
},tagline: "You Know, for Search"
}
Elasticsearch requires Java. *
15
* You have 5 seconds to whine about it and then shutup.
ElasticSearch for Centralized Logs
•Logstash + ElasticSearch + Kibana (ELK)
16
Based on Apache Lucene•Free and Open Source•Started in 1999•What’s it do?
17
Elasticsearch is a Document Store
18
Document Store•Like MongoDB and CouchDB… but more
like Solr•Document DB
19
Modeled in JSON
20
{ "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983}
{ "_index": "imdb", "_type": "movie", "_id": "u17o8zy9RcKg6SjQZqQ4Ow", "_version": 1, "_source": { "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983 }}
Schema-Free•Dynamic Mapping-Elasticsearch "guesses" the data-types (string,
int, float…)
21
"imdb": { "movie": { "properties": { "country": { "type": "string“, “store”:true, “index”:false }, "genre": { "type": "string“, "null_value" : "na“, “store”:false, “index:true }, "year": { "type": "long" } } }}
Terminology
•Cluster: 1..N Nodes w/ same Cluster Name•Node: One ElasticSearch instance (1 java
proc)•Shard = One Lucene instance-0 or more replicas 2
2
MySQL ElasticsearchDatabase Index
Table TypeRow Document
Column FieldSchema Mapping
Index (Everything is indexed)SQL Query DSL
REST•HTTP Verbs: GET, POST, PUT, DELETE•JSON •_cat API
23
% curl '192.168.56.10:9200/_cat/health?v&ts=0'cluster status nodeTotal nodeData shards pri relo init unassignfoo green 3 3 3 3 0 0 0
Random notes about Elasticsearch•I recommend public trainings for
beginners; developers and devops… •The QDSL is NOT intuitive.•Official support is expensive. •Release roadmap on github is hit-or-miss
in bizarro-land.•Beware client SDKs. Some of the
maintainers can be finicky.•The technology is great, but the space is
noisy.
24
Elasticsearch is Distributed
25
High Availability•No need for load balancer•Different Node Types•Indices are partitioned in to Shards•Replica shards on different Nodes•Automatic Master election & failover
26
Cluster Topology
27
A1 A2B2 B2 B1
B3
B1 A1 A2
B3
4 Node Cluster
Index A: 2 Shards & 1 ReplicaIndex B: 3 Shards & 1 Replica
Discovery•Nodes discover each other using multicast.-Unicast is an option
•Each cluster has an elected master node-Beware of split-brain
28
discovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3"]
Nodes• Master node handles cluster-wide (Meta-API) events- Node participation- New indices create/delete- Re-Allocation of shards
• Data Nodes- Indexing / Searching operations
• Tribe Nodes- Cross-cluster search
• Client Nodes- REST calls- Light-weight load balancers
• Ingest Nodes- Pipelines/Processors: Date, Convert, Rename, Split, Sort, UC, LC,
Foreach29
The Basics - Shards•Primary Shard:- First time Indexing- Index has 1..N primary shards (default: 5)-# Not changeable once index created
•Replica Shard:-Copy of the primary shard-# Can be changed later-Each primary has 0..N replicas- HA:• Promoted to primary if primary fails• Get/Search handled by primary||replica3
0
Shard Auto-Allocation
•Shard Phases:-Unassigned- Initializing-Started-Relocating
31
Node 1
0P
1R
Node 2
1P
0R
Node 3
0R
Add a Node: Shards Relocate
New in 2.x and 5.x•2.x•Pipeline Aggregations•Say goodbye to Filters
•5.x• Ingest Node•Painless scripting•Data structures:• Not_analyzed -> text
Elasticsearch in the Real World*
33
* Real World == My World
From Big Data to Predictions
What we do1. Ingest Data : Sensors, maintenance,
weather, faults2. Generate Features3. Train Models4. Issue Predictions5. Visualize
How it Works
How it Works
36
MAX
Raw datain TB
Auto dynamic engineered
features
Auto dynamic feature
selection
Method and algorithmselection
Dynamic self learning
How it REALLY Works
37
•Customer hands us X years of data•It’s a horrible mess•ETL to Elasticsearch via Spark•Feature generation via Spark•Train Models•Predictions Issued
85%
10%
5%
Onboarding
ETL Feature Generation Predictive Models
What is feature generation?
38
•We want to predict engine overheating:•Feature 1: Max temperature of this
engine per week•Feature 2: Difference in temperature of
one reading compared to the mean of all readings across the fleet.
•Feature 3: For the month of July, how does mean engine temperature compare to all other engines for all July’s on record.
•…•Feature N
The problem with 10 ms
39
•10ms response times are fast, unless:•10ms X 100 million queries:•16,667 minutes•277.8 hours•11.57 days
•More queries, slower responses•Downward spiral and cascade issues at
requester
Inconvenient Truths in Data Science
40
•Everyone is a data scientist or Uber driver today; or both.
•Space is noisy. Lots of consulting-ware.•Customer data is a disaster.•Watson can’t even predict IBM layoffs.
Elasticsearch at Predikto
41
•Write From Spark to Elasticsearch•Query from Spark to Elasticsearch•Visualize
How we use Apache Spark
42
•It’s all DAGs•No new code-per-customer
•ETL:•DAG-driven•Configuration triggers functions.
•Feature Generation:•DAG-driven•Some magic
DAG-driven ETL
43
Architecture Considerations•Fast read•Mostly time-series data
•Fast write•Dynamic querying•Apache Spark integration•Easily deployed on AWS •Commodity hardware•HA•Schema free•GeoJSON
Contenders•Cassandra •Datastax Enterprise•Relational (PSQL, MySQL)•Mongo•Solr•Hadoop
Why Elasticsearch?•Fast read•Fast write (?)•Dynamic querying•Apache Spark integration via Hadoop
connector•Easy EC2 setup/configuration•HA•Schema free
Heavy use in Machine Learning
47
•ETL process; FAST WRITES•Data is denormalized•Fields typed / standardized•Parallelized via Spark
•Feature generation; FAST READS•Billions of queries during process•Parallelized via Spark
•User Interface / API; FAST READS / DYNAMIC QUERYING•Query anything, any way
Current Infrastructure•NGINX as reverse proxy•Why?
•Elasticsearch nodes are EC2 r4.2xl w/ 60GB RAM•EBS mount• SSD. Never use spinning media!• ~500GB per volume
•VPC•Cluster-per-customer
Node Configuration•A base yml config:•bootstrap.mlockall: true•discovery.zen.ping.multicast.enabled:
false• index.search.slowlog.threshold.query.de
bug: 1s•plugin.mandatory: cloud-aws•discovery.type: ec2
•ES_HEAP_SIZE=30g•MAX_OPEN_FILES=65535•MAX_LOCKED_MEMORY=unlimited
Deploying on AWS•Highly recommend the Cloud-AWS plugin
for discovery. https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-ec2.html
1. Nodes should be configured with the same cluster name AND security group.
2. All EC2 instances must have the same IAM Role at creation!
3. Security group should be open on 9200-9300.
4. Discovery will happen on boot.
More about Elasticsearch on AWS
•Volumes:•Use EBS for smaller clusters (3-5)ish
nodes.• Instance store for larger clusters• ephemeral
•Amazon Linux AMIs•EC2 performance enhancements may
help here.•Small instance types get network
throttled.•Tribe nodes for cross-region synch•I don’t recommend the AWS Elasticsearch
service
Tending to Elasticsearch•Curator: https://github.com/elastic/curator•Actions:•Create INDEX•Alias•Snapshot
•Backups on AWS:•For 5.x, install the Snapshot plugin.•Snapshots to S3• Requires key/secret
Handling Fast Writes•Consider collectors for throttling writes•Always use the bulk API•Disable replicas•Index.refresh_interval : -1•Tuning for concurrent writing is an art.• Indexing buffer sizes•Translog flush size
Handling Fast Reads•Fast hardware•Avoid joins (nested & parent-child)•Avoid scripting•Pre 5.x: Use filters•Cache
•indices.fielddata.cache.size•Atomic queries
Maintaining Big Data•Big data grows•Archive/close unused data•Partition data:• Index by time, Index by time AND
device• Aliases• Templates / Curator
•Get your mapping right.•Re-indexing big data is not trivial• Elasticsearch 5.x reindexer
Questions?
56