using apache spark for generating elasticsearch indices offline · 2017-12-14 · elasticsearch...
TRANSCRIPT
![Page 1: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/1.jpg)
Using Apache Spark for generating ElasticSearch indices offline
Andrej BabolčaiESET Database systems engineer
Apache: Big Data Europe 2016
![Page 2: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/2.jpg)
Who am I
• Software engineer in database systems team
• Responsible for collecting, moving and providing access to data
![Page 3: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/3.jpg)
Context
Apache Kafka
Apache Hive/Impala
ElasticSearch
![Page 4: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/4.jpg)
Agenda
• Approaches we tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
![Page 5: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/5.jpg)
Agenda
• Approaches we tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
![Page 6: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/6.jpg)
Indexing data to live cluster
• Failed because of
• Slowed search and near real-time (NRT) import
• Reduce ingestion speed - too slow
![Page 7: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/7.jpg)
Spark job with Lucene library
• Approach
• Generate indices with Lucene and “import” them to ES
• Indexing with Lucene is fast, hundreds of GB/hour
• Failed because of
• ES types in Lucene
• ES translog and checksum
![Page 8: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/8.jpg)
Agenda
• Approaches tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
![Page 9: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/9.jpg)
Goal
• Offload ES cluster and generate indices on Spark cluster
• We want indices to be “ready to use”
• When appropriate copy them to ES
![Page 10: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/10.jpg)
Spark + local ES
• Based on https://github.com/MyPureCloud/elasticsearch-
lambda
• Similar approach to Cloudera Solr MapReduceIndexerTool
![Page 11: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/11.jpg)
How do we generate indices offline
S
p
a
r
k
0 partition
1 partition
n partition
…
Start ESn shards/1 with data
Start ES
Start ES
…
…
HDFS repository
Index/0
Index/1
Index/n
InputData
1 partition 1. shard (data)
0. shard (empty)
n. shard (empty)
…
SnapshotConstant shard routing
0. sh. snapshot(empty)
1. sh. snapshot
n. sh. snapshot(empty)
![Page 12: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/12.jpg)
HDFS snapshot repository layout
Dest dir
indices
idxname-2015
idxname-2016
0
__r
__z
1meta-idxname-
2016.dat
snap-idxname-2016.dat
![Page 13: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/13.jpg)
Creating local ES node
val nodeSettings: Settings = Settings.builder.put("http.enabled", false).put("processors", 1).put("index.merge.scheduler.max_thread_count", 1)
….build
val node: Node = nodeBuilder().settings(nodeSettings).local(true).node()
node.startval client: Client = node.clientclient.admin.indices. … .setSource(mapping).get
HTTP unnecessary, use transport interface
Only JVM local node discovery
Same json mapping as Index API (http)
![Page 14: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/14.jpg)
RDD export like saveAsTextFile
rddToIndex.repartition(config.numShards).saveToESSnapshot(config,…
![Page 15: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/15.jpg)
We use implicit conversions
package object spark {
implicit class
DBSysSparkRDDFunctions
[T <: Map[String,Object]] //Row
(val rdd: RDD[T]) extends AnyVal {
def saveToESSnapshot(config:String,…):Unit = {
…
rdd.sparkContext.runJob(rdd,esWriter.processPartition _)
Input RDD type bound
Indexing method
Infiltrate spark namespace
![Page 16: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/16.jpg)
Useful ES commands
• Create HDFS snapshot repository:
curl -s -XPUT 'localhost:9200/_snapshot/<Repo name>' -d '{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "/user/<username>/<Snapshot repo hdfs path>",
"load_defaults": "false"
}
}’
![Page 17: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/17.jpg)
Useful ES commands
• Start restore process:
curl -XPOST 'localhost:9200/_snapshot/<Repo name>/<snapshot name>/_restore’
• Monitor restore progress:curl -s –XGET 'localhost:9200/_cat/recovery?v' |
awk '{print $1 " " $11}' |
fgrep -v " 0.0%" |
fgrep -v "100.0%"
![Page 18: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/18.jpg)
Agenda
• Approaches tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
![Page 19: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/19.jpg)
ES cluster configurationProperty Value
Number of nodes 24
ES heap size 29GB
CPUs 8 (/proc/cpuinfo)
HDD 2x3.5TB / node
No. of indices 130
No. of shards ~3900
Data size 16 TB
No. of docs > 60 billion
Indexing speed (what we can handle…) ~1000 docs/s
![Page 20: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/20.jpg)
Offline indexing environmentProperty Value
Input size 135GB compr. parquet
Number of docs 470M
CPUs (indexing) 15 spark workers
Memory 4GB /worker
Output index layout 20 string fields, 15 part.
Job duration ~3.5h
Restore duration ~20m
Duration total ~4h
Indexing speed >30k docs/s
![Page 21: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/21.jpg)
Future work
• Shard routing
• Indexing on local FS, use directly HDFS
• Speed up indexing
• Use for stream indexing
![Page 22: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe](https://reader034.vdocuments.us/reader034/viewer/2022052500/5f1b7df74e77266f5c01a7b2/html5/thumbnails/22.jpg)
Summary
• Hard to directly compare RT with our offline approach
• What we wanted was to make historical data available for
users, without influencing production systems
https://github.com/andybab/OfflineESIndexGenerator