1. what spark is about 2. why spark on ceph? 3. implementation … · 2018-11-19 · 1. what spark...

24
Spark on Ceph at UPSud/LAL 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation ideas Julien Nauroy Spark on Ceph 1

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

Spark on Ceph at UPSud/LAL

1. What Spark is about

2. Why Spark on Ceph?

3. Implementation ideas

Julien Nauroy Spark on Ceph

1

Page 2: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

1. What Spark is about

• Spark is a computing framework

– Siminar to Hadoop MapReduce… from afar

• Many more use cases

– Machine Learning, Bioinformatics, …

• Key concept : Resilient Distributed Dataset

– Tries to fit the dataset into RAM

Julien Nauroy Spark on Ceph 2

Page 3: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

1. What Spark is about

• Spark runs on a cluster

– Uses YARN, MESOS, or standalone

• Reads from/writes to distributed filesystems

– HDFS, S3, …

– Not to Ceph (yet)

• Preferably uses HDFS

– Data locality – but doesn’t make sense in VMs

– Uses rename on writes – possible problem

Julien Nauroy Spark on Ceph 3

Page 4: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

1. Experiments at UPSud

• Life Sciences

– DNA/RNA Sequence alignment

– Galaxy on Spark

– Simulating turtle embryos growth

• Astrophysics

– Image coaddition

– Cross matching catalogs (CDS Strasbourg)

Julien Nauroy Spark on Ceph 4

Page 5: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How HDFS works

1. Split files into blocks

• Split on data structure boundaries (e.g. line)

• Indicative size : 128MB

} block 1

2

3

4

5

Julien Nauroy Spark on Ceph 5

Page 6: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How HDFS Works

2. Copy each block on multiple nodes

1

2

3

4

5

Node A Node B Node C Node D Node E

Julien Nauroy Spark on Ceph 6

Page 7: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How HDFS Works

2. Copy each block on multiple nodes

• In general, 3 copies

1

2

3

4

5

Node A

1

Node B Node C

1

Node D

1

Node E

Julien Nauroy Spark on Ceph 7

Page 8: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How HDFS Works

2. Copy each block on multiple nodes

• In general, 3 copies

1

2

3

4

5

Node A

1

Node B Node C

1

Node D

1

Node E

Julien Nauroy Spark on Ceph 8

Page 9: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How HDFS Works

2. Copy each block on multiple nodes

• In general, 3 copies

1

2

3

4

5

Node A

1

Node B

2

Node C

1

2

Node D

1

Node E

2

Julien Nauroy Spark on Ceph 9

Page 10: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

Fonctionnement de HDFS

2. Copy each block on multiple nodes

• In general, 3 copies

1

2

3

4

5

Node A

1

3

Node B

2

3

Node C

1

2

Node D

1

Node E

2

3

Julien Nauroy Spark on Ceph 10

Page 11: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How HDFS Works

2. Copy each block on multiple nodes

• In general, 3 copies

1

2

3

4

5

Node A

1

3

5

Node B

2

3

5

Node C

1

2

4

Node D

1

4

5

Node E

2

3

4

Julien Nauroy Spark on Ceph 11

Page 12: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How MapReduce Works

1. Select nodes on which to run computations

• Data has to be node-local (if possible)

1

2

3

4

5

Node A

1

3

5

Node B

2

3

5

Node C

1

2

4

Node D

1

4

5

Node E

2

3

4

Julien Nauroy Spark on Ceph 12

Page 13: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How MapReduce works

1. Select nodes on which to run computations

• Data has to be node-local (if possible)

1

2

3

4

5

Node A

1

3

5

Node B

2

3

5

Node C

1

2

4

Node D

1

4

5

Node E

2

3

4

Julien Nauroy Spark on Ceph 13

Page 14: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How MapReduce works

1. Sélection des nœuds portant les calculs

• The node must not be busy

1

2

3

4

5

Node A

1

3

5

Node B

2

3

5

Node C

1

2

4

Node D

1

4

5

Node E

2

3

4

Julien Nauroy Spark on Ceph 14

Page 15: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How MapReduce works

1. Sélection des nœuds portant les calculs

1

2

3

4

5

Node A

1

3

5

Node B

2

3

5

Node C

1

2

4

Node D

1

4

5

Node E

2

3

4

Julien Nauroy Spark on Ceph 15

Page 16: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

How MapReduce works

1. Sélection des nœuds portant les calculs

1

2

3

4

5

Node A

1

3

5

Node B

2

3

5

Node C

1

2

4

Node D

1

4

5

Node E

2

3

4

Julien Nauroy Spark on Ceph 16

Page 17: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

2. Why Spark on Ceph?

• Spark clusters in VM works great

– For computations at least

– Main usage of Spark (public clouds)

• Spark requires a distributed storage

– HDFS, S3, NFS …

– HDFS in a VM will not solve the problem

• HDFS over Ceph = double penalty

• Data locality doesn’t make sense in VMs

Julien Nauroy Spark on Ceph 17

Page 18: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

2. Why Spark on Ceph?

• Ceph is coupled with our OpenStack cluster

– Local expertise

• HDFS is not an option

– Problems with data locality

– Computing and storage not paired in our cloud

Julien Nauroy Spark on Ceph 18

Page 19: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

3. Spark on Ceph – ideas

1. Using RGWFS

2. Using CephFS-Hadoop

3. Using a gateway with an S3 endpoint

Julien Nauroy Spark on Ceph 19

Page 20: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

3.1 - RGWFS

Julien Nauroy Spark on Ceph 20

Page 21: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

3.1 - RGWFS

• http://www.slideshare.net/zhouyuan/hadoop-over-rgw

• Pros

– Should ntegrate well with Spark through rgw://

• Cons

– Git repo doesn’t exist anymore

– Cannot find more info – vaporware?

Julien Nauroy Spark on Ceph 21

Page 23: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

3.2 – S3 Gateway

• http://docs.ceph.com/docs/master/radosgw/s3/

• Pros

– Hadoop supports the S3 protocol

– VMS outside of the OSD network

• Cons

– Another layer of indirection?

– Perfs depending on the number of gateways?

Julien Nauroy Spark on Ceph 23

Page 24: 1. What Spark is about 2. Why Spark on Ceph? 3. Implementation … · 2018-11-19 · 1. What Spark is about • Spark runs on a cluster –Uses YARN, MESOS, or standalone • Reads

Which solution is best suited?

• discussion

Julien Nauroy Spark on Ceph 24