Блохин Леонид - "mist, как часть hydrosphere"

23
Mist https://github.com/Hydrospheredata/mist www.provectus.com © Provectus, Inc. 1 Леонид Блохин Big Data Engineer [email protected] +7 (917) 295 - 40 - 49

Upload: provectus

Post on 22-Jan-2018

171 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Блохин Леонид - "Mist, как часть Hydrosphere"

Misthttps://github.com/Hydrospheredata/mist

www.provectus.com

© Provectus, Inc.

1

• Леонид Блохин

• Big Data Engineer

[email protected]

• +7 (917) 295 - 40 - 49

Page 2: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

• HydroSphere

• Spark

• Why We Needed a Mist

• Running

• Configuration

• Spark Job at Mist

• Road Map

www.provectus.com2

Page 3: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com3

http://hydrosphere.io/

Hydrosphere – Opensource Big Data and Analytics platform

with DevOps culture in mind.

Page 4: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com4

http://hydrosphere.io/

Page 5: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com5

http://spark.apache.org/

Apache Spark™ is a fast and general engine for large-scale data processing.

Page 6: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com6

Page 7: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

• Mist is a thin service on top of Spark which makes it possible to execute Scala & Python Spark Jobs

from application layers and get synchronous, asynchronous, and reactive results as well as provide

an API to external clients.

• It implements Spark as a Service and creates a unified API layer for building enterprise solutions

and services on top of a Big Data lake.

www.provectus.com7

Page 8: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

● HTTP and Messaging (MQTT) API● Scala & Python Spark job execution● Works with Standalone, Mesos, Yarn any Spark config● Support for Spark SQL and Hive● High Availability and Fault Tolerance● Persist job state for self healing● Async and sync API, JSON job results

www.provectus.com8

Why We Needed a Mist

Page 9: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

Build the project

git clone https://github.com/hydrospheredata/mist.git

cd mist

./sbt/sbt -DsparkVersion=1.5.2 assembly

Create configuration file

Run

spark-submit --class io.hydrosphere.mist.Mist \

--driver-java-options "-Dconfig.file=/path/to/application.conf" \

target/scala-2.10/mist-assembly-0.2.0.jar

www.provectus.com9

Running

Page 10: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com

Configuration

10

# spark master url can be either of three: local, yarn, mesos (local by default)mist.spark.master = "local[*]"

# number of threads: one thread for one jobmist.settings.threadNumber = 16

# http interface (off by default)mist.http.on = truemist.http.host = "192.168.10.13"mist.http.port = 2003

Page 11: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com

Configuration

11

# MQTT interface (off by default)mist.mqtt.on = truemist.mqtt.host = "192.168.10.33"mist.mqtt.port = 1883# mist listens this topic for incoming requestsmist.mqtt.subscribeTopic = "foo"# mist answers in this topic with the resultsmist.mqtt.publishTopic = "foo"

Page 12: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com

Configuration

12

# recovery job (off by default)mist.recovery.on = truemist.recovery.multilimit = 10mist.recovery.typedb = "MapDb"mist.recovery.dbfilename = "file.db"

Page 13: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com

Configuration

13

# default settings for all contexts# timeout for each job in contextmist.contextDefaults.timeout = 100 days# mist can kill context after job finished (off by default)mist.contextDefaults.disposable = false

# settings for SparkConfmist.contextDefaults.sparkConf = { spark.default.parallelism = 128 spark.driver.memory = "10g" spark.scheduler.mode = "FAIR"}

Page 14: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com

Configuration

14

# settings can be overridden for each contextmist.contexts.foo.timeout = 100 days

mist.contexts.foo.sparkConf = { spark.scheduler.mode = "FIFO"}

mist.contexts.bar.timeout = 1000 secondmist.contexts.bar.disposable = true

# mist can create context on start, so we don't waste time on first requestmist.contextSettings.onstart = ["foo"]

Page 15: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

Spark Job at MistMist Scala Spark Job

In order to prepare your job to run on Mist you should extend scala object from MistJob and implement abstract method

doStuff :

def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = ???

def doStuff(context: SQLContext, parameters: Map[String, Any]): Map[String, Any] = ???

def doStuff(context: HiveContext, parameters: Map[String, Any]): Map[String, Any] = ???

www.provectus.com15

Page 16: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

Spark Job at MistExample:

object SimpleContext extends MistJob {

override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {

val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]

val rdd = context.parallelize(numbers)

Map("result" -> rdd.map(x => x * 2).collect())

}

}

Building Mist jobs

Add Mist as dependency in your build.sbt:libraryDependencies += "io.hydrosphere" % "mist" % "0.2.0"

www.provectus.com16

Page 17: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

Spark Job at MistMist Python Spark Job

Import mist and implement method doStuff.The following are Spark Contexts aliases to be used for convenience:

job.sc = SparkContext

job.sqlc = SQL Context

job.hc = Hive Context

www.provectus.com17

Page 18: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

Spark Job at Mist

for examplimport mist

class MyJob:

def __init__(self, job):

job.sendResult(self.doStuff(job))

def doStuff(self, job):

val = job.parameters.values()

list = val.head()

pylist = []

count = 0

while count < list.size():

pylist.append(list.head())

count = count + 1

list = list.tail()

rdd = job.sc.parallelize(pylist)

result = rdd.map(lambda s: 2 * s).collect()

return result

if __name__ == "__main__":

job = MyJob(mist.Job())

www.provectus.com18

Page 19: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com19

mosquitto_pub -h 192.168.10.33 -p 1883 -m'{

"jarPath":"/vagrant/examples/target/scala-2.11/mist_examples_2.11-0.0.1.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"

}' -t 'foo'

Page 20: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com20

Page 21: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com21

{"success":true,"payload":{"result":[2,4,6,8,10,12,14,16,18,0]},"errors":[],"request":{"jarPath":"src/test/resources/mistjob_2.10-1.0.jar","className":"SimpleContext$","name":"foo","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]},"external_id":"12345678"}

}

Page 22: Блохин Леонид - "Mist, как часть Hydrosphere"

Mist

www.provectus.com22

● Super parallel mode Support multi JVM● Cluster mode and node framework● Add logging● Restification● Support streaming contexts/jobs● Apache Kafka support● AMQP support● Web UI

Your contributions are very welcome on Github!https://github.com/Hydrospheredata/mist

Road Map

Page 23: Блохин Леонид - "Mist, как часть Hydrosphere"

Thanks!

Questions?

www.provectus.com23

Леонид Блохин

Skype: leonid_niko

Email: [email protected]

www.provectus.com