Блохин Леонид - "mist, как часть hydrosphere"

Misthttps://github.com/Hydrospheredata/mist

www.provectus.com

• Леонид Блохин

• Big Data Engineer

• lblokhin@provectus.com

• +7 (917) 295 - 40 - 49

• HydroSphere

• Spark

• Why We Needed a Mist

• Running

• Configuration

• Spark Job at Mist

• Road Map

www.provectus.com2

www.provectus.com3

http://hydrosphere.io/

Hydrosphere – Opensource Big Data and Analytics platform

with DevOps culture in mind.

www.provectus.com4

http://hydrosphere.io/

www.provectus.com5

http://spark.apache.org/

Apache Spark™ is a fast and general engine for large-scale data processing.

www.provectus.com6

• Mist is a thin service on top of Spark which makes it possible to execute Scala & Python Spark Jobs

from application layers and get synchronous, asynchronous, and reactive results as well as provide

an API to external clients.

• It implements Spark as a Service and creates a unified API layer for building enterprise solutions

and services on top of a Big Data lake.

www.provectus.com7

● HTTP and Messaging (MQTT) API● Scala & Python Spark job execution● Works with Standalone, Mesos, Yarn any Spark config● Support for Spark SQL and Hive● High Availability and Fault Tolerance● Persist job state for self healing● Async and sync API, JSON job results

www.provectus.com8

Why We Needed a Mist

Build the project

git clone https://github.com/hydrospheredata/mist.git

cd mist

./sbt/sbt -DsparkVersion=1.5.2 assembly

Create configuration file

spark-submit --class io.hydrosphere.mist.Mist \

--driver-java-options "-Dconfig.file=/path/to/application.conf" \

target/scala-2.10/mist-assembly-0.2.0.jar

www.provectus.com9

Running

www.provectus.com

Configuration

# spark master url can be either of three: local, yarn, mesos (local by default)mist.spark.master = "local[*]"

# number of threads: one thread for one jobmist.settings.threadNumber = 16

# http interface (off by default)mist.http.on = truemist.http.host = "192.168.10.13"mist.http.port = 2003

www.provectus.com

Configuration

# MQTT interface (off by default)mist.mqtt.on = truemist.mqtt.host = "192.168.10.33"mist.mqtt.port = 1883# mist listens this topic for incoming requestsmist.mqtt.subscribeTopic = "foo"# mist answers in this topic with the resultsmist.mqtt.publishTopic = "foo"

www.provectus.com

Configuration

# recovery job (off by default)mist.recovery.on = truemist.recovery.multilimit = 10mist.recovery.typedb = "MapDb"mist.recovery.dbfilename = "file.db"

www.provectus.com

Configuration

# default settings for all contexts# timeout for each job in contextmist.contextDefaults.timeout = 100 days# mist can kill context after job finished (off by default)mist.contextDefaults.disposable = false

# settings for SparkConfmist.contextDefaults.sparkConf = { spark.default.parallelism = 128 spark.driver.memory = "10g" spark.scheduler.mode = "FAIR"}

www.provectus.com

Configuration

# settings can be overridden for each contextmist.contexts.foo.timeout = 100 days

mist.contexts.foo.sparkConf = { spark.scheduler.mode = "FIFO"}

mist.contexts.bar.timeout = 1000 secondmist.contexts.bar.disposable = true

# mist can create context on start, so we don't waste time on first requestmist.contextSettings.onstart = ["foo"]

Spark Job at MistMist Scala Spark Job

In order to prepare your job to run on Mist you should extend scala object from MistJob and implement abstract method

doStuff :

def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = ???

def doStuff(context: SQLContext, parameters: Map[String, Any]): Map[String, Any] = ???

def doStuff(context: HiveContext, parameters: Map[String, Any]): Map[String, Any] = ???

www.provectus.com15

Spark Job at MistExample:

object SimpleContext extends MistJob {

override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {

val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]

val rdd = context.parallelize(numbers)

Map("result" -> rdd.map(x => x * 2).collect())

Building Mist jobs

Add Mist as dependency in your build.sbt:libraryDependencies += "io.hydrosphere" % "mist" % "0.2.0"

www.provectus.com16

Spark Job at MistMist Python Spark Job

Import mist and implement method doStuff.The following are Spark Contexts aliases to be used for convenience:

job.sc = SparkContext

job.sqlc = SQL Context

job.hc = Hive Context

www.provectus.com17

Spark Job at Mist

for examplimport mist

class MyJob:

def __init__(self, job):

job.sendResult(self.doStuff(job))

def doStuff(self, job):

val = job.parameters.values()

list = val.head()

pylist = []

count = 0

while count < list.size():

pylist.append(list.head())

count = count + 1

list = list.tail()

rdd = job.sc.parallelize(pylist)

result = rdd.map(lambda s: 2 * s).collect()

return result

if __name__ == "__main__":

job = MyJob(mist.Job())

www.provectus.com18

www.provectus.com19

mosquitto_pub -h 192.168.10.33 -p 1883 -m'{

"jarPath":"/vagrant/examples/target/scala-2.11/mist_examples_2.11-0.0.1.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"

}' -t 'foo'

www.provectus.com20

www.provectus.com21

{"success":true,"payload":{"result":[2,4,6,8,10,12,14,16,18,0]},"errors":[],"request":{"jarPath":"src/test/resources/mistjob_2.10-1.0.jar","className":"SimpleContext$","name":"foo","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]},"external_id":"12345678"}

www.provectus.com22

● Super parallel mode Support multi JVM● Cluster mode and node framework● Add logging● Restification● Support streaming contexts/jobs● Apache Kafka support● AMQP support● Web UI

Your contributions are very welcome on Github!https://github.com/Hydrospheredata/mist

Road Map

Thanks!

Questions?

www.provectus.com23

Леонид Блохин

Skype: leonid_niko

Email: lblokhin@provectus.com

www.provectus.com

Блохин Леонид - "mist, как часть hydrosphere"

Technology

editionno90i 1 часть

hydrosphere 3 rd period. the hydrosphere and the water cycle

the hydrosphere

Леонид Волков. Выступление на...

Презентация. Часть 1....

1hippeus - zerocopy messaging по законам...

Строительная часть лифтов_baslift.pdf

smoby часть 2

overview - hydrosphere

hydrosphere - hydrologic cycle. hydrosphere - global...

hydrosphere: basic vocabulary

7. николаев леонид mail con_2015

failconf - Леонид Волков, Прожектор

geography hydrosphere

omniretail forum minsk, Леонид Гольдорт,...

Леонид Глузман. Выступление на...

section 3 the hydrosphere and biosphere · the hydrosphere...

Леонид Воронцов -- инженерия...

[expert fridays] python meetup - Леонид Блохин:...

unit 9. the hydrosphere. la hidrosfera. natural science...