WHEN STORM HITS DATA.DATA STREAMS PROCESSING IN REAL TIME.
MARCIN STANISLAWSKI
WHO AM I?Architect/Developer at Interia.plStorm and Hadoop userGithub: webikTwitter: @unilama
RUN JOB
COFFEE BREAK*
RESULTS* - there are some solutions
IMPALA
implemented in C++non Map Reduce solution
KIJI
KijiRESTHDFS/HBase/Cassandra
BATCH PROCESSING VS. STREAMING
STREAMING SOLUTIONSYahoo S4AkkaSpark StreamingStorm
STORM WHAT IS THAT?
README.MDStorm is a distributed realtime computation system.
Storm is simple, can be used with any programming
language, and is a lot of fun to use!
CURRENT STATUSApache IncubationIncluded in HortonWorks DataPlatformContributed by YahooEasy deploy to Amazon EC2
SPOUTSTAKES EVENTS FROM:
KafkaKestrelRabitMQ...
AND PASS THEM TO...
BOLTSTUPLES ARE PROCESSED, IN WAY THAT YOU IMPLEMENT IT
EVENTS ARE TUPLES( 1, "TEST", "ATMOSPHERE", "2014-05-20 10:00:40", ... )
OBJECTS ARE SERIALIZED USING KYRO
WRITTEN IN JAVA&CLOJURETOPOLOGIES ARE DAGS
ARCHITECTURENimbusNodes(Supervisors)UIDRPC
EVENT PROCESSED ONE OR MORE TIMES.
ACKING FRAMEWORKEach tuple must be acked or failed
TUPLES TRACKINGtuple has random 64 bit id
xor of all tuple ids, that have been createdand/or acked in the tree
if tuple id equals 0, tuple is fully processed
COMMUNICATIONBetween:
Tasks: Disruptor LMAXWorkers: ⦰MQ -> Netty
TRIDENThigh-level abstractionsame as Cascading/Scalding in Hadoop World
SPOUTKey difference - producing Stream(s)
STREAMBatches chain with multiplication ability
STREAM OPERATIONSFunctionsFiltersProjectionsJoinsMerges
SATEOperations:
GroupingAggregateQuery
STATE TYPESnon-transactionaltransactionalopaque transactional
STATEIn memory stateNoSQL databasesExternal systems via APIs
DRPC TOPOLOGYNAMED DRPC SPOUT
USES MAIN TOPOLOGY STATESGENERATES ONE TUPLE OUTPUT
DRPC ELEMENTSTHRIFT SERVER(S)
WITH PREDEFINED SPOUTAND BOLT
ARE YOU PROGRAMMING IN NON-JVMLANGUAGE?NO PROBLEM :)
RubyPythonPerlPHP...
STREAMING APIAPI defined as ThriftJSON based communication
RED STORMWriting topologies in Ruby
REAL TIME ALGORITHMS
SIMPLE OPERATIONSSumCountMultiplication
MAXIMUM AND MINIMUMdon't lose current value
USUALLY TWO TOPOLOGIES
LEARNINGClassificationClustering
MODELEvaluatorVisualiser
BASIC ELEMENT TABLE
ALGORITHM EXAMPLESk-means clustering
statistical test (T, F, Z, Chi2)Hidden Markov Models
STORMUNIThttp://github.com/webik/StormUnit
MAVEN MOJO - COMMING SOON :)http://github.com/webik/storm-maven
SUMMINGBIRDWrite once, run on:
StormHadoop(Scalding)Amazon Kinesis
MAYBE BACK INTO ZOOSTORM YARN