viktor prasanna,yogesh simmhan, alok kumbhare, sreedhar natarajan 04/20/2012
TRANSCRIPT
Floe: Designing A Continuous Data Flow Enginefor Dynamic Applications on the Cloud
Viktor Prasanna,Yogesh Simmhan, Alok Kumbhare, Sreedhar Natarajan
04/20/2012
• Workflow and Stream Processing have been used to for pipeline basedapplications
• D3 Science – Dynamic, Distributed, Data Intensive Applications Dynamism
Data not being static and flowing continuously Data rates and size being changing depending on
domain requirements (QoS requirements)
• Workflows have compositional characteristics but limit dynamism• Stream Processing Systems provide real time processing but lack
the compositional and data diversity support
• Map Reduce framework dynamism in data flow but severely lacks compositional flexibility
• An architecture which has the capability of providing Compositional capability and allows real time stream processing Provide map reduce based key value exchange
Motivation
2
• Data Flow Model
Workflows follow Control Flow and data flow
For continuous data, its difficultto define strict control flow
Floe follows a Data Flow Model Allows for pipelined execution
• Dynamic Data Mapping• Decide whether the output is sent to one output channel(Round Robin)• Same Output is sent to every output channel• Map Reduce framework wires al Mapper to Reducer
• Dynamically maps data to reducer at runtime
• Typed Output Channel
Design Paradigms of Floe
3
• Continuous Execution• System should support continuous processing of data • Along with batch processing which takes an input and run once• Framework Should be able to pause and resume execution• For Low latency applications resources are provisioned and
workflow needs to be executed for next batch of input
• Decentralized Orchestration• Centralized Workflow becomes a bottleneck when data flows between
tasks which are distributed• Decentralized orchestration is better suited , where each component
is aware of subsequent component• Input Connections, Output Connections etc..
• Dynamism in Data Rates & Latency Needs• Apart from dynamism in data flow, dynamism occur in data rates
and data sizes• QoS requirement of Application determines the execution rate by
adding newresources at runtime
• Framework should be able to handle this.
Design Paradigms of Floe Contd
4
• Elastic Resources Cloud inherently provides dynamic provisioning of resources Resources needs to be provisioned ahead of time considering
the latency involved in initialization Application should resilient to overcome the failures
• Dynamic Task Update Considering the continuous data flow execution
− Pausing, Updating task logic and resuming the workflow in place is costly since the data should be stored
− An nice feature would be to have an update tracer event whichupdates task logic without pausing the workflow
• Dynamic Data Flow Updates• Depending on the requirements structure of a data flow is possible to
change• Tasks could be added or removed• Similar update tracer could be used to update the edge properties
rather thanthe task properties.
Design Paradigms of Floe Contd
5
Floe Architecture
6
Smart Grid Streaming Pipeline
Use Case
7