generating example tuples for data-flow programs in apache flink · 2015-10-20 · generating...

Generating example tuples for

Data-Flow programs in Apache Flink

Master Thesis

by

Amit Pawar

Submitted to Faculty IV, Electrical Engineering and Computer Science

Database Systems and Information Management Group

in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science

as part of the ERASMUS MUNDUS programme IT4BI

at the

Technische Universitat Berlin

July 31, 2015

Thesis Advisors:

Johannes Kirschnick

Thesis Supervisor:

Prof. Dr. Volker Markl

https://www.tu-berlin.de

i

Eidesstattliche Erklarung

Ich erklare an Eides statt, dass ich die vorliegende Arbeit selbststandig verfasst,

andere als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den be-

nutzten Quellen wortlich und inhaltlich entnommenen Stellen als solche kenntlich

gemacht habe.

Statutory Declaration

I declare that I have authored this thesis independently, that I have not used other

than the declared sources/resources, and that I have explicitly marked all material

which has been quoted either literally or by content from the used sources.

Berlin, July 31, 2015

Amit Pawar

GENERATING EXAMLE TUPLES FOR DATA-FLOW

PROGRAMS IN APACHE FLINK

by Amit Pawar

Database System and Information Management Group

Electrical Engineering and Computer Science

Masters in Information Technology for Business Intelligence

Abstract

Dataflow programming is a programming paradigm where a computational logic is

modeled as a directed graph from the input data sources to the output sink. The

intermediate nodes between sources and sink act as a processing unit that defines

what action is to be performed on the incoming data. Due to its inherent support

for concurrency, dataflow programming is a natural choice for many data-intensive

parallel processing systems and is being used extensively in the current big-data

market. Among the wide range of parallel processing platforms available, Apache

Hadoop (with MapReduce framework), Apache Pig (runs on top of MapReduce in

Hadoop ecosystem), Apache Flink and Apache Spark (with their own runtime and

optimizer) are some of the examples that leverage dataflow programming style.

Dataflow programs can handle terabytes of data and perform efficiently, but the

target data of such large-scale introduces difficulties such as understanding the

complete dataflow (what is the output of the dataflow or any intermediate node),

debugging (it is impractical to track large-scale data throughout the program using

breakpoints or watches) and visual representation (it is quite difficult to display

terabytes of data flowing through the tree of nodes).

This thesis aims to address these limitations for dataflow programs in Apache

Flink platform using the concept of Example Generation, a technique to generate

sample example tuples after each intermediate operation from source to sink. This

allows the user to view and validate the behavior of the underlying operators and

thus the overall dataflow. We implement the example generator algorithm for a

defined set of operators, and evaluate the quality of generated examples. For ease

of visual representation of the dataflow, we integrate this implementation with the

Interactive Scala Shell available in Apache Flink.

http://www.dima.tu-berlin.de

http://www.eecs.tu-berlin.de/

Acknowledgements

I would like to express my deep sense of gratitude to my advisor Johannes Kirschnick

for his excellent guidance, scientific advice and constant encouragement through-

out this thesis work. His apt assistance helped me understand and tackle many

challenges during the research, implementation and writing of this thesis.

I would like to thank Prof. Dr. Volker Markl and Dr. Ralf-Detlef Kutsche for

their kind co-ordination efforts. I would also like to thank Nikolaas Steenbergen,

Stephan Ewen and all the members of Flink dev user group for their kind advice

and clarifications on understanding the platform.

I express my sincere thanks to all the staffs and professors from IT4BI programme

for their help and support during my Masters degree.

Finally, I would like to thank all my friends from IT4BI programme, generation

I and II, for their constant support and cherishable memories over the past two

years.

iii

Contents

Abstract ii

Acknowledgements iii

List of Figures vi

List of Tables vii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Dataflow Programming 7

2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Comparison of Big-Data Analytics Systems . . . . . . . . . . . . . . 11

2.5 Limitations of Dataflow Programming . . . . . . . . . . . . . . . . . 12

3 Example Generation 14

3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Dataflow Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Example Generation Properties . . . . . . . . . . . . . . . . . . . . 17

3.4 Example Generation Techniques . . . . . . . . . . . . . . . . . . . . 18

3.4.1 Downstream Propagation . . . . . . . . . . . . . . . . . . . 18

3.4.2 Upstream Propagation . . . . . . . . . . . . . . . . . . . . . 19

3.4.3 Example Generator Algorithm . . . . . . . . . . . . . . . . 19

3.5 Example Generation In Apache Flink . . . . . . . . . . . . . . . . . 20

4 Implementation 23

4.1 Operator Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Single Operator . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.2 Constructing the Operator Tree . . . . . . . . . . . . . . . . 25

4.2 Supported Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 27

iv

Contents v

4.3 Equivalence Class Model . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.2 Downstream Pass . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.3 Upstream Pass . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4.4 Pruning Pass . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Flink Add-ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5.1 Using Semantic Properties . . . . . . . . . . . . . . . . . . . 45

4.5.2 Interactive Scala Shell . . . . . . . . . . . . . . . . . . . . . 47

5 Evaluation 48

5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.2 Conciseness . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.3 Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Setup and Datasets . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 Quality of Generated Examples . . . . . . . . . . . . . . . . 52

5.2.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Related Work 59

7 Conclusion 61

7.1 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 62

A Appendix 64

A.1 Word-Cound Example in different Dataflow Systems . . . . . . . . . 64

A.1.1 Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.1.2 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.1.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . 66

A.2 Sample Scala Script for Flink’s Interactive Scala Shell . . . . . . . . 67

A.3 Experiment’s Dataflow Programs . . . . . . . . . . . . . . . . . . . 68

A.4 Flink-Illustrator Source Code . . . . . . . . . . . . . . . . . . . . . 71

Bibliography 72

List of Figures

3.1 Sample Dataflow that returns highly populated countries . . . . . . 14

3.2 Sample Dataflow with output examples . . . . . . . . . . . . . . . . 15

3.3 Dataflow Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Ideal set of output for sample dataflow . . . . . . . . . . . . . . . . 18

3.5 Incomplete set of examples . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Single Operator Class Diagram . . . . . . . . . . . . . . . . . . . . 24

4.2 Tree Construction Approaches . . . . . . . . . . . . . . . . . . . . . 25

4.3 Sample Dataflow to retrieve users visiting high page rank websites . 32

4.4 Downstream pass on Sample Dataflow . . . . . . . . . . . . . . . . 33

4.5 Empty Equivalence class as a result of only Downstream pass . . . 36

4.6 Upstream Pass: Synthetic records introduction . . . . . . . . . . . . 38

4.7 Upstream Pass: Synthetic records converted to Real records . . . . 38

4.8 Pruning Pass: Start . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.9 Pruning Pass: After pruning sink operator . . . . . . . . . . . . . . 42

4.10 Pruning Pass: Complete pruning (Example 1) . . . . . . . . . . . . 43

4.11 Pruning Pass: Complete pruning (Example 2) . . . . . . . . . . . . 44

4.12 Forwarded Fields Annotations . . . . . . . . . . . . . . . . . . . . . 46

5.1 Completeness of Filter operator . . . . . . . . . . . . . . . . . . . . 49

5.2 Realism of Filter operator . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Evaluation of Programs 1 to 6 and 13 . . . . . . . . . . . . . . . . . 54

5.4 Evaluation of Programs 7 to 9 . . . . . . . . . . . . . . . . . . . . . 55

5.5 Evaluation of Programs 10 to 12 . . . . . . . . . . . . . . . . . . . . 56

5.6 Flink Illustrator Runtime comparison . . . . . . . . . . . . . . . . . 58

vi

List of Tables

2.1 Comparison of Big-Data systems in context of Dataflow Program-ming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Description of SingleOperator properties . . . . . . . . . . . . . . . 25

5.1 Locally created DataSet used for evaluation purpose . . . . . . . . . 52

5.2 Experiment Dataset’s description and sizes . . . . . . . . . . . . . . 57

vii

Chapter 1

Introduction

Dataflow programming has gained popularity with the booming big data market

over the past decade. It is a data processing paradigm that allows data movement

to be the focal point, in contrast to the traditional programming (object-oriented

or imperative or procedural), where passing control from one object to another

forms the basis of the respective programming model. The distributed data pro-

cessing framework leverages dataflow programming by splitting and distributing

large dataset across different computing nodes, where each node performs dataflow

operation on the local data. A single dataflow program can be seen as a directed

graph from the source nodes to the sink nodes, where source nodes represent the

input datasets consumed and sink nodes represent output datasets generated by

executing the dataflow program. All the intermediate nodes, between source and

sink, act as a processing unit that defines what action is to be performed on the

incoming data. These actions can be categorized as either: i. General relational

algebra operators (e.g., join, cross, project, distinct, etc.) or, ii. User-defined

function operators (e.g., flatmap, map, reduce, etc.). Apache Hadoop [1] with its

MapReduce [2] framework uses dataflow programming style. Apache Flink [3] is a

distributed streaming dataflow engine that falls into a newer category of big data

framework, an alternative to Hadoop’s MapReduce, along with Apache Spark [4].

Other dataflow programming systems include Apache Pig [5], Aurora [6], Dyrad

[7], River [8], Tioga [9] and CIEL [10].

This thesis is based on dataflow programs from Apache Flink, where we generate

sample examples after each intermediate node (operator) from source to sink,

allowing the Flink user to:

1

Chapter 1. Introduction 2

1. View and validate the behavior of the underlying set of operators and thus

understand and learn the complete dataflow

2. Optimize the dataflow by determining the correct set of operators to achieve

the final output

3. Monitor iterations in case of an iterative KDDM (Knowledge discovery and

Data Mining) algorithm

4. Understand the behavior of User-defined functional (UDF) operators (seen

as a blackbox)

1.1 Motivation

Dataflow programming, similar to any other programming paradigm, is an it-

erative/incremental process. In order to program a final correct version of the

dataflow, the user may be subjected to more than one iteration. Each itera-

tion consist of steps like; coding a dataflow, building and executing it on the

respective system and finally analyzing the output or the error log to determine

whether the dataflow resulted in the expected outcome, if not, the user revises

the steps (normally via debugging) and repeats with a new iteration. Any pro-

gramming paradigm when dealing with a large-scale dataset results in a longer

execution/testing time and same applies to the dataflow programming. Hence,

the iterative model of development when handling the large-scale data is a time

consuming process and, therefore, inefficient. The whole process can be made effi-

cient if the user can verify the underlying execution of the dataflow, i.e., verifying

execution at each node (operator) in the dataflow. This can be done by checking

the dataset that is being consumed and generated at any given operator, this in

turn allows the user to pin-point and rectify the logic or error (if any). Visualizing

the dataflow with example dataset after each operator execution will allow the

user to test the necessary assumptions made in the program, in a way, it voids the

need of debugging using breakpoints and watches. The overall process of example

generation after each operator permits the user to learn about the logic of the

operators as well as the complete dataflow program.

Dataflow program is a tree of operators from source to sink, where one or more

sources (leaves) converge into a single sink (root) via different intermediate nodes.


In order to get an optimal performance from the program, choosing the appro-

priate operator type is of utmost importance. For example the Join operator in

Apache Flink can be executed in multiple ways such as Repartition or Broadcast,

depending on the input size and order. With these options, the user can select

the best operator to optimize the overall performance. Similarly, the user can de-

cide to replace Join or any costly operator, with a transformation or aggregation

or other cheaper operators as long as the final objective is accomplished. Thus

dataflow programming can be seen as a plug-n-play like model, where the user can

plug/unplug (add/remove) operators then play (execute) the complete dataflow

in order to decide on the most optimal set of operators for the final version of

the program. In such scenario, having a concise set of examples (instead of large

datasets) that completely illustrates the dataflow would be of a great help to the

user as it avoids repetitive cost and time heavy execution of the program after

each plugging or unplugging.

Dataflow programming frameworks such as Flink and Spark are well suited for

the implementation of Machine-Learning algorithms for Knowledge Discovery and

Data Mining (KDDM). KDDM in itself is an iterative process, where a model

(classification, clustering, etc.) is repetitively trained for better prediction accu-

racy, for e.g., the k-means clustering algorithm, an iterative refinement process, is

used in many data mining or related applications. In k-means algorithm, given the

input of k clusters and n observations, the observations are iteratively alloted to

the most appropriate cluster based on the computation of the cluster mean. This

problem is considered to be NP-hard and hence training such predictive models can

be a cumbersome process, where constant fine-tuning (via re-sampling or dataset

feature changes or mathematical re-computation) is needed. The overall modeling

process in a dataflow environment can be complicated [11] and is pictured as a

blackbox, because the user has very less clue about how the newly tuned dataflow

might behave and has to wait throughout the process execution, only after that

the user can verify the output to take the necessary action. Having sample ex-

amples that demonstrates the dataflow quickly will be of a great help to the user,

as it opens up the blackbox and provides a quick efficient way for fine tuning the

dataflow and in turn the predictive model.

Out of several operators available in the dataflow programming, UDF operators

pose a problem of readability/understandability to a new user (one who is new to

the system or code), as it might be difficult to guess what is the purpose of the


respective UDF. It can be a grouping function or a transformation function or a

partition function, in order to interpret a UDF operator, one needs to investigate

at the code level, which can be a tedious task for a new user. Instead, if the same

user has access to input consumed and the output generated by the UDF operator,

it will be easier to reason the behavior of that very operator.

The idea of presenting examples after each operator has been realized and is being

used for testing and diagnostics purpose in Apache Pig. Apache Pig, a high-level

dataflow language and execution framework for parallel computation, runs on top

of MapReduce framework in Hadoop ecosystem. Pig features a Pig Latin language

layer that simplifies MapReduce programming via its set of operators. One of the

diagnostic operators include ILLUSTRATE 1, that displays sample examples after

each statement in a step-by-step execution of a sequence of statements (a dataflow

program), where each statement represents an operator. It allows user to review

how data is transformed through a sequence of Pig Latin statements (forming a

dataflow). This feature is included in a testing and diagnostics package of Pig, as

it allows the user to test the dataflows on small datasets and get faster turnaround

times. In this thesis, we enhance the understanding of large-scale dataflow program

execution by introducing a new feature in Apache Flink which helps the user by

providing sample examples at operator level, similar to the ILLUSTRATE function

in Apache Pig.

This thesis work, like Apache Pig, is based on an example generator [12]. The

algorithm used in example generator works by retrieving a small sample of the

input data and then propagating this data through the dataflow, i.e., through the

operator tree. However, some operators, such as JOIN and FILTER, can eliminate

these sample examples from the dataflow. For e.g., an operator performing a join

on two input datasets A(x,y) and B(x,z) on common attribute x (a join key), if

both A and B contains many distinct values for x, initial sampling at A and B may

lead to a possibility of unmatched x values. Hence, join may not produce an output

example due to the absence of common join key in both datasets. Similarly, filter

operator executed on a sample dataset might produce an empty result if no input

example satisfies the filtering predicate. To address such issues, the algorithm

used in [12] will generate synthetic example data, that allows the user to examine

the complete semantics of the given dataflow program.

1http://pig.apache.org/docs/r0.15.0/test.html#illustrate

http://pig.apache.org/docs/r0.15.0/test.html#illustrate


1.2 Goal

This thesis concentrates on dataflow programs from Apache Flink. Our main goal

is to implement ILLUSTRATE like example generator feature of Pig in Flink,

such that it will help users to tackle the problems that are faced with respect

to dataflow programming in a big-data environment. As this thesis is inclined

towards implementation of a new feature in Flink, we try to answer following

questions:

1. What are the challenges in implementing the example generator and ways

to tackle them?

2. How can the concept of example generator be realized in Flink? What

inputs are required for the implementation and how they can be obtained

from Flink?

3. What extra features of Flink can be exploited in order to differentiate it from

other implementations?

4. How well does the implemented algorithm performs with respect to the met-

rics defined?

1.3 Outline

The outline lays out the approach that we followed for the thesis that subsequently

helped us to find the answers of the questions mentioned in the previous section.

• A comprehensive study on dataflow programming and the systems that take

the advantage of this paradigm (Chapter 2)

• A broad introduction to the example generation problem, challenges and the

approach towards the solution (Chapter 3)

• Implementation of the example generator algorithm in Flink, with extensive

description of the input construction, various algorithm steps and the Flink

features used (Chapter 4)


• Experimenting and evaluating the performance of the implemented algo-

rithm against the defined metrics by executing the implemented code on

different dataflow programs and diverse datasets, covering the operators as

well as the stages of the algorithm (Chapter 5)

• A brief overview of the related works in the field of example generation

(Chapter 6)

• We conclude by discussing the results and findings (Chapter 7)

Chapter 2

Dataflow Programming

Dataflow programming introduces a new programming paradigm that internally

represents applications as a directed graph, similar to a dataflow diagram [13].

Program is represented as a set of nodes (also called as blocks) with input and/or

output ports in them. These nodes act as source, sink or processing blocks to the

data flowing in the system. Nodes are connected by directed edges that define

the flow of data between them. It is reminiscent to the Pipes and Filters software

architectural model, where Filters are the processing nodes and Pipes serves the

passage for the data streams between the filters.

One of the main reasons why dataflow programming surged with the hype of

big-data is, its by default support for concurrency and thus allowing increased

parallelism. In a dataflow program, internally, each node is an independent pro-

cessing block, i.e., each individual node can process on its own once they have

their respective input data. This kind of execution allows data streaming, an in-

termediate node in the dataflow starts functioning as soon as the data arrives from

previous node and transfers the output to the next node. Hence, this avoids the

need of waiting for the previous node to finish its complete execution.

In a big-data environment, the aforementioned characteristics, parallelism and

streaming, allows dataflow programs to be run on a distributed system with com-

puter clusters. Such system splits the large dataset into smaller chunks and dis-

tributes it across cluster nodes, each node then executes the dataflow on their

local chunk. The sub-result produced after execution at all the nodes are com-

bined to get the final result for the large dataset. This is the main gist on how

7

Chapter 2. Dataflow Programming 8

a dataflow program is executed on big-data system with a distributed processing

environment.

In this section, we discuss different big-data analytics systems that take the ad-

vantage of the dataflow programming paradigm.

2.1 Apache Hadoop

Apache Hadoop [1] is a big-data framework that allows distributed processing of

large-scale dataset across clusters of computers using a simple programming mod-

els. Hadoop leverages dataflow programming via MapReduce module. Hadoop

MapReduce is a software framework for writing applications and process vast

amount of data (multi-terabyte datasets) in-parallel on large clusters in a reliable,

fault-tolerant manner. A typical MapReduce program consist a chain of mappers

(with Map method) and reducers (with Reduce method) in that order. These

Map and Reduce methods are second order functions that consume input dataset

(either in whole or part of large dataset) and a user defined function (UDF) that is

applied to the corresponding input data. In a MapReduce program, the mappers

and reducers form the processing nodes of the dataflow with sources and sinks (of

respective formats) explicitly mentioned.

Irrespective of being able to scale to very large datasets, previous studies have

reported the limitations of Hadoop MapReduce and issues that can be related to

dataflow programming are listed as follows:

• Lack of support for dedicated relational algebraic operators such as Join,

Cross, Union and Filter [14, 15]. These operators are frequently used in many

iterative algorithms, for e.g., Join is an important operator in a PageRank

algorithm. This restricts the user to custom code the logic for handling the

above mentioned common operations and the programmer needs to think in

terms of map and reduce to implement them.

• Lack of inherent support for iterative programming [16], an integral feature

of any dataflow programming framework as well as useful for many data ana-

lytics algorithms. For e.g., k-means algorithm, an iterative clustering process

of finding k clusters. The workaround for iteration in MapReduce is to have


an external program that repeatedly invokes mappers and reducers. But,

this comes with an added overhead of increased IO latency and serialization

issues with no explicit support for specifying termination condition.

The above limitation of relation operators is addressed by Hadoop through another

module, Apache Pig.

2.2 Apache Pig

Apache Pig is a high-level dataflow language and execution framework for parallel

computation in Apache Hadoop. Apache Pig [5] was initially developed at Yahoo

to allow Hadoop users to focus more on analyzing dataset rather than to invest

time on writing complex code using the map and reduce operators. Internally, Pig

is an abstraction over MapReduce, i.e., all Pig scripts are converted into Map and

Reduce tasks by its compiler. Thus, in a way Pig makes programming MapReduce

applications easier. The language for the platform is a simple scripting language

called Pig Latin [17], which abstracts from the Java MapReduce idiom into a form

similar to SQL. Pig Latin allows the user to write a dataflow that describes how

the data will be transformed (via aggregation or join or sort) as well as develop

their own functions (UDFs) for reading, processing and writing data.

A typical Pig programs consist of following steps:

1. LOAD the data for manipulation.

2. Run the data through a set of transformations. These transformations can

be either a relation algebra transformation (join, cross, filter, etc.) or an

user defined function. (All the transformations are internally translated to

Map and Reduce tasks by the compiler.)

3. DUMP (display) the data to the screen or STORE the results in a file.

When relating a Pig program to a dataflow; LOAD operators are the source nodes,

the set of transformations forms the processing nodes and DUMP or STORE are

the sink nodes.


Pig addresses the limitations of MapReduce by providing a suite of relational oper-

ators for the ease of data manipulation [5]. Though, in order to achieve iterations

as well as other control flow structures (if else, for, while, etc.) one needs to use

Embedded Pig, where Pig Latin statements and Pig commands are nested into

scripting languages such as Python, JavaScript or Groovy. This reduces the sim-

plicity of programming (one of the major selling point) in Pig, as it introduces

JDBC-like compile, bind, run model that adds extra overhead of complex invoca-

tions. As Pig translates all the statements and commands from the scripts into

MapReduce tasks, it is considered to be slower than a well-written/implemented

MapReduce code 1. Although Pig offer better scope for optimization than MapRe-

duce, an optimized Pig script can perform on par with MapReduce code.

2.3 Apache Flink

Apache Flink (formerly known as Stratosphere) [3] is a data processing system and

an alternative to Hadoop’s MapReduce module. Unlike Pig, that runs on top of

MapReduce, Flink comes with its own runtime, a distributed streaming dataflow

engine that provides data distribution and communication for distributed com-

putations over data streams. It features powerful programming abstractions in

multiple languages, Java and Scala, thus providing the user different language op-

tions to program a dataflow. Flink also supports automatic program optimization,

allowing user to focus more on other data handling issues. It has native support

for iterative programming via iteration operators such as bulk iterations and in-

cremental iterations. Also, it provides support for program consisting of large

directed acyclic graphs (DAGs) of operations.

One of the essential components in Flink framework is the Flink Optimizer [18],

that provides automatic optimization for a given Flink job as well as offers tech-

niques to minimize the amount of data shuffling thus formulating an optimized

data processing pipeline. Flink Optimizer is based on a PACT [19] (Paralleliza-

tion Contract) programming model that extends the concepts from MapReduce,

but is also applicable to more complex operations. This allows Flink to extend its

support for relational operators such as Join, Cross, Union, etc. The output of the

Flink optimizer is a compiled and optimized PACT program, which is nothing but

1https://cwiki.apache.org/confluence/display/PIG/PigMix

https://cwiki.apache.org/confluence/display/PIG/PigMix


a DAG-based dataflow program. This is how dataflow programming is achieved

in Apache Flink.

A typical Flink program consists of the same basic steps:

1. Load/Create the initial data.

2. Specify transformations of this data.

3. Specify where to put the results of your computations, and

4. Trigger the program execution.

In Step 1 we mention the source nodes for the dataflow program, Step 2 transfor-

mations forms the processing nodes and in Step 3 we mention the sink nodes.

Flink’s runtime natively supports iterative programming [20], the feature lacking in

MapReduce and not easily accessible in Pig, through its different types of iteration

operators: Bulk and Delta. These operators encapsulate a part of the program and

execute it repeatedly, feeding back the result of one iteration into the next iteration.

It also allows to explicitly declare the termination criterion. Such an iterative

processing system makes Flink framework extremely fast for data-intensive and

iterative jobs compared to Hadoop’s MapReduce and Apache Pig.

2.4 Comparison of Big-Data Analytics Systems

The sample dataflow program, Word-Count (reads text from files or string variable

and counts how often words occur), implementation is presented in Appendix for

each framework, Hadoop MapReduce A.1.2, Pig A.1.1, and Flink A.1.3, in order

to observe the respective programming technique and to get an idea about the

efforts needed to code the same.

As shown in Table 2.1, Apache Flink has a clear advantage over the other two

systems, with its faster processing times and native support for relational oper-

ators as well as iterative programming. This makes Flink a certain choice when

implementing a dataflow program for a large-scale dataset.


Apache Hadoop (MR) Apache Pig Apache Flink

Framework MapReduce MapReduceFlink optimizer and Flinkruntime

Language Supported Java, C++, etc. Pig-Latin Java, Scala, Python(beta)

Dataflow NodesProcesssing Nodes

Mappers and Reducers Suite of operators, UDFs Suite of Operators

Ease of ProgrammingSimple but tricky whenimplementing join or similaroperators

Simple Simple

Relational OperatorsNot natively supported,implemented via MR

Natively supported Natively supported

Iterative Programming Not natively supportedSupported via EmbeddedPig

Natively supported

Processing Time Fast Slow Fastest

Table 2.1: Comparison of Big-Data systems in context of Dataflow Program-ming

2.5 Limitations of Dataflow Programming

As discussed in [13], the major limitations in a dataflow programming paradigm are

visual representations and debugging. In a big-data environment, with large-scale

data flowing, these limitations are indeed more severe [21]. It is quite impractical

to visually represent terabytes of data flowing through the tree of operators, and

denoting what action is being performed on that data at each operator.

For tracking an error in a program, the user often introduces breakpoints and

watches to monitor the flow of execution with changes in the variable values or

the output data. Although this approach of monitoring is futile when it comes to

data of such vast size.

This thesis work tries to address these limitations in Apache Flink environment.

For a Flink dataflow program, we generate concise set of example data at each

node in the operator tree such that it allows the user to validate the behavior of the

operator as well as the complete dataflow. It eliminates the need of debugging (via

breakpoints, watches, etc.) to an extent, as the user can diagnose the error (after

seeing the flow of sample examples in the dataflow) by locating the problematic

operator in the tree and rectifying the logic at that very operator.

Visual representation of a dataflow is made available in Flink via its Web-Client

interface or Flink job submission interface, though the flow of data is not a part

of this representation. This thesis work integrates with the Flink’s Interactive

Scala Shell, a new feature in Flink, to display the set of examples for the dataflow


program executed via the interactive shell. Thus making the generated examples

visually accessible to the users.

This way we address the limitations of a dataflow programming paradigm in a

big-data system, Apache Flink. In next section, we define the Example generation

problem, explain the different approaches to generating a set of concise examples

that allows the user to reason the complete dataflow.

Chapter 3

Example Generation

In this chapter, we describe the problem of generating example records for dataflow

programs and mention the already existing approaches for the same, followed by

the challenges faced and their drawbacks. We also describe the theoretical terms

and concepts used throughout this thesis, with a brief introduction to the algorithm

that betters the drawbacks in the existing approaches.

3.1 Definition

Dataflow program is a directed graph G=(V,E) where V is the set of nodes de-

noting operators (source, data transformation/processing node, sink) and E is the

set of directed edges denoting the flow of data from one operator to the next. Ex-

ample generation is a process of producing/generating a set of concise examples

after each operator in the dataflow, such that it allows a user to understand the

complete semantics of the dataflow program. Let us demonstrate this concept

using an input dataflow example that returns a list of highly populated countries.

Figure 3.1: Dataflow that returns highly populated countries

14

Chapter 3. Example Generation 15

Figure 3.1 is a dataflow that LOADs two datasets Countries (ISO Code, Name)

and Cities (Name, Country, Population1) and performs a JOIN on the attribute

name/country (Countries/Cities). The joined result is then grouped by countries

and aggregation (SUM) is performed on population. Later we filter out the coun-

tries with population less than and equal to 4 million and present the list as output.

Example generation when applied to above dataflow from Figure 3.1, gives us the

output, shown in Figure 3.2, where we have sample examples after each operator

to facilitate the understanding of the operators as well as the complete dataflow

for a user.

Figure 3.2: Sample output of Example Generation on dataflow from 3.1

This allows user to verify the behavior of each operator just by looking at the

output sample, for e.g., verifying whether aggregation is having an intended effect

or filter is behaving correctly. Instead of cross-checking the whole logic, user will

now be able to target only the faulty operators.

Figure 3.1 and Figure 3.2 are respectively considered as the input and the output

of any example generation algorithm.

Let us now briefly explain the terms used throughout this thesis with respect to

the dataflow example from Figure 3.1

1in million


3.2 Dataflow Concepts

Source: Load operators are the sources in the above dataflow, as they read the

data from the input files/tables/collections.

Sink: Operators that produce (by storing in a file or displaying) the final result

is the Sink. Filter is the Sink in the above dataflow.

Downstream pass: Starting from Sources we move in the direction toward Sink,

i.e., from Loads to Filter.

Upstream pass: Starting from Sink we move in the direction toward Sources,

i.e., from Filter to Loads.

Downstream operator/neighbor: If Operator1 consumes the output of Op-

erator2, it makes Operator1 as the downstream neighbor of Operator2. Join is

the downstream operator of both the Loads, similarly Group is the downstream

neighbor of Join and Filter is of Group.

Upstream operator/neighbor: If Operator1 consumes the output of Opera-

tor2, it makes Operator2 as the upstream neighbor of Operator1. Group is the

upstream neighbor of Filter, similarly Join is of Group and Loads are of Join.

Operator Tree: The complete chain of operators from sources to sink makes

an operator tree. The operator tree is the representation of the given dataflow

program in term of operators used.

All the concepts can be visualized as shown in Figure 3.3

Figure 3.3: Concepts related to a dataflow program


3.3 Example Generation Properties

Let us now briefly explain the properties that define a good example generation

technique.

Completeness

The examples generated at each operator in the dataflow should collectively be

able to illustrate the semantics of that operator. For e.g., in above dataflow out-

put Figure 3.2, the examples before and after the Filter operator clearly explains

the semantics of filtering by population, as examples with low population are not

propagated to the output. Same can be said with respect to Group operator, the

examples illustrate the meaning of grouping of countries and aggregating (via sum)

the population. Completeness is the most important property, as it allows user to

verify the behavior of each and every operator in the dataflow.

Conciseness

The set of examples generated after each operator should be as small as possible,

such that it minimizes the effort of examining the data in order to verify the

behavior of that operator. The output presented in Figure 3.2 cannot be considered

as concise, as there is a scope to explain the behavior of the operators will smaller

set of examples, as shown in Figure 3.4.

Realism

An example is considered to be real if it is present in the respective source

file/table/collection. Thus, the set of examples generated by any of the opera-

tor must be the subset of the examples in the source, in order to be considered

as a real set. In the dataflow output, shown in Figure 3.2, assume that all the

examples generated by Load operators are from the respective source files, this

means all the examples generated in the output 3.2 are real.

In Figure 3.4, we have altered the output of the dataflow (Figure 3.2), to make it

more concise and thus ideal with respect to the properties defined.


Figure 3.4: Ideal set of examples satisfying all properties

3.4 Example Generation Techniques

In this section, we discuss the techniques/approaches that can be considered for

example generation and the problems related to each approach and how we can

overcome the same.

3.4.1 Downstream Propagation

The most simple way to generate examples at each operator in the dataflow is to

sample few examples from the source files and push these examples through the

operator tree, executing each operator and recording the results after each execu-

tion. The problem with this approach is, not always the sampled examples would

allow all operators in the operator tree to achieve total completeness. For e.g.,

considering our dataflow example from Figure 3.1, initial sampling might produce

examples that cannot perform join and thus resulting the lack of completeness

(as shown in Figure 3.5). This can indeed be averted if we increase the sampling

size and push as many possible examples through the operator tree [22], but this

violates the property of conciseness. As we seek the approach that generates com-

plete and concise set of examples, relying only on downstream propagation cannot

be the choice for our implementation.


Figure 3.5: Incompleteness is often the case in downstream propagation

3.4.2 Upstream Propagation

The second approach that can be considered for example generation is to move

from sink to source and generate the output examples based on the characteristics

of the operator (joined examples, crossed examples, or filtered examples). Based

on these output we generate the input and propagate upstream recursively till we

reach the sources. This approach will work fine if we are aware of the operator

behavior (join, union, cross, etc.) but fails when a UDF operator is part of the

operator tree. A UDF being a blackbox, it is complex to predict the behavior

beforehand, this approach fails to generate examples resulting incompleteness. The

only way to generate examples for these UDF operators is to push data through

them in the downstream direction.

Hence, both approaches Downstream and Upstream propagation individually is

not efficient to generate an ideal set of examples satisfying both completeness and

conciseness. Nonetheless, combination of both downstream and upstream with

pruning of redundant examples can lead to the better set of results.

3.4.3 Example Generator Algorithm

The algorithm with steps (in that order):

1. Downstream Pass, 2. Pruning, 3. Upstream Pass, 4. Pruning


was proposed in [12] and was observed to be more efficient (to generate complete

and concise set of examples) compared to only Downstream and only Upstream

approaches.

To ensure the completeness of all the operators in an operator tree, [12] introduces

the Equivalence Class Model.

Equivalence Class Model

For a given operator O, a set of equivalence classes εO = {E1, E2, ...., Em} is defined

such that each class denotes one aspect of the operator semantics. For e.g., a

FILTER operator’s equivalence set will be denoted by εfilter = {E1, E2}, where

E1 denotes the set of examples passing the filtering predicate, and E2 denotes the

set of examples failing the filtering predicate. In section 4.3 we define equivalence

classes for the set of supported operators.

Following is the pseudo-code of the algorithm [12] used in this thesis for example

generation.

Algorithm 1 Example Generator

Step 1 : use downstream pass to generate possible sample examples at eachoperator in the operator tree

Step 2 : using equivalence class model check for the operator completeness, andfor incompleteness (if any) identified, generate sample examples using upstreampass

Step 3 : use pruning pass to prune/remove the redundant set of examples fromthe operators

Let us now formally introduce the example generation problem with respect to

Apache Flink that we are addressing in this thesis.

3.5 Example Generation In Apache Flink

As discussed in section 1.2, we implement the ILLUSTRATE2 feature available in

Apache Pig in a completely distinct computing platform of Apache Flink. This

work is based on the paper [12], where they describe and prove the best possible

2http://pig.apache.org/docs/r0.15.0/test.html#illustrate

http://pig.apache.org/docs/r0.15.0/test.html#illustrate


way to generate a set of complete, concise and real examples for a given dataflow

program. In the implementation, we have considered few notable differences be-

tween Pig and Flink, and thereby defined our requirements,

1. Flink is a multi-language supporting framework (with support for Java, Scala

and Python), in contrast to Pig that uses a textual scripting language called

Pig Latin. In our implementation, we focus on Flink’s Java jobs.

2. Pig allows ILLUSTRATE either on a complete dataflow script or after any

given dataset (also known as a relation) from the script. We have adapted

this accordingly to support only a complete Flink job; i.e., a dataflow from

source to sink. We are of the opinion that illustrating after the complete

dataflow is more feasible than that of after every dataset because eventually

all used datasets would be displayed in a complete job illustration.

3. Pig’s input is transformed into a relation, a relation is a collection of tuples

and is similar to that of a table in a relational database. In Flink the input

can be of any format (txt, csv, hdfs, collection), although for our implemen-

tation we need to convert these inputs into tuple format before performing

any transformation on the input, as our intention is to find example “tuples”.

These were the differences which we observed and accordingly adapted it in our

implementation. In order to view the overall implementation picture let us now

briefly explain the basic life-cycle when the flink illustrator, the module that gen-

erates sample examples, is invoked by a flink job:

1. For a submitted job, an operator tree is created.

2. This operator tree forms the main input for our algorithm, different passes

then acts on the input to generate complete, real and concise examples.

3. These generated examples are then displayed to the user (via a console or

scala shell).

After an overview on example generation concepts and the introduction to al-

gorithm as well as equivalence class model, next section provides the detailed

explanation of the implementation with respect to Apache Flink, the process of


generating the input (Operator Tree), followed by the list of supported Flink op-

erators and their respective Equivalence classes, and each algorithm pass. Later,

we mention the Flink features that have been integrated with the implementation,

making it different and novel.

Chapter 4

Implementation

After having defined what is example generation and how it is useful in Apache

Flink, this chapter extends to give an in-depth proceeding on how this concept is

realized in Flink. This topic can be best treated under following headings:

1. Operator Tree

2. Supported Operators

3. Equivalence Class

4. Algorithm

5. Flink add-ons

4.1 Operator Tree

Operator tree is a chain or list of single operators that starts from the input

sources and ends at the output sink with one or more data processing/transforming

operators between the two. Operator tree is created once the submitted flink job

invokes the illustrator module. Before going into further details how we build an

operator tree for a given job, let us define the Single Operator object, which is at

the granular level of an operator tree.

23

Chapter 4. Implementation 24

4.1.1 Single Operator

Single Operator object defines different properties of the operator under consid-

eration. The idea behind formalizing our own operator object rather than using

the one available from Flink 1 was, i. We were not going to use all the properties

of flink operator class, as they were not adding any value to our implementa-

tion, e.g., the degree of parallelism and compiler hints. ii. There were few things

missing in the available flink operator class which were needed specifically for our

implementation purpose, e.g., having join keys easily available for the given input,

having the output of each operator for further manipulation, and having details

that verify the operator produced an output for the set of inputs were necessary.

Our single operator class along with the properties can be viewed in Figure 4.1.

SingleOperator

equivalenceClasses: List<EquivalenceClass>

operator: Operator<?>

parentOperators: List<SingleOperator>

operatorOutputAsList: List<Object>

syntheticRecord : Tuple

semanticProperties : SemanticProperties

JUCCondition : JUCCondition

operatorName: String

operatorType : OperatorType

operatorOutputType : TypeInformation<?>

JUCCondition

firstInput : int

secondInput : int

firstInputKeyColumns: int[ ]

secondInputKeyColumns: int[ ]

operatorType : OperatorType

Figure 4.1: The Single Operator Class

1https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/java/operators/Operator.html

https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/java/operators/Operator.html

https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/java/operators/Operator.html


Property Description

equivalenceClasses To define the completeness of the output set

operator Flink operator object

parentOperators Set of input operators whose output is being consumed by the current operator

operatorOutputAsList Output examples produced by the current operator (stored as a List collection )

syntheticRecord Synthetic record constructed to ensure operator completeness

semanticProperties Defines the transformations on tuple fields in the UDF operators

JUCConditionAn inner class that facilitates manipulation of details related to join, union and crossoperators, such as providing the input number as well as the assigned key columns

operatorName Name of the current operator

operatorType The type of the operator (JOIN, LOAD, CROSS, etc.)

operatorOutputType Defines the structure and datatypes of the output produced

Table 4.1: Description of SingleOperator properties

The properties for a given SingleOperator object are explained in the Table 4.1

4.1.2 Constructing the Operator Tree

After defining what our custom operator class consists of, let us now see the two

different ways to construct a list of these custom operators that constitutes our

main input, operator tree. For better understanding let us consider a sample

dataflow program, as shown in Figure 4.2, that takes 2 input Load operators then

performs Join on it and Projects some result as an output. In this scenario, Load1,

Load2 forms our input sources and Project is our output sink.

Figure 4.2: Tree Construction Approaches

1. Top-Down Push Approach:

This is the approach used in our final implementation. In this method, we start

from the source operator and then move downstream towards the sink output


operator. If we consider the dataflow program in Figure 4.2, we would get two

source operators, Load1 and Load2, in the order they are defined in the program.

Suppose Load1 is defined first, we construct Load1 operator’s SingleOperator ob-

ject and add it to the operator tree, we then seek for the downstream operator of

Load1 and thus get the Join operator. As Join is a dual-input operator (similar

to Cross, Union), we make sure both of its parents are added to the tree before

adding the Join operator itself. This allows us better handling of outputs and

execution of a dual- input operator. For e.g., if we add and execute in the order

“Load1, Join, Project, Load2” Join will have input only from one parent operator,

Load1, and the execution will halt. Hence, we make sure Load2’s SingleOperator

object is added to the tree before adding Join’s. Using the above technique we

continue moving downstream until the sink is reached, and thus we get our com-

plete operator tree for a given sample dataflow program. This method can be seen

as following pseudo code 2:

Algorithm 2 Top-Down Push Method

Step 1 : let List<SingleOperator> be an operatorTree

Step 2 : get all the input sources for a given dataflow program

Step 3 : get first input from the input sources

Step 4 : construct SingleOperator object for the received input and add it tothe operatorTree

Step 5 : get the downstream operator of the added input and check its property,whether it is a single-input operator or a dual-input operator, if single-inputoperator goto Step 6.1, if dual-input operator goto Step 6.2

Step 6.1 : construct SingleOperator object for this downstream operator andadd it to the operatorTree

Step 6.2 : check if both the parents are present in the operatorTree,if present then construct SingleOperator object for this downstream operatorand add it to the operatorTree,if not get the next input from the input sources and goto Step 4

Step 7 : continue from Step 5 till we have reached the sink, i.e., the addedoperator has no downstream operator


2. Bottom-Up Volcano Approach

This can be an alternate technique to create an operator tree. As the name

suggests, this approach constructs the operator tree starting from the sink, i.e., an

output operator and then moving upstream towards the input source operators.

Once a flink job for the above sample dataflow program (Figure 4.2) is submitted,

we seek the sink operator, in this case, the sink operator is Project. We construct

the single operator object for Project operator by setting the necessary details and

then add it to the operator tree. We then seek for the input/parent operators of

Project, and get the Join operator. We continue in similar fashion until there is no

input for the operator, that means we have reached the source operator, resulting

in the completion of the operator tree for a given sample dataflow program.

This whole process can be seen using following pseudo code 3:

Algorithm 3 Bottom-Up Volcano Method

Step 1 : let List<SingleOperator> be an operatorTree

Step 2 : get the sink for a given dataflow program

Step 3 : construct SingleOperator object for the sink and add it to the opera-torTree

Step 4 : get the upstream operators for the object added to the operatorTree

Step 5 : construct SingleOperator object for every upstream operator and addit to the operatorTree

Step 6 : continue from Step 4 to 5 till we have reached the sources, i.e., wedon’t find any upstream operator for the added operator

4.2 Supported Operators

After seeing the techniques to construct an operator tree, we define the type of

operators that are allowed and supported in our implementation and also look into

different conditions when an operator may not produce any output.

• LOAD : Is the first operator in the operator tree and acts as a source for

the operators downstream. It will always produce records unless the input


source file or collection is empty. Load in Flink can also be achieved via

FlatMap User defined function (UDF) operator, that allows us to convert

input into requested tuple input format.

• DISTINCT : This operator de-duplicates the records present in the up-

stream operator. Unless the upstream operator generates empty set of tu-

ples, Distinct will always produce output records.

• JOIN : Is a dual-input operator, that combines the records from the up-

stream parent operators on a common join key. If there does not exist

a matching field on the join key in the parent operators output, Join will

produce an empty output. Join in Flink behaves differently compared to tra-

ditional SQL Join, in Flink it will join the tuples from both inputs to form

a single tuple of size 2 (tuple from input 1, tuple from input 2), whereas in

SQL the tuples will be merged as one tuple. Figure 3.2 shows the behavior

of an SQL Join, and Figure 4.4 depicts Flink Join behavior.

• UNION : Is a dual-input operator, that combines the outputs of the up-

stream parent operators. Here the output type of both parent operators

must be same, i.e., they must produce same type of tuple fields. Unless

both the parent operators generates the empty result set, Union will always

produce output records.

• CROSS : Is also a dual-input operator, that produces a cross product of

the records from the upstream parent operators, i.e., building all pairwise

combination of the records from both parent operators output. Cross will

produce empty output if any of the upstream operator’s result set is empty.

• FILTER : Is a User defined function (UDF) operator in Flink, which takes

a filtering predicate and applies it on the upstream operator’s result set. If

none of the upstream operator’s result set passes the predicate, Filter’s result

will be an empty set.

• PROJECT : Is an operator that projects the mentioned fields from the

tuples. Unless the upstream operator’s record set is empty, Project will

always have an output result set.

In our implementation, we are limited to these set of operators from the plethora

of Flink operators that include many other UDF operators because, having these


above set of operators gives us the overall idea of how the example generation

works in Flink.

As of now, Filter, FlatMap and Map are the only Flink UDF operators included in

this implementation. Although, after defining the fitting semantics, it is possible

to extend the implementation to include the other not-supported operators.

The list of Flink operators not supported in this implementation is:

GroupReduce, GroupCombine, CoGroup, Aggregation, Iteration, Partition

The reason behind the exclusion of the above operators is, one needs to define exact

semantics for each operator and most of the operators being a UDF (where the

functionality is defined at the code-level), it is difficult to characterize the same.

We were able to include FlatMap and Map UDF operators in this implementation

because, structural transformation of the input tuple is the only semantic that one

has to take into account, in contrast to Grouping or Iterations.

In order to maintain completeness, i.e., to avoid situation where an operator gen-

erates an empty set and to cover all the necessary semantics of an operator, we

adapt the model stated in the paper [12] known as Equivalence Class Model, which

is described in next section.

4.3 Equivalence Class Model

In this model, we assign each operator from the operator tree a list of equivalence

classes, such that each equivalence class depicts a particular unique aspect of the

operator semantics. Let us formulate the above concept; for a given operator

O, εO = {E1, E2, ....., Em} is the set of equivalence classes, and each Ei specifies

and illustrates one aspect of the operator semantics. When the sample examples

are generated for an operator either as input or output, we make sure that each

generated example belongs to an equivalence class. Also we must cover all the

equivalence classes for the given operator by having at least one example for each

equivalence class in the result set. This way we can achieve the semantic com-

pleteness for the operator under consideration. Let us now define the equivalence

classes for the operators that we have considered:


• LOAD : Every record from the source file or collection that is being loaded

forms the single equivalence class E1. This can also be stated as, every record

that Load generates is assigned to a single equivalence class E1.

• DISTICNT : Every output record is assigned to a single equivalence class

E1. The behavior can best described when there exist at least one duplicate

record in the input set, but this may not always be possible as it is dependent

on the output generated from its upstream operator.

• JOIN : Every output record is assigned to a single equivalence class E1. This

illustrates that, there exists a case of matching join key in parent operators

output set and thereby clarifying the semantics.

• UNION : Every input record from first parent operator is assigned to E1,

and every input record from second parent operator is assigned to E2. This

allows us to confirm that Union produces tuples having at least one record

from each parent operator.

• CROSS : Similar to Union, every input record from first parent operator

is assigned to E1, and every input from second parent operator is assigned

to E2. This way we confirm that both parents have an input, for Cross to

produce any output.

• FILTER : Every input record that passes the filtering predicate is assigned

to E1, whereas the one that fails is assigned to E2. This way we have at

least one record that passes the filter and Filter produces an output, and we

also depict the type of examples that are blocked by the filtering predicate.

• PROJECT : Every output record is assigned to E1, this verifies that Project

had appropriate input to produce an output.

4.4 Algorithm

In this section, we describe our implementation algorithm for generating the ex-

ample tuples for each operator in the operator tree. Our algorithm needs two main

inputs, 1. An operator tree for a submitted Flink job (which is created using the

technique seen above 4.1.2), 2. The base tables, i.e., the input source files or the

collections, depending on the input format used in the dataflow program. The


algorithm is generic with respect to input formats and supports different input

formats used by Flink, such as csv file, text file, java collection, or HDFS file. Ev-

ery input record is then converted to a tuple structure, as they form the basis for

the data movement through the operator tree. With respect to operators support,

we are limited to the set of operators mentioned above, because of the reasons

discussed in the earlier Section 4.2. Let us quickly overview the different steps we

perform throughout out algorithm in next section.

4.4.1 Overview

After receiving the inputs of an operator tree for a given Flink dataflow program

and the input sources, we perform in total three different passes in order to generate

the final concise, complete set of examples. The algorithm can be summarized as:

1. Downstream pass: In this pass we move from source operators to sink,

considering one operator at a time. In source operators, we randomly sample

n records from the input sources and propagate these records to the down-

stream operator. The downstream operator then acts on the records to gen-

erate an intermediate output, which is then passed to the next downstream

operator. We continue in similar fashion till we reach the sink operator.

After reaching the sink, we set the equivalence classes for each operator in

the tree using the examples generated. Based on the initial sampled records,

a downstream operator may or may not generate an example. To tackle the

incompleteness, case of empty equivalence class, we have the next pass.

2. Upstream pass: In this pass we move from sink to source operators, we

check at each operator whether there exist any incompleteness, by identifying

empty equivalence classes. In case of incompleteness, we create a synthetic

record, such that it resolves the incompleteness for the operator under con-

sideration. We then propagate this synthetic record till the source operator

and convert it to a concrete real example. We continue in similar fashion

until the source is reached.

3. Pruning pass: After resolving the incompleteness issue, we tackle concise-

ness in this pass. There may exist some records that may be redundant, in

a way that they do not add any extra value semantically for the operator


under consideration. We eliminate such records in this pass, along with their

trail throughout the tree.

In next sections, we explain each pass in very detail. For better understanding, let

us consider an example dataflow program and use its operator tree and sources as

the inputs to our algorithm. Figure 4.3 is a dataflow program that provides the

list of users that tends to visit urls with high page rank.

Figure 4.3: Sample example - Dataflow program that retrieves users visitingpages with high page rank

In above running example, we load two datasets, UserVisits and PageRanks, using

respective source files. Then both the datasets are joined on field url, these joined

records are then filtered where only records having rank greater than three are

accepted. Later, the filtered records are projected in a tuple format of <user, url,

rank> thereby stating the high-ranked url for that user. Once the flink-illustrator

module is invoked the operator tree is created, for our running example, the tree

will be same as Figure 4.3, i.e.,{ Load UserVisits, Load PageRanks, Join, Filter,

Project} as per our approach Top-Down Push algorithm 2. As the input is ready,

our algorithm invokes the downstream pass.

4.4.2 Downstream Pass

In our downstream pass, we randomly select n records from each input source,

which forms the result set of the Load operators. These records are then prop-

agated to the downstream operator, that acts on the received input records and

generates the output, which in turn will act as an input to the next downstream

operator and continue in similar manner till we reach the sink. Consider our op-

erator tree as a list of operators {O1, O2, O3, ...., Om} . Given the order here O1’s

output will be the input of O2, operator O2 will then evaluate on the received set

of inputs and possibly generate an output, this output thus becomes an input for


O3. This way we continue till we reach Om , that by definition is the sink operator,

thereby completing the downstream pass. We store all the intermediate results

produced by each operator. Lets apply this concept on running example.

Figure 4.4: Downstream pass applied on our running example 4.3

After the downstream pass, the example generated for each operator can be viewed

as in Figure 4.4. As stated, Loads will have randomly selected inputs from the

source files. Join operator will receive these records as inputs and then evaluate the

join on the url, the joining key, there are two common urls present namely youtube

and facebook, this allows Join to produce outputs by joining the respective tuples

from the parent operators. Next downstream operator, Filter, will accept only

those records with rank greater than three, while discarding the rest. Finally,

Project, the sink operator, creates a tuple format as per the requested structure.

Record’s Lineage

During the downstream pass, we track the flow of each individual record for its

lifetime in the operator tree, this tracking of an individual record is known as a

Lineage of that record. For e.g., lineage of the record <Amit, facebook, 5pm>,

moving in the direction of source to sink will be constructed as follows:

Left hand side denotes the operator traversal in the operator tree (and the effect

each operator has on the tuple’s structure), whereas right hand side denotes the

record’s traversal through the operator tree and shows the effect of evaluation on

it after the respective operator.


Operator traversal Record’s traversal

↓ Load ↓ Load<Username, URL, Time> <Amit, facebook, 5pm>

↓ JoinUserV isits with PageRanks on URL ↓ Join<<Username,URL,Time>,<URL, Rank>> <<Amit,facebook,5pm>,<facebook,4>>

↓ FilterRank > 3 ↓ Filter

<<Username,URL,Time>,<URL, Rank>> <<Amit,facebook,5pm>,<facebook,4>>

↓ ProjectUsername, URL, Rank ↓ Project

<Username, URL, Rank> <Amit, facebook, 4>

Hence the lineage of record <Amit, facebook, 5pm> is:

<Amit, facebook, 5pm>→ <<Amit, facebook, 5pm>,<facebook, 4>>→<<Amit, facebook, 5pm>,<facebook, 4>>→ <Amit, facebook, 4>

Similarly, lineage of record <youtube,2> will be:

<youtube, 2>→ <<Amit, youtube, 3pm>,<youtube,2>>

Having these lineage will in turn let us form the lineage group, which is the tech-

nique used for our pruning pass.

Lineage Group

Lineage group for a given record adds some extra information to the available

lineage, such as, the operator that acts on the record and the output it generates.

For e.g., the lineage group of <Amit, facebook, 4pm> will be as follows:

<Amit, facebook, 4pm> → {Load,<Amit, facebook, 4pm>}{Join,<<Amit, facebook, 5pm>,<facebook, 4>>}{Filter,<<Amit, facebook, 5pm>,<facebook, 4>>}{Project,<Amit, facebook, 4>}

These lineage groups are then used for pruning in the pruning pass section. After

the completion of the downstream pass, we examine the generated example set

at each operator and set the equivalence classes for the respective operator. The

complete downstream pass can be summarized into following pseudo code 4:


Algorithm 4 Downstream Pass

Step 1 : let operatorTree be the input

Step 2 : set the LOAD operators’ records by randomly selecting n values fromthe respective sources, store these records as the output for LOAD operators

Step 3 : initiate lineage groups for every record from the above outputs as<record → <Load, record>>

Step 4 : get the next downstream operator from operator tree, evaluate theoperator on the available input, store the intermediate output, add <record→ <operator, record’s evaluation on this operator>> to the record’s respectivelineage group

Step 5 : continue Step 4 until the sink is reached

Step 6 : set the equivalence classes for each operator based on the outputgenerated

In the above running dataflow example, each operator generated an output and

also every equivalence class for each operator were filled. For e.g., Join had the

joined example hence filling the E1, Filter had both equivalence classes filled,

E1 with 2 records passing and E2 with one record failing to pass the filtering

predicate. This may not always be the case, for instance, consider the situation as

shown in Figure 4.5 for the same dataflow program but with slight modification,

we remove the Filter operator for simplicity. This type of scenarios might occur in

any example dataflow, to tackle this we then move towards next Upstream pass.

4.4.3 Upstream Pass

In this pass, we move from sink towards the source inspecting one operator at a

time. At each operator, we identify possible cases of incompleteness and address

this by introducing synthetic records that might boost the completeness for the

operator under consideration. We identify incompleteness based on the equiva-

lence classes set in the downstream pass, if for a given operator O, any of the

equivalence class is empty we introduce a synthetic record in O’s input set such

that, after evaluation it makes that respective equivalence class not empty. After

introducing the synthetic record, we propagate it upstream until the respective

source is reached, and there we then convert this synthetic record to an actual


Figure 4.5: Unmatching join-key attribute leads to an empty result at theJoin operator

record by matching it as closely as possible with data available at the source.

In the following section, we explain how we generate a synthetic record for an

operator’s equivalence class.

Synthetic Record is a concise representation of the tuple that the operators con-

sumes or produces based on the equivalence class we are trying to fill. Based on

the operator under consideration, we explicitly define whether a field in the tuple

is of certain significance. For e.g., in case of a Join operator we mention whether

a field is the join key or not. For the given set of operators that we consider in

our implementation, our language to define synthetic record is very simple:

i. if the field is a join key, add the marker “Join key” to the respective field

position in the tuple,

ii. all the remaining fields will be marked as “Dont care” (with a hint of field’s

datatype) that means any value is acceptable in this position.

This could for sure be extended to support richer set of operators for e.g., in the

case of Filter operator stating the appropriate field in the tuple is the filtering

predicate or in the case of FlatMap operator stating that this field is transformed.

But as these operators are UDF operators in Flink, retrieving such information is

not yet possible. Indeed semantic properties 4.5.1 of an operator is useful in this


scenario, but it cannot be generalized to the overall implementation, as it is an

optional property for an operator and may not always be set.

Now let us explain how this works, with respect to the example in Figure 4.5,

as we can see there is no common join key in both Loads, thereby resulting Join

to produce an empty result and hence the equivalence class E1 is empty. As we

traverse from the sink to source we first come across Project operator, as project

just being an operator that displays result in a mentioned structure, we know that

having some records in its input will solve its incompleteness and hence ignore

it, because if Join produces the result, for sure Project will too. Thus, we move

upstream and reach at Join operator and find E1, i.e., any joined example to be

empty. Hence we now attempt to introduce records in Join’s input such that it

will resolve the incompleteness. As we know, url being the join key, our synthetic

records will be as follows:

Load (UserVisits) < Dont care, Join key,Dont care >

Load (PageRanks) < Join key,Dont care >

Having the above set of records as an input to Join operator, will resolve its in-

completeness as we get the resulting joined set as

<< Dont care, Join key,Dont care >,< Join key,Dont care >> and progres-

sively of Project’s as well, as shown in Figure 4.6. We then move these synthetic

records upstream till the source and convert them to records with real matching

data values taken from the respective source files, these new data values are ob-

tained from the examples that were not used in the Downstream pass, as show in

Figure 4.7, < John, linkedin, 6pm > and < linkedin, 4 >, unused in downstream

pass, are the records introduced to boost completeness.


Figure 4.6: Upstream Pass: Synthetic records introduction and assumed op-erator evaluation

Figure 4.7: Upstream Pass : Synthetic records to Real records


This way our upstream pass attempts to resolve any incompleteness present, the

whole process can be seen as a single iteration from introducing synthetic records,

converting them to real records and evaluating the operators once again. For

Join, Cross or Union operators, we can fill empty equivalence class within a single

iteration, but when it comes to Filter operator one iteration may not always be

ideal because, to find a specific record that either satisfies or fails the filtering

predicate (depending on the equivalence class to fill) is a time exhausting process

when it comes to big datasets. This is also due to the fact, that we are not

exposed to the filtering predicate as it being a UDF in Flink, hence manipulating

the synthetic record to our benefit is not possible. Having said that, we use

the same method for Filter as we use for other operators to generate the best

possible record in a single iteration. Given the input from the downstream pass

the upstream pass can be viewed as per the following pseudo code 5, where we

move from sink to source examining one operator at a time.

Algorithm 5 Upstream Pass

Step 1 : let O be the operator under consideration (initially O will be the sinkoperator)

Step 2 : check for the empty equivalence classes in O, if exist any empty equiv-alence class goto Step 3.1, if all equivalence classes are filled get the upstreamoperator and goto Step 1

Step 3.1 : introduce synthetic records in O’s input set such that it makes theequivalence class non-empty, propagate this record in upstream direction till thesource

Step 3.2 : at source convert this synthetic record to a closely matching realrecord from the source’s input

Step 4 : evaluate the operators again, store the newly added records, updatethe respective lineage groups, discard the synthetic records

Step 5 : set the equivalence classes for the operators based on the newlyavailable records

Step 6 : get the next upstream operator (if any) and continue from Step 1

After tackling the incompleteness issue in this upstream pass, we now move to-

wards making the result set as concise as possible by pruning the generated ex-

amples. This is done in our last pass of Pruning.


4.4.4 Pruning Pass

In this pass, given the set of examples generated by each operator after above two

passes, we reduce the example set for every operator possible such that it does not

affect the completeness of that operator. This is done using the concepts of record

lineage and lineage groups, that are set in downstream and upstream passes. We

start from the sink operator, consider its example set and try removing one exam-

ple at a time. If the removed example keeps the operator’s completeness intact,

we then use the record’s lineage group and attempt to remove the trace of that

record throughout the tree. While pruning, we make sure that the completeness is

being not compromised at any given operator in the tree. We then move upstream,

continue in similar fashion at each operator, making sure that removing record at

this upstream operator does not impact the completeness of the current operator

as well as the downstream operators, till the source operator is reached. Let

us now explain this concept and how it is realized with respect to examples from

Figure 4.4 and 4.7.

Let us start with example from Figure 4.4 (recreated here for reference)

Figure 4.8: Pruning pass on example from 4.4: Initial State

Here the sink operator is Project, we consider its output and see two examples.

Project being an operator where every output record is assigned to a single equiv-

alence class E1, hence to maintain Project’s completeness having just one record

at E1 would suffice. This allows us to prune one of the two examples from the

Project’s output set {< Hung, facebook, 4 >,< Amit, facebook, 4 >}. Let us try


to prune record < Hung, facebook, 4 >, this record is a part of two lineage groups

as follows:

Lineage Group 1:

< Hung, facebook, 4pm > → {Load,< Hung, facebook, 4pm >}{Join,<< Hung, facebook, 4pm >,< facebook, 4 >}{Filter,<< Hung, facebook, 4pm >,< facebook, 4 >}{Project,<Hung, facebook, 4>}

Lineage Group 2:

< facebook, 4 > → {Load,< facebook, 4 >}{Join,<< Hung, facebook, 4pm >,< facebook, 4 >}{Join,<< Amit, facebook, 5pm >,< facebook, 4 >}{Filter,<< Hung, facebook, 4pm >,< facebook, 4 >}{Filter,<< Amit, facebook, 5pm >,< facebook, 4 >}{Project,<Hung, facebook, 4>}{Project, < Amit, facebook, 4 >}

Considering Lineage Group 1, as we move from sink to source; we see at Project,

eliminating < Hung, facebook, 4 > does not affect its completeness and thus we

are allowed to prune the example at the sink-Project operator. Next upstream

operator is Filter, here too, removing:

<< Hung, facebook, 4pm >,< facebook, 4 >> has no impact on Filter’s com-

pleteness as both of its equivalence classes

E1 (passing filter: << Amit, facebook, 5pm >,< facebook, 4 >> ) and

E2 (failing filter: << Amit, youtube, 3pm >,< youtube, 2 >>) are still filled, and

Project’s completeness is intact as well.

Thus << Hung, facebook, 4pm >,< facebook, 4 >> is pruned at the filter op-

erator. Similar is the case at Join and Load (UserVisits) operator and the

respective record is pruned at these operators as well. We hence pruned the com-

plete trace of < Hung, facebook, 4 > throughout the operator tree using Lineage

group 1. After the first round of pruning, Lineage group 2 will now look as follows:


< facebook, 4 > → {Load,< facebook, 4 >}{Join,<< Hung, facebook, 4pm >,< facebook, 4 >}{Join,<< Amit, facebook, 5pm >,< facebook, 4 >}{Filter,<< Hung, facebook, 4pm >,< facebook, 4 >}{Filter,<< Amit, facebook, 5pm >,< facebook, 4 >}{Project,<Hung, facebook, 4>}{Project, < Amit, facebook, 4 >}

As the remaining records at Join, Filter and Project are not related to:

< Hung, facebook, 4 >, we cannot prune them. But we are allowed to prune

< facebook, 4 > at Load, we attempt to remove it but realize that it will disturb

the completeness of the downstream operator Filter, as there won’t be any record

passing the filter and will make E1 void. Hence < facebook, 4 > is retained at

Load (PageRanks). After finishing pruning at the sink operator (Project), the

resulting operator tree with examples will look as in Figure 4.9, we then move to

next upstream operator Filter.

Figure 4.9: Pruning pass: After pruning Project operator

At Filter, it is not possible to prune the only available record from its output,

as it would make Filter’s, as well as Project’s, result set empty. Moving fur-

ther upstream to Join operator, here the output set consists of two records,

<< Amit, facebook, 5pm >,< facebook, 4 >> and << Amit, youtube, 3pm >

,< youtube, 2 >> at Join it is possible to prune either one of the records without


affecting Join’s completeness, but pruning any of these records will reduce com-

pleteness at downstream Filter operator because,

pruning << Amit, facebook, 5pm >,< facebook, 4 >> will make Filter’s E1

(passing records) void. Pruning << Amit, youtube, 3pm >,< youtube, 2 >>

will make Filter’s E2 (failing records) void. Hence at Join, similar to Filter, no

record is pruned. Applying the similar logic to the remaining upstream operators

Load (UserVisits) and Load (PageRanks), at Load (UserVisits) we can prune

< June, gmail, 5pm > and at Load (PageRanks) we can prune < yahoo, 3 > and

< skysports, 2 > as it does not upset the completeness of respective Load’s as well

as any downstream operators. So the overall operator tree along with examples,

after pruning pass completion will look as shown in the Figure 4.10.

Figure 4.10: Pruning pass: Complete pruning of example from Figure 4.4

Using the similar technique, the pruning pass on example from Figure 4.7, would

produce result set as shown in the Figure 4.11.

The complete pruning pass can be summarized as per the pseudo-code 6.


Figure 4.11: Pruning pass: Complete pruning of example from Figure 4.7

Algorithm 6 Pruning Pass

Step 1 : let O be the operator under consideration (initially O will be the sinkoperator), for every record R in O’s result set perform Step 2

Step 2 : verify whether pruning R would result in incompleteness at O, ifcompleteness is intact proceed to Step 3, else get next R (if any) from O’s resultset and redo Step 2

Step 3 : get R’s lineage group l, for every lineage entries in l execute Step 4

Step 4 : get record Rli (record at operator i in lineage group l) initially i willbe the farthest downstream operator, verify whether pruning Rli affects com-pleteness at operator i, as well as the completeness of i’s downstream operators(if any). If completeness is intact everywhere remove Rli from the operator i,move to next lineage entry record Rli−1 (if any) where i − 1 is the upstreamoperator of operator i and redo Step 4 (now Rli = Rli−1)

Step 5 : repeat Step 1 through 4 for the next upstream operator until we reachthe source

These were the main passes used in our example generation algorithm. In our next

section we will explain the extensible features of Flink used in our implementation.


4.5 Flink Add-ons

4.5.1 Using Semantic Properties

Flink’s optimizer provides a way to inspect UDFs, normally seen as a black box,

by presenting information such as location of the fields of the input tuples that

are being accessed, modified or copied. Hence providing an indirect access to

data movement in a UDF. To access this information, one needs to add Function

Annotations 2 to a UDF. This can be done in a manner similar to shown in Code-

Snippet 4.1 as follows:

@FunctionAnnotation.ForwardedFields("f0.f0->f0;f0.f1->f1;f1.f1->f2")

public static class PrintResultWithAnnotation implements

FlatMapFunction <Tuple2 <Tuple2 <String , String >,

Tuple2 <String , Long >>, Tuple3 <String , String , Long >> {

// Merges the tuples to create tuples of <User ,URL ,PageRank >

public void flatMap(

Tuple2 <Tuple2 <String , String >, Tuple2 <String , Long >> joinSet ,

Collector <Tuple3 <String , String , Long >> collector)

throws Exception {

collector.collect(new Tuple3 <String , String , Long >( joinSet.f0.f0,

joinSet.f0.f1, joinSet.f1.f1));

}

}

Listing 4.1: Demonstrating ForwardedFields Usage

The class PrintResultWithAnnotation, a FlatMap UDF operator, takes input tu-

ple of the format << String, String >,< String, Long >> corresponding to

<< User, URL >,< URL, PageRank >> and returns a new tuple of the format

< String, String, Long > corresponding to < User, URL, PageRank >. If we try

to paint a picture of what is going inside the FlatMap function, it can be viewed

as shown in Figure 4.12:

2https://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/programming guide.html#semantic-annotations

https://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/programming_guide.html#semantic-annotations

https://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/programming_guide.html#semantic-annotations


Figure 4.12: Inside picture of the FlatMap function in 4.1

This is exactly what is mentioned using ForwardedFields annotation, that can be

seen just before the class definition starts. It is stated as follows:

@FunctionAnnotation.ForwardedFields(“f0.f0 -> f0 ; f0.f1 -> f1 ; f1.f1 -> f2”)

input tuple︷︸︸︷<

f0︷︸︸︷< User︸︷︷︸

f0

, URL︸︷︷︸f1

>,

f1︷︸︸︷< URL︸︷︷︸

f0

, PageRank︸︷︷︸f1

> >

<

f0︷︸︸︷User,

f1︷︸︸︷URL,

f2︷︸︸︷PageRank >︸︷︷︸

output tuple

To explain this annotation in detail, we need to understand the input tuple struc-

ture, << User, URL >,< URL, PageRank >> this overall is a Tuple2 represen-

tation, where field0 is < User, URL > and field1 is < URL,PageRank >; let

us for simplicity call them main-field0 and main-field1. These main fields, both 0

and 1, are in turn Tuple2 representations, for < User, URL >, User is field0 and

URL is field1, similarly with < URL,PageRank >, URL is field0 and PageRank

is field1. In expression f0.f1 -> f1, left hand side f0.f1 denotes URL which is

field1 in main-field0, and is just copied or say forwarded to f1, i.e., the location

field1 in the output tuple of < User,URL, PageRank >. This demonstrates how

functional annotations allow us to drill into UDF’s internal operation.

In our implementation, we have used and exploited this exclusive feature of Flink

with respect to FlatMap UDF operator. This indeed can be further extended to

other UDFs, given we have the annotation information available.

Static Code Analysis (SCA) [23], is a new feature in the latest Flink release. SCA

implicitly generate the same type of semantic information that we gain using Func-

tion Annotations. The user can thus avoid explicit mentioning of annotations in

order to realize the UDF behavior. Our implementation can further leverage from


this additional feature of static code analysis, as we would then be exposed to

input output mapping of all the UDF operators. We can then extend our imple-

mentation to other UDFs and can be considered as a part for future enhancement.

4.5.2 Interactive Scala Shell

Interactive Scala Shell for Flink is the latest addition to the Flink features, which

allows command line execution of Flink scala code, and evaluating them per line

basis. It provides a very good platform for our implementation to display the ex-

ample result sets generated using the algorithm for a given dataflow program. We

hence have added an extra command “:ILLUSTRATE” to the existing list of Shell

standard commands, upon receiving this command we call our flink-illustrator

module, and display the sample examples in the shell. To use this command we

need to provide two arguments namely, Flink’s execution environment and the sink

operator or sink operator’s dataset. As our implementation is limited to Flink Java

jobs, to execute using scala shell we have to make the necessary adjustments and

use respective Java API modules. In Appendix A.2, we have a sample scala code

adapted to Java API module and the usage of ILLUSTRATE command.

In this chapter, we have explained the details related to our algorithm implemen-

tation as well as the necessary extensions that one can use from Flink. The next

chapter describes the experiments and evaluations conducted on the algorithm.

Chapter 5

Evaluation

This chapter encompasses one of the main objectives of this thesis, i.e., to evaluate

the performance of the algorithm with respect to key metrics namely:

Completeness: Verifies that the examples generated at the operator covers all

the semantics of that operator,

Conciseness: Verifies that the example set is as small as possible, to allow quick

overview of the operator functionality and the dataflow,

Realism: Verifies that the examples are indeed from the input sources, and

Runtime: Verifies that the example generation indeed is faster compared to the

complete dataflow execution.

In following sections, we explain the definition of the each metric, the experimental

setup and workload (the dataflow programs) used, and finally we measure the

performance of our algorithm (with respect to the above metrics) against two

other example generation approaches:

• Downstream Propagation : Sinlgle downstream pass from the source to sink.

• Sink-Pruning : Pruning only at the final sink operator.

48

Chapter 5. Evaluation 49

5.1 Performance Metrics

5.1.1 Completeness

The completeness of an operator O is defined using the concept of equivalence

classes. It is defined as the ratio of number of equivalence classes with at least one

example to the number of equivalence classes for the given operator. This can be

formulated as:

CompletenessO =

[# O′s equivalence classes with atleast one example

# O′s equivalence classes

]10

[This value will always range between 0 and 1, 1 denotes total completeness]

For example, Filter operator has two equivalence classes for its input records

namely,

E1 - for records passing the filtering predicate

E2 - for records failing the filtering predicate

Figure 5.1: Completeness of Filter = 0.5

Consider a scenario where Filter’s input only has records that pass the filter, as

shown in Figure 5.1 where operator accepts records with age above 18. From the

available set of examples E1 is filled and E2 is empty, i.e., only one out of two

equivalence classes has at least one example; so by substituting the values in the

above formula we get 1/2. Using the above technique, we define per-operator

completeness for all the operators in a given dataflow program.

The overall completeness of a given dataflow program P can then be defined as

the average of per-operator completeness of all the operators in P .

CompletenessP =CompletenessO1 + CompletenessO2 + ..... + CompletenessOm

m


5.1.2 Conciseness

The conciseness of an operator O is defined as the ratio of the number of operator’s

equivalence classes to the total number of example records for that operator (with

a ceiling at 1).

ConcisenessO =

[# O′s equivalence classes

# O′s example records

]10

[This value will always range between 0 and 1, 1 denotes that operator has pro-

duced most concise set of examples]

Using the example from Figure 5.1, Filter generates two examples and has two

equivalence classes, hence the conciseness for this operator is: 2/2 = 1, which

means that the records generated are indeed concise, but not complete. Given the

above method to evaluate per-operator conciseness, the overall conciseness of a

dataflow program P can be defined as the average of per-operator conciseness of

all the operators in P .

ConcisenessP =ConcisenessO1 + ConcisenessO2 + ..... + ConcisenessOm

m

5.1.3 Realism

The realism for an operator O can be defined as the fraction of the number of

records that are real over to the total number of records generated by the same

operator. A record is considered as real if it comes from the source operators or

is derived via different operations from the source operators. Operator may or

may not consist synthetic records in its output set, having these synthetic records

reduces realism.

RealismO =

[# real records in O′s output example set

# records in O′s output example set

]10

[This value will always range between 0 and 1]

And similarly the overall realism for a dataflow program P can be seen as:

RealismP =RealismO1 + RealismO2 + ..... + RealismOm

m


Considering the same example as in Figure 5.1, the realism at Filter is 1 as there

exist no synthetic records. Suppose during the upstream pass a synthetic record

< PQR, 17 > is added to Filter’s input to boost its completeness as shown in

Figure 5.2.

Figure 5.2: Realism of Filter after boosting completeness = 2/3

Due to the addition of the synthetic record, we indeed increase the completeness

of Filter operator but at the expense of the reduction in realism. Given the nature

of this problem, there is no solution that can guarantee full completeness with full

realism, either of them is compromised. Similar scenario may appear in a real world

dataflow program, and one cannot guarantee the score of 1 on each completeness,

conciseness and realism. Our algorithm favors completeness over realism, as

it is more important to determine the complete behavior of an operator.

5.2 Experiments

5.2.1 Setup and Datasets

In this section, we evaluate our dataflow programs with respect to above perfor-

mance metrics. Before diving into the details, let us briefly explain the experi-

mental environment used. The experiments were conducted on a local Linux box

running Ubuntu 14.04 with 4 cores and 12 GB RAM. The Apache Flink’s 0.9.0

version was used in our implementation. Most of our experiments were conducted

with local text files as input sources, while some using HDFS as its storage layer

with Hadoop version 2.6.0.

We are using datasets constructed locally to test our implementation that covers

all the possible operators with different scenarios for complete algorithm coverage.


We try to introduce empty equivalence classes by maneuvering the input files

accordingly, this allows us to verify the completeness aspect of the operators and

the overall dataflow program. The dataset details can be seen in Table 5.1.

DataSet Details Format #rows

UserVisits Log of user’s visit to web-sites during the daycsv and hdfs file (Username, URL,Visit TimeStamp)

100

UserVisitsEU Log of EU user’s visit to websites during the daycsv and hdfs file (Username, URL,Visit TimeStamp)

70

PageRanks Contains pagerank of the respective website csv (URL, Rank) 50

JobSeeker Contains user details csv (UserId, Username) 50

Jobs Contains job and salary details csv (JobId, City, Salary) 20

JobsEU Contains job and salary details in EU csv (JobId, City, Salary) 20

Table 5.1: Locally created DataSet used for evaluation purpose

For thorough evaluation, we define in total 13 dataflow programs that use the

variations of above datasets as input. These dataflow programs can be found in

Appendix A.3. We categorize these 13 programs into following types (based on

type of evaluation):

1. Single-operator: Dataflow contains only one data transformation operator

(Programs 1 to 6 and 13).

2. Multi-operators: Dataflow contains multiple data transformation operators.

(Programs 7 to 9).

3. Empty Equivalence classes: Dataflow inputs are manipulated such that it

results in empty-set or empty equivalence classes for the operator under test.

(Programs 10 to 12).

5.2.2 Quality of Generated Examples

We evaluate the above programs using three different approaches against the met-

rics defined; completeness, conciseness and realism. The approaches used are:

1. Our Algorithm: The algorithm from our implementation, with down-

stream, upstream and full pruning passes.(4.4)

2. Downstream: In this approach we propagate the sample example sets from

source LOAD operators till the sink, while evaluating at each operator in the

operator tree. This approach allows us to validate the completeness metric.


3. Sink-level Pruning: This is a variant of our algorithm, with pruning lim-

ited to only sink operator rather than at each operator in the operator tree.

This approach allows us to validate conciseness metric.

We ran three iterations of each program (in order to compensate for the initial

sampling), in Figures 5.3,5.4,5.5 Y-axis thus denotes the average of the metrics,

while X-axis denotes the algorithm types. The idea behind multiple executions

and thus evaluation of each program was, initial sampling at LOAD might pro-

duce examples that cannot always be consumed by the downstream operators to

generate an output, resulting incompleteness. For e.g., two different LOADs pro-

ducing examples with no common join key that is assigned at JOIN, or LOAD

producing no examples that can pass the filtering predicate at FILTER. Having

multiple runs thus helped us to verify whether our algorithm is efficient enough to

tackle the incompleteness (if any) being introduced.

Let us analyze the experimental observations.

1. Single-operator dataflows:

With respect to the completeness of the dataflows containing only one data trans-

formation operator, our algorithm achieves the highest score of 1 for all the 7

programs, i.e., algorithm illustrates all the necessary semantics of the operators.

In contrast, Downstream approach fails to attain total completeness in more than

one occasion, at Join (due to the unavailability of common join-key), at Filter (due

to missing record that passes/fails the filtering predicate). Our algorithm man-

aged to tackle such incompleteness introduced. As stated earlier, we emphasize

completeness over realism, from Figure 5.3 (a/e/f), it is clear that we introduce

a synthetic record in order to address Join incompleteness and, therefore, com-

promising on realism. Compared to other approaches, examples generated by our

algorithm are more concise and attains the highest score, whereas Downstream

and Sink-Pruning struggles in the range of 0.4 - 0.8. Thus, it is quite clear that

our algorithm produces the most complete and concise set of examples for the

single-operator dataflow validation.


Figure 5.3: Evaluation of Single-Operator dataflows (a) Join, (b) Union, (c)Cross, (d) Filter, (e) Join with Project, (f) Join with Project using Annotations,

(g) Join using HDFS


2. Multi-operator dataflows

Figure 5.4: Evaluation of Multi-Operators dataflow (a) Union, Join, Project(b) Union, Filter, Join, Project (c) Union, Cross

As seen in the above Figure 5.4 for Multi-Operator dataflows, our algorithm attains

total completeness in 2 out of 3 programs. Program 8 being a multi-operator

dataflow with FILTER, the upstream pass may not always be successful to fill the

empty equivalence class. This can indeed be rectified using multiple iterations until

the equivalence class is filled but at the expense of extra overhead of long running

time as well as increased cost for pruning (each iteration might add redundant

record). Program 7 and 8, containing UNION followed by JOIN in the operator

tree, results in more than one joined examples in some scenarios and hence the

lack of conciseness. Given that, the conciseness is still higher than the other two

approaches. For multi-operator dataflows, the set of examples produced by our

algorithm may not always be totally complete, but it indeed fares better compared

to only downstream approach.


3. Empty Equivalence Classes:

Figure 5.5: Evaluation of Special cases with intentional manipulation of inputsto create empty equivalence classes at (a) Join, (b) Union, (c) Cross

For the special cases of Program 10-12, the input sources were intentionally al-

tered or were kept empty so that it would result in incompleteness during the

downstream pass. The algorithm still managed to achieve total completeness at

the expense of zero realism, because only synthetic records traversed through the

dataflow. As we can see in Figure 5.5, compared to our implemented algorithm,

downstream approach achieved very low scores on completeness, as there were no

records in the input sources to completely perform either join (in 10) , union (in

11) or cross (in 12). Again, our approach managed to produce the concise set of

operators with highest score of 1 on every occasion.

To summarize the evaluation of the quality of examples generated, our algorithm

approach achieves total completeness on 12 out of 13 programs. Lack of realism

denotes the fact that the downstream pass resulted in incompleteness and irre-

spective of that the algorithm achieved the highest completeness. The examples

generated are always concise and achieved the score of 1 in 11 out 13 programs.

The conciseness is higher than the other two approaches because, Downstream ap-

proach does not perform any pruning, whereas in Sink-Pruning it is limited only


at final sink operator. Our implementation manages good conciseness via prun-

ing, at the same time keeps completeness intact. We can hence confirm that our

algorithm implementation of multiple passes downstream, upstream and pruning

is qualitatively and quantitatively a better approach for the example generation

problem.

5.2.3 Running Time

In this section, we compare the performance of the implementation with respect

to running time. In this experiment, we ran our implementation against varying

sizes of input sources, ranging from 17 MB to 1.1 GB. As we are running our

experiments on local machine and cluster, we limited the maximum dataset size

to 1.1 GB. The datasets were obtained from SNAP Datasets Library [24] and the

details are mentioned in Table 5.2 :

DataSet Details Size (in MB)

Reddit Resubmitted contents on reddit.com 17

Skitter Internet topology graph, from trace-routes daily in 2005 150

Patent-Citations Citation networks among US Patents 280

Pokec Pokec online social network 500

LiveJournal1 LiveJournal online social network 1100

Table 5.2: Experiment Dataset’s description and sizes

For this experiment, we ran the variation of dataflow Program 1, where we JOIN

two datasets (any dataset from Table 4 and a random collection of integers). All

the datasets in Table 5.2 contains an id field as an integer, hence we join on

this id (from the dataset) and integer (from the collection). The purpose of this

experiment was to test the runtime of the flink-illustrator over varying sizes of

input datasets, hence we limited the dataflow to a simple logic of Join on integers.

Figure 5.6 shows the difference in runtime (in seconds) of flink-illustrator module

and flink job execution. For a give dataflow program, the flink-illustrator module

executes on a small sample of the input dataset to produce the concise and com-

plete set of examples; whereas the flink job execution means the total (every input

record) execution of the dataflow. Figure 5.6 clearly shows, that the illustrator

module indeed provides a quick method to overview the dataflow, irrespective of

the input dataset sizes, with runtime ranging from just 0.5 to 2 seconds. This


Figure 5.6: Comparison of Illustrator runtime v/s Flink Job Execution timefor varying sizes of input

allows Flink user to verify the dataflow very quickly and voids the need of running

the complete job, with longer waiting times (around 5 to 40 seconds for above

input sizes), over and over again in order to make the necessary adjustments to

the dataflow program.

This verifies our claim that flink-illustrator provides a fast and effective way to

understand and learn about the dataflow (and its operators) compared to complete

execution of the job when it comes to input data of large scale.

Chapter 6

Related Work

The work of generating example tuples started a while ago in 1985 [25], where

automatic techniques were introduced and studied to generate a test database for

a given query. This work was limited to relational database systems and assisted

in testing relational queries. The methods in [25] are based on illustrating the

semantics of the complete query, whereas our work illustrates the semantics of

every operator in the given dataflow program.

QAGen [26], improves on traditional (trial and error) test database generation

techniques and generates a query aware test database, that facilitates a wide range

of DBMS testing tasks. QAGen takes the query and set of constraints defined on

the query as input and generates query aware test database with all the interme-

diate query results. This is similar to our approach for dataflow programs, where

we take the dataflow program as an input and generate the example tuples after

every intermediate operator.

Recent work on schema mapping via data examples [27], presents good data ex-

amples that illustrate schema mappings between the two databases. The work in

[27] documents the capabilities and limitations of data examples in explaining and

understanding the schema mappings. This technique can indeed be extended to

our work, most notably as UDFs mappings, although the main requirement will

be to make the schema of each UDF operator available beforehand.

This work [28], reverse query processing, is closely related to the logic of the

upstream pass, it gets a query and a result as input and returns the possible

database instance that could have produced that result of the query. In upstream

59

Chapter 6. Related Work 60

pass, we too are aware what output will resolve the incompleteness at the operator

and we introduce the possible set of input examples that will produce that output.

Indeed this work is effective for traditional relational operators (JOIN, CROSS,

UNION, AGGREGATIONS), but it has no mention whatsoever for UDFs.

The work in [29] extends the concepts from [12] by introducing a new technique

to generate example data for dataflow programs. This technique is known as

Symbolic Example Data GEneration (SEDGE) and uses dynamic symbolic exe-

cution for example generation. This work relies on transforming the program into

symbolic constraints and then solving these constraints using a symbolic reason-

ing engine. This work like [12], has been tried and tested only for Apache Pig’s

dataflow programs with Pig-Latin scripts.

Chapter 7

Conclusion

This thesis has shown the significance of dataflow programming in the current

big-data environments. We provided a comparative study of 3 different type of

big-data systems, Apache Hadoop, Apache Pig and Apache Flink that takes ad-

vantage of the dataflow programming paradigm. The presented study makes sev-

eral noteworthy observations to understand the challenges related to the dataflow

programming in these systems, namely debugging and visualization, and how the

example generation technique can overcome the related issues.

Apache Flink provides a comprehensive computing platform that supports faster

data-intensive dataflow processing with inherent support for the iterative program-

ming and traditional relational operators (join, cross, union, etc.). This thesis work

addresses the limitations of dataflow programming in Apache Flink environment

by implementing the example generation “Illustrator” module. Flink-Illustrator

provides the user a set of complete and concise example tuples after each operator

for better understanding and easy visualization of the dataflow.

Example generation is a concept difficult to practically realize and implement.

When basic techniques are used such as, just downstream or just upstream prop-

agation, there exist many different scenarios that lead to the generation of incom-

plete examples or lengthy/repetitive examples at any given operator. In order to

avoid these situations we use the combination of downstream, upstream and prun-

ing passes to generate the complete and concise set of examples. Downstream

pass will get the initial sampled set of examples and propagate it through the

61

Chapter 7. Conclusion 62

dataflow, upstream pass will handle incompleteness (if any) by introducing a syn-

thetic record and pruning will remove the redundant examples (examples that add

no extra meaning to the present set).

The algorithm indeed proves efficient compared to other techniques (only down-

stream propagation, only sink pruning) and generates most complete and concise

set of examples. The examples generated might not be real (from the input source)

all the time, because of the synthetic examples that might be introduced to boost

completeness. With respect to runtime, Flink-Illustrator is definitely a way to

move forward for faster debugging or dataflow/operator understanding in compar-

ison to complete dataflow execution. This is clearly visible from the experimental

observations when it comes to input data of large size, illustrator is 5x - 35x faster

than complete execution over increasing input size.

We have made use of some features from Flink to make our implementation novel

and different from already existing one in PIG. Apache Flink allows us to read

through the UDF operators via its semantic annotations, we make use of this

property in our implementation to read the mappings of tuples in the Flatmap

UDF operator. For visualization purpose, we integrate our implementation with

Flink’s Interactive Scala Shell, by introducing a new command “:ILLUSTRATE”

that invokes the flink-illustrator module and displays the set of examples on the

scala shell.

We conclude by saying that example generation indeed is a technique that makes

life easier for a developer working on a big-data systems. We extend this technique

to dataflow programs in Apache Flink, by implementing the tried and tested algo-

rithm to generate the complete and concise set of examples, for a better program

understanding and visualization.

7.1 Discussion and Future Work

As stated in Section 3.5 our implementation is limited to the Flink’s Java dataflow

programs, we can further extend it to the available Scala API making the flink-

illustrator a language friendly tool. We are also restricted to use only tuple formats

in our input datasets, as they form our base for implementation and are traversed

throughout the dataflow. Although, it would be a great addition to have POJO

Chapter 7. Conclusion 63

input types and then using reflection API such as Trail [30] to access the appro-

priate properties of the objects. Further additions can be, the support for more

richer UDF operators (GroupReduce, etc.) available in Flink as well as support

for iterative dataflows. In order to extend support for the new operators (UDFs

and iterative), we must broaden the language used for defining equivalence classes

as well as the rules used to create synthetic records during an upstream pass. We

can take advantage of different techniques, such as reverse query processing [28]

or symbolic query processing [26] for enhancing the above rules. Hence, more

research based feature extensions can be implemented as an advancement to this

thesis work thereby covering different aspects of Flink dataflow programming.

Appendix A

Appendix

A.1 Word-Cound Example in different Dataflow

Systems

A.1.1 Apache Pig

A = load ’data.txt’ as (line:chararray);

B = foreach A generate TOKENIZE(line) as tokens;

C = foreach B generate flatten(tokens) as words;

D = group C by words;

E = foreach D generate group , COUNT(C);

F = order E by $1;

dump F;

Listing A.1: Word-Count in Apache Pig

64

Appendix A. Appendix 65

A.1.2 Apache Hadoop

public class WordCount {

public static class TokenizerMapper

extends Mapper <Object , Text , Text , IntWritable >{

private final static IntWritable one = new IntWritable (1);

private Text word = new Text();

public void map(Object key , Text value , Context context)

throws IOException , InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString ());

while (itr.hasMoreTokens ()) {

word.set(itr.nextToken ());

context.write(word , one);

}

}

}

public static class IntSumReducer

extends Reducer <Text ,IntWritable ,Text ,IntWritable > {

private IntWritable result = new IntWritable ();

public void reduce(Text key , Iterable <IntWritable > values ,

Context context

) throws IOException , InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key , result);

}

}

public static void main(String [] args) throws Exception {

Configuration conf = new Configuration ();

Job job = Job.getInstance(conf , "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job , new Path(args [0]));

FileOutputFormat.setOutputPath(job , new Path(args [1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

Listing A.2: Word-Count in Apache Hadoop


A.1.3 Apache Flink

public class WordCountExample {

public static void main(String [] args) throws Exception {

final ExecutionEnvironment env =

ExecutionEnvironment.getExecutionEnvironment ();

DataSet <String > text = env.fromElements(

"Who’s there?",

"I think I hear them. Stand , ho! Who’s there?");

DataSet <Tuple2 <String , Integer >> wordCounts = text

.flatMap(new LineSplitter ())

.groupBy (0)

.sum(1);

wordCounts.print ();

}

public static class LineSplitter

implements FlatMapFunction <String , Tuple2 <String , Integer >> {

@Override

public void flatMap(String line , Collector <Tuple2 <String , Integer >> out)

{

for (String word : line.split(" ")) {

out.collect(new Tuple2 <String , Integer >(word , 1));

}

}

}

}

Listing A.3: Word-Count in Apache Flink


A.2 Sample Scala Script for Flink’s Interactive

Scala Shell

import java.util.regex.Pattern

val environ =

org.apache.flink.api.java.ExecutionEnvironment.getExecutionEnvironment

val input1 = environ.readTextFile("src/test/resources/CrossInput1")

val input2 = environ.readTextFile("src/test/resources/CrossInput2")

class OneReader extends org.apache.flink.api.common.functions.FlatMapFunction

[String , org.apache.flink.api.java.tuple.Tuple2[Integer , String ]] {

private val SEPARATOR = Pattern.compile("[ \t,]")

def flatMap(readLineFromFile: String , collector:org.apache.flink.util.Collector

[org.apache.flink.api.java.tuple.Tuple2[Integer , String ]]) {

if (! readLineFromFile.startsWith("%")) {

val tokens = SEPARATOR.split(readLineFromFile)

val id = java.lang.Integer.parseInt(tokens (0))

val city = tokens (1)

collector.collect(new org.apache.flink.api.java.tuple.Tuple2[Integer ,

String ](id,city))

}

}

}

class TwoReader extends org.apache.flink.api.common.functions.FlatMapFunction

[String , org.apache.flink.api.java.tuple.Tuple2[Integer , Double ]] {

private val SEPARATOR = Pattern.compile("[ \t,]")

def flatMap(readLineFromFile: String , collector:org.apache.flink.util.Collector

[org.apache.flink.api.java.tuple.Tuple2[Integer , Double ]]) {

if (! readLineFromFile.startsWith("%")) {

val tokens = SEPARATOR.split(readLineFromFile)

val id = java.lang.Integer.parseInt(tokens (0))

val salary = java.lang.Double.parseDouble(tokens (1))

collector.collect(new org.apache.flink.api.java.tuple.Tuple2[Integer ,

Double ](id,salary))

}

}

}

val set1 = input1.flatMap(new OneReader ())

val set2 = input2.flatMap(new TwoReader ())

val crossSet = set1.cross(set2)

:ILLUSTRATE environ crossSet //Call to Flink -Illustrator

Listing A.4: Scala script from Flink’s Interactive scala shell with

:ILLUSTRATE command usage


A.3 Experiment’s Dataflow Programs

Program 1 (page rank for user-visited urls):

1. LOAD set UserVisits using a csv input source

2. FLATMAP to input tuple format < user, url >

3. DISTINCT on UserVisits

4. LOAD set PageRanks using a csv input source

5. FLATMAP to input tuple format < url, pagerank >

6. DISTINCT on PageRanks

7. JOIN UserVisits and PageRanks

Program 2 (combining two continent’s user-visited urls):


2. FLATMAP to input tuple format < user, url, count >

3. LOAD set UserVisitsEU using a csv input source

4. FLATMAP to input tuple format < user, url, count >

5. UNION UserVisits and UserVisitsEU sets

Program 3 (options for available jobs):

1. LOAD set JobSeeker using a csv input source

2. FLATMAP to input tuple format < userID, user >

3. LOAD set Jobs using a csv input source

4. FLATMAP to input tuple format < jobId, salary >

5. CROSS set JobSeeker and Jobs

Program 4 (option for high paying jobs):




4. FLATMAP to input tuple format < jobId, salary >

5. FILTER set Jobs on salary field

6. CROSS set JobSeeker and Jobs


Program 5 (users visiting high pagerank urls using simple flatmap for

projection):








7. PROJECT joined set in the tuple format as < user, url, pagerank >

Program 6 (users visiting high pagerank urls using function annotated

flatmap for projection):








7. PROJECT joined set in the tuple format as < user, url, pagerank > by setting

forwarded fields annotations.

Program 7 (page rank for combined user-visited urls):








8. JOIN unioned set of UserVisits and UserVisitsEU with Page-Ranks

9. PROJECT joined set


Program 8 (combined user-visits urls with only high page ranks):








8. FILTER set PageRanks on field pagerank

9. JOIN unioned set of UserVisits and UserVisitsEU with Page-Ranks

10. PROJECT joined set

Program 9 (options for finding jobs in different continents):




4. FLATMAP to input tuple format < jobID, city, salary >

5. LOAD set JobsEU using a csv input source

6. FLATMAP to input tuple format < jobID, city, salary >

7. UNION sets Jobs and JobsEU

8. CROSS set JobSeeker and unioned set of Jobs and JobsEU

Program 10:

Variation of Program 1 , with no common key (url) in both input sets, thereby

resulting in empty equivalence class at JOIN.

Program 11:

Variation of Program 2 , with no/blank input in one of the file, thereby resulting

in empty equivalence class at UNION.

Program 12:

Variation of Program 3 , with no/blank input in one of the file, thereby resulting

in empty equivalence class at CROSS.

Program 13:

Variation of Program 1 , with input source as Hadoop-HDFS file.


A.4 Flink-Illustrator Source Code

Source Code

https://github.com/amitpawar/examplegenerator

Bibliography

[1] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg,

H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, and others,

“Apache Hadoop goes realtime at Facebook,” in Proceedings of the 2011 ACM

SIGMOD International Conference on Management of data, pp. 1071–1080,

ACM, 2011.

[2] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large

clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[3] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise,

O. Kao, M. Leich, U. Leser, V. Markl, and others, “The Stratosphere platform

for big data analytics,” The VLDB JournalThe International Journal on Very

Large Data Bases, vol. 23, no. 6, pp. 939–964, 2014.

[4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark:

cluster computing with working sets,” in Proceedings of the 2nd USENIX

conference on Hot topics in cloud computing, vol. 10, p. 10, 2010.

[5] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy,

C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level

dataflow system on top of Map-Reduce: the Pig experience,” Proceedings of

the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.

[6] D. J. Abadi, D. Carney, U. etintemel, M. Cherniack, C. Convey, S. Lee,

M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: a new model and archi-

tecture for data stream management,” The VLDB JournalThe International

Journal on Very Large Data Bases, vol. 12, no. 2, pp. 120–139, 2003.

[7] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed

data-parallel programs from sequential building blocks,” in ACM SIGOPS

Operating Systems Review, vol. 41, pp. 59–72, ACM, 2007.

72

Bibliography 73

[8] R. H. Arpaci-Dusseau, “Run-time adaptation in river,” ACM Transactions

on Computer Systems (TOCS), vol. 21, no. 1, pp. 36–86, 2003.

[9] M. Stonebraker, J. Chen, N. Nathan, C. Paxson, and J. Wu, “Tioga: Pro-

viding data management support for scientific visualization applications,” in

VLDB, vol. 93, pp. 25–38, Citeseer, 1993.

[10] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy,

and S. Hand, “CIEL: A Universal Execution Engine for Distributed Data-

Flow Computing.,” in NSDI, vol. 11, pp. 9–9, 2011.

[11] W. Lee, S. J. Stolfo, and K. W. Mok, “Mining in a data-flow environment:

Experience in network intrusion detection,” in Proceedings of the fifth ACM

SIGKDD international conference on Knowledge discovery and data mining,

pp. 114–124, ACM, 1999.

[12] C. Olston, S. Chopra, and U. Srivastava, “Generating example data for

dataflow programs,” in Proceedings of the 2009 ACM SIGMOD International

Conference on Management of data, pp. 245–256, ACM, 2009.

[13] T. B. Sousa, “Dataflow Programming Concept, Languages and Applications,”

in Doctoral Symposium on Informatics Engineering, 2012.

[14] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reduce-merge:

simplified relational data processing on large clusters,” in Proceedings of

the 2007 ACM SIGMOD international conference on Management of data,

pp. 1029–1040, ACM, 2007.

[15] D. Battr, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke, “Nephele/-

PACTs: a programming model and execution framework for web-scale ana-

lytical processing,” in Proceedings of the 1st ACM symposium on Cloud com-

puting, pp. 119–130, ACM, 2010.

[16] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “HaLoop: efficient iterative

data processing on large clusters,” Proceedings of the VLDB Endowment,

vol. 3, no. 1-2, pp. 285–296, 2010.

[17] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a

not-so-foreign language for data processing,” in Proceedings of the 2008 ACM

SIGMOD international conference on Management of data, pp. 1099–1110,

ACM, 2008.

Bibliography 74

[18] F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl,

and J. Freytag, “Peeking into the optimization of data flow programs with

mapreduce-style udfs,” in Data Engineering (ICDE), 2013 IEEE 29th Inter-

national Conference on, pp. 1292–1295, IEEE, 2013.

[19] A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Ni-

jkamp, and D. Warneke, “MapReduce and PACT-Comparing Data Parallel

Programming Models.,” in BTW, pp. 25–44, 2011.

[20] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, “Spinning fast iterative

data flows,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1268–

1279, 2012.

[21] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan, “Mochi: visual

log-analysis based tools for debugging hadoop,” in USENIX Workshop on Hot

Topics in Cloud Computing (HotCloud), San Diego, CA, vol. 6, 2009.

[22] S. Chaudhuri, R. Motwani, and V. Narasayya, “On random sampling over

joins,” in ACM SIGMOD Record, vol. 28, pp. 263–274, ACM, 1999.

[23] F. Hueske, A. Krettek, and K. Tzoumas, “Enabling operator reorder-

ing in data flow programs through static code analysis,” arXiv preprint

arXiv:1301.4200, 2013.

[24] Jure Leskovec and Andrej Krevl, “SNAP Datasets: Stanford Large Network

Dataset Collection.”

[25] H. Mannila and K. J. Raiha, “Test data for relational queries,” in Proceedings

of the fifth ACM SIGACT-SIGMOD symposium on Principles of database

systems, pp. 217–223, ACM, 1985.

[26] C. Binnig, D. Kossmann, E. Lo, and M. T. zsu, “QAGen: generating query-

aware test databases,” in Proceedings of the 2007 ACM SIGMOD interna-

tional conference on Management of data, pp. 341–352, ACM, 2007.

[27] B. Alexe, B. Ten Cate, P. G. Kolaitis, and W.-C. Tan, “Designing and refining

schema mappings via data examples,” in Proceedings of the 2011 ACM SIG-

MOD International Conference on Management of data, pp. 133–144, ACM,

2011.

Bibliography 75

[28] C. Binnig, D. Kossmann, and E. Lo, “Reverse query processing,” in Data

Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on,

pp. 506–515, IEEE, 2007.

[29] K. Li, C. Reichenbach, Y. Smaragdakis, Y. Diao, and C. Csallner, “SEDGE:

Symbolic example data generation for dataflow programs,” in Automated Soft-

ware Engineering (ASE), 2013 IEEE/ACM 28th International Conference on,

pp. 235–245, IEEE, 2013.

[30] D. Green, “Trail: The Reflection API,” The Java Tutorial Continued: The

Rest of the JDK (TM). Addison-Wesley Pub Co, 1998.

generating example tuples for data-flow programs in apache flink · 2015-10-20 · generating...

Documents