vyacheslav zholudev – flink, a convenient abstraction layer for yarn?

18
FLINK - A CONVENIENT ABSTRACTION LAYER FOR YARN? VYACHESLAV ZHOLUDEV

Upload: flink-forward

Post on 08-Jan-2017

5.506 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

FLINK - A CONVENIENT ABSTRACTION LAYER FOR YARN?

VYACHESLAV ZHOLUDEV

Page 2: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

INTRODUCTION

• YARN opened Hadoop for many more developers • API to integrate into a Hadoop cluster • Flexibility • Applications: MR, TEZ, Flink, Spark,…

• Flink has been great in using the opportunity • Flexible program execution graph • Operators other than Map and Reduce • Clean and convenient API • Efficient with I/O

Page 3: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

EXPECTATIONS FROM YARN

• New programming models in addition to MapReduce • More alternatives to cover cases where the MapReduce paradigm does

not suit well • Flexibility with expressing operations on data • Elasticity of a cluster • Ability to write own applications to distribute computations across

the cluster

Page 4: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

DISTRIBUTING COMPUTATIONAL TASKS

• Writing own YARN application • Complicated • Tedious • Error-prone • Somebody must have done

something simpler • Apache Twill • Was not simple enough still

• Execute CLI tools remotely (if everything else fails)

• Flink?

Page 5: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

FLINK AT RESEARCHGATE

Lots of benefits:• Made MapReduce jobs more readable • More compact • Less boiler plate code • Easier to understand and maintain

• Got rid of ugly Hive queries and optimised runtime • Better and cleaner orchestration of workflow

subtasks (before we had to glue multiple MR jobs) • Iterative machine learning algorithms • Distributing computational tasks across a cluster

Page 6: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

REAL USE CASE:MONGODB TO AVRO BRIDGE

Page 7: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

REAL USE CASE

• In essence:• Reads MongoDB documents• Converts them to Avro records (based on a provided Avro schema)• Persists them on HDFS

• Avrongo evolution • One threaded program• Multi-threaded program talking to different shards in parallel• Distributed across cluster

• Reasons for distributing:• Were CPU bound• HDFS load distribution

A MongoDB to Avro Bridge (aka Avrongo)

Used to dump live DB data to HDFS for further batch-processing and analytics

Page 8: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

HOW AVRONGO WORKS?

Basic Version• One thread• Using one MongoDB cursor to iterate the whole collection• Suitable for smaller collections

Page 9: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

MONGODB SHARDS AND CHUNKS

• Controlling load on the MongoDB cluster• Deterministic way of splitting collection for input

Utilizing MongoDB chunks

Page 10: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

AVRONGO - SHARDED VERSION

• Collecting chunks information (sets of documents living on a particular shard)• Processing chunks of each shard in a separate group of threads

Page 11: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

AVRONGO - FLINK VERSION

• Custom InputFormat that distributes MongoDB chunks uniformly• FlatMap operator• Number of task nodes = (number of shards) x (parallelism per shard) • Custom Generic AvroOutputFormat• Slower shards receive a bit more attention

Page 12: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

FLINK APPROACH

Outcome• No longer bound by CPU• Imports to HDFS are faster

• Some collections: from 6h to 2.5h or from 3.5h to 2h

• Very few lines of code• Same command line interface (no efforts to migrate to Flink-based version)• Reusing the same converter as in standalone versions• All orchestration and parallelisation work is done automatically by Flink

Benefits

Page 13: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

ANOTHER USE CASE:DISTRIBUTED FILE COPYING

Page 14: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

HADOOP DISTCP

• Generates a MapReduce job that copies big amount of data• List of files as an input to a Map Task• Two types of Input Formats:

• UniformSizeInputFormat• DynamicInputFormat• gives more load to faster mappers• complicated code• utilizes FS to feed the mappers

https://hadoop.apache.org/docs/r1.2.1/distcp2.html

Page 15: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

• Implements the same logic as in a DynamicInputFormat of Hadoop’s distcp• Much fewer lines of code • Same runtime as Hadoop distcp • Available in Flink Java examples• Not fault-tolerant (yet)

FLINK DISTCP

https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/distcp

Page 16: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

CONCLUSIONS

Page 17: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

CONCLUSIONS

• Flink - a thin layer for implementing your YARN application for parallelising independent tasks on the cluster• Thanks to custom input formats that are easy to implement• No boilerplate code

Would be nice to have:• Elasticity• Better progress tracking• Fault tolerance

Custom input format + a Flink operator with business logic = Happiness

Page 18: Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

QUESTIONS?

https://www.researchgate.net/careers