avro apache course: distributed class student id: am20144203 name: azzaya galbazar 2014.12.17

17
Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Upload: norah-perry

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Avro Apache Course: Distributed class

Student ID: AM20144203

Name: Azzaya Galbazar

2014.12.17

Page 2: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Overview-What is Avro? Avro is an Apache open source project that provides

two services for the Hadoop(data serialization and exchange).

Avro is recent serialization system. Interoperability

Can Serialize into Avro/Binary or Avro/JSON

Supports reading and writing protobufs and thrift

Page 3: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Overview-Avro provides..? Rich data structures with schema designed over JSON

A compact, fast binary format.

A container file, to store persistent data.

Remote procedure call (RPC).

Simple integration with dynamic languages.

Code generation is not required to read or write data files nor to use or implement RPC protocols.

Code generation as an optional optimization, only worth implementing for statically typed languages.

Page 4: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Overview Avro uses JSON for Interface Description

Language(IDL)To specify data types

To specify protocols Review: JavaScript Object Notation is just a light-

weight text-based standard for data interchange.

Page 5: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Overview-Why the need for Avro? Primary usage in Hadoop, provides standard:

Serialization format for persistent data Wire format for communication

Among Hadoop nodes.

From client programs to Hadoop services.

Page 6: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Overview Avro relies on schemas.

Schema stored with data Each datum written with no per-value overheads.

Thus serialization is fast and small Avro in RPC:

Schema exchange during client-server handshake Correspondence in fields can be easily resolved.

Page 7: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Overview-APIs Supporting API for:

Java C C++ C# Python Ruby

Page 8: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Specification A Schema is represented in JSON by on of: A JSON string, naming a defined type.

A JSON object, of the form:{“type”: ”type name” …attributes…}

A JSON array, representing a union of embedded types.

Primitive types: null, boolean, int, long, float, double, bytes, string

Complex types: records, enums, arrays, maps, unions, fixed

Page 9: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

1. Apache Maven is a software project management and comprehension tool.

2. Based on the concept of a project object model (POM),

3. Maven can manage a project's build, reporting and documentation from a central piece of information

Page 10: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

1.Add two dependencies to pom.xml-the one is Apache Avro library, the other one is maven plugin that allows us to generate Java classes.

Page 11: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

1.Add two dependencies to pom.xml-the one is Apache Avro library, the other one is maven plugin that allows us to generate Java classes.

Page 12: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

2.Defining a schema

#a schema file can only contain a single schema definition.

Page 13: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

2.Serializing and deserializing from a File

# serializes book to file and deserializes it and print it to output.

Page 14: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

2.Serializing and deserializing from a File

# serializes book to file and deserializes it and print it to output.

Page 15: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven

2.Describing functions

#DataFileWriter converts Java object into an in-memory serialized format.

#SpecificDatumWriter extracts the schema from specified type.

#DataFileWriter writes the serialized record, as well as the schema.

Page 16: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Apache Avro with Maven Java

4.Running the example code

5.Result output.

Page 17: Avro Apache Course: Distributed class Student ID: AM20144203 Name: Azzaya Galbazar 2014.12.17

Thank you for your attention