apache avro (boston hug, jan 19, 2010)

14

Click here to load reader

Upload: cloudera-inc

Post on 10-May-2015

5.261 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Apache AVRO (Boston HUG, Jan 19, 2010)

Apache AVROWhat's new?

Philip Zeyliger, Cloudera(AVRO committer)

Boston HUGJanuary 19, 2009

Page 2: Apache AVRO (Boston HUG, Jan 19, 2010)

What's AVRO?

A data serialization systemIncludes:

A schema languageA compact serialized formAn RPC frameworkA handful of APIs, in a handful of languages

Goals:Cross-languageSupport for dynamic accessSimple but expressive schema evolution

Same "space" as Apache Thrift, Google Protocol Buffers, Binary JSON, and XDR. Subtle differences with all of them.

Page 3: Apache AVRO (Boston HUG, Jan 19, 2010)

AVRO Protocols & Schemas@namespace("org.apache.avro.demo")protocol CurrencyConversion { enum Currency { USD, GBP, EUR, JPY } record Money { Currency currency; int amount; } error UnknownRateError { Currency currency; } Money convert(Money input, Currency targetCurrency) throws UnknownRateError; double rate(Currency input, Currency output) throws UnknownRateError;}

"genavro" IDL (AVRO-258)

Page 4: Apache AVRO (Boston HUG, Jan 19, 2010)

$java -jar avro-tools-1.2.0-dev.jar genavro < demo.genavro { "protocol" : "CurrencyConversion", "namespace" : "org.apache.avro.demo", "types" : [ { "type" : "enum", "name" : "Currency", "symbols" : [ "USD", "GBP", "EUR", "JPY" ] }, { "type" : "record", "name" : "Money", "fields" : [ { "name" : "currency", "type" : "Currency" }, { "name" : "amount", "type" : "int" } ] }, { "type" : "error", "name" : "UnknownRateError", "fields" : [ { "name" : "currency", "type" : "Currency" } ] } ],

"messages" : { "convert" : { "request" : [ { "name" : "input", "type" : "Money" }, { "name" : "targetCurrency", "type" : "Currency" } ], "response" : "Money", "errors" : [ "UnknownRateError" ] }, "rate" : { "request" : [ { "name" : "input", "type" : "Currency" }, { "name" : "output", "type" : "Currency" } ], "response" : "double", "errors" : [ "UnknownRateError" ] } }}[

JSON Representation of Protocol and Schemas

Page 5: Apache AVRO (Boston HUG, Jan 19, 2010)

Types

primitivestringbytesint & longfloat & doublebooleannull

complexrecordarraymap: string -> Tunionfixed<N>enum

Page 6: Apache AVRO (Boston HUG, Jan 19, 2010)

Schema Evolution & ProjectionAVRO binary data never travels without its schema. This allows dynamic tooling.Writer's Schema and Reader's Schema may be different.

{ /* Writer */ "type" : "record", "name" : "Person", "fields" : [ { "name" : "first", "type" : "string" }, { "name" : "sport", "type" : "string", } }

Serialized Data:

"Alice", "Ultimate Frisbee"

{ /* Reader */ "type" : "record", "name" : "Person", "fields" : [ { "name" : "first", "type" : "string" }, { "name" : "age", "type" : "int", "default": 0, } }

Data presented to application:

"Alice", 0

Page 7: Apache AVRO (Boston HUG, Jan 19, 2010)

APIs

PythonDynamic

JavaSpecific (generated code)Generic (container-based)Reflection (induces schemas from classes)

CC++Ruby

Page 8: Apache AVRO (Boston HUG, Jan 19, 2010)

C API

char buf[64];avro_writer_t writer = avro_writer_memory(buf, sizeof(buf));avro_schema_t writers_schema = avro_schema_string();avro_datum_t datum = avro_string("Hello, world!");avro_write_data(writer, writers_schema, datum);

avro_reader_t reader = avro_reader_memory(buf, sizeof(buf));avro_schema_t readers_schema = avro_schema_string();avro_datum_t read_datum;avro_read_data(reader, writers_schema, readers_schema, &read_datum);

Page 9: Apache AVRO (Boston HUG, Jan 19, 2010)

Data File Format (AVRO-160)

Features: * Splittable (important for Hadoop!) * Append only with same schema. * Compression * Arbitrary metadata * Simple

Page 10: Apache AVRO (Boston HUG, Jan 19, 2010)

Hadoop IntegrationUsers

AvroInputFormat/AvroOutputFormat (MR-815)Using AVRO in the shuffle (MR-1126)

Note that AVRO schemas let you specify sort order; binary comparators are a thing of the past

Many Writables can be AVRO+Reflection insteadAVRO sort order leaves hand-writing RawComparators in the past; for Streaming, you now get fast comparators for free!

FrameworkAVRO for Hadoop RPC (e.g., HDFS-982)

GoalsOpen up protocols for cross-language use

Page 11: Apache AVRO (Boston HUG, Jan 19, 2010)

avro-tools

Available tools: compile Generates Java code for the given schema.fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. genavro Generates a JSON schema from a GenAvro file getschema Prints out schema of an Avro data file. induce Induce a schema/protocol from Java class/interface.jsontofrag Renders a JSON-encoded Avro datum as binary.rpcreceive Opens an HTTP RPC Server and listens for one message. rpcsend Sends a single RPC message. tojson Dumps an Avro data file as JSON, one record per line.

Page 12: Apache AVRO (Boston HUG, Jan 19, 2010)

1.3 to be released soon...

Good time to try it out!

What's evolving?Trying not to evolve the serialized format.APIs are evolving.Transports are evolving.

Page 13: Apache AVRO (Boston HUG, Jan 19, 2010)

Obligatory Links

Web page: http://hadoop.apache.org/avro/Mailing list:[email protected] repository:http://svn.apache.org/repos/asf/hadoop/avro/

Page 14: Apache AVRO (Boston HUG, Jan 19, 2010)

Thanks!

Questions?

Philip [email protected]