apachecon09: avro

Post on 10-May-2015

6.050 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Doug Cutting presents Avro at the 2009 ApacheCon US.

TRANSCRIPT

State of the Elephant

Doug CuttingCloudera

2009 Hadoop Milestones

● A 2009 Sort Champion (O'Malley)● won 100TB “Gray” sort @ .578TB/minute● won “Minute” sort with 500GB

● Split Core Project in Three● Common, HDFS & MapReduce

● Released 0.18.3, 0.19.[0-2], 0.20.[0-1]● Many meetups, conferences, etc.● Yada, yada, yada.

Goals for 2010 and beyond

● IMHO, YMMV, IANAM, etc.

● Concrete● 1.0 release

– compatible APIs & RPCs for > 1 year

– Kerberos-based authentication

● Abstract● faster, more reliable, available● easier sharing

– of data & hardware resources

● spreadsheet-like interfaces– provide non-programmers

– with powerful, interactive tools

Abstract Requirements

● security● facilitate sharing of resources

● stable cross-language APIs● facilitate diverse tools & apps

● expressive, inter-operable data● facilitates sharing of datasets● facilitates dynamic analyses

Data Formats

● today in Hadoop:● text

– pro: inter-operable– con: not expressive, inefficient

● Java Writable– pro: expressive, efficient– con: platform-specific, fragile

Protocol Buffers & Thrift

● expressive● efficient (small & fast)● but not very dynamic

● cannot browse arbitrary data● no DESCRIBE or SHOW● viewing a new dataset

– requires code generation & load● writing a new dataset

– requires generating schema text– plus code generation & load

Avro Data

● as expressive● smaller and faster● dynamic

● schema stored with data– but factored out of instances

● APIs permit reading & creating– arbitrary datatypes– without generating & loading code

Avro Data

● includes a file format● includes a textual encoding● handles versioning

● if schema changes● can still process data

● hope Hadoop apps will● upgrade from text; &● and standardize on Avro for data

Avro RPC

● leverage versioning support● to permit different versions of services to

interoperate

● for Hadoop services, will● provide cross-language access● let apps talk to clusters running different versions

Avro Status

● Java & Python APIs● C & C++ APIs making rapid progress

● 1.1 release● added JSON data and comparators

● 1.2 release● added HTTP & UDP-based RPC

● included in Hadoop 0.21● as format for job history● in sequence files

Avro Near Future

● full mapreduce support for Avro data● enables fast comparators for non-Java apps

● Avro RPC used in Hadoop 0.22 (1.0)?● provides compatibility; &● native access from non-Java

Thanks!

hadoop.apache.org/avro

top related