apachecon09: avro

12
State of the Elephant Doug Cutting Cloudera

Upload: cloudera-inc

Post on 10-May-2015

6.050 views

Category:

Technology


0 download

DESCRIPTION

Doug Cutting presents Avro at the 2009 ApacheCon US.

TRANSCRIPT

Page 1: ApacheCon09: Avro

State of the Elephant

Doug CuttingCloudera

Page 2: ApacheCon09: Avro

2009 Hadoop Milestones

● A 2009 Sort Champion (O'Malley)● won 100TB “Gray” sort @ .578TB/minute● won “Minute” sort with 500GB

● Split Core Project in Three● Common, HDFS & MapReduce

● Released 0.18.3, 0.19.[0-2], 0.20.[0-1]● Many meetups, conferences, etc.● Yada, yada, yada.

Page 3: ApacheCon09: Avro

Goals for 2010 and beyond

● IMHO, YMMV, IANAM, etc.

● Concrete● 1.0 release

– compatible APIs & RPCs for > 1 year

– Kerberos-based authentication

● Abstract● faster, more reliable, available● easier sharing

– of data & hardware resources

● spreadsheet-like interfaces– provide non-programmers

– with powerful, interactive tools

Page 4: ApacheCon09: Avro

Abstract Requirements

● security● facilitate sharing of resources

● stable cross-language APIs● facilitate diverse tools & apps

● expressive, inter-operable data● facilitates sharing of datasets● facilitates dynamic analyses

Page 5: ApacheCon09: Avro

Data Formats

● today in Hadoop:● text

– pro: inter-operable– con: not expressive, inefficient

● Java Writable– pro: expressive, efficient– con: platform-specific, fragile

Page 6: ApacheCon09: Avro

Protocol Buffers & Thrift

● expressive● efficient (small & fast)● but not very dynamic

● cannot browse arbitrary data● no DESCRIBE or SHOW● viewing a new dataset

– requires code generation & load● writing a new dataset

– requires generating schema text– plus code generation & load

Page 7: ApacheCon09: Avro

Avro Data

● as expressive● smaller and faster● dynamic

● schema stored with data– but factored out of instances

● APIs permit reading & creating– arbitrary datatypes– without generating & loading code

Page 8: ApacheCon09: Avro

Avro Data

● includes a file format● includes a textual encoding● handles versioning

● if schema changes● can still process data

● hope Hadoop apps will● upgrade from text; &● and standardize on Avro for data

Page 9: ApacheCon09: Avro

Avro RPC

● leverage versioning support● to permit different versions of services to

interoperate

● for Hadoop services, will● provide cross-language access● let apps talk to clusters running different versions

Page 10: ApacheCon09: Avro

Avro Status

● Java & Python APIs● C & C++ APIs making rapid progress

● 1.1 release● added JSON data and comparators

● 1.2 release● added HTTP & UDP-based RPC

● included in Hadoop 0.21● as format for job history● in sequence files

Page 11: ApacheCon09: Avro

Avro Near Future

● full mapreduce support for Avro data● enables fast comparators for non-Java apps

● Avro RPC used in Hadoop 0.22 (1.0)?● provides compatibility; &● native access from non-Java

Page 12: ApacheCon09: Avro

Thanks!

hadoop.apache.org/avro