apachecon09: avro
DESCRIPTION
Doug Cutting presents Avro at the 2009 ApacheCon US.TRANSCRIPT
State of the Elephant
Doug CuttingCloudera
2009 Hadoop Milestones
● A 2009 Sort Champion (O'Malley)● won 100TB “Gray” sort @ .578TB/minute● won “Minute” sort with 500GB
● Split Core Project in Three● Common, HDFS & MapReduce
● Released 0.18.3, 0.19.[0-2], 0.20.[0-1]● Many meetups, conferences, etc.● Yada, yada, yada.
Goals for 2010 and beyond
● IMHO, YMMV, IANAM, etc.
● Concrete● 1.0 release
– compatible APIs & RPCs for > 1 year
– Kerberos-based authentication
● Abstract● faster, more reliable, available● easier sharing
– of data & hardware resources
● spreadsheet-like interfaces– provide non-programmers
– with powerful, interactive tools
Abstract Requirements
● security● facilitate sharing of resources
● stable cross-language APIs● facilitate diverse tools & apps
● expressive, inter-operable data● facilitates sharing of datasets● facilitates dynamic analyses
Data Formats
● today in Hadoop:● text
– pro: inter-operable– con: not expressive, inefficient
● Java Writable– pro: expressive, efficient– con: platform-specific, fragile
Protocol Buffers & Thrift
● expressive● efficient (small & fast)● but not very dynamic
● cannot browse arbitrary data● no DESCRIBE or SHOW● viewing a new dataset
– requires code generation & load● writing a new dataset
– requires generating schema text– plus code generation & load
Avro Data
● as expressive● smaller and faster● dynamic
● schema stored with data– but factored out of instances
● APIs permit reading & creating– arbitrary datatypes– without generating & loading code
Avro Data
● includes a file format● includes a textual encoding● handles versioning
● if schema changes● can still process data
● hope Hadoop apps will● upgrade from text; &● and standardize on Avro for data
Avro RPC
● leverage versioning support● to permit different versions of services to
interoperate
● for Hadoop services, will● provide cross-language access● let apps talk to clusters running different versions
Avro Status
● Java & Python APIs● C & C++ APIs making rapid progress
● 1.1 release● added JSON data and comparators
● 1.2 release● added HTTP & UDP-based RPC
● included in Hadoop 0.21● as format for job history● in sequence files
Avro Near Future
● full mapreduce support for Avro data● enables fast comparators for non-Java apps
● Avro RPC used in Hadoop 0.22 (1.0)?● provides compatibility; &● native access from non-Java
Thanks!
hadoop.apache.org/avro