efficient schemas in motion with kafka and schema registry

22
Efficient Schemas in Motion with Kafka and Schema Registry Pat Patterson Community Champion @metadaddy [email protected]

Upload: pat-patterson

Post on 29-Jan-2018

365 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Efficient Schemas in Motion with Kafka and Schema Registry

Efficient Schemas in Motion with Kafka and Schema Registry

Pat Patterson

Community Champion

@metadaddy

[email protected]

Page 2: Efficient Schemas in Motion with Kafka and Schema Registry

Enterprise Data DNA

Commercial Customers Across Verticals

250,000+ downloads

50+ of the Fortune 100

Doubling each quarter

Strong Partner Ecosystem Open Source Success

Mission: empower enterprises to harness their data in motion.

Who is StreamSets?

Page 3: Efficient Schemas in Motion with Kafka and Schema Registry

Avro

Schema Registry

Demo

Agenda

Page 4: Efficient Schemas in Motion with Kafka and Schema Registry

Joined ASF as a Hadoop subproject in 2009

Record-oriented serialization format

Binary (most common) and JSON (human readable) encodings

Apache Avro

Page 5: Efficient Schemas in Motion with Kafka and Schema Registry

Avro Prehistory

Page 6: Efficient Schemas in Motion with Kafka and Schema Registry

Schema defined in JSON• Relatively readable

Schema evolution• Can add new fields, rename fields in schema• Existing data can still be read under the new schema

Untagged binary data• Space-efficient!

Avro Advantages

Page 7: Efficient Schemas in Motion with Kafka and Schema Registry

{

"type": "record",

"namespace": "com.example",

"name": "Person",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" }

]

}

Avro Schema Definition

Page 8: Efficient Schemas in Motion with Kafka and Schema Registry

• null: 0 bytes

• boolean: 1 byte

• int/long: variable-length, zig-zag encoded

• float/double: 4/8 bytes

• bytes: length as long, then data

• string: length as long, then UTF-8-encoded data

Avro Binary Encoding - Simple Types

Page 9: Efficient Schemas in Motion with Kafka and Schema Registry

• Record: concatenate the field encodings

• Enum: zero-based index of symbol, as int

• Array: blocks of items, each preceded by a long count; zero count terminates array

• Map: blocks of K-V pairs, each preceded by a long count; zero count terminates array

• Union: position of item in schema as a long, then the item

• Fixed: the number of bytes defined in the schema

Avro Binary Encoding - Complex Types

Page 10: Efficient Schemas in Motion with Kafka and Schema Registry

{

"type": "record",

"namespace": "com.example",

"name": "Person",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" },

{ "name": "age", "type": "int", "default": -1 }

]

}

Avro Schema Evolution

Page 11: Efficient Schemas in Motion with Kafka and Schema Registry

Compatibility Rules:• New fields must have a default• Deleted field must have had a default• Doc/Order can be added/removed/changed• Field default can be added/changed• Field/type aliases can be added/removed• Non-union can be converted to union with just that type, or vice

versa

General rule is that old data can be read under the new schema

Avro Schema Evolution

Page 12: Efficient Schemas in Motion with Kafka and Schema Registry

Avro Schema Serialization

Various options, depending on file/message orientation, but, generally:• Metadata, including the schema• Data

Great for files - schema is sent just once, but what about messages?• Send just once? Periodically?• Send per message?• Agree out of band?

Page 13: Efficient Schemas in Motion with Kafka and Schema Registry

Schema Overhead

Demo

Page 14: Efficient Schemas in Motion with Kafka and Schema Registry

Online schema repository• Simple REST APIEach schema has an ID• Unique within the repositorySchemas versioned within subjects• Supports schema evolution• Subject loosely corresponds to topic• Subject + version -> ID

Schema Registry

Page 15: Efficient Schemas in Motion with Kafka and Schema Registry

Register schema, registry returns an ID

Sender passes schema ID in each message

Recipient looks up ID in registry

Solves the Avro-by-Message Problem

Page 16: Efficient Schemas in Motion with Kafka and Schema Registry

Schema By Reference

Demo

Page 17: Efficient Schemas in Motion with Kafka and Schema Registry

Just register a new (compatible) schema via the same topic

Schema is assigned a new ID

Evolution with Schema Registry

Page 18: Efficient Schemas in Motion with Kafka and Schema Registry

Schema Evolution

Demo

Page 19: Efficient Schemas in Motion with Kafka and Schema Registry

Landoop schema-registry-uihttps://github.com/Landoop/schema-registry-ui

Bonus Feature: Web UI

Page 20: Efficient Schemas in Motion with Kafka and Schema Registry

Schema Evolution Part Deux

Demo

Page 21: Efficient Schemas in Motion with Kafka and Schema Registry

Conclusion

Avro: a row-oriented, self-describing format for data serialization

Default Avro is inefficient in a message-passing setting

Referencing schema by ID dramatically reduces the volume of network traffic

Page 22: Efficient Schemas in Motion with Kafka and Schema Registry

Thank You! Pat Patterson

Community Champion

@metadaddy

[email protected]