south bay cassandra meetup 4/23: building a flexible, real-time big data applications platform on...

196
Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Clint Kelly Member of Technical Staff WibiData Cassandra Meetup 23 April 2014

Upload: planet-cassandra

Post on 15-Jan-2015

394 views

Category:

Technology


2 download

DESCRIPTION

The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following: • Support for evolvable schemas of complex data types • Batch training of machine learning models with Hadoop • Real-time scoring with trained modelsIntegration with Hive and R • A REST endpoint Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following: • The Kiji architecture and data model • Implementing the Kiji data model in Cassandra using the Java driver and CQL3 • Integrating Cassandra with Hadoop 2.x • Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously) • Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

TRANSCRIPT

Page 1: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Building a Flexible, Real-time Big Data Applications Platform

on Cassandra with Kiji

Clint KellyMember of Technical StaffWibiData

Cassandra Meetup23 April 2014

Page 2: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Agenda

Page 3: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Agenda

The problem

Page 4: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Agenda

The problemHow Kiji works

Page 5: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Agenda

The problemHow Kiji works

Kiji on Cassandra

Page 6: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 7: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 8: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 9: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 10: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 11: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 12: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 13: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 14: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 15: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!Open source

software

Page 16: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 17: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 18: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 19: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 20: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

Page 21: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

?

Page 22: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in

Page 23: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in

Page 24: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in

REST

Page 25: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect

Page 26: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect

Page 27: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect

Page 28: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect

Page 29: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect

Page 30: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

Page 31: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

Page 32: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

“Trained model”

Page 33: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

“Trained model”

Page 34: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

“Trained model”

Page 35: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

“Trained model”

Page 36: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Train

“Trained model”

Page 37: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Model

Page 38: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Model

AaBb

Page 39: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Model

AaBb

Page 40: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Model

Page 41: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Model

Page 42: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Model

Page 43: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply

Page 44: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply

Page 45: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

ApplyAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Page 46: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

ApplyAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Page 47: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply

Batch

AaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Page 48: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data out

Page 49: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data out

Page 50: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data out

REST

Page 51: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data out

REST

Page 52: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 53: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 54: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 55: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 56: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 57: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

REST

Page 58: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

REST

Page 59: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

REST

Page 60: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 61: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 62: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 63: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 64: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 65: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra
Page 66: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

AaBb

Page 67: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

AaBb

Page 68: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

AaBb

Page 69: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Experiments / Deployment

Page 70: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Experiments / Deployment

Page 71: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Experiments / Deploymentc

d

c

d

Page 72: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Experiments / Deploymentc

d

c

d

Page 73: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

3

Page 74: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in / out

Page 75: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in / out(REST)

Page 76: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect and train

Page 77: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply

Page 78: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply(real-time)

Page 79: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!

?

Page 80: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

!!

Kiji

Page 81: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How Kiji works

Page 82: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 83: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 84: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 85: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 86: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 87: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 88: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 89: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji History

Page 90: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

In production now

Fortune 500 retailer : Personalized recommendations

Opower: Energy usage and analytics reporting

Page 91: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Page 92: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

EngineeringData

Science

Page 93: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Data Science

Write

Engineering

Page 94: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Data Science

Write

Channels Engineering

Page 95: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Data Science

WriteLogs

DBs

EngineeringChannels

Page 96: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Data Science

WriteLogs

DBs

Kij

iMR

EngineeringChannels

Page 97: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Data Science

Write

Kij

iRE

ST

Stream

EngineeringChannels

Page 98: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

Kiji

Data Science

Write

Read

Kij

iRE

ST

Stream

EngineeringChannels

Page 99: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

How does it work?

KijiSchema(Cassandra)

Data Science

Write

Read

Kij

iRE

ST

Stream

EngineeringChannels

Page 100: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

EngineeringChannels

Page 101: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

Page 102: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

Page 103: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

Page 104: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

C

C

C

EngineeringChannels

Page 105: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

Page 106: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

Page 107: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

Page 108: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

Page 109: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiMR

C

C

C

EngineeringChannels

Data

Page 110: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

C

C

C

EngineeringChannels

Data

Page 111: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

Page 112: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

Page 113: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

R

EngineeringChannels

Data

Page 114: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

Page 115: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

Page 116: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

R

R

R

EngineeringChannels

Data

Page 117: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 118: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 119: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 120: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 121: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 122: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 123: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

Page 124: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

Page 125: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

Page 126: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

R

Page 127: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

R

c

d

c

d

Page 128: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Kiji Model Repository

Kij

iSco

rin

g

Freshness Policy

C

C

C

R

EngineeringChannels

Data

Page 129: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

3

Page 130: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in / outKijiRESTKijiMR

Page 131: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect and trainKijiHiveKijiMR

KijiExpress

Page 132: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply(real-time)

KijiModelRepositoryKijiScoring

Page 133: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Modular

Page 134: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji on Cassandra

Page 135: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji ~ BigTable

Page 136: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

table

Page 137: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

table

rowrowrowrowrowrowrowrowrowrowrowrow

Page 138: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

row

Page 139: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Row key = entity ID

entity ID data

Page 140: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Composite entity IDs

data0xfa “bob”

Page 141: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Column families

payment0xfa “bob” interactions recommendations

Page 142: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

inter:clicks

inter:search0xfa “bob” payment:

cardnumpayment:address

rec:scorer1

rec:scorer2

Columns

Page 143: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Timestamped versions

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Page 144: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Complex data types

record Search { string search_term; long session_id; device_type device;}

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Page 145: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group

Page 146: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group

Column families

Page 147: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group

Page 148: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group

Batch Batch Batch

Page 149: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group

Batch Batch BatchReal-time

Real-time

Real-time

Page 150: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group

Batch BatchReal-time

Real-time

Real-time

Batch

Page 151: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

locality_group_real_timelocality_group_batch

Locality group

Batch BatchReal-time

Real-time

Real-time

Batch

Page 152: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

locality_group_real_timelocality_group_batch

Locality group

Batch Batch

Real-time

Real-time

Real-time

Batch

Page 153: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

Page 154: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

On disk.Compressed.

Page 155: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

On disk.Compressed. In memory.

Page 156: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Row ➔ transactional consistency

Page 157: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Locality group ➔ Column family

CREATE TABLE loc_grp

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Page 158: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Entity ID ➔ Primary key

CREATE TABLE loc_grp (city text, user text,

PRIMARY KEY (city, user) )

WITH CLUSTERING ORDER BY (user ASC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Page 159: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Family, Qualifier, Version ➔ Clustering Columns

CREATE TABLE loc_grp (city text, user text,

family text, qualifier text, version bigint,

PRIMARY KEY (city, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Page 160: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Column values ➔ Blobs

CREATE TABLE loc_grp (city text, user text,

family text, qualifier text, version bigint, value blob,

PRIMARY KEY (city, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Page 161: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

bob:pay:cardnum:t

AMEX1234...

bob:pay:addr:t5

1234 Main St, SF

bob:inter:clicks:t9

...

bob:inter:clicks:t7

...

bob:inter:clicks:t6

...

0xfa

Page 162: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Implementation notes

Page 163: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Implementation notes

DataStax Java driver

Page 164: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Implementation notes

DataStax Java driverCassandra 2.0.6

Page 165: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Implementation notes

DataStax Java driverCassandra 2.0.6

Async API

Page 166: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Implementation notes

DataStax Java driverCassandra 2.0.6

Async APINew MapReduce InputFormat

Page 167: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Issues

Page 168: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groups

Page 169: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Page 170: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Page 171: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups

Page 172: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Page 173: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Page 174: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups

Page 175: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Page 176: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Page 177: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Lose transactional consistency

Page 178: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Filters

HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses

Page 179: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Filters

HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses

Client-side filtering

Page 180: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Entity IDs with unhashed components

Page 181: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

EntityId(state, city, username)

Page 182: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

EntityId(state, city, username)

hashed

Page 183: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

EntityId(state, city, username)

hashed unhashed

Page 184: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

EntityId(state, city, username)

hashed unhashed

0x235af-alice

0x235af-bob

0x235af-cathy

0x235af-dave

0x38e0a-andy

0x38e0a-jane

0x38e0a-lucy

0x38e0a-nancy

HBase

Page 185: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

EntityId(state, city, username)

hashed unhashed

0x235af-alice

0x235af-bob

0x235af-cathy

0x235af-dave

0x38e0a-andy

0x38e0a-jane

0x38e0a-lucy

0x38e0a-nancy

HBase0x235af | alice | bob | cathy | dave

0x38e0a | andy | jane | lucy | nancy

Cassandra

Page 186: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

EntityId(state, city, username)

hashed unhashed

0x235af-alice

0x235af-bob

0x235af-cathy

0x235af-dave

0x38e0a-andy

0x38e0a-jane

0x38e0a-lucy

0x38e0a-nancy

HBase0x235af | alice | bob | cathy | dave

0x38e0a | andy | jane | lucy | nancy

Cassandra

Limited to width of C* wide row!

Page 187: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Project status

Page 189: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Next quarterCassandra in all Kiji components

Run MapReduce jobs with KijiExpressExpose Cassandra-specific features

Page 190: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

3

Page 191: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Data in / outKijiRESTKijiMR

Page 192: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Inspect and trainKijiHiveKijiMR

KijiExpress

Page 193: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Apply(real-time)

KijiModelRepositoryKijiScoring

Page 194: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra

Thanks to Cassandra community

Mailing listsMeetups, webinars, conferences

Page 196: South Bay Cassandra Meetup 4/23: Building a flexible, real-time Big Data Applications platform on Cassandra