lily at hug uk

47
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org Lily Smart data at scale

Upload: ngdata

Post on 08-Jul-2015

3.447 views

Category:

Technology


0 download

DESCRIPTION

presentation given 10/feb

TRANSCRIPT

Page 1: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

LilySmart data at scale

Page 2: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

big data,big problems

Page 3: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

MOORE vs data

» coping with volume + need for timeliness = parallel processing

» data becomes business-critical = resilience through distributed architectures

» Hadoop, MapReduce, HBase:the future data platform

3

moore

data

Page 4: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

the CHALLENGES

» process ALL data» process data in REAL-TIME» derive INSIGHTS» provide INSTANT FEEDBACK

4

Page 5: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5

data STOREdata

warehouse analytics

ETL

batched, off-line, overnight

current thinking

Page 6: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6

DATA

1. store and manage all YOUR data

Page 7: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7

DATA

USERBehavior

2. store user behaviour, nearby

Page 8: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8

DATA data processing

USERBehavior

3. analyze usage patterns

Page 9: Lily at HUG UK

data processing

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9

DATA

domain knowledge

patternsruleskeywordslists...

USERBehavior

4. add domain knowledge

Page 10: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10

DATA data processing

domain knowledge

patternsruleskeywordslists...

USERBehavior

recommendationssemantic augmentationAnalytics

5. process, in real-time

Page 11: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11

DATA data processing

domain knowledge

patternsruleskeywordslists...

USERBehavior

recommendationssemantic augmentationAnalytics

6. augment data

Page 12: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12

SMARTER DATA data processing

domain knowledge

patternsruleskeywordslists...

data insights

relations

recommendationssemantic augmentationAnalytics

Page 13: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13

SMARTER DATA data processing

domain knowledge

patternsruleskeywordslists...

relations

SMART DATA, at SCALE... and in real time

recommendationssemantic augmentationAnalytics

data insights

Page 14: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

stories

Page 15: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15

NEWS

organisationsnameslocationsbrands

HYPER-PERSONALrecommendations

TOGETHERNESSinterestingness

news aggregator

scale

Page 16: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16

productCATALOG

product familiesrelated activitiessocial graph

up-sellingCROSS-SELLING

recommendednessrelatedness

e-retail

real-time

Page 17: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17

patents

companiespeoplematerialsprocesses

competitiveinnovation

(dis)SIMILARITY

IP research

insights

Page 18: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Outerthought

18

» software product company» scalable content applications» open source product portfolio» Java, REST, internet

THIS NOTEBOOK BELONGS TO:

Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42

“The world is moving from content as a cost to data as an opportunity.”

Page 19: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19

data STORE datawarehouse

analytics+ +

real time

Lily 2.0

Lily 1.0 (CR)

}

Page 20: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily (now)

» Large-scale content storage, indexing and search» Current pilots

» up-to now: 4 man-years investment (since Sept/2009)

20

e-retail mobile media isp e-gov ip research

Page 21: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

roadmap

» now: Lily 0.3» april 2011 : Lily 1.0» Q3 2011» real-time statistics + analytics

» Q2 2012 : Lily 2.0» real-time data processing engine» Data Insights

21

» Along the road:Lily SaaS edition

Page 22: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

open source

»www.lilyproject.org» docs.outerthought.org/lily-docs-current/

22

Page 23: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily Core Concepts

» storage» HBase» repository model» versioning, varianting, mixins» indexing»mapping» search» SOLR

23

Page 24: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

falling in love with Hbase : phase 1

» automatic scaling to large data sets» fault-tolerance» flexible datamodel with sparse data» commodity hardware» efficient random access» community-based open source» Java if possible

24

Page 25: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

» need for consistency» atomic single-row updates»M/R for index regeneration

25

falling in love with Hbase : phase 2

Page 26: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase» datamodel with column families and cell versioning» ordered tables with range scans» HDFS for blob storage» Apache

26

falling in love with Hbase : phase 3

Page 27: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27

Lily Repository Model

Page 28: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily Datatypes

28

Page 29: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Mixins

29

Page 30: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Sample Lily Schema (excerpt)

30

namespaces:{/*Declarationofnamespaceprefixes.*/"org.lilyproject.bookssample":"b","org.lilyproject.vtag":"vtag"},fieldTypes:[{name:"b$title",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$pages",valueType:{primitive:"INTEGER"},scope:"versioned"},{name:"b$language",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$authors",valueType:{primitive:"LINK",multiValue:true},scope:"versioned"},

{name:"b$name",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$bio",valueType:{primitive:"STRING"},scope:"versioned"},{name:"vtag$last",valueType:{primitive:"LONG"},scope:"non_versioned"}],recordTypes:[{name:"b$Book",fields:[{name:"b$title",mandatory:true},{name:"b$pages",mandatory:false},{name:"b$language",mandatory:false},{name:"b$authors",mandatory:false},{name:"vtag$last",mandatory:false}]},

...

Page 31: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily Versioning

31

Page 32: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32

Flexible content model» generic enough to accomodate many popular content

schemas» HTML5, CMIS, RDF, NewsML, Dublin Core, ...» academically verified» not limited to ‘content applications’ only» developer convenience» higher level constructs» schema reuse» versioning, linking, ...

Page 33: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33

Lily Architecture(deployment)

Page 34: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34

Lily

Arc

hite

ctur

e(c

ompo

nent

s)

Page 35: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase RowLog Library

» need for sync/async operations» updating of secondary indexes (i.e. tables)» feeding of Indexer (= bridge to SOLR index maintenance)» not: transactions» need for distribution and durability

35

Page 36: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase RowLog Library

36

» WAL» guaranteed execution of synchronous

actions

» call doesn’t return before secondary action finishes

» e.g. update secondary index tables

» if all goes well, size = #concurrent ops

» useful outside of Lily context as well!

» Queue» triggering of async actions

» e.g. (re)index (updated) record with SOLR back-end

» size depends on speed of back-end process

Page 37: Lily at HUG UK

denormalization indexing of multiple versions of a record

incremental index updating

batch index building blob content extraction

sharding towards multiple SOLR

instances

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Lily Indexer

37

Page 38: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Indexing configuration (SOLR)

38

<schema name="example" version="1.2">

<types> [snipped: see SOLR example schema] </types>

<fields> <!-- Fields which are required by Lily --> <field name="@@key" type="string" indexed="true" stored="true" required="true"/> <field name="@@id" type="string" indexed="true" stored="true" required="true"/> <field name="@@vtag" type="string" indexed="true" stored="true" required="true"/> <field name="@@versionless" type="string" indexed="true" stored="true" required="false"/>

<!-- Your own fields --> <field name="title" type="text" indexed="true" stored="true" required="false"/> <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/> </fields>

<uniqueKey>@@key</uniqueKey>

<defaultSearchField>title</defaultSearchField>

<solrQueryParser defaultOperator="OR"/>

</schema>

Page 39: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Indexer configuration (Lily)

39

<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample"> <cases> <case recordType="b:Book" variant="*" vtags="last" indexVersionless="true"/> </cases>

<indexFields> <indexField name="title"> <value> <field name="b:title"/> </value> </indexField>

<indexField name="authors"> <value> <deref> <follow field="b:authors"/> <field name="b:name"/> </deref> </value> </indexField> </indexFields>

</indexer>

Page 40: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

(opt.) Sharding configuration

40

{  shardingKey: {    value: {      source: "variantProperty",      property: "language"    },    type: "string"  },

  mapping: {    type: "list",    entries: [      { shard: "shard1", values: ["en", "it"] },      { shard: "shard2", values: ["nl", "de", "es"] }    ]  }}

Page 41: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily API

» Java (using Avro)» http://docs.outerthought.org/lily-docs-current/g3/g1/390-lily.html

» REST (HTTP + JSON)» http://docs.outerthought.org/lily-docs-current/g3/g2/427-lily.html

» All docs» http://docs.outerthought.org/lily-docs-current/ext/toc/

41

Page 42: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Demo» http://outerthought.blip.tv/file/4245615/

42

Page 43: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily and HBase

» adds high-level content model» data types» versioning» blob storage on HDFS» focus on sparse (efficient) storage» RowLog for synchronous cross-table updates and async

message queues

43

Page 44: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily and SOLR

» provides flexible mapping between HBase content model and SOLR index fields» interactive and batch (M/R) index maintenance» sharding» use(s) SOLR as-is: loose, flexible, extensible coupling» search access via SOLR (HTTP) API

44

Page 45: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily and CDH

»we intend to rely on CDH-‘blessed’ versions of HBase/HDFS/ZK» 700 patches and testing» next: adopting similar distribution lay-out» since we contribute patches to ASF HBase trunk, we would

expect CDH to track closely (until HBase 1.0)» some Lily users could be interested in ‘CDH-level’ services

45

Page 46: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

goodbye

» It’s open source !» Content Repository: available now

(Lily model + HBase + SOLR + RowLog)» Lily 1.0 soon, will mainly focus on differentiating open

source and enterprise edition» “HBase is wa de max maat.”

46

Page 47: Lily at HUG UK

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Thank you !for your attentionfor your questions

» [email protected]

» @stevenn