lily at hug uk

Post on 08-Jul-2015

3.448 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

presentation given 10/feb

TRANSCRIPT

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

LilySmart data at scale

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

big data,big problems

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

MOORE vs data

» coping with volume + need for timeliness = parallel processing

» data becomes business-critical = resilience through distributed architectures

» Hadoop, MapReduce, HBase:the future data platform

3

moore

data

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

the CHALLENGES

» process ALL data» process data in REAL-TIME» derive INSIGHTS» provide INSTANT FEEDBACK

4

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5

data STOREdata

warehouse analytics

ETL

batched, off-line, overnight

current thinking

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6

DATA

1. store and manage all YOUR data

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7

DATA

USERBehavior

2. store user behaviour, nearby

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8

DATA data processing

USERBehavior

3. analyze usage patterns

data processing

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9

DATA

domain knowledge

patternsruleskeywordslists...

USERBehavior

4. add domain knowledge

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10

DATA data processing

domain knowledge

patternsruleskeywordslists...

USERBehavior

recommendationssemantic augmentationAnalytics

5. process, in real-time

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11

DATA data processing

domain knowledge

patternsruleskeywordslists...

USERBehavior

recommendationssemantic augmentationAnalytics

6. augment data

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12

SMARTER DATA data processing

domain knowledge

patternsruleskeywordslists...

data insights

relations

recommendationssemantic augmentationAnalytics

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13

SMARTER DATA data processing

domain knowledge

patternsruleskeywordslists...

relations

SMART DATA, at SCALE... and in real time

recommendationssemantic augmentationAnalytics

data insights

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

stories

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15

NEWS

organisationsnameslocationsbrands

HYPER-PERSONALrecommendations

TOGETHERNESSinterestingness

news aggregator

scale

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16

productCATALOG

product familiesrelated activitiessocial graph

up-sellingCROSS-SELLING

recommendednessrelatedness

e-retail

real-time

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17

patents

companiespeoplematerialsprocesses

competitiveinnovation

(dis)SIMILARITY

IP research

insights

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Outerthought

18

» software product company» scalable content applications» open source product portfolio» Java, REST, internet

THIS NOTEBOOK BELONGS TO:

Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42

“The world is moving from content as a cost to data as an opportunity.”

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19

data STORE datawarehouse

analytics+ +

real time

Lily 2.0

Lily 1.0 (CR)

}

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily (now)

» Large-scale content storage, indexing and search» Current pilots

» up-to now: 4 man-years investment (since Sept/2009)

20

e-retail mobile media isp e-gov ip research

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

roadmap

» now: Lily 0.3» april 2011 : Lily 1.0» Q3 2011» real-time statistics + analytics

» Q2 2012 : Lily 2.0» real-time data processing engine» Data Insights

21

» Along the road:Lily SaaS edition

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

open source

»www.lilyproject.org» docs.outerthought.org/lily-docs-current/

22

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily Core Concepts

» storage» HBase» repository model» versioning, varianting, mixins» indexing»mapping» search» SOLR

23

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

falling in love with Hbase : phase 1

» automatic scaling to large data sets» fault-tolerance» flexible datamodel with sparse data» commodity hardware» efficient random access» community-based open source» Java if possible

24

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

» need for consistency» atomic single-row updates»M/R for index regeneration

25

falling in love with Hbase : phase 2

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase» datamodel with column families and cell versioning» ordered tables with range scans» HDFS for blob storage» Apache

26

falling in love with Hbase : phase 3

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27

Lily Repository Model

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily Datatypes

28

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Mixins

29

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Sample Lily Schema (excerpt)

30

namespaces:{/*Declarationofnamespaceprefixes.*/"org.lilyproject.bookssample":"b","org.lilyproject.vtag":"vtag"},fieldTypes:[{name:"b$title",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$pages",valueType:{primitive:"INTEGER"},scope:"versioned"},{name:"b$language",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$authors",valueType:{primitive:"LINK",multiValue:true},scope:"versioned"},

{name:"b$name",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$bio",valueType:{primitive:"STRING"},scope:"versioned"},{name:"vtag$last",valueType:{primitive:"LONG"},scope:"non_versioned"}],recordTypes:[{name:"b$Book",fields:[{name:"b$title",mandatory:true},{name:"b$pages",mandatory:false},{name:"b$language",mandatory:false},{name:"b$authors",mandatory:false},{name:"vtag$last",mandatory:false}]},

...

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily Versioning

31

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32

Flexible content model» generic enough to accomodate many popular content

schemas» HTML5, CMIS, RDF, NewsML, Dublin Core, ...» academically verified» not limited to ‘content applications’ only» developer convenience» higher level constructs» schema reuse» versioning, linking, ...

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33

Lily Architecture(deployment)

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34

Lily

Arc

hite

ctur

e(c

ompo

nent

s)

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase RowLog Library

» need for sync/async operations» updating of secondary indexes (i.e. tables)» feeding of Indexer (= bridge to SOLR index maintenance)» not: transactions» need for distribution and durability

35

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase RowLog Library

36

» WAL» guaranteed execution of synchronous

actions

» call doesn’t return before secondary action finishes

» e.g. update secondary index tables

» if all goes well, size = #concurrent ops

» useful outside of Lily context as well!

» Queue» triggering of async actions

» e.g. (re)index (updated) record with SOLR back-end

» size depends on speed of back-end process

denormalization indexing of multiple versions of a record

incremental index updating

batch index building blob content extraction

sharding towards multiple SOLR

instances

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Lily Indexer

37

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Indexing configuration (SOLR)

38

<schema name="example" version="1.2">

<types> [snipped: see SOLR example schema] </types>

<fields> <!-- Fields which are required by Lily --> <field name="@@key" type="string" indexed="true" stored="true" required="true"/> <field name="@@id" type="string" indexed="true" stored="true" required="true"/> <field name="@@vtag" type="string" indexed="true" stored="true" required="true"/> <field name="@@versionless" type="string" indexed="true" stored="true" required="false"/>

<!-- Your own fields --> <field name="title" type="text" indexed="true" stored="true" required="false"/> <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/> </fields>

<uniqueKey>@@key</uniqueKey>

<defaultSearchField>title</defaultSearchField>

<solrQueryParser defaultOperator="OR"/>

</schema>

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Indexer configuration (Lily)

39

<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample"> <cases> <case recordType="b:Book" variant="*" vtags="last" indexVersionless="true"/> </cases>

<indexFields> <indexField name="title"> <value> <field name="b:title"/> </value> </indexField>

<indexField name="authors"> <value> <deref> <follow field="b:authors"/> <field name="b:name"/> </deref> </value> </indexField> </indexFields>

</indexer>

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

(opt.) Sharding configuration

40

{  shardingKey: {    value: {      source: "variantProperty",      property: "language"    },    type: "string"  },

  mapping: {    type: "list",    entries: [      { shard: "shard1", values: ["en", "it"] },      { shard: "shard2", values: ["nl", "de", "es"] }    ]  }}

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily API

» Java (using Avro)» http://docs.outerthought.org/lily-docs-current/g3/g1/390-lily.html

» REST (HTTP + JSON)» http://docs.outerthought.org/lily-docs-current/g3/g2/427-lily.html

» All docs» http://docs.outerthought.org/lily-docs-current/ext/toc/

41

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Demo» http://outerthought.blip.tv/file/4245615/

42

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily and HBase

» adds high-level content model» data types» versioning» blob storage on HDFS» focus on sparse (efficient) storage» RowLog for synchronous cross-table updates and async

message queues

43

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily and SOLR

» provides flexible mapping between HBase content model and SOLR index fields» interactive and batch (M/R) index maintenance» sharding» use(s) SOLR as-is: loose, flexible, extensible coupling» search access via SOLR (HTTP) API

44

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily and CDH

»we intend to rely on CDH-‘blessed’ versions of HBase/HDFS/ZK» 700 patches and testing» next: adopting similar distribution lay-out» since we contribute patches to ASF HBase trunk, we would

expect CDH to track closely (until HBase 1.0)» some Lily users could be interested in ‘CDH-level’ services

45

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

goodbye

» It’s open source !» Content Repository: available now

(Lily model + HBase + SOLR + RowLog)» Lily 1.0 soon, will mainly focus on differentiating open

source and enterprise edition» “HBase is wa de max maat.”

46

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Thank you !for your attentionfor your questions

» stevenn@outerthought.org

» @stevenn

top related