lily at hug uk
Post on 08-Jul-2015
3.448 Views
Preview:
DESCRIPTION
TRANSCRIPT
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
LilySmart data at scale
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
big data,big problems
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
MOORE vs data
» coping with volume + need for timeliness = parallel processing
» data becomes business-critical = resilience through distributed architectures
» Hadoop, MapReduce, HBase:the future data platform
3
moore
data
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
the CHALLENGES
» process ALL data» process data in REAL-TIME» derive INSIGHTS» provide INSTANT FEEDBACK
4
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
data STOREdata
warehouse analytics
ETL
batched, off-line, overnight
current thinking
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
DATA
1. store and manage all YOUR data
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
DATA
USERBehavior
2. store user behaviour, nearby
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
DATA data processing
USERBehavior
3. analyze usage patterns
data processing
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
DATA
domain knowledge
patternsruleskeywordslists...
USERBehavior
4. add domain knowledge
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
DATA data processing
domain knowledge
patternsruleskeywordslists...
USERBehavior
recommendationssemantic augmentationAnalytics
5. process, in real-time
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
DATA data processing
domain knowledge
patternsruleskeywordslists...
USERBehavior
recommendationssemantic augmentationAnalytics
6. augment data
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
SMARTER DATA data processing
domain knowledge
patternsruleskeywordslists...
data insights
relations
recommendationssemantic augmentationAnalytics
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
SMARTER DATA data processing
domain knowledge
patternsruleskeywordslists...
relations
SMART DATA, at SCALE... and in real time
recommendationssemantic augmentationAnalytics
data insights
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
stories
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
NEWS
organisationsnameslocationsbrands
HYPER-PERSONALrecommendations
TOGETHERNESSinterestingness
news aggregator
scale
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
productCATALOG
product familiesrelated activitiessocial graph
up-sellingCROSS-SELLING
recommendednessrelatedness
e-retail
real-time
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
patents
companiespeoplematerialsprocesses
competitiveinnovation
(dis)SIMILARITY
IP research
insights
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Outerthought
18
» software product company» scalable content applications» open source product portfolio» Java, REST, internet
THIS NOTEBOOK BELONGS TO:
Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42
“The world is moving from content as a cost to data as an opportunity.”
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
data STORE datawarehouse
analytics+ +
real time
Lily 2.0
Lily 1.0 (CR)
}
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily (now)
» Large-scale content storage, indexing and search» Current pilots
» up-to now: 4 man-years investment (since Sept/2009)
20
e-retail mobile media isp e-gov ip research
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
roadmap
» now: Lily 0.3» april 2011 : Lily 1.0» Q3 2011» real-time statistics + analytics
» Q2 2012 : Lily 2.0» real-time data processing engine» Data Insights
21
» Along the road:Lily SaaS edition
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
open source
»www.lilyproject.org» docs.outerthought.org/lily-docs-current/
22
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily Core Concepts
» storage» HBase» repository model» versioning, varianting, mixins» indexing»mapping» search» SOLR
23
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
falling in love with Hbase : phase 1
» automatic scaling to large data sets» fault-tolerance» flexible datamodel with sparse data» commodity hardware» efficient random access» community-based open source» Java if possible
24
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
» need for consistency» atomic single-row updates»M/R for index regeneration
25
falling in love with Hbase : phase 2
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
HBase» datamodel with column families and cell versioning» ordered tables with range scans» HDFS for blob storage» Apache
26
falling in love with Hbase : phase 3
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
Lily Repository Model
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily Datatypes
28
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Mixins
29
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Sample Lily Schema (excerpt)
30
namespaces:{/*Declarationofnamespaceprefixes.*/"org.lilyproject.bookssample":"b","org.lilyproject.vtag":"vtag"},fieldTypes:[{name:"b$title",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$pages",valueType:{primitive:"INTEGER"},scope:"versioned"},{name:"b$language",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$authors",valueType:{primitive:"LINK",multiValue:true},scope:"versioned"},
{name:"b$name",valueType:{primitive:"STRING"},scope:"versioned"},{name:"b$bio",valueType:{primitive:"STRING"},scope:"versioned"},{name:"vtag$last",valueType:{primitive:"LONG"},scope:"non_versioned"}],recordTypes:[{name:"b$Book",fields:[{name:"b$title",mandatory:true},{name:"b$pages",mandatory:false},{name:"b$language",mandatory:false},{name:"b$authors",mandatory:false},{name:"vtag$last",mandatory:false}]},
...
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily Versioning
31
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
Flexible content model» generic enough to accomodate many popular content
schemas» HTML5, CMIS, RDF, NewsML, Dublin Core, ...» academically verified» not limited to ‘content applications’ only» developer convenience» higher level constructs» schema reuse» versioning, linking, ...
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
Lily Architecture(deployment)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
Lily
Arc
hite
ctur
e(c
ompo
nent
s)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
HBase RowLog Library
» need for sync/async operations» updating of secondary indexes (i.e. tables)» feeding of Indexer (= bridge to SOLR index maintenance)» not: transactions» need for distribution and durability
35
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
HBase RowLog Library
36
» WAL» guaranteed execution of synchronous
actions
» call doesn’t return before secondary action finishes
» e.g. update secondary index tables
» if all goes well, size = #concurrent ops
» useful outside of Lily context as well!
» Queue» triggering of async actions
» e.g. (re)index (updated) record with SOLR back-end
» size depends on speed of back-end process
denormalization indexing of multiple versions of a record
incremental index updating
batch index building blob content extraction
sharding towards multiple SOLR
instances
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The Lily Indexer
37
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Indexing configuration (SOLR)
38
<schema name="example" version="1.2">
<types> [snipped: see SOLR example schema] </types>
<fields> <!-- Fields which are required by Lily --> <field name="@@key" type="string" indexed="true" stored="true" required="true"/> <field name="@@id" type="string" indexed="true" stored="true" required="true"/> <field name="@@vtag" type="string" indexed="true" stored="true" required="true"/> <field name="@@versionless" type="string" indexed="true" stored="true" required="false"/>
<!-- Your own fields --> <field name="title" type="text" indexed="true" stored="true" required="false"/> <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/> </fields>
<uniqueKey>@@key</uniqueKey>
<defaultSearchField>title</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
</schema>
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Indexer configuration (Lily)
39
<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample"> <cases> <case recordType="b:Book" variant="*" vtags="last" indexVersionless="true"/> </cases>
<indexFields> <indexField name="title"> <value> <field name="b:title"/> </value> </indexField>
<indexField name="authors"> <value> <deref> <follow field="b:authors"/> <field name="b:name"/> </deref> </value> </indexField> </indexFields>
</indexer>
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
(opt.) Sharding configuration
40
{ shardingKey: { value: { source: "variantProperty", property: "language" }, type: "string" },
mapping: { type: "list", entries: [ { shard: "shard1", values: ["en", "it"] }, { shard: "shard2", values: ["nl", "de", "es"] } ] }}
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily API
» Java (using Avro)» http://docs.outerthought.org/lily-docs-current/g3/g1/390-lily.html
» REST (HTTP + JSON)» http://docs.outerthought.org/lily-docs-current/g3/g2/427-lily.html
» All docs» http://docs.outerthought.org/lily-docs-current/ext/toc/
41
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Demo» http://outerthought.blip.tv/file/4245615/
42
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily and HBase
» adds high-level content model» data types» versioning» blob storage on HDFS» focus on sparse (efficient) storage» RowLog for synchronous cross-table updates and async
message queues
43
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily and SOLR
» provides flexible mapping between HBase content model and SOLR index fields» interactive and batch (M/R) index maintenance» sharding» use(s) SOLR as-is: loose, flexible, extensible coupling» search access via SOLR (HTTP) API
44
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily and CDH
»we intend to rely on CDH-‘blessed’ versions of HBase/HDFS/ZK» 700 patches and testing» next: adopting similar distribution lay-out» since we contribute patches to ASF HBase trunk, we would
expect CDH to track closely (until HBase 1.0)» some Lily users could be interested in ‘CDH-level’ services
45
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
goodbye
» It’s open source !» Content Repository: available now
(Lily model + HBase + SOLR + RowLog)» Lily 1.0 soon, will mainly focus on differentiating open
source and enterprise edition» “HBase is wa de max maat.”
46
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Thank you !for your attentionfor your questions
» stevenn@outerthought.org
» @stevenn
top related