sunbirst

22
Sunbirst A distributed worker model for Apache Solr @sleepyfox for sourcesense

Upload: nigel-runnels-moss

Post on 24-May-2015

177 views

Category:

Technology


0 download

DESCRIPTION

Introducing Sunbirst, a distributed pipeline processing architecture for Apache Solr.

TRANSCRIPT

Page 1: Sunbirst

SunbirstA distributed worker model for Apache Solr

@sleepyfox for sourcesense

Page 2: Sunbirst

@sleepyfox for sourcesense

What’s in the box

• Context• Problem definition• One possible solution• Discussion• ...

Page 3: Sunbirst

Where we are now

Page 4: Sunbirst

@sleepyfox for sourcesense

Existing system

Page 5: Sunbirst

@sleepyfox for sourcesense

Existing system

• Usual Solr production configuration:• High-volume search• Low volume indexing

Page 6: Sunbirst

@sleepyfox for sourcesense

Existing system

• Usual Solr production configuration:• High-volume search• Low volume indexing

• Our customer:• High volume indexing• Low volume search

Page 7: Sunbirst

@sleepyfox for sourcesense

Volumes

Page 8: Sunbirst

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day

Page 9: Sunbirst

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive

Page 10: Sunbirst

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive • = 180m docs indexed

Page 11: Sunbirst

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive • = 180m docs indexed• 10k searches/day

Page 12: Sunbirst

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive • = 180m docs indexed• 10k searches/day• = 1 search per few seconds-ish

Page 13: Sunbirst

@sleepyfox for sourcesense

Existing architecture

Page 14: Sunbirst

@sleepyfox for sourcesense

How it works

• 2 rows, each 20 shards + coordinator• Partitioning algorithm = (id % 20)• Each shard has:

• Solr instance• Indexer• Optimizer• Committer• Purger

Page 15: Sunbirst

@sleepyfox for sourcesense

How it works

• Documents retrieved by coordinator in blocks of 500

• These are allocated by id to shards according to the partitioning scheme

• Shards poll metabases for their content• Shards index content• Coordinator archives content

Page 16: Sunbirst

@sleepyfox for sourcesense

Challenges

• Coordinator responsible for 2 things:• Archiving content• Routing searches

• Redundant data flow from metabases• Partitioning scheme means (n-1/n)*100

percent of docs move on adding shard

Page 17: Sunbirst

One possible future

Page 18: Sunbirst

@sleepyfox for sourcesense

Distributed workflow

• Different worker pools:• Indexer• Searcher• Archiver• Coordinator• Content enricher...

Page 19: Sunbirst

@sleepyfox for sourcesense

Ingest Pipeline

Ingester ArchiverEnricher

DiskDisk

Ref. data

Indexer

Archive SolrIngest queue

Coordinator

Page 20: Sunbirst

@sleepyfox for sourcesense

ESB

• Orchestration, workflow and EI patterns by Apache ServiceMix

• Messaging by ApacheMQ• REST by Apache CXF• Runtime container by Apache Karaf• 100% Open Source Software

Page 21: Sunbirst

@sleepyfox for sourcesense

Call to arms

• Designed to be more generic than initial itch that needed scratching

• Have Solr/Lucene committers • Happy to accept outside contributors • May eventually become Apache incubator• Contact: Nigel Runnels-Moss

• @sleepyfox on Twitter• [email protected]

Page 22: Sunbirst

Questions