sunbirst

Post on 24-May-2015

177 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introducing Sunbirst, a distributed pipeline processing architecture for Apache Solr.

TRANSCRIPT

SunbirstA distributed worker model for Apache Solr

@sleepyfox for sourcesense

@sleepyfox for sourcesense

What’s in the box

• Context• Problem definition• One possible solution• Discussion• ...

Where we are now

@sleepyfox for sourcesense

Existing system

@sleepyfox for sourcesense

Existing system

• Usual Solr production configuration:• High-volume search• Low volume indexing

@sleepyfox for sourcesense

Existing system

• Usual Solr production configuration:• High-volume search• Low volume indexing

• Our customer:• High volume indexing• Low volume search

@sleepyfox for sourcesense

Volumes

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive • = 180m docs indexed

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive • = 180m docs indexed• 10k searches/day

@sleepyfox for sourcesense

Volumes

• 3m new docs indexed/day• 60 day archive • = 180m docs indexed• 10k searches/day• = 1 search per few seconds-ish

@sleepyfox for sourcesense

Existing architecture

@sleepyfox for sourcesense

How it works

• 2 rows, each 20 shards + coordinator• Partitioning algorithm = (id % 20)• Each shard has:

• Solr instance• Indexer• Optimizer• Committer• Purger

@sleepyfox for sourcesense

How it works

• Documents retrieved by coordinator in blocks of 500

• These are allocated by id to shards according to the partitioning scheme

• Shards poll metabases for their content• Shards index content• Coordinator archives content

@sleepyfox for sourcesense

Challenges

• Coordinator responsible for 2 things:• Archiving content• Routing searches

• Redundant data flow from metabases• Partitioning scheme means (n-1/n)*100

percent of docs move on adding shard

One possible future

@sleepyfox for sourcesense

Distributed workflow

• Different worker pools:• Indexer• Searcher• Archiver• Coordinator• Content enricher...

@sleepyfox for sourcesense

Ingest Pipeline

Ingester ArchiverEnricher

DiskDisk

Ref. data

Indexer

Archive SolrIngest queue

Coordinator

@sleepyfox for sourcesense

ESB

• Orchestration, workflow and EI patterns by Apache ServiceMix

• Messaging by ApacheMQ• REST by Apache CXF• Runtime container by Apache Karaf• 100% Open Source Software

@sleepyfox for sourcesense

Call to arms

• Designed to be more generic than initial itch that needed scratching

• Have Solr/Lucene committers • Happy to accept outside contributors • May eventually become Apache incubator• Contact: Nigel Runnels-Moss

• @sleepyfox on Twitter• n.runnels-moss@sourcesense.com

Questions

top related