Download - Sunbirst
SunbirstA distributed worker model for Apache Solr
@sleepyfox for sourcesense
@sleepyfox for sourcesense
What’s in the box
• Context• Problem definition• One possible solution• Discussion• ...
Where we are now
@sleepyfox for sourcesense
Existing system
@sleepyfox for sourcesense
Existing system
• Usual Solr production configuration:• High-volume search• Low volume indexing
@sleepyfox for sourcesense
Existing system
• Usual Solr production configuration:• High-volume search• Low volume indexing
• Our customer:• High volume indexing• Low volume search
@sleepyfox for sourcesense
Volumes
@sleepyfox for sourcesense
Volumes
• 3m new docs indexed/day
@sleepyfox for sourcesense
Volumes
• 3m new docs indexed/day• 60 day archive
@sleepyfox for sourcesense
Volumes
• 3m new docs indexed/day• 60 day archive • = 180m docs indexed
@sleepyfox for sourcesense
Volumes
• 3m new docs indexed/day• 60 day archive • = 180m docs indexed• 10k searches/day
@sleepyfox for sourcesense
Volumes
• 3m new docs indexed/day• 60 day archive • = 180m docs indexed• 10k searches/day• = 1 search per few seconds-ish
@sleepyfox for sourcesense
Existing architecture
@sleepyfox for sourcesense
How it works
• 2 rows, each 20 shards + coordinator• Partitioning algorithm = (id % 20)• Each shard has:
• Solr instance• Indexer• Optimizer• Committer• Purger
@sleepyfox for sourcesense
How it works
• Documents retrieved by coordinator in blocks of 500
• These are allocated by id to shards according to the partitioning scheme
• Shards poll metabases for their content• Shards index content• Coordinator archives content
@sleepyfox for sourcesense
Challenges
• Coordinator responsible for 2 things:• Archiving content• Routing searches
• Redundant data flow from metabases• Partitioning scheme means (n-1/n)*100
percent of docs move on adding shard
One possible future
@sleepyfox for sourcesense
Distributed workflow
• Different worker pools:• Indexer• Searcher• Archiver• Coordinator• Content enricher...
@sleepyfox for sourcesense
Ingest Pipeline
Ingester ArchiverEnricher
DiskDisk
Ref. data
Indexer
Archive SolrIngest queue
Coordinator
@sleepyfox for sourcesense
ESB
• Orchestration, workflow and EI patterns by Apache ServiceMix
• Messaging by ApacheMQ• REST by Apache CXF• Runtime container by Apache Karaf• 100% Open Source Software
@sleepyfox for sourcesense
Call to arms
• Designed to be more generic than initial itch that needed scratching
• Have Solr/Lucene committers • Happy to accept outside contributors • May eventually become Apache incubator• Contact: Nigel Runnels-Moss
• @sleepyfox on Twitter• [email protected]
Questions