unstructured or: how i learned to stop worrying and love the xml, presented by mike nibeck and...
TRANSCRIPT
Un-Structured !
Or: How I Learned to Stop Worrying and Love the XML
Mike Nibeck, Asim Shaikh
1st NF, 2nd NF, 3rd NF !
It’s The Way It’s Done
Maintainability vs. Performance
I’m Feeling Lucky
SolrExtension of
Apache LuceneFull Text Search Open Interfaces
(XML, JSON, HTTP)
Faceted Search Database Ingest Document Indexing (PDF, Word, etc)
Spelling Suggestions
Auto Suggest “Cloudy”
Advanced Input Parsing
Relevance Ranking v4.4
You got your chocolate in my peanut butter!
It’s a Hammer. A really nice, efficient
and free hammer.
A Mental Shift Pancakes & Relevancy
Chronicling America
• 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month
Congress.gov
• 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes • Adding many thousands/month
Library Web Search
• 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month
World Digital Library
• 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month
Load Balancer
Database Filesystem
Indexing
SOLR Cores SOLR Cores
UsersApp Servers
Web Cache
Legacy Systems
Data Partners
Solr Architecture - congress.gov
ETL Processing
Extract TranslateLoad
Master Data Sources
Analyzers, Tokenizers and Filters. Oh My!
Cores? We Don’t Need No Stinkin' Cores
Data Import Handler
Next Steps
Open Source Tools
• PHP / Zend • Python / Django • MySQL • RabbitMQ
•Varnish • Jenkins • Graphite, Statsd
Mike Nibeck - [email protected] !
Asim Shaikh - [email protected]