unstructured or: how i learned to stop worrying and love the xml, presented by mike nibeck and...

Un-Structured !

Or: How I Learned to Stop Worrying and Love the XML

Mike Nibeck, Asim Shaikh

1st NF, 2nd NF, 3rd NF !

It’s The Way It’s Done

Maintainability vs. Performance

I’m Feeling Lucky

SolrExtension of

Apache LuceneFull Text Search Open Interfaces

(XML, JSON, HTTP)

Faceted Search Database Ingest Document Indexing (PDF, Word, etc)

Spelling Suggestions

Auto Suggest “Cloudy”

Advanced Input Parsing

Relevance Ranking v4.4

You got your chocolate in my peanut butter!

It’s a Hammer. A really nice, efficient

and free hammer.

A Mental Shift Pancakes & Relevancy

Chronicling America

• 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month

Congress.gov

• 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes • Adding many thousands/month

Library Web Search

• 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month

World Digital Library

• 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month

Load Balancer

Database Filesystem

Indexing

SOLR Cores SOLR Cores

UsersApp Servers

Web Cache

Legacy Systems

Data Partners

Solr Architecture - congress.gov

ETL Processing

Extract TranslateLoad

Master Data Sources

Analyzers, Tokenizers and Filters. Oh My!

Cores? We Don’t Need No Stinkin' Cores

Data Import Handler

Next Steps

Open Source Tools

• PHP / Zend • Python / Django • MySQL • RabbitMQ

•Varnish • Jenkins • Graphite, Statsd

Mike Nibeck - [email protected] !

Asim Shaikh - [email protected]

mailto:[email protected]

mailto:[email protected]?subject=

unstructured or: how i learned to stop worrying and love the xml, presented by mike nibeck and...

Technology

queriesday index

gb index size

gb indexesadding

stinkin cores

queriesday user

xml mike nibeck

mike nibeck

asim shaikh