unstructured or: how i learned to stop worrying and love the xml, presented by mike nibeck and...

16
Un-Structured Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh

Upload: lucidworks-archived

Post on 11-May-2015

231 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Un-Structured !

Or: How I Learned to Stop Worrying and Love the XML

Mike Nibeck, Asim Shaikh

Page 2: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

1st NF, 2nd NF, 3rd NF !

It’s The Way It’s Done

Page 3: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Maintainability vs. Performance

Page 4: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

I’m Feeling Lucky

Page 5: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

SolrExtension  of  

Apache  LuceneFull  Text  Search Open  Interfaces  

(XML,  JSON,  HTTP)

Faceted  Search Database  Ingest Document  Indexing  (PDF,  Word,  etc)

Spelling  Suggestions

Auto  Suggest “Cloudy”

Advanced  Input  Parsing

Relevance  Ranking v4.4

Page 6: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

You got your chocolate in my peanut butter!

Page 7: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

It’s a Hammer. A really nice, efficient

and free hammer.

Page 8: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

A Mental Shift Pancakes & Relevancy

Page 9: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Chronicling America

• 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month

Congress.gov

• 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes • Adding many thousands/month

Library Web Search

• 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month

World Digital Library

• 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month

Page 10: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Load Balancer

Database Filesystem

Indexing

SOLR Cores SOLR Cores

UsersApp Servers

Web Cache

Legacy Systems

Data Partners

Solr Architecture - congress.gov

ETL Processing

Extract TranslateLoad

Master Data Sources

Page 11: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Analyzers, Tokenizers and Filters. Oh My!

Page 12: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Cores? We Don’t Need No Stinkin' Cores

Page 13: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Data Import Handler

Page 14: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Next Steps

Page 15: Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

Open Source Tools

• PHP / Zend • Python / Django • MySQL • RabbitMQ

•Varnish • Jenkins • Graphite, Statsd