Download - Turning search upside down with powerful open source search software

Charlie Hull - Managing Director28th November 2014

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

Turning Search Upside Down

with open source software

We design, build and support open source powered search applications

Who are Flax?


Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers

Who are Flax?



UK Authorized Partner of

Who are Flax?




Customers include Reed Specialist Recruitment, Mydeco, NLA, Gorkana, Financial Times, News UK, EMBL-EBI, Accenture, University of Cambridge, UK Government...

Who are Flax?




Customers in recruitment, government, e-commerce, news & media, bioinformatics, consulting, law...

Who are Flax?

What I'll cover today

Monitoring & classifying data with stored queries

Media monitoring - how it used to be done

What we can't change

Turning search upside down with Flax's Luwak

Case studies: Media monitoring in Scandinavia Who else is using Luwak?

Conclusions

Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'


Use cases:– Monitor news for client interests– Tag/classify incoming data according to certain rules


Use cases:– Monitor news for client interests– Tag/classify incoming data according to certain rules

Volume & speed– 10k → 1m stored queries– 10k → 1m new items a day– Latencies as low as 50-100ms

Media monitoring

Provide a news monitoring service to clients

Media monitoring


Usually also provide a searchable news archive

Media monitoring



People translate requirements & verify output...

Media monitoring




..but volume is handled by software

Media monitoring




..but volume is handled by software

UK/Europe – Gorkana, Precise, Cision; USA – BurellesLuce

Media monitoring – how it used to be done

Manual process of translating client interests into stored queries

Media monitoring – how it used to be done

Manual process of translating client interests into stored queries – many pages of Boolean terms– 10k-250k characters– Evolved over many years, grown not built– Difficult to maintain and troubleshoot

Like this...

(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

…and that's an easy one!

Media monitoring – how it used to be done Complex pipelines with many sources monitored– print & scan & OCR– XML feeds (e.g. NLA media access in UK)– Web content, Twitter etc.

Lots of search engines running in parallel– dtSearch on 100+ servers, then merged– 'Stored query' functionality built in e.g. Verity

What we can't change...

Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business



Volume keeps increasing




The need to test new and old queries on archive data - using the same syntax




The need to test new and old queries on archive data - using the same syntax

Source data will often be low quality– Scans– PDF extracts– Re-typed XML

Speaking the same language

Parse the old query syntax into a new form– Either on the fly (dtSearch, Verity)– Or off-line (Verity) – In theory we can translate any old syntax– A neutral syntax protects against lock-in• Why not?

Speaking the same language

Parse the old query syntax into a new form– Either on the fly (dtSearch, Verity)– Or off-line (Verity) – In theory we can translate any old syntax– A neutral syntax protects against lock-in• Why not?

Testing very, very important– Agree a test set– Watch for broken queries (in several ways)– Use the client's resources to help

Turning search upside down..

Docs

Result

QueryQueryStoredQueries


Docs

Result


1 million queriesSome 250k longComplex rules

1 million new documents a day

Within 5-100ms


Docs

Result




$$$$$$

Within 5-100ms


Docs

QueryQueryStoredQueries 1.

Pre

QuerySubset


~200

Doc



Docs

QueryQueryStoredQueries 1.

Pre

QuerySubset

Result


~200

2.Search

Doc



How to avoid running 1 million searches on 1 million documents?


How to avoid running 1 million searches on 1 million documents? – don't run all of them!

– 1. Presearch - Find the likely candidates first by searching the stored queries using the document


• How to avoid running 1 million searches on 1 million documents? – don't run all of them!

– 1. Presearch - Find the likely candidates first by searching the stored queries using the document

– 2. Search - Then perform a normal search, but with far fewer queries, and do it in memory for speed

Introducing Luwak

Based on a (slight) fork of Apache Lucene

A library for efficiently running many stored queries

Two stages:– Presearcher (find candidate queries)– Monitor Searcher (run candidate queries against a

single document in-memory index)

Very fast – 70-100k queries per second when first tested and 3-4 times faster now

https://github.com/flaxsearch/luwak

The gory details...

Custom QueryParsers for dtSearch, Verity VQL/OTL etc.

Hacked Lucene to expose positional informationLUCENE-3831, LUCENE-3827 in Lucene 4.xLUCENE-2878 'Nuke Spans' coming in Lucene 5.0

Works with standard Lucene queries TermQuery, BooleanQuery, RegexpQuery...Can be extended to work with custom query types

Lots (and lots) of performance tuning

Using Luwak in a larger system

Run multiple instances of Luwak to handle throughput

Use message passing platforms (e.g. RabbitMQ, Apache Kafka)

Build a separate index of articles – a searchable archive – for testing new stored queries and/or for customers

Use RESTful Web Service APIs everywhere

Test against known matches from older systems

Likely problems

Some old queries will be broken– Replicate the broken behaviour, or fix it?

Likely problems


Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language

Likely problems



Performance tuning is hard– JVM settings, caching, storage types

Likely problems



Performance tuning is hard– JVM settings, caching, storage types

Remember this is a different system – 1 to 1 matching correspondence is impossible!

Case study – media monitoring in Scandinavia

“The leading Danish provider of media intelligence, i.e. media search, media monitoring and media analysis”

Old system based on Verity (monitoring) and Autonomy IDOL (archive search)

Issues:Systems at full capacity - even with lots of hardwareArchive search very, very slowComplex workflow with some very old parts (FTP, batch files...)Proprietary query format (Verity VQL)


Flax engaged in 2013 as part of a complete re-architecture

Phased approach:1. Translate all old queries to new in-house syntax 'IQL'2. New monitoring pipeline (RabbitMQ, RESTful Web Services, Flax's Luwak)3. New archive search (Apache Solr)


Flax engaged in 2013 as part of a complete re-architecture

Phased approach:1. Translate all old queries to new in-house syntax 'IQL'2. New monitoring pipeline (RabbitMQ, RESTful Web Services, Flax's Luwak)3. New archive search (Apache Solr)

Performance/volume targets:Monitoring 20 articles/second, aim for 100 eventually100 qps for archive search, response within 250ms17000 stored queries, 85 million articles in archive


Monitoring using 2 servers, Archive search using 8 servers (including failover)

6 cores, 18GB per serverlaunch more servers to scale the system




All stored queries translated to new IQL syntaxIncluding the ones Verity didn't tell us weren't working...




All stored queries translated to new IQL syntaxIncluding the ones Verity didn't tell us weren't working...

Monitoring customers are being migrated in stages from Oct 2014, Archive search beta testing during Oct 2014, completion by end 2014


The benefits:

– No vendor lock-in (open source, IQL)

– Industry standard software (Apache Lucene/Solr, RabbitMQ)

– Predictable scaling & costs

Who else is using Luwak?

Bloomberg http://www.flax.co.uk/blog/2014/03/24/london-search-meetup-serious-solr-at-bloomberg-elasticsearch-1-0/

Booz Allen Hamilton http://www.slideshare.net/BryanBende/realtime-inverted-search-nyc-aslug-oct-2014

Early versions are installed at

We know of users in other sectors e.g. recruitment

We're hoping to make it a standard part of Lucene/Solr and Elasticsearch (i.e. to improve the Percolator)

Apache Samza integration?https://github.com/romseygeek/samza-luwak

http://www.flax.co.uk/blog/2014/03/24/london-search-meetup-serious-solr-at-bloomberg-elasticsearch-1-0/

http://www.slideshare.net/BryanBende/realtime-inverted-search-nyc-aslug-oct-2014

https://github.com/romseygeek/samza-luwak

Conclusions

To solve a hard problem, sometimes you need to turn it upside down

Conclusions


When you migrate to open source, you can keep some of the parts of your system (if you must)

Conclusions



Open source helps you scale - with predictable costs

Conclusions




Leading companies are doing it

Conclusions




Leading companies are doing it

Go and turn Search Upside Down!

Another thing...

BioSolr – a funded project to improve the way search technology is used in bioinformatics– EBI & NCBI are involved– http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/

– Already producing Solr patches e.g SOLR-1387– Come and join us!

http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/

https://issues.apache.org/jira/browse/SOLR-1387

Thankyou!

Any questions?

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

Download - Turning search upside down with powerful open source search software

Top Related