Charlie Hull - Managing Director28th November 2014
[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch
Turning Search Upside Down
with open source software
We design, build and support open source powered search applications
Who are Flax?
We design, build and support open source powered search applications
Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers
Who are Flax?
We design, build and support open source powered search applications
Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers
UK Authorized Partner of
Who are Flax?
We design, build and support open source powered search applications
Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers
UK Authorized Partner of
Customers include Reed Specialist Recruitment, Mydeco, NLA, Gorkana, Financial Times, News UK, EMBL-EBI, Accenture, University of Cambridge, UK Government...
Who are Flax?
We design, build and support open source powered search applications
Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers
UK Authorized Partner of
Customers in recruitment, government, e-commerce, news & media, bioinformatics, consulting, law...
Who are Flax?
We design, build and support open source powered search applications
Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers
UK Authorized Partner of
Customers in recruitment, government, e-commerce, news & media, bioinformatics, consulting, law...
Who are Flax?
What I'll cover today
Monitoring & classifying data with stored queries
Media monitoring - how it used to be done
What we can't change
Turning search upside down with Flax's Luwak
Case studies: Media monitoring in Scandinavia Who else is using Luwak?
Conclusions
Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'
Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'
Use cases:– Monitor news for client interests– Tag/classify incoming data according to certain rules
Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'
Use cases:– Monitor news for client interests– Tag/classify incoming data according to certain rules
Volume & speed– 10k → 1m stored queries– 10k → 1m new items a day– Latencies as low as 50-100ms
Media monitoring
Provide a news monitoring service to clients
Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
People translate requirements & verify output...
Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
People translate requirements & verify output...
..but volume is handled by software
Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
People translate requirements & verify output...
..but volume is handled by software
UK/Europe – Gorkana, Precise, Cision; USA – BurellesLuce
Media monitoring – how it used to be done
Manual process of translating client interests into stored queries
Media monitoring – how it used to be done
Manual process of translating client interests into stored queries – many pages of Boolean terms– 10k-250k characters– Evolved over many years, grown not built– Difficult to maintain and troubleshoot
Like this...
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
…and that's an easy one!
Media monitoring – how it used to be done Complex pipelines with many sources monitored– print & scan & OCR– XML feeds (e.g. NLA media access in UK)– Web content, Twitter etc.
Lots of search engines running in parallel– dtSearch on 100+ servers, then merged– 'Stored query' functionality built in e.g. Verity
What we can't change...
Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business
What we can't change...
Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business
Volume keeps increasing
What we can't change...
Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business
Volume keeps increasing
The need to test new and old queries on archive data - using the same syntax
What we can't change...
Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business
Volume keeps increasing
The need to test new and old queries on archive data - using the same syntax
Source data will often be low quality– Scans– PDF extracts– Re-typed XML
Speaking the same language
Parse the old query syntax into a new form– Either on the fly (dtSearch, Verity)– Or off-line (Verity) – In theory we can translate any old syntax– A neutral syntax protects against lock-in• Why not?
Speaking the same language
Parse the old query syntax into a new form– Either on the fly (dtSearch, Verity)– Or off-line (Verity) – In theory we can translate any old syntax– A neutral syntax protects against lock-in• Why not?
Testing very, very important– Agree a test set– Watch for broken queries (in several ways)– Use the client's resources to help
Turning search upside down..
Docs
Result
QueryQueryStoredQueries
Turning search upside down..
Docs
Result
QueryQueryStoredQueries
1 million queriesSome 250k longComplex rules
1 million new documents a day
Within 5-100ms
Turning search upside down..
Docs
Result
QueryQueryStoredQueries
1 million queriesSome 250k longComplex rules
1 million new documents a day
$$$$$$
Within 5-100ms
Turning search upside down..
Docs
Result
QueryQueryStoredQueries
1 million queriesSome 250k longComplex rules
1 million new documents a day
$$$$$$
Within 5-100ms
Turning search upside down..
Docs
QueryQueryStoredQueries 1.
Pre
QuerySubset
1 million queriesSome 250k longComplex rules
~200
Doc
1 million new documents a day
Turning search upside down..
Docs
QueryQueryStoredQueries 1.
Pre
QuerySubset
Result
1 million queriesSome 250k longComplex rules
~200
2.Search
Doc
1 million new documents a day
Turning search upside down..
How to avoid running 1 million searches on 1 million documents?
Turning search upside down..
How to avoid running 1 million searches on 1 million documents? – don't run all of them!
– 1. Presearch - Find the likely candidates first by searching the stored queries using the document
Turning search upside down..
• How to avoid running 1 million searches on 1 million documents? – don't run all of them!
– 1. Presearch - Find the likely candidates first by searching the stored queries using the document
– 2. Search - Then perform a normal search, but with far fewer queries, and do it in memory for speed
Introducing Luwak
Based on a (slight) fork of Apache Lucene
A library for efficiently running many stored queries
Two stages:– Presearcher (find candidate queries)– Monitor Searcher (run candidate queries against a
single document in-memory index)
Very fast – 70-100k queries per second when first tested and 3-4 times faster now
https://github.com/flaxsearch/luwak
The gory details...
Custom QueryParsers for dtSearch, Verity VQL/OTL etc.
Hacked Lucene to expose positional informationLUCENE-3831, LUCENE-3827 in Lucene 4.xLUCENE-2878 'Nuke Spans' coming in Lucene 5.0
Works with standard Lucene queries TermQuery, BooleanQuery, RegexpQuery...Can be extended to work with custom query types
Lots (and lots) of performance tuning
Using Luwak in a larger system
Run multiple instances of Luwak to handle throughput
Use message passing platforms (e.g. RabbitMQ, Apache Kafka)
Build a separate index of articles – a searchable archive – for testing new stored queries and/or for customers
Use RESTful Web Service APIs everywhere
Test against known matches from older systems
Likely problems
Some old queries will be broken– Replicate the broken behaviour, or fix it?
Likely problems
Some old queries will be broken– Replicate the broken behaviour, or fix it?
Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language
Likely problems
Some old queries will be broken– Replicate the broken behaviour, or fix it?
Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language
Performance tuning is hard– JVM settings, caching, storage types
Likely problems
Some old queries will be broken– Replicate the broken behaviour, or fix it?
Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language
Performance tuning is hard– JVM settings, caching, storage types
Remember this is a different system – 1 to 1 matching correspondence is impossible!
Case study – media monitoring in Scandinavia
“The leading Danish provider of media intelligence, i.e. media search, media monitoring and media analysis”
Old system based on Verity (monitoring) and Autonomy IDOL (archive search)
Issues:Systems at full capacity - even with lots of hardwareArchive search very, very slowComplex workflow with some very old parts (FTP, batch files...)Proprietary query format (Verity VQL)
Case study – media monitoring in Scandinavia
Flax engaged in 2013 as part of a complete re-architecture
Phased approach:1. Translate all old queries to new in-house syntax 'IQL'2. New monitoring pipeline (RabbitMQ, RESTful Web Services, Flax's Luwak)3. New archive search (Apache Solr)
Case study – media monitoring in Scandinavia
Flax engaged in 2013 as part of a complete re-architecture
Phased approach:1. Translate all old queries to new in-house syntax 'IQL'2. New monitoring pipeline (RabbitMQ, RESTful Web Services, Flax's Luwak)3. New archive search (Apache Solr)
Performance/volume targets:Monitoring 20 articles/second, aim for 100 eventually100 qps for archive search, response within 250ms17000 stored queries, 85 million articles in archive
Case study – media monitoring in Scandinavia
Monitoring using 2 servers, Archive search using 8 servers (including failover)
6 cores, 18GB per serverlaunch more servers to scale the system
Case study – media monitoring in Scandinavia
Monitoring using 2 servers, Archive search using 8 servers (including failover)
6 cores, 18GB per serverlaunch more servers to scale the system
All stored queries translated to new IQL syntaxIncluding the ones Verity didn't tell us weren't working...
Case study – media monitoring in Scandinavia
Monitoring using 2 servers, Archive search using 8 servers (including failover)
6 cores, 18GB per serverlaunch more servers to scale the system
All stored queries translated to new IQL syntaxIncluding the ones Verity didn't tell us weren't working...
Monitoring customers are being migrated in stages from Oct 2014, Archive search beta testing during Oct 2014, completion by end 2014
Case study – media monitoring in Scandinavia
The benefits:
– No vendor lock-in (open source, IQL)
– Industry standard software (Apache Lucene/Solr, RabbitMQ)
– Predictable scaling & costs
Who else is using Luwak?
Bloomberg http://www.flax.co.uk/blog/2014/03/24/london-search-meetup-serious-solr-at-bloomberg-elasticsearch-1-0/
Booz Allen Hamilton http://www.slideshare.net/BryanBende/realtime-inverted-search-nyc-aslug-oct-2014
Early versions are installed at
We know of users in other sectors e.g. recruitment
We're hoping to make it a standard part of Lucene/Solr and Elasticsearch (i.e. to improve the Percolator)
Apache Samza integration?https://github.com/romseygeek/samza-luwak
Conclusions
To solve a hard problem, sometimes you need to turn it upside down
Conclusions
To solve a hard problem, sometimes you need to turn it upside down
When you migrate to open source, you can keep some of the parts of your system (if you must)
Conclusions
To solve a hard problem, sometimes you need to turn it upside down
When you migrate to open source, you can keep some of the parts of your system (if you must)
Open source helps you scale - with predictable costs
Conclusions
To solve a hard problem, sometimes you need to turn it upside down
When you migrate to open source, you can keep some of the parts of your system (if you must)
Open source helps you scale - with predictable costs
Leading companies are doing it
Conclusions
To solve a hard problem, sometimes you need to turn it upside down
When you migrate to open source, you can keep some of the parts of your system (if you must)
Open source helps you scale - with predictable costs
Leading companies are doing it
Go and turn Search Upside Down!
Another thing...
BioSolr – a funded project to improve the way search technology is used in bioinformatics– EBI & NCBI are involved– http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/
– Already producing Solr patches e.g SOLR-1387– Come and join us!