what's the story with open source?
TRANSCRIPT
What's the story with open source? Searching and monitoring news media with open source technology
Charlie Hull, FlaxBCS IRSG Search Solutions 2010
Photo source: http://www.flickr.com/photos/shironekoeuro/
www.flax.co.uk 2
What is Flax?
www.flax.co.uk 3
What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd
and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by
Lucid Imagination Customers include Mydeco, NLA, Durrants
Ltd, Financial Times, MediaMiser, MySkreen
Apache Lucene and Solr are trademarks of The Apache Software Foundation
www.flax.co.uk 4
The challenges
www.flax.co.uk 5
The challenges
Content is created for publication, not for search
www.flax.co.uk 6
The challenges
Content is created for publication, not for searchContent isn't published consistently or available to all
www.flax.co.uk 7
The challenges
Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple
www.flax.co.uk 8
The challenges
Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google”
www.flax.co.uk 9
The challenges
Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google” Every system will have to scale beyond its originally
planned size
www.flax.co.uk 10
The challenges
Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google” Every system will have to scale beyond its originally
planned size
- Every project is different
www.flax.co.uk 11
So how do we build news search?
www.flax.co.uk 12
So how do we build news search?
Indexing
www.flax.co.uk 13
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)
www.flax.co.uk 14
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quickly
www.flax.co.uk 15
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, source
www.flax.co.uk 16
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessary
www.flax.co.uk 17
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes required
www.flax.co.uk 18
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes requiredContent restriction & embargo data
www.flax.co.uk 19
So how do we build news search?
IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes requiredContent restriction & embargo data
SolutionLightweight, customisable index scripts using powerful open source libraries
www.flax.co.uk 20
So how do we build news search? import xapian import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db)
doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()
www.flax.co.uk 21
So how do we build news search?
Searching
www.flax.co.uk 22
So how do we build news search?
SearchingFree text with Boolean operators
www.flax.co.uk 23
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date ranges
www.flax.co.uk 24
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance ranking
www.flax.co.uk 25
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriate
www.flax.co.uk 26
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting
www.flax.co.uk 27
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'
www.flax.co.uk 28
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters
www.flax.co.uk 29
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters
SolutionTemplate-based user interface scripts, again using open source libraries
www.flax.co.uk 30
So how do we build news search?
SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters
SolutionTemplate-based user interface scripts, again using open source librariesBeware Javascript & older browsers!
www.flax.co.uk 31
So how do we build news search?
Administration Indexing failures commonLogging is essential
www.flax.co.uk 32
So how do we build news search?
Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later
www.flax.co.uk 33
So how do we build news search?
Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later
ScalabilityContent is always growingBoth indexing & searching must scale
www.flax.co.uk 34
So how do we build news search?
Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later
ScalabilityContent is always growingBoth indexing & searching must scaleOpen source search libraries provide distributed indexing, replication, remote indexesNot simple to get this right!
www.flax.co.uk 35
So how do we build news search?
●Available open source technologiesLanguages – C/C++, Java, Python, JavascriptSearch libraries – Xapian, LuceneSearch bindings/servers – Xappy, Flax.core, SolrExternal libraries – pyparsing, CherryPy, xmllib, mxODBC, ...Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...
www.flax.co.uk 36
So how do we build news search?
●Available open source technologiesLanguages – C/C++, Java, Python, JavascriptSearch libraries – Xapian, LuceneSearch bindings/servers – Xappy, Flax.core, SolrExternal libraries – pyparsing, CherryPy, xmllib, mxODBC, ...Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), …We can use whatever works!
www.flax.co.uk 37
Some examples
Newspaper Licensing Agency – NLA Clipshare20 million newspaper stories6500 usersContent from every major newspaper (and most regionals)Used by journalists, clippings agencies, media monitorsReplacing internal systems at major newspapers
http://www.nla-clipshare.com
www.flax.co.uk 38
Some examples
Newspaper Licensing Agency – NLA Clipshare20 million newspaper stories6500 usersContent from every major newspaper (and most regionals)Used by journalists, clippings agencies, media monitorsReplacing internal systems at major newspapersOne of very few ways to search content from all the papers within hours of publication
http://www.nla-clipshare.com
www.flax.co.uk 39
www.flax.co.uk 40
www.flax.co.uk 41
www.flax.co.uk 42
Some examples
Financial Times – press cuttingsWeb Service for easy integrationXML source dataFaceted searchArea filters (whole article, body, headline, byline or any combination)Synonyms, spelling suggestions
http://presscuttings.ft.com
www.flax.co.uk 43
Some examples
Financial Times – press cuttingsWeb Service for easy integrationXML source dataFaceted searchArea filters (whole article, body, headline, byline or any combination)Synonyms, spelling suggestionsBuilt from scratch in a fortnightDesigned as a prototype, scaled to production use without significant change
http://presscuttings.ft.com
www.flax.co.uk 44
www.flax.co.uk 45
A different task – news monitoring
Non-traditional use of search
www.flax.co.uk 46
A different task – news monitoring
Non-traditional use of searchMany automated searches on incoming content
www.flax.co.uk 47
A different task – news monitoring
Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needs
www.flax.co.uk 48
A different task – news monitoring
Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needsFalse positives require human checking
www.flax.co.uk 49
A different task – news monitoring
Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needsFalse positives require human checkingFalse negatives should never occur!
www.flax.co.uk 50
A different task – news monitoringAn example
Durrants Ltd.
www.flax.co.uk 51
A different task – news monitoringAn example
Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline
www.flax.co.uk 52
A different task – news monitoringAn example
Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline
SolutionFlexible query language allows OCR errors, punctuation, fuzzy matching, weightingSupports features of previous engineScalable master-slave architecture
www.flax.co.uk 53
A different task – news monitoringAn example
Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline
SolutionFlexible query language allows OCR errors, punctuation, fuzzy matching, weightingSupports features of previous engineScalable master-slave architecture
Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system
www.flax.co.uk 54
Why open source?
Flexible, extendable
www.flax.co.uk 55
Why open source?
Flexible, extendable Powerful & scalable
www.flax.co.uk 56
Why open source?
Flexible, extendable Powerful & scalable Lower cost
www.flax.co.uk 57
Why open source?
Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary
www.flax.co.uk 58
Why open source?
Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary
- Freedom to innovate
www.flax.co.uk 59
Looking to the future
www.flax.co.uk 60
Looking to the future
More and more content including social media
www.flax.co.uk 61
Looking to the future
More and more content including social mediaMultiple delivery platforms
www.flax.co.uk 62
Looking to the future
More and more content including social mediaMultiple delivery platforms Search-powered websites & applications
www.flax.co.uk 63
Looking to the future
More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'
www.flax.co.uk 64
Looking to the future
More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud
www.flax.co.uk 65
Looking to the future
More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud
Search no longer a bolt-on, but a platform for innovation
www.flax.co.uk 66
Looking to the future
More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud
Search no longer a bolt-on, but a platform for innovationOpen source no longer an outsider, but the obvious choice
www.flax.co.uk 67
Thankyou!
Questions?
[email protected]/blogTwitter: @FlaxSearch
Photo source: http://www.flickr.com/photos/katerha/4259440136/