Download - IR: Open source state
![Page 1: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/1.jpg)
IR: open source state
Dmitry Kan, AlphaSense, Insider Solutions
University of Helsinki, Information Retrieval and Search Engines course, Feb 21, 2017
![Page 2: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/2.jpg)
About me● PhD in CS (Saint Petersburg State University), 2011
● Running a Search Engine team at AlphaSense since 2014
● Founded Insider Solutions in 2009: text analytics solutions +
consulting
● Co-committer on luke project: toolbox for Lucene index since 2013
![Page 3: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/3.jpg)
What is AlphaSense● Google for financial analysts● Semantic research engine● Edit, tag, annotate, share you data in a team● Oracle, JP Morgan, Credit Suisse● Engineering is 98% in Helsinki + 1% NYC + 1% India● #1 fastest growing IT startup in Finland by Deloitte
(2015)
www.alpha-sense.com
![Page 4: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/4.jpg)
● Founded 2009
● BigText Analytics APIs and on-premise solutions
○ Sentiment analysis: Russian, Chinese, English
○ Searchable trend extraction
● Consulting: startups and corporates
https://semanticanalyzer.info
Insider Solutions
![Page 5: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/5.jpg)
Outline● Search engine architecture
● Open source search ecosystem
● Research directions for applied IR
![Page 6: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/6.jpg)
Search engine: building blocks● Web crawler: Apache Nutch (based on Hadoop)
● Data ingestion pipeline: receiving, cleaning, data
extraction
● SolrCloud OR Elasticsearch (both based on Lucene)
● Shards: storing index on disk and / or memory
![Page 7: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/7.jpg)
Lucene / Solr history timeline
![Page 8: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/8.jpg)
![Page 9: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/9.jpg)
Inject URLs
Create segments
New URLs
![Page 10: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/10.jpg)
Search Engine Software Components● Schema
● Query parser
● Scoring algorithm
● Snippet highlighter
● Index (on-disk or in-memory)
![Page 11: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/11.jpg)
Query analysis and suggestions
![Page 12: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/12.jpg)
British vs US English handling
![Page 13: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/13.jpg)
One shard of the index
![Page 14: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/14.jpg)
Content extractionApache Tika for parsing formats:
● Html, XML
● Microsoft Office & iWorks document formats
● Audio, image, video
● Source code
![Page 15: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/15.jpg)
Inspecting Lucene index with LukeImplemented by Andrzej Bialecki. Since 2013 → by Dmitry Kan (Finland) and Tomoko Uchida (Japan)
● Perform index maintenance
● Prototype similarity functions
● Search for documents, reconstruct field values from the index
● Read index from HDFS (Hadoop’s distributed file system)
● Supports Apache Solr and Elasticsearch
![Page 16: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/16.jpg)
![Page 17: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/17.jpg)
Learning to rank: SolrContributed by Bloomberg
Machine learnt model for reranking documents based on user feedback
Trained on features: views, popularity, was hit in the title, length, can view on mobile device?
LamdaMART, RankSVM
![Page 18: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/18.jpg)
Lucene scoring formula
![Page 19: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/19.jpg)
![Page 20: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/20.jpg)
![Page 21: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/21.jpg)
![Page 22: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/22.jpg)
Feature: is person and executive?
![Page 23: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/23.jpg)
Feature: recency of the document
![Page 24: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/24.jpg)
Features as signal of result importance
![Page 25: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/25.jpg)
Learnt model
![Page 26: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/26.jpg)
Word vectors with LuceneWord2vec was released by Google to open source
Possible to train word2vec on Lucene index: https://github.com/kojisekig/word2vec-lucene
● NO need to provide a text file besides Lucene index
● NO need to normalize text. Normalization already done in the index or
Analyzer does it for you when processing
● Use part of the index by specifying a filter query
![Page 27: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/27.jpg)
![Page 28: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/28.jpg)
![Page 29: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/29.jpg)
![Page 30: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/30.jpg)
![Page 31: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/31.jpg)
![Page 32: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/32.jpg)
![Page 33: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/33.jpg)
Questions?Reach me at:
Twitter: @dmitrykanQuora: https://www.quora.com/profile/Dmitry-Kan
![Page 34: IR: Open source state](https://reader031.vdocuments.us/reader031/viewer/2022021920/58cfd9391a28ab13238b53af/html5/thumbnails/34.jpg)
References1. Luke: https://github.com/DmitryKey/luke2. My blog: http://dmitrykan.blogspot.fi/3. Solr vs Elasticsearch (overview): https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/4. Solr vs Elasticsearch (in-depth): https://sematext.com/blog/2012/08/23/solr-vs-elasticsearch-part-1-overview/5. Introduction to Apache Solr http://www.slideshare.net/ChristosManios/introduction-to-apache-solr-540761896. Word2vec-lucene: https://github.com/kojisekig/word2vec-lucene7. Apache Tika: https://tika.apache.org/8. Apache Solr: http://lucene.apache.org/solr/9. Elasticsearch: https://github.com/elastic/elasticsearch
10. Learning to rank in Solr (video): https://www.youtube.com/watch?v=M7BKwJoh96s11. Learning to rank in Solr (slides): https://lucidworks.com/2016/08/17/learning-to-rank-solr/12. Word2vec: https://en.wikipedia.org/wiki/Word2vec#Analysis13. Lucene scoring formula:
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html