the original vision of nutch, 14 years later: building an open source search engine

61
The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 [email protected] @sylvinus

Upload: sylvain-zimmer

Post on 17-Jan-2017

162 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: The original vision of Nutch, 14 years later: Building an open source search engine

The original vision of Nutch, 14 years later: Building an open source search engine

Apache Big Data Europe 2016

[email protected] @sylvinus

Page 2: The original vision of Nutch, 14 years later: Building an open source search engine

/usr/bin/whoami

• Jamendo (Founder & CTO, 2004-2011)

• TEDxParis (Co-founder, 2009-2012)

• dotConferences (Founder, 2012-)

• Pricing Assistant (Co-founder & CTO, 2012-)

Page 3: The original vision of Nutch, 14 years later: Building an open source search engine

"The original motivation for the Nutch project was to provide a transparent alternative to the growing power of a

handful of private search services over most users’ view of the Web.

CommerceNet Labs Technical Report, Nov 2004

However, as Nutch has been adopted with greater enthusiasm by smaller organizations, the Nutch

Organization has de-emphasized operating a multi-billion-page index in the public interest."

Page 4: The original vision of Nutch, 14 years later: Building an open source search engine
Page 5: The original vision of Nutch, 14 years later: Building an open source search engine

again?

Page 6: The original vision of Nutch, 14 years later: Building an open source search engine
Page 7: The original vision of Nutch, 14 years later: Building an open source search engine
Page 8: The original vision of Nutch, 14 years later: Building an open source search engine
Page 9: The original vision of Nutch, 14 years later: Building an open source search engine
Page 10: The original vision of Nutch, 14 years later: Building an open source search engine

transparency

reproducibility

Page 11: The original vision of Nutch, 14 years later: Building an open source search engine
Page 12: The original vision of Nutch, 14 years later: Building an open source search engine

https://uidemo.commonsearch.org

Page 13: The original vision of Nutch, 14 years later: Building an open source search engine

https://explain.commonsearch.org/?q=python&g=en

Page 14: The original vision of Nutch, 14 years later: Building an open source search engine

Agenda

• Values & tech choices

• Search engine components

• Challenges

• Opportunities

Page 15: The original vision of Nutch, 14 years later: Building an open source search engine

Values & tech choices

Page 16: The original vision of Nutch, 14 years later: Building an open source search engine
Page 17: The original vision of Nutch, 14 years later: Building an open source search engine

Radical transparency

• Open source (Apache License v2)

• Open data

• (Governance)

Page 18: The original vision of Nutch, 14 years later: Building an open source search engine

Privacy

• Results can be tailored by language/country, but NOT by user/cookie/sessionid

• \o/ Cache everything!

• Tor service: http://comsearchl2zlnre.onion

Page 19: The original vision of Nutch, 14 years later: Building an open source search engine

Participation & Pragmatism

• Use high-level languages as much as possible (Python, Go)

• Embrace active communities (Spark, Elasticsearch)

• Use mainstream participation platforms, even if they are nonfree (GitHub, Slack)

Page 20: The original vision of Nutch, 14 years later: Building an open source search engine

Search engines

Page 21: The original vision of Nutch, 14 years later: Building an open source search engine

http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler

Indexer

Database

SearcherRanker

Page 22: The original vision of Nutch, 14 years later: Building an open source search engine

Crawler

Page 23: The original vision of Nutch, 14 years later: Building an open source search engine

http://commoncrawl.org

Page 24: The original vision of Nutch, 14 years later: Building an open source search engine

Today at 3:30pm!

Page 25: The original vision of Nutch, 14 years later: Building an open source search engine

http://scrapy.org

Page 26: The original vision of Nutch, 14 years later: Building an open source search engine

http://github.com/cocrawler/cocrawler

Page 27: The original vision of Nutch, 14 years later: Building an open source search engine

Indexer

Page 28: The original vision of Nutch, 14 years later: Building an open source search engine

Specs

• HTML parsing & analysis

• Tokenization / NLP

• Static rankings

• Language detection

• I/O from crawls to databases

Page 29: The original vision of Nutch, 14 years later: Building an open source search engine
Page 30: The original vision of Nutch, 14 years later: Building an open source search engine

Common Search Pipeline

Doc sourcesCommon Crawl,

WARC files, URLs ...

Filter plugins

Document parsing

Output plugins

Data outputDatabase, file, HDFS, S3, ...

Page 31: The original vision of Nutch, 14 years later: Building an open source search engine

HTML parsers

• BeautifulSoup & friends

• lxml

• html5lib

• Gumbo!

Page 32: The original vision of Nutch, 14 years later: Building an open source search engine

https://github.com/google/gumbo-parser

Page 33: The original vision of Nutch, 14 years later: Building an open source search engine

Gumbocy

• Use Cython instead of ctypes

• Smaller API

• Tree traversal on the Cython side with basic boilerplate/visibility support

https://github.com/commonsearch/gumbocy

Page 34: The original vision of Nutch, 14 years later: Building an open source search engine

https://github.com/commonsearch/urlparse4

Page 35: The original vision of Nutch, 14 years later: Building an open source search engine

Database(s)

Page 36: The original vision of Nutch, 14 years later: Building an open source search engine
Page 37: The original vision of Nutch, 14 years later: Building an open source search engine

http://lucene.apache.org/

Page 38: The original vision of Nutch, 14 years later: Building an open source search engine
Page 39: The original vision of Nutch, 14 years later: Building an open source search engine

Ranker

Page 40: The original vision of Nutch, 14 years later: Building an open source search engine

Ranking formula

rank = f( static_score , dynamic_score( query ) )

Alexa DMOZ

Blacklists PageRank

...

ElasticSearch & Lucene TF-IDF BM25

...

Page 41: The original vision of Nutch, 14 years later: Building an open source search engine
Page 42: The original vision of Nutch, 14 years later: Building an open source search engine

https://about.commonsearch.org/developer/get-started

Page 43: The original vision of Nutch, 14 years later: Building an open source search engine

Today @ 4:30pm ;-)

Page 44: The original vision of Nutch, 14 years later: Building an open source search engine

Searcher / Frontend

Page 45: The original vision of Nutch, 14 years later: Building an open source search engine

Specs

• Send user query to databases

• Search-as-you-type

• HTML & JSON endpoints

• High performance

Page 46: The original vision of Nutch, 14 years later: Building an open source search engine
Page 47: The original vision of Nutch, 14 years later: Building an open source search engine

https://github.com/commonsearch/cosr-front

Page 48: The original vision of Nutch, 14 years later: Building an open source search engine

http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler

Parser

Index

SearcherRanker

Page 49: The original vision of Nutch, 14 years later: Building an open source search engine

Challenges

Page 50: The original vision of Nutch, 14 years later: Building an open source search engine

Funding / Scale

• Frugalism

• Caching

• In-kind services

• Individual donations / Foundation grants

• General economic incentives

Page 51: The original vision of Nutch, 14 years later: Building an open source search engine

Spam

• Email spam

• Wikipedia vandalism

• Algorithm complexity & scale

• Given enough eyeballs, all spam is shallow?

Page 52: The original vision of Nutch, 14 years later: Building an open source search engine

Relevance

• Exhaustivity

• Rescoring

• Evaluation

• More at 4:30pm ;-)

Page 53: The original vision of Nutch, 14 years later: Building an open source search engine

More search dimensions

• Realtime search

• Local search

• Universal search

Page 54: The original vision of Nutch, 14 years later: Building an open source search engine

Semantic search

• Wikidata

• YAGO

• Conversational / Voice search

Page 55: The original vision of Nutch, 14 years later: Building an open source search engine

Outreach

• Easy onboarding & docs

• Making people care believe

Page 56: The original vision of Nutch, 14 years later: Building an open source search engine

Opportunities

Page 57: The original vision of Nutch, 14 years later: Building an open source search engine

Decentralization

• YaCy

• Extremely high technical & social cost!

• Transparency?

Page 58: The original vision of Nutch, 14 years later: Building an open source search engine

Research

• More people should know how to build search engines

• Spam, Relevance, Large-scale data processing

• We need more open datasets!

Page 59: The original vision of Nutch, 14 years later: Building an open source search engine

https://about.commonsearch.org/blog/

Page 60: The original vision of Nutch, 14 years later: Building an open source search engine

Make the Web a better place!

• SEO

• Transparency

• Influence of money

• Public service

Page 61: The original vision of Nutch, 14 years later: Building an open source search engine

Questions?https://about.commonsearch.org/contributing

https://github.com/commonsearch [email protected]

slack.commonsearch.org