building a search engine for the cuban web · pdf filecommon search engine features 2 1 3 web...

Post on 09-Mar-2018

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S P A I N

Building a Search Engine for the Cuban Web

Jorge Luis Betancourt

Search/Crawl Engineer

2

Who am I01

Jorge Luis Betancourt González

Search/Crawl Engineer

Apache Nutch Committer & PMC

Apache Solr/ES enthusiast

3

Agenda

• Introduction & motivation

• Technologies used

• Customizations

• Conclusions and future work

4

Introduction / Motivation

Cuba

Internet Intranet

Global search engines can’t access documents

hosted the Cuban Intranet

5

Writing your own web search engine

from scratch?

or …

6

Common search engine features

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails

• filters (facets)

• show metadata

• match text with images

• near real time • email, push, SMS

7

How to fulfill these requirements?

store query At the core a search

engine: stores some

information a retrieve this

information when a

question is received

8

Open Source to the rescue …

Index Server

crawler

web interface

2

1

3

9

Apache Nutch

“ Nutch is a well matured, production ready

Web crawler. Enables fine grained

configuration, relying on Apache Hadoop™

data structures, which are great for batch

processing.

10

Apache Nutch

• Highly scalable

• Highly extensible

• Pluggable parsing protocols, storage,

indexing, scoring,

• Active community

• Apache License

11

Apache Solr

TOTAL DOWNLOADS

8M+MONTHLY

DOWNLOADS 250,000+• Apache License

• Highly modular

• Based on Lucene

• Great community

• Stability / Scalability

• Battle tested

12

Back to the list of features

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

13

Image search and thumbnails

Custom parser & indexer to store the image

thumbnail

img p

h1

Custom parser &

indexer & scoring

identify and store the text

related with an image

14

How does it work?

img p

h11

img

img

3

2

15

News search (NRT & alerting)

Nutch is really not suited for this task: Batch nature of

the Hadoop Jobs doesn’t fit well in this scenario

16

Our topology

http://news-site.com

RSS fetch parse

index

parse the RSS feed and outputs the news links to be processed according to SC protocol.

https://github.com/commoncrawl/news-crawl

monitor

flaxsearch/luwak

17

Querying the data

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

17

18

Querying the data

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

18

19

Apache Solr

• Solr has full support for highlighting (3 impl)

• powerful faceting capabilities (even more on recent

releases)

• autocorrection support based on the index content

• awesome scalability (SolrCloud, classic master-slave

replication)

20

The features, once again

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

21

The features, once again

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

22

Other features - monitoring

We needed a way of monitoring our infrastructure

without a great Internet connection you can’t send

GB of logs to a cloud environment, so …

(and facets)analytical tool

(and logs)

(and metrics)time series store

23

Other features - monitoring

(and facets)analytical tool

(and logs)

(and metrics)time series store

(and logs) parsing & aggregation

24

Banana (Kibana port) for visualizations

25

Infrastructure

Solr Master

CrawlersNutch

SolrReplicador

WEB

HTTP

HTTP HTTP HTTP

HTTP HTTP

JAVABIN

1

2

26

Some usage stats

less than 10 000 visits around 600 unique visitors

27

Future work

Apply deep learning techniques to process the raw

images and mix with current approach

Increase the number of signals that we get from our

crawlers (correlate even more crawl related events)

Thanks

Questions?

M

!

jorgelbg@apache.org

@jorgelbg

top related