search bootstrapping how / where to get started. crawling start with nutch – index directly to...

4
Search Bootstrapping How / Where to get started

Post on 19-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Search Bootstrapping

How / Where to get

started

Page 2: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Crawling

• Start with Nutch– http://nutch.apache.org/

• Index directly to SOLR– http://www.lucidimagination.com/blog/2010/09/10

/refresh-using-nutch-with-solr/

• Create a seed list from DMOZ rdf– http://www.dmoz.org/rdf.html– http://wiki.apache.org/nutch/NutchTutorial

Page 3: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Understanding Content

• Entity Extraction– LingPipe http://alias-i.com/lingpipe/– OpenNLP http://incubator.apache.org/opennlp/

• Entity Identification / Taxonomies– Freebase http://www.freebase.com/

Page 4: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Some Additional Links

• Basic Web Page Parser– https://github.com/pjaol/Webcrawler

• Example of OpenNLP usage– https://github.com/pjaol/entity_extractor