nutch homepage search engine

21
HOMEPAGE & SEARCH ENGINE HOMEPAGE & SEARCH ENGINE 2008.12.08

Upload: kay-kim

Post on 24-Jan-2015

2.253 views

Category:

Documents


0 download

DESCRIPTION

Source: http://flyingbono.tistory.com/entry/Nutch-%EC%82%AC%EC%9A%A9%ED%95%B4%EB%B3%B4%EA%B8%B0

TRANSCRIPT

Page 1: Nutch Homepage Search Engine

HOMEPAGE & SEARCH HOMEPAGE & SEARCH ENGINEENGINE 2008.12.08

Page 2: Nutch Homepage Search Engine

2. About Cloud computing 3. Application Introduction - Nutch - Google App Engine 4. Presentation

ContentsContents

Page 3: Nutch Homepage Search Engine

2. ABOUT CLOUD 2. ABOUT CLOUD COMPUTINGCOMPUTING

Page 4: Nutch Homepage Search Engine

Cloud computing is Internet-based ("cloud") development and use of computer technology ("computing").

Cloud computing is a general concept that incorporates software as a service (SaaS), Web 2.0 and other recent, well-known technology trends, in which the common theme is reliance on the Internet for satisfying the computing needs of the users.

2. What is Cloud computing? 2. What is Cloud computing?

Page 5: Nutch Homepage Search Engine

3. APPLICATION 3. APPLICATION INTRODUCTIONINTRODUCTION

Page 6: Nutch Homepage Search Engine

open source web-search software based Lucene

원래는 Apache Lucene project 의 sub-project Lucene 을 좀더 사용하기 편하게 하기 위한 목적

Lucene Java :• Apache 의 매우 유명한 open source search engine

3-1.What is ‘Nutch’?3-1.What is ‘Nutch’?

Page 7: Nutch Homepage Search Engine

Transparency. • Nutch is open source, so anyone can see

how the ranking algorithms work. Understanding. • Nutch has been built using ideas from

academia and industry• for instance, core parts of Nutch are

currently being re-implemented to use the Map Reduce distributed processing model

• Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.

3-1. What is Nutch? 3-1. What is Nutch?

Page 8: Nutch Homepage Search Engine

Extensibility. • Nutch is very flexible

• it can be customized and incorporated into your application.

• For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism.

3-1. What is Nutch? 3-1. What is Nutch?

Page 9: Nutch Homepage Search Engine

Nutch divides naturally into two pieces: • the crawler

• the searcher

Crawl • 페이지를 수집• 페이지에 대한 index 를 만든다

• index 는 Crawl 과 Search 간의 가교 역할을 한다

Search• 유저의 요청에 따라 필요한 정보를 찾아서 보여준다

3-1. What is Nutch? 3-1. What is Nutch?

Page 10: Nutch Homepage Search Engine

More detail about crawler

• the Nutch crawler system produces three key data structures:• The WebDB containing the web graph of

pages and links. • A set of segments containing the raw data

retrieved from the Web by the fetchers. • The merged index created by indexing

and de-duplicating parsed data from the segments.

3-1. What is Nutch?3-1. What is Nutch?

Page 11: Nutch Homepage Search Engine

More detail about searcher

• Nutch looks for these in the index and segments subdirectories of the directory defined in the searcher.dir property.

• The default value for searcher.dir is the current directory (.), which is where you started Tomcat.

3-1. What is Nutch?3-1. What is Nutch?

Page 12: Nutch Homepage Search Engine

1. crawl db 로부터 url 의 목록을 생성한다 . 2. segment 에서 url 의 목록을 fetch 한다 . 3. segment 에서 fetch 한 contents 를 분석 (parse)

한다 . 4. 세그먼트로부터 crawl db 와 분석한 데이터를

업데이트 한다5. segments 로부터 invert 링크를 분석한다 .6. segment 문서와 anchor 문서에 대한 색인을

생성한다 .

• 이 부분을 계속 반복 실행

3-1. What is Nutch?3-1. What is Nutch?

Page 13: Nutch Homepage Search Engine

Nutch 실행 방법

• Nutch 가 설치된 directory 에서 cralwing 을 시작 >> /bin/nutch crawl –dir urls crawl –depth 3 -topN 10

• Tomcat 5.5 를 실행 • 주의할 점 : Nutch directory 에서 tomcat 을

실행시켜야 함>> /opt/apache-tomcat-5.5.27/bin/catalina.sh

start

• http://localhost:8080/en/

3-1. What is Nutch?3-1. What is Nutch?

Page 14: Nutch Homepage Search Engine

Nutch 0.9 from apache-nutch homepage JAVA JDK-6 Tomcat 5.5 version 이상 version OS : Linux server Edition

Cygwin for Window’s developer

3-2 .Development 3-2 .Development environment of Nutchenvironment of Nutch

Page 15: Nutch Homepage Search Engine

A project for Cloud Computing of Google

Google web application platformEasy to build, easy to maintain, and

easy to scale as user’s traffic and data storage needs grow

No servers to maintain, with App Engine : just upload an application, and it’s ready to serve your users.

3-3. What is ‘Google App 3-3. What is ‘Google App Engine’?Engine’?

Page 16: Nutch Homepage Search Engine

Google App Engine 에서 제공하는 기능

• Python 이 제공하는 기본 기능 • Python 으로 만들어 졌기 때문

• BigTable/GFS 기술이 뒷받침하는 견고한 Datastore • Google 에서 만든 기존의 oracle, mysql 과 같은

database • 확장성을 제공하는 호스팅 공간 • Free ‘Google’ account • SDK 를 이용한 로컬 개발 및 테스트

3-3. What is ‘Google App 3-3. What is ‘Google App Engine’?Engine’?

Page 17: Nutch Homepage Search Engine

Google’s Moto : “Web Development that doesn’t hurt”Google App Engine 을 통해 웹 서비스 개발자들은 또 다른 고통

없이 개발할 수 있는 선택권을 갖게 된다 .Load balancing, automatic scaling, dynamic web serving

등을 Google App Engine 에서 제공할테니 걱정 없이 application 개발만 신경 써라

다만 , 이 선택에는 세가지의 제약이 따른다 .1. 모든 코드는 반드시 Python 으로 작성해야 한다 .

현재 , perl 로 개발 중 2. 사용량 제한을 통해 비용 지불의 가능성이 존재한다 .

무료로 제공되는 사용량 500MB of persistent storage and enough CPU and bandwidth for about

5 million page views a month3. 모든 데이타는 구글 플랫폼에서 움직이며 구글이 갖게 된다는

점이다 . 이는 , 구글 플랫폼에 종속된 어플리케이션은 쉽게 구글 플랫폼을 벗어나지 못하게 할 것이다 . 3 번 째 제약이 Google App Engine 의 가장 치명적

3-3. What is ‘Google App 3-3. What is ‘Google App Engine’?Engine’?

Page 18: Nutch Homepage Search Engine

Google App Engine 실행 방법

• Google-engine 이 설치된 directory 로 이동• Google-engine 실행 명령• dev_appserver.py bono/ : Test 용• appcfg.py update bono/ : Web 에 uploading

함• ID & PWD 를 매번 입력하여 uploading

• 결과 화면 확인• http://localhost:8080/• http://flyingbono.appspot.com

3-3. What is Google App 3-3. What is Google App Engine?Engine?

Page 19: Nutch Homepage Search Engine

Google App Engine using the App Engine software development kit (SDK)

Python 2.5• You need active Python in window

environment OS : Windows

Mac OS X Linux

3-4. The Development 3-4. The Development Environment Environment

Page 20: Nutch Homepage Search Engine

4. PRESENTATION4. PRESENTATION

Page 21: Nutch Homepage Search Engine

Nutch

Google App Engine + Nutch

Another example of using Google App Engine

4. Presentation4. Presentation