info 2402 assignment 2_ crawler

3
INFO 2402 – Assignment -2 Due Date: 4 th March (Report) Weightage: 10% Maximum 2 in a Group. You have to Create a simple web crawler using any of the open source search engine. I.e Galago https://code.google.com/p/galagosearch/ Specification of your tasks for option 1: o The seed should be included as argument to your web crawler program URL seed = new URL(args[0]); o Open a connection to the document based on the seed URLConnection connection = seed.openConnection(); connection.connect(); o Get input stream that read from this connection InputStream stream = connection.getInputStream(); o Read from the connection up to the maximum of 256k byte[] docData = new byte[256*1024]; int accumulateRead = 0; int recentRead = 0; while (recentRead >= 0) { recentRead = stream.read(docData, accumulateRead, docData.length - accumulateRead); if (recentRead >= 0) { accumulateRead += recentRead; } } o Convert docData into String docString = new String(docData, 0, accumulateRead, "UTF-8"); o Use the TagTokenizer, Document and Tag in order to break the docString and search through the tags. These source codes can be downloadable from www.galagosearch.org . import org.galagosearch.core.parse.Document; import org.galagosearch.core.parse.Tag; import org.galagosearch.core.parse.TagTokenizer; for (Tag tag : document.tags) {

Upload: shahriar-rafee

Post on 08-Aug-2015

28 views

Category:

Internet


5 download

TRANSCRIPT

Page 1: Info 2402 assignment 2_ crawler

INFO 2402 – Assignment -2

Due Date: 4th March (Report)

Weightage: 10%

Maximum 2 in a Group. You have to Create a simple web crawler using any of the open source search engine. I.e Galago

https://code.google.com/p/galagosearch/

Specification of your tasks for option 1:o The seed should be included as argument to your web crawler program

URL seed = new URL(args[0]);

o Open a connection to the document based on the seedURLConnection connection = seed.openConnection();connection.connect();

o Get input stream that read from this connectionInputStream stream = connection.getInputStream();

o Read from the connection up to the maximum of 256kbyte[] docData = new byte[256*1024];int accumulateRead = 0;int recentRead = 0;

while (recentRead >= 0) {recentRead = stream.read(docData, accumulateRead, docData.length - accumulateRead);

if (recentRead >= 0) { accumulateRead += recentRead; } }

o Convert docData into StringdocString = new String(docData, 0, accumulateRead, "UTF-8");

o Use the TagTokenizer, Document and Tag in order to break the docString and search through the tags. These source codes can be downloadable from www.galagosearch.org.import org.galagosearch.core.parse.Document;import org.galagosearch.core.parse.Tag;import org.galagosearch.core.parse.TagTokenizer;

for (Tag tag : document.tags) {if (tag.name.equals("base") && tag.attributes.containsKey("href")) {

URL baseUrl = new URL(tag.attributes.get("href"));

for (Tag tag : document.tags) {if (tag.name.equals("a") && tag.attributes.containsKey("href"))

URL embeddedUrl = new URL(baseUrl, tag.attributes.get("href"));

} Print the results

Page 2: Info 2402 assignment 2_ crawler

System.out.println(" Found: " + embeddedUrl.toString());

You are required to write a simple report with explanations and screen shots to record the tasks described above. And you may include any problems faced throughout the project. Make sure you include the full source code.

You need to include a small section in your report and describe what you have learn from this project by associating them to the concepts introduce during the class.

I have tested a simple web crawler and the screen shot shown below in Fig 1 listed the URLs that are retrieved from the seed – www.iium.edu.my:

Figure 1: The Output for a Simple Web Crawler Using IIUM as its Seeds