lis512 lecture 12 conclusions

LIS512 lecture 12 conclusions

Thomas Krichel2010-12-22

knowledge organization• As I have already said, the course is not about

knowledge organization.• Instead it is about the organization of

information. • In fact it is only about library cataloging.• Library cataloging represents a very tiny

amount of organized information as used by people. It follows pre-web and even pre-computer standards.

information retrieval• is generally about the getting to information

that it is in some type of system.• Usually this is a computer system.• Most of the time it’s about what algorithms

the system needs to implement. • Recently information retrieval has become a

huge because of search engines.

information retrieval first• One may argue that one has to understand

information retrieval first. • Information retrieval in the sense that you

have to understand how people use information.

• Then one designs a system that helps people with the way they search.

search engines• Search engines such as Google have become

the way most people seek information.• Information found in search engines comes

mostly from the open web. • That means from web pages that can be

reached by browsing the web using links.

deep web• Library catalogs are mainly part of what is

referred to as the “deep web”. • This is information on the web that is not

available through browsing. • The information we have in our catalog is part

of it, although it really would not have to, we could also publish it in a non-deep web form.

• I don’t know why this has not been done for koha.

modern information organization• Since people use web search engine find

information, people running search engines de facto control access to information.

• But the search engine (usually) can only show what has been put on the web sites.

• Thus webmasters have a crucial role to play.

how do search engines work?• Initially web search engines worked a bit like

the search engine we have on koha.• That engine basically looks for the words of

the query in a record and shows those records that have them.

• Some people in the 1990, mainly advocated some form of cataloging for web pages.

metadata• The idea of metadata for the web is that we

can add some information to pages that give some more/other information than the text that is on it.

• It’s been largely discredited by the amount of effort required to do this and the lack of support by search engines (past and present) for it.

the big computer brain• If a computer can just simply suck in all the

information on the web, and build a search engine, then is there any need for any metadata?

• Hmm, the information that is on the web is already structured in some form. – html headers– image file header– mp3 header

metadata not dead• At it’s start Google bragged about it not using

any metadata at all.• The words you would search would be on the

pages.• Sometimes, however, words from the links to

the page are used.• Sometimes, special components of a page are

used.

metadata not dead• Users use search engine generally to reach a

certain site.• Once they are on the site, they like to follow

links around the site rather than doing more searching (I think).

• To organize a site, you need to have some form of organization of your material.

digital preservation• The big problem looming with digital

information is how to preserve it. • Some information may not be worth

preserving.• But any crap becomes valuable when it is old

enough, or when there is a lot of it. • One thing for sure, it can not be done without

organizing information.

back to cataloging• It cataloged material receives so little

attention, it is worth to spend a whole course on it?

• Yes, physical libraries still need it. We don’t know however, how long they will be around.

• On the other hand the cataloging rule bring in some thinking about structure information that can be of some use in other areas of information organization.

back to cataloging• Other areas of information organization are

less human organized.• A lot of them are – subject to research– owned by companies

• Therefore I have no problems with spending so much time on cataloging.

• But they way it is done is to relate cataloging rules closer to general principles.

the future• Webmasters will have more and more control

over what people can/will read.• Search engine builders will have more and

more control over what people can find.• While we will find it difficult to build search

engines, we all can build web sites.• Learn web site architecture and see

information organization in some other action.

web information retrieval• We can think of the web as a pile of documents called pages.

• Some pages are hard to index– PDF documents• some index, some don’t

– Pictures– Sound files

• But a majority of pages are written in HTML– easy to index– have a loose structure

Google uses the structure of HTML• Google finds the title of the page, i.e. the

contents of the <title> element. • Google analysis headings and large font sizes

and gives priority weight to terms found there.• Most importantly, Google uses the link

structure of the web to find important pages.

Google finds important pages• The idea is that the documents on the web

have different degrees of importance.• Google will show the most important pages

first.• The ideas is that more important pages are

likely to be more relevant to any query than non-important pages.

Google's monkey• Imagine that the web has P pages. Each page

has its own address (URL). • Imagine a monkey who sits at a terminal. He

follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P.

• Will the monkey visit all pages with equal probability?

page rank• Google page rank of a page is the probability

that the Google's monkey will visit the page. – The monkey will come frequently to pages that

have a lot of links to them.– Once he is there, he will likely go to a page that it

linked by one of the pages that an important page links to.

• The structure of all the links on the entire web reveals the importance of the page.

many page ranks• There is an infinite number of ways to

calculate the page rank depending on– how likely the monkey gets bored.– the probability of the monkey to visit each page.

• Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

notation• Assume that a monkey gets bored with

probability d. If bored, it will visit page p with probability π_p.

• For any page p, let o_p the number of outgoing links.

• Let l(p',p) be the number of links from page p' to page p.

page rank formula• The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p'• In words, it is likelihood that, if bored the

money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

example• Let there be a web of four pages A B C D• A links to B.• B links to C. • C links to A and D.• D links to A.• Let the probability to get bored be ¼ and

there be a ¼ chance to move to any page when bored.

page ranks• The following system calculates the ranks r_A = ¼ ¼ + ¾ (r_C / 2 + r_D) r_B = ¼ ¼ + ¾ r_A r_C = ¼ ¼ + ¾ r_B r_D = ¼ ¼ + ¾ r_C / 2• Since this is fairly complicated, Google uses an

iterative approximation to calculate the rank.• Note that the sum of all ranks is 1.

http://openlib.org/home/krichel

Thank you for your attention!

Please switch off machines b4 leaving!

lis512 lecture 12 conclusions

Documents

web search engines

organization of information

moreother information

organized information

information retrievalis

web pages

information retrieval

web sites