lis512 lecture 12 conclusions

27
LIS512 lecture 12 conclusions Thomas Krichel 2010-12-22

Upload: ash

Post on 17-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Thomas Krichel 2010-12-22. LIS512 lecture 12 conclusions. knowledge organization. As I have already said, the course is not about knowledge organization. Instead it is about the organization of information. In fact it is only about library cataloging. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LIS512 lecture 12  conclusions

LIS512 lecture 12 conclusions

Thomas Krichel2010-12-22

Page 2: LIS512 lecture 12  conclusions

knowledge organization• As I have already said, the course is not about

knowledge organization.• Instead it is about the organization of

information. • In fact it is only about library cataloging.• Library cataloging represents a very tiny

amount of organized information as used by people. It follows pre-web and even pre-computer standards.

Page 3: LIS512 lecture 12  conclusions

information retrieval• is generally about the getting to information

that it is in some type of system.• Usually this is a computer system.• Most of the time it’s about what algorithms

the system needs to implement. • Recently information retrieval has become a

huge because of search engines.

Page 4: LIS512 lecture 12  conclusions

information retrieval first• One may argue that one has to understand

information retrieval first. • Information retrieval in the sense that you

have to understand how people use information.

• Then one designs a system that helps people with the way they search.

Page 5: LIS512 lecture 12  conclusions

search engines• Search engines such as Google have become

the way most people seek information.• Information found in search engines comes

mostly from the open web. • That means from web pages that can be

reached by browsing the web using links.

Page 6: LIS512 lecture 12  conclusions

deep web• Library catalogs are mainly part of what is

referred to as the “deep web”. • This is information on the web that is not

available through browsing. • The information we have in our catalog is part

of it, although it really would not have to, we could also publish it in a non-deep web form.

• I don’t know why this has not been done for koha.

Page 7: LIS512 lecture 12  conclusions

modern information organization• Since people use web search engine find

information, people running search engines de facto control access to information.

• But the search engine (usually) can only show what has been put on the web sites.

• Thus webmasters have a crucial role to play.

Page 8: LIS512 lecture 12  conclusions

how do search engines work?• Initially web search engines worked a bit like

the search engine we have on koha.• That engine basically looks for the words of

the query in a record and shows those records that have them.

• Some people in the 1990, mainly advocated some form of cataloging for web pages.

Page 9: LIS512 lecture 12  conclusions

metadata• The idea of metadata for the web is that we

can add some information to pages that give some more/other information than the text that is on it.

• It’s been largely discredited by the amount of effort required to do this and the lack of support by search engines (past and present) for it.

Page 10: LIS512 lecture 12  conclusions

the big computer brain• If a computer can just simply suck in all the

information on the web, and build a search engine, then is there any need for any metadata?

• Hmm, the information that is on the web is already structured in some form. – html headers– image file header– mp3 header

Page 11: LIS512 lecture 12  conclusions

metadata not dead• At it’s start Google bragged about it not using

any metadata at all.• The words you would search would be on the

pages.• Sometimes, however, words from the links to

the page are used.• Sometimes, special components of a page are

used.

Page 12: LIS512 lecture 12  conclusions

metadata not dead• Users use search engine generally to reach a

certain site.• Once they are on the site, they like to follow

links around the site rather than doing more searching (I think).

• To organize a site, you need to have some form of organization of your material.

Page 13: LIS512 lecture 12  conclusions

digital preservation• The big problem looming with digital

information is how to preserve it. • Some information may not be worth

preserving.• But any crap becomes valuable when it is old

enough, or when there is a lot of it. • One thing for sure, it can not be done without

organizing information.

Page 14: LIS512 lecture 12  conclusions

back to cataloging• It cataloged material receives so little

attention, it is worth to spend a whole course on it?

• Yes, physical libraries still need it. We don’t know however, how long they will be around.

• On the other hand the cataloging rule bring in some thinking about structure information that can be of some use in other areas of information organization.

Page 15: LIS512 lecture 12  conclusions

back to cataloging• Other areas of information organization are

less human organized.• A lot of them are – subject to research– owned by companies

• Therefore I have no problems with spending so much time on cataloging.

• But they way it is done is to relate cataloging rules closer to general principles.

Page 16: LIS512 lecture 12  conclusions

the future• Webmasters will have more and more control

over what people can/will read.• Search engine builders will have more and

more control over what people can find.• While we will find it difficult to build search

engines, we all can build web sites.• Learn web site architecture and see

information organization in some other action.

Page 17: LIS512 lecture 12  conclusions

web information retrieval• We can think of the web as a pile of documents called pages.

• Some pages are hard to index– PDF documents• some index, some don’t

– Pictures– Sound files

• But a majority of pages are written in HTML– easy to index– have a loose structure

Page 18: LIS512 lecture 12  conclusions

Google uses the structure of HTML• Google finds the title of the page, i.e. the

contents of the <title> element. • Google analysis headings and large font sizes

and gives priority weight to terms found there.• Most importantly, Google uses the link

structure of the web to find important pages.

Page 19: LIS512 lecture 12  conclusions

Google finds important pages• The idea is that the documents on the web

have different degrees of importance.• Google will show the most important pages

first.• The ideas is that more important pages are

likely to be more relevant to any query than non-important pages.

Page 20: LIS512 lecture 12  conclusions

Google's monkey• Imagine that the web has P pages. Each page

has its own address (URL). • Imagine a monkey who sits at a terminal. He

follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P.

• Will the monkey visit all pages with equal probability?

Page 21: LIS512 lecture 12  conclusions

page rank• Google page rank of a page is the probability

that the Google's monkey will visit the page. – The monkey will come frequently to pages that

have a lot of links to them.– Once he is there, he will likely go to a page that it

linked by one of the pages that an important page links to.

• The structure of all the links on the entire web reveals the importance of the page.

Page 22: LIS512 lecture 12  conclusions

many page ranks• There is an infinite number of ways to

calculate the page rank depending on– how likely the monkey gets bored.– the probability of the monkey to visit each page.

• Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

Page 23: LIS512 lecture 12  conclusions

notation• Assume that a monkey gets bored with

probability d. If bored, it will visit page p with probability π_p.

• For any page p, let o_p the number of outgoing links.

• Let l(p',p) be the number of links from page p' to page p.

Page 24: LIS512 lecture 12  conclusions

page rank formula• The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p'• In words, it is likelihood that, if bored the

money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

Page 25: LIS512 lecture 12  conclusions

example• Let there be a web of four pages A B C D• A links to B.• B links to C. • C links to A and D.• D links to A.• Let the probability to get bored be ¼ and

there be a ¼ chance to move to any page when bored.

Page 26: LIS512 lecture 12  conclusions

page ranks• The following system calculates the ranks r_A = ¼ ¼ + ¾ (r_C / 2 + r_D) r_B = ¼ ¼ + ¾ r_A r_C = ¼ ¼ + ¾ r_B r_D = ¼ ¼ + ¾ r_C / 2• Since this is fairly complicated, Google uses an

iterative approximation to calculate the rank.• Note that the sum of all ranks is 1.

Page 27: LIS512 lecture 12  conclusions

http://openlib.org/home/krichel

Thank you for your attention!

Please switch off machines b4 leaving!