lis512 lecture 12 conclusions
DESCRIPTION
Thomas Krichel 2010-12-22. LIS512 lecture 12 conclusions. knowledge organization. As I have already said, the course is not about knowledge organization. Instead it is about the organization of information. In fact it is only about library cataloging. - PowerPoint PPT PresentationTRANSCRIPT
LIS512 lecture 12 conclusions
Thomas Krichel2010-12-22
knowledge organization• As I have already said, the course is not about
knowledge organization.• Instead it is about the organization of
information. • In fact it is only about library cataloging.• Library cataloging represents a very tiny
amount of organized information as used by people. It follows pre-web and even pre-computer standards.
information retrieval• is generally about the getting to information
that it is in some type of system.• Usually this is a computer system.• Most of the time it’s about what algorithms
the system needs to implement. • Recently information retrieval has become a
huge because of search engines.
information retrieval first• One may argue that one has to understand
information retrieval first. • Information retrieval in the sense that you
have to understand how people use information.
• Then one designs a system that helps people with the way they search.
search engines• Search engines such as Google have become
the way most people seek information.• Information found in search engines comes
mostly from the open web. • That means from web pages that can be
reached by browsing the web using links.
deep web• Library catalogs are mainly part of what is
referred to as the “deep web”. • This is information on the web that is not
available through browsing. • The information we have in our catalog is part
of it, although it really would not have to, we could also publish it in a non-deep web form.
• I don’t know why this has not been done for koha.
modern information organization• Since people use web search engine find
information, people running search engines de facto control access to information.
• But the search engine (usually) can only show what has been put on the web sites.
• Thus webmasters have a crucial role to play.
how do search engines work?• Initially web search engines worked a bit like
the search engine we have on koha.• That engine basically looks for the words of
the query in a record and shows those records that have them.
• Some people in the 1990, mainly advocated some form of cataloging for web pages.
metadata• The idea of metadata for the web is that we
can add some information to pages that give some more/other information than the text that is on it.
• It’s been largely discredited by the amount of effort required to do this and the lack of support by search engines (past and present) for it.
the big computer brain• If a computer can just simply suck in all the
information on the web, and build a search engine, then is there any need for any metadata?
• Hmm, the information that is on the web is already structured in some form. – html headers– image file header– mp3 header
metadata not dead• At it’s start Google bragged about it not using
any metadata at all.• The words you would search would be on the
pages.• Sometimes, however, words from the links to
the page are used.• Sometimes, special components of a page are
used.
metadata not dead• Users use search engine generally to reach a
certain site.• Once they are on the site, they like to follow
links around the site rather than doing more searching (I think).
• To organize a site, you need to have some form of organization of your material.
digital preservation• The big problem looming with digital
information is how to preserve it. • Some information may not be worth
preserving.• But any crap becomes valuable when it is old
enough, or when there is a lot of it. • One thing for sure, it can not be done without
organizing information.
back to cataloging• It cataloged material receives so little
attention, it is worth to spend a whole course on it?
• Yes, physical libraries still need it. We don’t know however, how long they will be around.
• On the other hand the cataloging rule bring in some thinking about structure information that can be of some use in other areas of information organization.
back to cataloging• Other areas of information organization are
less human organized.• A lot of them are – subject to research– owned by companies
• Therefore I have no problems with spending so much time on cataloging.
• But they way it is done is to relate cataloging rules closer to general principles.
the future• Webmasters will have more and more control
over what people can/will read.• Search engine builders will have more and
more control over what people can find.• While we will find it difficult to build search
engines, we all can build web sites.• Learn web site architecture and see
information organization in some other action.
web information retrieval• We can think of the web as a pile of documents called pages.
• Some pages are hard to index– PDF documents• some index, some don’t
– Pictures– Sound files
• But a majority of pages are written in HTML– easy to index– have a loose structure
Google uses the structure of HTML• Google finds the title of the page, i.e. the
contents of the <title> element. • Google analysis headings and large font sizes
and gives priority weight to terms found there.• Most importantly, Google uses the link
structure of the web to find important pages.
Google finds important pages• The idea is that the documents on the web
have different degrees of importance.• Google will show the most important pages
first.• The ideas is that more important pages are
likely to be more relevant to any query than non-important pages.
Google's monkey• Imagine that the web has P pages. Each page
has its own address (URL). • Imagine a monkey who sits at a terminal. He
follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P.
• Will the monkey visit all pages with equal probability?
page rank• Google page rank of a page is the probability
that the Google's monkey will visit the page. – The monkey will come frequently to pages that
have a lot of links to them.– Once he is there, he will likely go to a page that it
linked by one of the pages that an important page links to.
• The structure of all the links on the entire web reveals the importance of the page.
many page ranks• There is an infinite number of ways to
calculate the page rank depending on– how likely the monkey gets bored.– the probability of the monkey to visit each page.
• Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.
notation• Assume that a monkey gets bored with
probability d. If bored, it will visit page p with probability π_p.
• For any page p, let o_p the number of outgoing links.
• Let l(p',p) be the number of links from page p' to page p.
page rank formula• The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p'• In words, it is likelihood that, if bored the
money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.
example• Let there be a web of four pages A B C D• A links to B.• B links to C. • C links to A and D.• D links to A.• Let the probability to get bored be ¼ and
there be a ¼ chance to move to any page when bored.
page ranks• The following system calculates the ranks r_A = ¼ ¼ + ¾ (r_C / 2 + r_D) r_B = ¼ ¼ + ¾ r_A r_C = ¼ ¼ + ¾ r_B r_D = ¼ ¼ + ¾ r_C / 2• Since this is fairly complicated, Google uses an
iterative approximation to calculate the rank.• Note that the sum of all ranks is 1.
http://openlib.org/home/krichel
Thank you for your attention!
Please switch off machines b4 leaving!