modern information retrival course, semantic web research labratory1 information filtering

Modern Information Retrival Course, Semantic web Research labratory

1

Information Filtering


2

Outline

Introduction Information Filtering concept Previous work Filtering general features Filtering rules and attributes Type of filters Profling and Filtering Technologies user-modeling techniques Conclusion


3

Introduction

Internet and Information overloading A vast amount of information of varying

quality is disseminated. There are lots of interesting things, but also

lots of trash. Filtering is tools to help people find the most

valuable information


4

Introduction

The goal of an information filtering system is to sort through large volumes of dynamically generated information and present to the user those which are likely to satisfy his or her information requirement.


5

Introduction

In order to identify information that satisfies a user's information requirement or interest, an IF system needs to acquire an information filter that, when applied to an information item, evaluates whether the item is of interest or not.

Information filter represents the user's interests Identifying only those pieces of information that a

user would find interesting. The key question for designing an IF system is

how to acquire such an information filter.


6

Information Filtering concept

Filtering information is not a new concept, nor is it one that is limited to electronic documents.

When we read standard paper texts, information filtering occurs.

We only buy certain magazines, since other magazines may contain information that is redundant with or irrelevant to our interests

With the increasing availability of information in electronic form, it becomes more important and feasible to have automatic methods to filter information.


7


We can describe a filtering information system as being an automatic mechanism with the capacity of monitoring a continuous flow of documents and ability to select documents considering it’s relevance for a certain user or users’ groups, according to its needs.

Filtering is based on descriptions of individual or group information preferences, often called profiles. Such profiles typically represent long-term interests.


8


These needs are represented through a profile of interests associated to the user or users’ group.

The ability to select relevant documents is associated with the mechanisms of retrieval information that calculate the value of similarity between documents of the collection and the profiles.

Documents of great similarity with the profile are considered important for the user or users’ group.


9


due to personal or professional reasons, a user’s interests may shift or change.

These changes may happen in a relatively short duration of time or over a long period of time.

The shifts can affect the user’s interests partially or fully.

To cope with this problem it should be possible to do reformulation on the user’s

profile. This actualization is made through information sent

to the system about the relevance of the received documents.


10


One of the simplest methods of determining whether information matches a user's interests is through keyword matching.

If a user's interests are described by certain words, then information containing those words should be relevant.

This straightforward keyword matching often fails however.

Inappropriate matches can arise because The words people use do not unambiguously reflect the topic

or content. A single word can have more than one meaning (e.g., chip) The same concept can be described by surprisingly many

different words (e.g., human factors, ergonomics).


11


Furnas, showed that two people use the same main word to describe an object only 10 to 20 percent of the time.

Bates has reported comparably poor agreement in the generation of search terms by trained intermediaries.


12

Previous work

Conventional information retrieval (IR) is very closely related to information filtering (IF)

They both have the goal of retrieving information relevant to what a user wants

And minimizing the amount of irrelevant information retrieved


13

Previous work

One of the earliest forms of electronic information filtering came from work on Selective Dissemination of Information (SDI).

SDI was designed as an automatic way of keeping scientists informed of new documents published in their areas of specialization.

The scientist could create and modify a user profile of keywords that described his or her interests.

SDI used the profile to match the keywords against new articles in order to predict which new articles would be most relevant to the scientist's interests.


14

Previous work

Allen conducted a series of experiments to explore user models in predicting preferences for news articles.

He predicted which articles a person would read based on previous articles read using a measure of overlap of nouns between the new and old articles.

While the predictions were better than chance, the average correlation between the predicted articles and the subjects' ratings of the articles was fairly low (r=0.44).


15

Previous work

The models were more successful at predicting user preferences for general categories of articles than for specific articles.

Predicting what news articles a person will read may be an especially difficult task.

News topics vary from day to day, making it difficult to get stable estimates of interest. In addition, external sources of news probably influenced what people read in the experiment.

We believe that users' interests for technical literature will be more stable over time.


16

Previous work

In Allen's research, the subject's past preferences were used to construct an implicit model for retrieving relevant articles.

A different approach is to let the user explicitly structure the information.

For Example the Information Lens system allows users to create rules to filter mail messages based on keyword matches in the mail fields.

There is some structure in mail messages, (e.g. sender, subject)

These rules can take advantage of this structure to perform user specified actions on the messages.


17

Previous work

While a variety of information systems have been developed, there has been little systematic evaluation of what features are most effective for filtering.

This leaves many unanswered questions, such as: What are the most effective methods for matching a user's

interests to information available? How should a user's interests be described? How will the performance of filtering methods vary in

different domains?


18

Filtering general features

An information filtering system is an information system designed for unstructured or semi structured data.

This contrasts with a typical database application that involves very structured data, such as employee records.

The notion of structure being used here is not only that the data conforms to a format such as a record type description, but also that the fields of the records consist of simple data types with well-defined meanings.

Email messages are an example of semi structured data in that they have well-defined header fields and an unstructured text body.


19


Information filtering systems deal primarily with textual information.

Unstructured data is often used as a synonym for textual data.

It is, however, more general than that and should include other types of data such as images, voice, and video that are part of

multimedia information systems. None of these data types are handled well by

conventional database systems, and all have meanings that are difficult to represent.


20


Filtering systems involve large amounts of data. Typical applications would deal with gigabytes of text, or much

larger amounts of other media. Filtering applications typically involve streams of

incoming data, either being broadcast by remote sources (such as newswire services), or sent directly by other sources (email).

Filtering has also been used to describe the process of accessing and retrieving information from remote databases, in which case the incoming data is the result of the database searches.


21


Filtering is based on descriptions of individual or group information preferences, often called profiles. Such profiles typically represent long-term interests.

Filtering is often meant to imply the removal of data from an incoming stream, rather than finding data in that stream.

In the first case: The users of the system see what is left after the data is removed

In the later case: they see the data that is extracted.

A common example of the first approach is an email filter designed to remove junk mail.

profiles may not only express what people want, but also what they do not want.


22


Many of these features are virtually the same as those found in a variety of other text-based information systems.

Text routing, for example, involves sending relevant incoming data to individuals or groups. This process is essentially identical to filtering.

Categorization systems are designed to attach one or more predefined categories to incoming objects (this is done by newswire services, for example). The major difference from filtering in this case is the static

nature of the categories, when compared to profiles.


23

IF vs. IR

The entities and processes relevant to IF are almost identical to those that are relevant to IR.

The major differences appear to be: IR is typically concerned with single uses of the

system, by a person with a one-time goal and one-time query.

IF is concerned with repeated uses of the system, by a person or persons with long-term goals or interests.


24

IF vs. IR

IR recognizes inherent problems in the adequacy of queries as representations of information needs.

IF assumes that profiles can be correct specifications of information interests.

IR is concerned with the collection and organization of texts.

IF is concerned with the distribution of texts to groups or individuals.


25

IF vs. IR

IR is typically concerned with the selection of texts from a relatively static database.

IF is mainly concerned with selection or elimination of texts from a dynamic data stream.

IR is concerned with responding to the user’s interaction with texts within a single information-seeking episode.

IF is concerned with long-term changes over a series of information-seeking episodes.


26

IF vs. IR

In addition to these distinctions based on the models of IR and IF, there seem to be some other, contextual differences that might also be relevant to research interests.

These arise from differences in the social and/or practical situations with which IR and IF have been concerned.

Differences could be categorized according to differences associated with Texts Users General environment of concern to each.


27

IF vs. IR

Text-related issues. For IF, the timeliness of a text is often of overriding

significance. For IR, this has typically not been the case.

User-related issues. IR has, by-and-large, studied well-defined user groups, in

well-defined, specific domains, largely in science and technology.

IF, however, is often concerned with very undefined user communities

Environmental issues. IF is highly concerned, in many situations, with issues of

privacy IR, for a variety of reasons, has paid almost no attention to this

kind of problem.


28

Filtering using IR

In general, the idea for filtering is to create a space of documents, some of which have previously been judged by a user to be relevant to his or her interests.

If a new document is close to relevant documents in the space, then it would be considered likely to be interesting to the user.

For all these comparisons, the only difference between the LSI and the keyword matching methods is that LSI represents terms and documents in a reduced dimensional space of derived indexing dimensions.


29

Filtering using IR

Foltz compared LSI and keyword vector matching for filtering of Netnews articles.

In an experiment, subjects rated Netnews articles as either relevant or not relevant to their interests.

The ratings from the initial 80% of the articles they read were used to predict the relevance of the remaining 20% of the articles for each person.

Foltz found that the LSI filtering improved prediction performance over the keyword matching method by an average of 13% and showed a 26% improvement in precision


30

Filtering using IR


31

Automatic vs. Social filtering

Automatic Filtering: is where the computer evaluates what is of value

for you. Social Filtering (collaborative filtering):

is tools where other people help you evaluate what is of most value to read. Just like the publishers and organizations did in society before the Internet.


32

Social filtering

By social filtering is meant that some kind of ratings are assigned to documents.

The ratings can be compared to the stars (***) which newspapers often assign to films, books and other consumer products.

But the ratings can also include categorization into subject areas or according to particular scales.

Social filtering has some similarities to the filtering done by editors, journalists and publishers, since in both cases humans select the filtering attributes.


33

Social filtering

Why use social filtering? It is difficult to design automatic or intelligent

filtering algorithms which really can evaluate the content of a document and evaluate its value. Humans are more capable of really deciding the value of

a document.

Who make the ratings? Ratings for use in social filtering can be provided

by:


34

Social filtering

Editors: special people with the task of doing such rating.

An example is the people selecting which messages to put into services like Yahoo.

Readers: ordinary readers might input ratings on what they

read, and these ratings might be collected and put into databases to help other people.

Authors: can provide certain kinds of ratings themselves.


35

Social filtering

The most successful social filtering system is Yahoo.

Yahoo employs humans to evaluate documents, and puts documents, which are interesting into its structured information database.

This is very similar to what the publishers, editors, journalists and organizations did in the world before the Internet.


36

Social filtering

The simplest and most common filtering is by organizing discussions into groups (newsgroups, mailing lists, forums, etc.)

Each group has a topic, and wants only contributions within that topic. Sometimes the right to submit contributions is restricted. only members can submit. competence control is done before accepting a new member. special moderators must approve contributions before

distribution. The act when a recipient selects which groups to subscribe to,

can thus be seen as an act of setting a personal filter.


37

Thread filtering

Another simple and common filtering method is to filter by thread. A thread is a set of messages, which directly or indirectly refer to

each other. People can use threads for filtering by specifying that they want

to skip reading of existing and future contributions in certain threads.

In Usenet News, this functionality is known under the term kill buffer.


38

Thread filtering

In discussion groups, messages often belong to threads.

It may then not be possible to understand a single message without seeing other messages in the same thread.

A filter or search facility which only selects certain individual messages, out of threads, might then not satisfy their users.

The filter must either select several items in the thread, or at least make it very easy for users, when reading one selected message, to traverse the tree up and down from this message.


39

Filtering rules and attributes

Filtering is done by applying filtering rules to attributes of the documents to be filtered.

Filtering rules are often Boolean conditions. They are usually put in an ordered list, which is

scanned for each item to be filtered. The attributes of documents, to be used in filtering, are

words in: the titles, abstracts or the whole document automatic measurements of stylistic and language quality name of author, and ratings on the documents supplied by its

author or by other people


40


Filtering can be done in servers or in clients. This figure shows how a server can filter messages before

downloading them to the client. Advantage:

Filtering can be done in the background Disadvantage:

Communication between user and filtering system becomes more complex.


41


Alternatively, filters may be part of the client, and apply to sets of documents after they have been downloaded to the client.


42

Delivery of filtering results

The most common way of delivery of filtering results is that documents are filtered into different folders.

Users choose to read new items one folder at a time.

The filter helps users read messages on the same topic at the same time.

The user can also have a personal priority on the order of reading news in different folders.

Unwanted messages can be filtered to special “trashcan” folders.


43

Intelligent filtering

By intelligent filtering is meant use of artificial intelligence (AI) methods to enhance filtering.

This can be done in different ways: to derive attributes for documents, to derive filtering rules, for the filtering process itself. With the machine learning

approach Such filtering can be done in the background, with little

or no interaction with the user it can also be done in a way where a user can interact

with the filter and help the filter understand why the user likes certain messages.


44

Filtering against spamming

Many people want filters which will remove unsolicited direct marketing e-mail messages, so called spamming.

The filter has to recognize special properties of spam messages, which distinguish them from other messages.

Examples of such properties are: A message does not have your name or e-mail

address in the message heading, but it does not come from any mailing list, which you subscribe to.


45

Filtering against spamming

Examples of such properties are: The author or sender of a message has an illegal e-

mail address. Certain words, such as “money” or “$$$” in the

subject. This is not very dependable. It has the same problem as all intelligent filtering.

If you often get similar spam, you might be able to recognize special properties of them to use to stop further similar spam.

The same message, with identical content, was sent to very many users


46

Type of filters

Various Type of Filters: Content-based Filters Collaborative Filters Hybrid Filters


47

Content-based Filters

A content-based filter makes use of the content of the information items to evaluate whether the item is interesting

profiles are either in the form of user-specified keywords or rules and reflects the long-term interests of the user.

the user would like the system to learn the user profile rather than impose upon the user to provide one.

This generally involves the application of Machine Learning (ML) techniques.

The user’s feedback can be acquired either implicitly by observing the user or explicitly by asking the user to rate the seen information item


48

Content-based Filters (cont.)

The two primary weaknesses of using ML techniques to learn a user profile is that Most techniques require large amounts of data If a new information item is significantly different from

anything seen (and hence labeled) by the user before, the learned profile cannot make an accurate prediction

Content-based filters have been used successfully in various domains including: Web browsing (Letizia and Syskill&Webert), News filtering (NewsWeeder2,WebMate and NewsDude3) Email filtering (Re:Agent and EmailValet).


49

Collaborative Filters

Collaborative filters also known as Social Filters, are often used in Recommender Systems.

A collaborative filter makes use of a database of user preferences to find users with similar interests

Predict whether an unseen information item is likely to be of interest to you based on how other users have rated this item.

A community of users has to continuously rate whether the information they have seen is interesting to them or not

Generally this rating is on a scale (e.g., from 1, meaning “not interesting” to 5, meaning “very interesting”.)


50

Collaborative Filters (cont.)

Collaborative filters have two common weaknesses: The first rater problem

If no users have rated an information item, the filter cannot evaluate whether that item is likely to be of interest to its user

Sparse data Most users do not rate all that much information due to the time it

takes, and as such, it is not always easy to find users with similar profiles.

Collaborative filters work quite well and have successfully been applied in a variety of domains including: Finding people who are knowledge in a given field (Tapestry) Netnews (GroupLens4) Music recommendation (Ringo) Helping people to find Web resources (PHOAKS) CDNow.com, reel.com, and Amazon.com.


51

Hybrid Filters

The goal of hybrid filters is to take the best features of each and minimize the impact of their weaknesses with the goal of outperforming each individually.

Generally they start with one type of filter (content-based or collaborative) and incorporate features from the other type of filter to improve the performance of the original filter.

One simple approach is to have the content-based and collaborative filter each produce separate recommendations, and then combine their predictions


52

Profling and Filtering Technologies Most information filtering systems are based on a

number of key-techniques used to describe information, create a user profile and create the interaction and filtering needed for a useful system.

Key filtering technologies Keyword vectors N-grams Hyperlink structures Collaborative and economic-based filtering Data-mining techniques


53

Keyword vectors

Keywords are the most popular way of representing documents and are also used to represent user-profiles.

Most representations are based on a standard information retrieval technique called weighted vector representation

Document similarity and document distance to a preferred profile-vector can be easily obtained by comparing the respective vectors with for instance k-nearest-Neighbor algorithms

User profiles can be obtained by determining (clusters of) document vectors that are indicative for the type of information of interest to the user


54

N-grams

An n-gram is a sequence of n letters. Typically n is at least three.

For each n and size of alphabet there are a finite number of letter sequence of length n and thus a fixed number of n-grams.

A text can be converted to an n-gram distribution by counting the number of times each possible n-gram appears within the text

The main benefit of N-grams lies in the fact that they are less sensitive to spelling-errors and that the (large!) n-gram vector also incorporates more of the document structure as compared to keywords.


55

Hyperlink structures

Specially for documents with linked structures, such as web-pages, graph-like representations can be extracted, mapping out the relationships between documents, and between words near links to other documents.

Such structure can be exploited to filter WebPages into different categories


56

Collaborative and economic-based filtering Collaborative (or social) filtering utilizes feedback

and ratings from different users to filter out irrelevant information

The information interesting for a user is gathered on the fly by using the opinions of other users with similar interest

Economic-based filtering augments this idea with a cost-benefit analysis on behalf of the user

It takes into consideration parameters like the price of a document and its cost of transmission when making filtering decisions.


57

Data-mining techniques

Data-mining techniques can be employed to find similarities between data-entries, and thus inferring that the profile of a given user might be very close to the profile of some other users.

Thus correlating the current customer to previous users (people-to-people correlation, e.g. you are like this type of

customer who typically likes...) or to these previous users' interests

(item- to-item correlation, e.g. this item you are considering is very much like these items...)

allows companies to present a customer with information that he or she is likely to be interested in


58

Key user-modeling techniques User modeling can be defined as the effort to

create a profile of the user's interests and habits

Profiles could be acquired or generated in a variety of ways: By explicit modeling by humans

By direct user interviews and questionnaires By knowledge engineers using user stereotypes Rule-based profiles, where the users specify their own

rules in the profile, rules that control the behavior of the model.


59

Key user-modeling techniques (cont.)

By automated software techniques Machine learning techniques like inference, induction

and classification, where the modeler tries to identify certain patterns in the user's behavior.

Profile building by example, where the user provides examples of his/her behavior and the modeling software records them.

At the moment, the first method is much further developed and significantly more applied than the second, which is in its development phase


60

Conclusion

Information retrieval and Information filtering are indeed two sides of the same coin.

They work together to help people get the information needed to perform their tasks.

modern information retrival course, semantic web research labratory1 information filtering

Documents

information item

generated information

pieces of information

information filtering

filtering information

users information requirement

group information preferences

information overloadinga