recomender system notes

38
Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user . The suggestions relate to various decision-making processes, such as what items to buy, what music to listen to, or what online news to read. “Item” is the general term used to denote what the system recommends to users. A RS normally focuses on a specific type of item (e.g., CDs, or news) and accordingly its design, its graphical user interface, and the core recommendation technique used to generate the recommendations are all customized to provide useful and effective suggestions for that specific type of item. RSs are primarily directed towards individuals who lack sufficient personal experience or competence to evaluate the potentially overwhelming number of alter-native items that a Web site. A case in point is a book recommender system that assists users to select a book to read. In the popular Web site, Amazon.com, the site employs a RS to personalize the online store for each customer. Since recommendations are usually personalized, different users or user groups receive diverse suggestions. In addition there are also non-personalized recommendations. These are much simpler to generate and are normally featured in magazines or newspapers. Typical examples include the top ten selections of books, CDs etc. While they may be useful and effective in certain situations, these types of non-personalized recommendations are not typically addressed by RS research. In their simplest form, personalized recommendations are offered as ranked lists of items. In performing this ranking, RSs try to predict what the most suitable products or services are, based on the user’s preferences and constraints. In order to complete such a computational task, RSs collect from users their preferences, which are either explicitly expressed, e.g., as ratings for products, or are inferred by interpreting user actions. For instance, a RS may consider the navigation to a particular product page as an implicit sign of preference for the items shown on that page. In seeking to mimic this behavior, the first RSs applied algorithms to leverage recommendations produced by a community of users to deliver recommendations to an active user, i.e., a user looking for suggestions. The recommendations were for items that similar users (those with similar tastes) had liked. This approach is termed collaborative-filtering and its rationale is that if the active user agreed in the past with some users, then the other recommendations coming

Upload: sube-singh-insan

Post on 30-Jan-2016

219 views

Category:

Documents


0 download

DESCRIPTION

In this file basic of Recommender System is explained.

TRANSCRIPT

Page 1: Recomender System Notes

Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user The suggestions relate to various decision-making processes such as what items to buy what music to listen to or what online news to read ldquoItemrdquo is the general term used to denote what the system recommends to users A RS normally focuses on a specific type of item (eg CDs or news) and accordingly its design its graphical user interface and the core recommendation technique used to generate the recommendations are all customized to provide useful and effective suggestions for that specific type of item RSs are primarily directed towards individuals who lack sufficient personal experience or competence to evaluate the potentially overwhelming number of alter-native items that a Web site

A case in point is a book recommender system that assists users to select a book to read In the popular Web site Amazoncom the site employs a RS to personalize the online store for each customer Since recommendations are usually personalized different users or user groups receive diverse suggestions In addition there are also non-personalized recommendations These are much simpler to generate and are normally featured in magazines or newspapers Typical examples include the top ten selections of books CDs etc While they may be useful and effective in certain situations these types of non-personalized recommendations are not typically addressed by RS research In their simplest form personalized recommendations are offered as ranked lists of items In performing this ranking RSs try to predict what the most suitable products or services are based on the userrsquos preferences and constraints In order to complete such a computational task RSs collect from users their preferences which are either explicitly expressed eg as ratings for products or are inferred by interpreting user actions For instance a RS may consider the navigation to a particular product page as an implicit sign of preference for the items shown on that page In seeking to mimic this behavior the first RSs applied algorithms to leverage recommendations produced by a community of users to deliver recommendations to an active user ie a user looking for suggestions The recommendations were for items that similar users (those with similar tastes) had liked This approach is termed collaborative-filtering and its rationale is that if the active user agreed in the past with some users then the other recommendations coming from these similar users should be relevant as well and of interest to the active user

As noted above the study of recommender systems is relatively new compared to research into other classical information system tools and techniques (eg databases or search engines) Recommender systems emerged as an independent research area in the mid-1990s

In recent years the interest in recommender systems has dramatically increased as the following facts indicate

1 Recommender systems play an important role in such highly rated Internet sites as Amazoncom YouTube Netflix Yahoo Tripadvisor Lastfm and IMDb Moreover many media companies are now developing and deploying RSs as part of the services they provide to their subscribers For example Netflix the online movie rental service awarded a million dollar prize to the team that first succeeded in improving substantially the performance of its recommender system

2 There are dedicated conferences and workshops related to the field We refer specifically to ACM Recommender Systems (RecSys) established in 2007 and now the premier annual event in recommender technology research and applications In addition sessions dedicated to RSs are frequently included in the more traditional conferences in the area of data bases information systems and adaptive systems Among these conferences are worth mentioning ACM SIGIR Special Interest Group on Information Retrieval (SIGIR) User Modeling Adaptation and Personalization (UMAP) and ACMrsquos Special Interest Group on Management Of Data (SIGMOD)

3 At institutions of higher education around the world undergraduate and graduate courses are now dedicated entirely to RSs tutorials on RSs are very popular at computer science conferences and recently a book introducing RSs techniques was published

4 There have been several special issues in academic journals covering research and developments in the RS field Among the journals that have dedicated issues to RS are AI Communications (2008) IEEE Intelligent Systems (2007) International Journal of Electronic Commerce (2006) International Journal of Computer Science and Applications (2006) ACM Transactions on Computer-Human Interaction (2005) and ACM Transactions on Information Systems (2004)

In general we can say that from the service providerrsquos point of view the primary goal for introducing a RS is to increase the conversion rate ie the number of users that accept the recommendation and consume an item compared to the number of simple visitors that just browse through the information

bull Sell more diverse items Another major function of a RS is to enable the user to select items that might be hard to find without a precise recommendation For instance in a movie RS such as Netflix the service provider is interested in renting all the DVDs in the catalogue not just the most popular ones This could be difficult without a RS since the service provider cannot afford the risk of advertising movies that are not likely to suit a particular userrsquos taste Therefore a RS suggests or advertises unpopular movies to the right users

bull Increase the user satisfaction A well designed RS can also improve the experience of the user with the site or the application The user will find the recommendations interesting relevant and with a properly designed human-computer interaction she will also enjoy using the system The combination of effective ie accurate recommendations and a usable interface will increase the userrsquos subjective evaluation of the system This in turn will increase system usage and the likelihood that the recommendations will be accepted

bull Increase user fidelity A user should be loyal to a Web site which when visited recognizes the old customer and treats him as a valuable visitor This is a normal feature of a RS since many RSs compute recommendations leveraging the information acquired from the user in previous interactions eg her ratings of items Consequently the longer the user interacts with the site the more refined her user model becomes ie the system representation of the userrsquos preferences and the more the recommender output can be effectively customized to match the userrsquos preferences

bull Better understand what the user wants Another important function of a RS which can be leveraged to many other applications is the description of the userrsquos preferences either collected explicitly or predicted by the system The service provider may then decide to re-use this knowledge for a number of other goals such as improving the management of the itemrsquos stock or production For instance in the travel domain destination management organizations can decide to advertise a specific region to new customer sectors or advertise a particular type of promotional message derived by analyzing the data collected by the RS (transactions of the users)

We mentioned above some important motivations as to why e-service providers introduce RSs But users also may want a RS if it will effectively support their tasks or goals Consequently a RS must balance the needs of these two players and offer a service that is valuable to both

Herlocker et al in a paper that has become a classical reference in this field define eleven popular tasks that a RS can assist in implementing Some may be considered as the main or core tasks that are normally associated with a RS ie to offer suggestions for items that may be useful to a user Others might be considered as more

ldquoopportunisticrdquo ways to exploit a RS As a matter of fact this task differentiation is very similar to what happens with a search engine Its primary function is to locate documents that are relevant to the userrsquos information need but it can also be used to check the importance of a Web page (looking at the position of the page in the result list of a query) or to discover the various usages of a word in a collection of documents

bull Find Some Good Items Recommend to a user some items as a ranked list along with predictions of how much the user would like them (eg on a one- to five star scale) This is the main recommendation task that many commercial systems address (see for instance Chapter 9) Some systems do not show the predicted rating

bull Find all good items Recommend all the items that can satisfy some user needs In such cases it is insufficient to just find some good items This is especially true when the number of items is relatively small or when the RS is mission-critical such as in medical or financial applications In these situations in addition to the benefit derived from carefully examining all the possibilities the user may also benefit from the RS ranking of these items or from additional explanations that the RS generates

bull Annotation in context Given an existing context eg a list of items emphasize some of them depending on the userrsquos long-term preferences For example a TV recommender system might annotate which TV shows displayed in the electronic program guide (EPG) are worth watching (Chapter 18 provides interesting examples of this task)

bull Recommend a sequence Instead of focusing on the generation of a single recommendation the idea is to recommend a sequence of items that is pleasing as a whole Typical examples include recommending a TV series a book on RSs after having recommended a book on data mining or a compilation of musical tracks

bull Recommend a bundle Suggest a group of items that fits well together For instance a travel plan may be composed of various attractions destinations and accommodation services that are located in a delimited area From the point of view of the user these various alternatives can be considered and selected as a single travel destination

bull Just browsing In this task the user browses the catalog without any imminent intention of purchasing an item The task of the recommender is to help the user to browse the items that are more likely to fall within the scope of the userrsquos interests for that specific browsing session This is a task that has been also supported by adaptive hypermedia techniques

bull Find credible recommender Some users do not trust recommender systems thus they play with them to see how good they are in making recommendations Hence some system may also offer specific functions to let the users test its behavior in addition to those just required for obtaining recommendations

bull Improve the profile This relates to the capability of the user to provide (input) information to the recommender system about what he likes and dislikes This is a fundamental task that is strictly necessary to provide personalized recommendations If the system has no specific knowledge about the active user then it can only provide him with the same recommendations that would be delivered to an ldquoaveragerdquo user

bull Express self Some users may not care about the recommendations at all Rather what it is important to them is that they be allowed to contribute with their ratings and express their opinions and beliefs The user satisfaction for that activity can still act as a leverage for holding the user tightly to the application (as we mentioned above in discussing the service providerrsquos motivations)

bull Help others Some users are happy to contribute with information eg their evaluation of items (ratings) because they believe that the community benefits from their contribution This could be a major motivation for entering information into a recommender system that is not used routinely For instance with a car RS a user who has already bought her new car is aware that the rating entered in the system is more likely to be useful for other users rather than for the next time she will buy a car

bull Influence others In Web-based RSs there are users whose main goal is to explicitly influence other users into purchasing particular products As a matter of fact there are also some malicious users that may use the system just to promote or penalize certain items (see Chapter 25)

In any case as a general classification data used by RSs refers to three kinds of objects items users and transactions ie relations between users and items

Items Items are the objects that are recommended Items may be characterized by their complexity and their value or utility The value of an item may be positive if the item is useful for the user or negative if the item is not appropriate and the user made a wrong decision when selecting it We note that when a user is acquiring an item she will always incur in a cost which includes the cognitive cost of searching for the item and the real monetary cost eventually paid for the item For instance the designer of a news RS must take into account the complexity of a news item ie its structure the textual representation and the time-dependent importance of any news item But at the same time the RS designer must understand that even if the user is not paying for reading news there is always a cognitive cost associated to searching and reading news items If a selected item is relevant for the user this cost is dominated by the benefit of having acquired a useful information whereas if the item is not relevant the net value of that item for the user and its recommendation is negative In other domains eg cars or financial investments the true monetary cost of the items becomes an important element to consider when selecting the most appropriate recommendation approach Items with low complexity and value are news Web pages books CDs movies Items with larger complexity and value are digital cameras mobile phones PCs etc The most complex items that have been considered are insurance policies financial investments travels jobs RSs according to their core technology can use a range of properties and features of the items For example in a movie recommender system the genre (such as comedy thriller etc) as well as the director and actors can be used to describe a movie and to learn how the utility of an item depends on its features Items can be represented using various information and representation approaches eg in a minimalist way as a single id code or in a richer form as a set of attributes but even as a concept in an ontological representation of the domain

Users Users of a RS as mentioned above may have very diverse goals and characteristics In order to personalize the recommendations and the human-computer interaction RSs exploit a range of information about the users This information can be structured in various ways and again the selection of what information to model depends on the recommendation technique For instance in collaborative filtering users are modeled as a simple list containing the ratings provided by the user for some items In a demographic RS sociodemographic attributes such as age gender profession and education are used User data is said to constitute the user model The user model profiles the user ie encodes her preferences and needs Various user modeling approaches have been used and in a certain sense a RS can be viewed as a tool that generates recommendations by building and exploiting user models Since no personalization is possible without a convenient user model unless the recommendation is non-personalized as in the top-10 selection the user model will always play a central role For instance considering again a collaborative filtering approach the user is either profiled directly by its ratings to items or using these ratings the system derives a vector of factor values where users differ in how each factor weights in their model Users can also be described by their behavior pattern data for example

site browsing patterns (in a Web-based recommender system) or travel search patterns (in a travel recommender system) Moreover user data may include relations between users such as the trust level of these relations between users A RS might utilize this information to recommend items to users that were preferred by similar or trusted users

Transactions We generically refer to a transaction as a recorded interaction between a user and the RS Transactions are log-like data that store important information generated during the human-computer interaction and which are useful for the recommendation generation algorithm that the system is using For instance a transaction log may contain a reference to the item selected by the user and a description of the context (eg the user goalquery) for that particular recommendation If available that transaction may also include an explicit feedback the user has provided such as the rating for the selected item

In fact ratings are the most popular form of transaction data that a RS collects These ratings may be collected explicitly or implicitly In the explicit collection of ratings the user is asked to provide her opinion about an item on a rating scale According to ratings can take on a variety of forms

bull Numerical ratings such as the 1-5 stars provided in the book recommender associated with Amazoncom

bull Ordinal ratings such as ldquostrongly agree agree neutral disagree strongly disagreerdquo where the user is asked to select the term that best indicates her opinion regarding an item (usually via questionnaire)

bull Binary ratings that model choices in which the user is simply asked to decide if a certain item is good or bad

bull Unary ratings can indicate that a user has observed or purchased an item or otherwise rated the item positively In such cases the absence of a rating indicates that we have no information relating the user to the item (perhaps she purchased the item somewhere else)

Another form of user evaluation consists of tags associated by the user with the items the system presents For instance in Movielens RS (httpmovielensumnedu) tags represent how MovieLens users feel about a movie eg ldquotoo longrdquo or ldquoactingrdquo In transactions collecting implicit ratings the system aims to infer the users opinion based on the userrsquos actions For example if a user enters the keyword ldquoYogardquo at Amazoncom she will be provided with a long list of books In return the user may click on a certain book on the list in order to receive additional information At this point the system may infer that the user is somewhat interested in that book

In conversational systems ie systems that support an interactive process the transaction model is more refined In these systems user requests alternate with system actions That is the user may request a recommendation and the system may produce a suggestion list But it can also request additional user preferences to provide the user with better results Here in the transaction model the system collects the various requests-responses and may eventually learn to modify its interaction strategy by observing the outcome of the recommendation process

Recommendation Techniques

In order to implement its core function identifying the useful items for the user a RS must predict that an item is worth recommending In order to do this the system must be able to predict the utility of some of them or at least compare the utility of some items and then decide what items to recommend based on this comparison The prediction step may not be explicit in the recommendation algorithm but we can still apply this unifying model to describe the general role of a RS

To illustrate the prediction step of a RS consider for instance a simple non-personalized recommendation algorithm that recommends just the most popular songs The rationale for using this approach is that in absence of more precise information about the userrsquos preferences a popular song ie something that is liked (high utility) by many users will also be probably liked by a generic user at least more than another randomly selected song Hence the utility of these popular songs is predicted to be reasonably high for this generic user

This view of the core recommendation computation as the prediction of the utility of an item for a user has been suggested in They model this degree of utility of the user u for the item i as a (real valued) function R(u i) as is normally done in collaborative filtering by considering the ratings of users for items Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items ie to compute ˆR(u i) where we denote with ˆR the estimation computed by the RS of the true function R Consequently having computed this prediction for the active user u on a set of items ie ˆR(u i1) ˆR(u iN) the system will recommend the items i j1 i jK (K le N) with the largest predicted utility K is typically a small number ie much smaller than the cardinality of the item data set or the items on which a user utility prediction can be computed ie RSs ldquofilterrdquo the items that are recommended to users

As mentioned above some recommender systems do not fully estimate the utility before making a recommendation but they may apply some heuristics to hypothesize that an item is of use to a user This is typical for instance in knowledge-based systems These utility predictions are computed with specific algorithms (see below) and use various kind of knowledge about users items and the utility function itself For instance the system may assume that the utility function is Boolean and therefore it will just determine whether an item is or is not useful for the user Consequently assuming that there is some available knowledge (possibly none) about the user who is requesting the recommendation knowledge about items and other users who received recommendations the system will leverage this knowledge with an appropriate algorithm to generate various utility predictions and hence recommendations

To provide a first overview of the different types of RSs we want to quote a taxonomy provided by that has become a classical way of distinguishing between recommender systems and referring to them [25] distinguishes between six different classes of recommendation approaches

Content-based The system learns to recommend items that are similar to the ones that the user liked in the past The similarity of items is calculated based on the features associated with the compared items For example if a user has positively rated a movie that belongs to the comedy genre then the system can learn to recommend other movies from this genre Chapter 3 provides an overview of content based recommender systems imposing some order among the extensive and diverse aspects involved in their design and implementation It presents the basic concepts and terminology of content-based RSs their high level architecture and their main advantages and drawbacks The chapter then surveys state-of-the-art systems that have been adopted in several application domains The survey encompasses a thorough description of both classical and advanced techniques for representing items and user profiles Finally it discusses trends and future research which might lead towards the next generation of recommender systems

Collaborative filtering The simplest and original implementation of this approach [93] recommends to the active user the items that other users with similar tastes liked in the past The similarity in taste of two users is calculated based on the similarity in the rating history of the users This is the reason why [94] refers to collaborative filtering as ldquopeople-to-people correlationrdquo Collaborative filtering is considered to be the most popular and widely implemented technique in RS

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 2: Recomender System Notes

3 At institutions of higher education around the world undergraduate and graduate courses are now dedicated entirely to RSs tutorials on RSs are very popular at computer science conferences and recently a book introducing RSs techniques was published

4 There have been several special issues in academic journals covering research and developments in the RS field Among the journals that have dedicated issues to RS are AI Communications (2008) IEEE Intelligent Systems (2007) International Journal of Electronic Commerce (2006) International Journal of Computer Science and Applications (2006) ACM Transactions on Computer-Human Interaction (2005) and ACM Transactions on Information Systems (2004)

In general we can say that from the service providerrsquos point of view the primary goal for introducing a RS is to increase the conversion rate ie the number of users that accept the recommendation and consume an item compared to the number of simple visitors that just browse through the information

bull Sell more diverse items Another major function of a RS is to enable the user to select items that might be hard to find without a precise recommendation For instance in a movie RS such as Netflix the service provider is interested in renting all the DVDs in the catalogue not just the most popular ones This could be difficult without a RS since the service provider cannot afford the risk of advertising movies that are not likely to suit a particular userrsquos taste Therefore a RS suggests or advertises unpopular movies to the right users

bull Increase the user satisfaction A well designed RS can also improve the experience of the user with the site or the application The user will find the recommendations interesting relevant and with a properly designed human-computer interaction she will also enjoy using the system The combination of effective ie accurate recommendations and a usable interface will increase the userrsquos subjective evaluation of the system This in turn will increase system usage and the likelihood that the recommendations will be accepted

bull Increase user fidelity A user should be loyal to a Web site which when visited recognizes the old customer and treats him as a valuable visitor This is a normal feature of a RS since many RSs compute recommendations leveraging the information acquired from the user in previous interactions eg her ratings of items Consequently the longer the user interacts with the site the more refined her user model becomes ie the system representation of the userrsquos preferences and the more the recommender output can be effectively customized to match the userrsquos preferences

bull Better understand what the user wants Another important function of a RS which can be leveraged to many other applications is the description of the userrsquos preferences either collected explicitly or predicted by the system The service provider may then decide to re-use this knowledge for a number of other goals such as improving the management of the itemrsquos stock or production For instance in the travel domain destination management organizations can decide to advertise a specific region to new customer sectors or advertise a particular type of promotional message derived by analyzing the data collected by the RS (transactions of the users)

We mentioned above some important motivations as to why e-service providers introduce RSs But users also may want a RS if it will effectively support their tasks or goals Consequently a RS must balance the needs of these two players and offer a service that is valuable to both

Herlocker et al in a paper that has become a classical reference in this field define eleven popular tasks that a RS can assist in implementing Some may be considered as the main or core tasks that are normally associated with a RS ie to offer suggestions for items that may be useful to a user Others might be considered as more

ldquoopportunisticrdquo ways to exploit a RS As a matter of fact this task differentiation is very similar to what happens with a search engine Its primary function is to locate documents that are relevant to the userrsquos information need but it can also be used to check the importance of a Web page (looking at the position of the page in the result list of a query) or to discover the various usages of a word in a collection of documents

bull Find Some Good Items Recommend to a user some items as a ranked list along with predictions of how much the user would like them (eg on a one- to five star scale) This is the main recommendation task that many commercial systems address (see for instance Chapter 9) Some systems do not show the predicted rating

bull Find all good items Recommend all the items that can satisfy some user needs In such cases it is insufficient to just find some good items This is especially true when the number of items is relatively small or when the RS is mission-critical such as in medical or financial applications In these situations in addition to the benefit derived from carefully examining all the possibilities the user may also benefit from the RS ranking of these items or from additional explanations that the RS generates

bull Annotation in context Given an existing context eg a list of items emphasize some of them depending on the userrsquos long-term preferences For example a TV recommender system might annotate which TV shows displayed in the electronic program guide (EPG) are worth watching (Chapter 18 provides interesting examples of this task)

bull Recommend a sequence Instead of focusing on the generation of a single recommendation the idea is to recommend a sequence of items that is pleasing as a whole Typical examples include recommending a TV series a book on RSs after having recommended a book on data mining or a compilation of musical tracks

bull Recommend a bundle Suggest a group of items that fits well together For instance a travel plan may be composed of various attractions destinations and accommodation services that are located in a delimited area From the point of view of the user these various alternatives can be considered and selected as a single travel destination

bull Just browsing In this task the user browses the catalog without any imminent intention of purchasing an item The task of the recommender is to help the user to browse the items that are more likely to fall within the scope of the userrsquos interests for that specific browsing session This is a task that has been also supported by adaptive hypermedia techniques

bull Find credible recommender Some users do not trust recommender systems thus they play with them to see how good they are in making recommendations Hence some system may also offer specific functions to let the users test its behavior in addition to those just required for obtaining recommendations

bull Improve the profile This relates to the capability of the user to provide (input) information to the recommender system about what he likes and dislikes This is a fundamental task that is strictly necessary to provide personalized recommendations If the system has no specific knowledge about the active user then it can only provide him with the same recommendations that would be delivered to an ldquoaveragerdquo user

bull Express self Some users may not care about the recommendations at all Rather what it is important to them is that they be allowed to contribute with their ratings and express their opinions and beliefs The user satisfaction for that activity can still act as a leverage for holding the user tightly to the application (as we mentioned above in discussing the service providerrsquos motivations)

bull Help others Some users are happy to contribute with information eg their evaluation of items (ratings) because they believe that the community benefits from their contribution This could be a major motivation for entering information into a recommender system that is not used routinely For instance with a car RS a user who has already bought her new car is aware that the rating entered in the system is more likely to be useful for other users rather than for the next time she will buy a car

bull Influence others In Web-based RSs there are users whose main goal is to explicitly influence other users into purchasing particular products As a matter of fact there are also some malicious users that may use the system just to promote or penalize certain items (see Chapter 25)

In any case as a general classification data used by RSs refers to three kinds of objects items users and transactions ie relations between users and items

Items Items are the objects that are recommended Items may be characterized by their complexity and their value or utility The value of an item may be positive if the item is useful for the user or negative if the item is not appropriate and the user made a wrong decision when selecting it We note that when a user is acquiring an item she will always incur in a cost which includes the cognitive cost of searching for the item and the real monetary cost eventually paid for the item For instance the designer of a news RS must take into account the complexity of a news item ie its structure the textual representation and the time-dependent importance of any news item But at the same time the RS designer must understand that even if the user is not paying for reading news there is always a cognitive cost associated to searching and reading news items If a selected item is relevant for the user this cost is dominated by the benefit of having acquired a useful information whereas if the item is not relevant the net value of that item for the user and its recommendation is negative In other domains eg cars or financial investments the true monetary cost of the items becomes an important element to consider when selecting the most appropriate recommendation approach Items with low complexity and value are news Web pages books CDs movies Items with larger complexity and value are digital cameras mobile phones PCs etc The most complex items that have been considered are insurance policies financial investments travels jobs RSs according to their core technology can use a range of properties and features of the items For example in a movie recommender system the genre (such as comedy thriller etc) as well as the director and actors can be used to describe a movie and to learn how the utility of an item depends on its features Items can be represented using various information and representation approaches eg in a minimalist way as a single id code or in a richer form as a set of attributes but even as a concept in an ontological representation of the domain

Users Users of a RS as mentioned above may have very diverse goals and characteristics In order to personalize the recommendations and the human-computer interaction RSs exploit a range of information about the users This information can be structured in various ways and again the selection of what information to model depends on the recommendation technique For instance in collaborative filtering users are modeled as a simple list containing the ratings provided by the user for some items In a demographic RS sociodemographic attributes such as age gender profession and education are used User data is said to constitute the user model The user model profiles the user ie encodes her preferences and needs Various user modeling approaches have been used and in a certain sense a RS can be viewed as a tool that generates recommendations by building and exploiting user models Since no personalization is possible without a convenient user model unless the recommendation is non-personalized as in the top-10 selection the user model will always play a central role For instance considering again a collaborative filtering approach the user is either profiled directly by its ratings to items or using these ratings the system derives a vector of factor values where users differ in how each factor weights in their model Users can also be described by their behavior pattern data for example

site browsing patterns (in a Web-based recommender system) or travel search patterns (in a travel recommender system) Moreover user data may include relations between users such as the trust level of these relations between users A RS might utilize this information to recommend items to users that were preferred by similar or trusted users

Transactions We generically refer to a transaction as a recorded interaction between a user and the RS Transactions are log-like data that store important information generated during the human-computer interaction and which are useful for the recommendation generation algorithm that the system is using For instance a transaction log may contain a reference to the item selected by the user and a description of the context (eg the user goalquery) for that particular recommendation If available that transaction may also include an explicit feedback the user has provided such as the rating for the selected item

In fact ratings are the most popular form of transaction data that a RS collects These ratings may be collected explicitly or implicitly In the explicit collection of ratings the user is asked to provide her opinion about an item on a rating scale According to ratings can take on a variety of forms

bull Numerical ratings such as the 1-5 stars provided in the book recommender associated with Amazoncom

bull Ordinal ratings such as ldquostrongly agree agree neutral disagree strongly disagreerdquo where the user is asked to select the term that best indicates her opinion regarding an item (usually via questionnaire)

bull Binary ratings that model choices in which the user is simply asked to decide if a certain item is good or bad

bull Unary ratings can indicate that a user has observed or purchased an item or otherwise rated the item positively In such cases the absence of a rating indicates that we have no information relating the user to the item (perhaps she purchased the item somewhere else)

Another form of user evaluation consists of tags associated by the user with the items the system presents For instance in Movielens RS (httpmovielensumnedu) tags represent how MovieLens users feel about a movie eg ldquotoo longrdquo or ldquoactingrdquo In transactions collecting implicit ratings the system aims to infer the users opinion based on the userrsquos actions For example if a user enters the keyword ldquoYogardquo at Amazoncom she will be provided with a long list of books In return the user may click on a certain book on the list in order to receive additional information At this point the system may infer that the user is somewhat interested in that book

In conversational systems ie systems that support an interactive process the transaction model is more refined In these systems user requests alternate with system actions That is the user may request a recommendation and the system may produce a suggestion list But it can also request additional user preferences to provide the user with better results Here in the transaction model the system collects the various requests-responses and may eventually learn to modify its interaction strategy by observing the outcome of the recommendation process

Recommendation Techniques

In order to implement its core function identifying the useful items for the user a RS must predict that an item is worth recommending In order to do this the system must be able to predict the utility of some of them or at least compare the utility of some items and then decide what items to recommend based on this comparison The prediction step may not be explicit in the recommendation algorithm but we can still apply this unifying model to describe the general role of a RS

To illustrate the prediction step of a RS consider for instance a simple non-personalized recommendation algorithm that recommends just the most popular songs The rationale for using this approach is that in absence of more precise information about the userrsquos preferences a popular song ie something that is liked (high utility) by many users will also be probably liked by a generic user at least more than another randomly selected song Hence the utility of these popular songs is predicted to be reasonably high for this generic user

This view of the core recommendation computation as the prediction of the utility of an item for a user has been suggested in They model this degree of utility of the user u for the item i as a (real valued) function R(u i) as is normally done in collaborative filtering by considering the ratings of users for items Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items ie to compute ˆR(u i) where we denote with ˆR the estimation computed by the RS of the true function R Consequently having computed this prediction for the active user u on a set of items ie ˆR(u i1) ˆR(u iN) the system will recommend the items i j1 i jK (K le N) with the largest predicted utility K is typically a small number ie much smaller than the cardinality of the item data set or the items on which a user utility prediction can be computed ie RSs ldquofilterrdquo the items that are recommended to users

As mentioned above some recommender systems do not fully estimate the utility before making a recommendation but they may apply some heuristics to hypothesize that an item is of use to a user This is typical for instance in knowledge-based systems These utility predictions are computed with specific algorithms (see below) and use various kind of knowledge about users items and the utility function itself For instance the system may assume that the utility function is Boolean and therefore it will just determine whether an item is or is not useful for the user Consequently assuming that there is some available knowledge (possibly none) about the user who is requesting the recommendation knowledge about items and other users who received recommendations the system will leverage this knowledge with an appropriate algorithm to generate various utility predictions and hence recommendations

To provide a first overview of the different types of RSs we want to quote a taxonomy provided by that has become a classical way of distinguishing between recommender systems and referring to them [25] distinguishes between six different classes of recommendation approaches

Content-based The system learns to recommend items that are similar to the ones that the user liked in the past The similarity of items is calculated based on the features associated with the compared items For example if a user has positively rated a movie that belongs to the comedy genre then the system can learn to recommend other movies from this genre Chapter 3 provides an overview of content based recommender systems imposing some order among the extensive and diverse aspects involved in their design and implementation It presents the basic concepts and terminology of content-based RSs their high level architecture and their main advantages and drawbacks The chapter then surveys state-of-the-art systems that have been adopted in several application domains The survey encompasses a thorough description of both classical and advanced techniques for representing items and user profiles Finally it discusses trends and future research which might lead towards the next generation of recommender systems

Collaborative filtering The simplest and original implementation of this approach [93] recommends to the active user the items that other users with similar tastes liked in the past The similarity in taste of two users is calculated based on the similarity in the rating history of the users This is the reason why [94] refers to collaborative filtering as ldquopeople-to-people correlationrdquo Collaborative filtering is considered to be the most popular and widely implemented technique in RS

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 3: Recomender System Notes

ldquoopportunisticrdquo ways to exploit a RS As a matter of fact this task differentiation is very similar to what happens with a search engine Its primary function is to locate documents that are relevant to the userrsquos information need but it can also be used to check the importance of a Web page (looking at the position of the page in the result list of a query) or to discover the various usages of a word in a collection of documents

bull Find Some Good Items Recommend to a user some items as a ranked list along with predictions of how much the user would like them (eg on a one- to five star scale) This is the main recommendation task that many commercial systems address (see for instance Chapter 9) Some systems do not show the predicted rating

bull Find all good items Recommend all the items that can satisfy some user needs In such cases it is insufficient to just find some good items This is especially true when the number of items is relatively small or when the RS is mission-critical such as in medical or financial applications In these situations in addition to the benefit derived from carefully examining all the possibilities the user may also benefit from the RS ranking of these items or from additional explanations that the RS generates

bull Annotation in context Given an existing context eg a list of items emphasize some of them depending on the userrsquos long-term preferences For example a TV recommender system might annotate which TV shows displayed in the electronic program guide (EPG) are worth watching (Chapter 18 provides interesting examples of this task)

bull Recommend a sequence Instead of focusing on the generation of a single recommendation the idea is to recommend a sequence of items that is pleasing as a whole Typical examples include recommending a TV series a book on RSs after having recommended a book on data mining or a compilation of musical tracks

bull Recommend a bundle Suggest a group of items that fits well together For instance a travel plan may be composed of various attractions destinations and accommodation services that are located in a delimited area From the point of view of the user these various alternatives can be considered and selected as a single travel destination

bull Just browsing In this task the user browses the catalog without any imminent intention of purchasing an item The task of the recommender is to help the user to browse the items that are more likely to fall within the scope of the userrsquos interests for that specific browsing session This is a task that has been also supported by adaptive hypermedia techniques

bull Find credible recommender Some users do not trust recommender systems thus they play with them to see how good they are in making recommendations Hence some system may also offer specific functions to let the users test its behavior in addition to those just required for obtaining recommendations

bull Improve the profile This relates to the capability of the user to provide (input) information to the recommender system about what he likes and dislikes This is a fundamental task that is strictly necessary to provide personalized recommendations If the system has no specific knowledge about the active user then it can only provide him with the same recommendations that would be delivered to an ldquoaveragerdquo user

bull Express self Some users may not care about the recommendations at all Rather what it is important to them is that they be allowed to contribute with their ratings and express their opinions and beliefs The user satisfaction for that activity can still act as a leverage for holding the user tightly to the application (as we mentioned above in discussing the service providerrsquos motivations)

bull Help others Some users are happy to contribute with information eg their evaluation of items (ratings) because they believe that the community benefits from their contribution This could be a major motivation for entering information into a recommender system that is not used routinely For instance with a car RS a user who has already bought her new car is aware that the rating entered in the system is more likely to be useful for other users rather than for the next time she will buy a car

bull Influence others In Web-based RSs there are users whose main goal is to explicitly influence other users into purchasing particular products As a matter of fact there are also some malicious users that may use the system just to promote or penalize certain items (see Chapter 25)

In any case as a general classification data used by RSs refers to three kinds of objects items users and transactions ie relations between users and items

Items Items are the objects that are recommended Items may be characterized by their complexity and their value or utility The value of an item may be positive if the item is useful for the user or negative if the item is not appropriate and the user made a wrong decision when selecting it We note that when a user is acquiring an item she will always incur in a cost which includes the cognitive cost of searching for the item and the real monetary cost eventually paid for the item For instance the designer of a news RS must take into account the complexity of a news item ie its structure the textual representation and the time-dependent importance of any news item But at the same time the RS designer must understand that even if the user is not paying for reading news there is always a cognitive cost associated to searching and reading news items If a selected item is relevant for the user this cost is dominated by the benefit of having acquired a useful information whereas if the item is not relevant the net value of that item for the user and its recommendation is negative In other domains eg cars or financial investments the true monetary cost of the items becomes an important element to consider when selecting the most appropriate recommendation approach Items with low complexity and value are news Web pages books CDs movies Items with larger complexity and value are digital cameras mobile phones PCs etc The most complex items that have been considered are insurance policies financial investments travels jobs RSs according to their core technology can use a range of properties and features of the items For example in a movie recommender system the genre (such as comedy thriller etc) as well as the director and actors can be used to describe a movie and to learn how the utility of an item depends on its features Items can be represented using various information and representation approaches eg in a minimalist way as a single id code or in a richer form as a set of attributes but even as a concept in an ontological representation of the domain

Users Users of a RS as mentioned above may have very diverse goals and characteristics In order to personalize the recommendations and the human-computer interaction RSs exploit a range of information about the users This information can be structured in various ways and again the selection of what information to model depends on the recommendation technique For instance in collaborative filtering users are modeled as a simple list containing the ratings provided by the user for some items In a demographic RS sociodemographic attributes such as age gender profession and education are used User data is said to constitute the user model The user model profiles the user ie encodes her preferences and needs Various user modeling approaches have been used and in a certain sense a RS can be viewed as a tool that generates recommendations by building and exploiting user models Since no personalization is possible without a convenient user model unless the recommendation is non-personalized as in the top-10 selection the user model will always play a central role For instance considering again a collaborative filtering approach the user is either profiled directly by its ratings to items or using these ratings the system derives a vector of factor values where users differ in how each factor weights in their model Users can also be described by their behavior pattern data for example

site browsing patterns (in a Web-based recommender system) or travel search patterns (in a travel recommender system) Moreover user data may include relations between users such as the trust level of these relations between users A RS might utilize this information to recommend items to users that were preferred by similar or trusted users

Transactions We generically refer to a transaction as a recorded interaction between a user and the RS Transactions are log-like data that store important information generated during the human-computer interaction and which are useful for the recommendation generation algorithm that the system is using For instance a transaction log may contain a reference to the item selected by the user and a description of the context (eg the user goalquery) for that particular recommendation If available that transaction may also include an explicit feedback the user has provided such as the rating for the selected item

In fact ratings are the most popular form of transaction data that a RS collects These ratings may be collected explicitly or implicitly In the explicit collection of ratings the user is asked to provide her opinion about an item on a rating scale According to ratings can take on a variety of forms

bull Numerical ratings such as the 1-5 stars provided in the book recommender associated with Amazoncom

bull Ordinal ratings such as ldquostrongly agree agree neutral disagree strongly disagreerdquo where the user is asked to select the term that best indicates her opinion regarding an item (usually via questionnaire)

bull Binary ratings that model choices in which the user is simply asked to decide if a certain item is good or bad

bull Unary ratings can indicate that a user has observed or purchased an item or otherwise rated the item positively In such cases the absence of a rating indicates that we have no information relating the user to the item (perhaps she purchased the item somewhere else)

Another form of user evaluation consists of tags associated by the user with the items the system presents For instance in Movielens RS (httpmovielensumnedu) tags represent how MovieLens users feel about a movie eg ldquotoo longrdquo or ldquoactingrdquo In transactions collecting implicit ratings the system aims to infer the users opinion based on the userrsquos actions For example if a user enters the keyword ldquoYogardquo at Amazoncom she will be provided with a long list of books In return the user may click on a certain book on the list in order to receive additional information At this point the system may infer that the user is somewhat interested in that book

In conversational systems ie systems that support an interactive process the transaction model is more refined In these systems user requests alternate with system actions That is the user may request a recommendation and the system may produce a suggestion list But it can also request additional user preferences to provide the user with better results Here in the transaction model the system collects the various requests-responses and may eventually learn to modify its interaction strategy by observing the outcome of the recommendation process

Recommendation Techniques

In order to implement its core function identifying the useful items for the user a RS must predict that an item is worth recommending In order to do this the system must be able to predict the utility of some of them or at least compare the utility of some items and then decide what items to recommend based on this comparison The prediction step may not be explicit in the recommendation algorithm but we can still apply this unifying model to describe the general role of a RS

To illustrate the prediction step of a RS consider for instance a simple non-personalized recommendation algorithm that recommends just the most popular songs The rationale for using this approach is that in absence of more precise information about the userrsquos preferences a popular song ie something that is liked (high utility) by many users will also be probably liked by a generic user at least more than another randomly selected song Hence the utility of these popular songs is predicted to be reasonably high for this generic user

This view of the core recommendation computation as the prediction of the utility of an item for a user has been suggested in They model this degree of utility of the user u for the item i as a (real valued) function R(u i) as is normally done in collaborative filtering by considering the ratings of users for items Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items ie to compute ˆR(u i) where we denote with ˆR the estimation computed by the RS of the true function R Consequently having computed this prediction for the active user u on a set of items ie ˆR(u i1) ˆR(u iN) the system will recommend the items i j1 i jK (K le N) with the largest predicted utility K is typically a small number ie much smaller than the cardinality of the item data set or the items on which a user utility prediction can be computed ie RSs ldquofilterrdquo the items that are recommended to users

As mentioned above some recommender systems do not fully estimate the utility before making a recommendation but they may apply some heuristics to hypothesize that an item is of use to a user This is typical for instance in knowledge-based systems These utility predictions are computed with specific algorithms (see below) and use various kind of knowledge about users items and the utility function itself For instance the system may assume that the utility function is Boolean and therefore it will just determine whether an item is or is not useful for the user Consequently assuming that there is some available knowledge (possibly none) about the user who is requesting the recommendation knowledge about items and other users who received recommendations the system will leverage this knowledge with an appropriate algorithm to generate various utility predictions and hence recommendations

To provide a first overview of the different types of RSs we want to quote a taxonomy provided by that has become a classical way of distinguishing between recommender systems and referring to them [25] distinguishes between six different classes of recommendation approaches

Content-based The system learns to recommend items that are similar to the ones that the user liked in the past The similarity of items is calculated based on the features associated with the compared items For example if a user has positively rated a movie that belongs to the comedy genre then the system can learn to recommend other movies from this genre Chapter 3 provides an overview of content based recommender systems imposing some order among the extensive and diverse aspects involved in their design and implementation It presents the basic concepts and terminology of content-based RSs their high level architecture and their main advantages and drawbacks The chapter then surveys state-of-the-art systems that have been adopted in several application domains The survey encompasses a thorough description of both classical and advanced techniques for representing items and user profiles Finally it discusses trends and future research which might lead towards the next generation of recommender systems

Collaborative filtering The simplest and original implementation of this approach [93] recommends to the active user the items that other users with similar tastes liked in the past The similarity in taste of two users is calculated based on the similarity in the rating history of the users This is the reason why [94] refers to collaborative filtering as ldquopeople-to-people correlationrdquo Collaborative filtering is considered to be the most popular and widely implemented technique in RS

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 4: Recomender System Notes

bull Help others Some users are happy to contribute with information eg their evaluation of items (ratings) because they believe that the community benefits from their contribution This could be a major motivation for entering information into a recommender system that is not used routinely For instance with a car RS a user who has already bought her new car is aware that the rating entered in the system is more likely to be useful for other users rather than for the next time she will buy a car

bull Influence others In Web-based RSs there are users whose main goal is to explicitly influence other users into purchasing particular products As a matter of fact there are also some malicious users that may use the system just to promote or penalize certain items (see Chapter 25)

In any case as a general classification data used by RSs refers to three kinds of objects items users and transactions ie relations between users and items

Items Items are the objects that are recommended Items may be characterized by their complexity and their value or utility The value of an item may be positive if the item is useful for the user or negative if the item is not appropriate and the user made a wrong decision when selecting it We note that when a user is acquiring an item she will always incur in a cost which includes the cognitive cost of searching for the item and the real monetary cost eventually paid for the item For instance the designer of a news RS must take into account the complexity of a news item ie its structure the textual representation and the time-dependent importance of any news item But at the same time the RS designer must understand that even if the user is not paying for reading news there is always a cognitive cost associated to searching and reading news items If a selected item is relevant for the user this cost is dominated by the benefit of having acquired a useful information whereas if the item is not relevant the net value of that item for the user and its recommendation is negative In other domains eg cars or financial investments the true monetary cost of the items becomes an important element to consider when selecting the most appropriate recommendation approach Items with low complexity and value are news Web pages books CDs movies Items with larger complexity and value are digital cameras mobile phones PCs etc The most complex items that have been considered are insurance policies financial investments travels jobs RSs according to their core technology can use a range of properties and features of the items For example in a movie recommender system the genre (such as comedy thriller etc) as well as the director and actors can be used to describe a movie and to learn how the utility of an item depends on its features Items can be represented using various information and representation approaches eg in a minimalist way as a single id code or in a richer form as a set of attributes but even as a concept in an ontological representation of the domain

Users Users of a RS as mentioned above may have very diverse goals and characteristics In order to personalize the recommendations and the human-computer interaction RSs exploit a range of information about the users This information can be structured in various ways and again the selection of what information to model depends on the recommendation technique For instance in collaborative filtering users are modeled as a simple list containing the ratings provided by the user for some items In a demographic RS sociodemographic attributes such as age gender profession and education are used User data is said to constitute the user model The user model profiles the user ie encodes her preferences and needs Various user modeling approaches have been used and in a certain sense a RS can be viewed as a tool that generates recommendations by building and exploiting user models Since no personalization is possible without a convenient user model unless the recommendation is non-personalized as in the top-10 selection the user model will always play a central role For instance considering again a collaborative filtering approach the user is either profiled directly by its ratings to items or using these ratings the system derives a vector of factor values where users differ in how each factor weights in their model Users can also be described by their behavior pattern data for example

site browsing patterns (in a Web-based recommender system) or travel search patterns (in a travel recommender system) Moreover user data may include relations between users such as the trust level of these relations between users A RS might utilize this information to recommend items to users that were preferred by similar or trusted users

Transactions We generically refer to a transaction as a recorded interaction between a user and the RS Transactions are log-like data that store important information generated during the human-computer interaction and which are useful for the recommendation generation algorithm that the system is using For instance a transaction log may contain a reference to the item selected by the user and a description of the context (eg the user goalquery) for that particular recommendation If available that transaction may also include an explicit feedback the user has provided such as the rating for the selected item

In fact ratings are the most popular form of transaction data that a RS collects These ratings may be collected explicitly or implicitly In the explicit collection of ratings the user is asked to provide her opinion about an item on a rating scale According to ratings can take on a variety of forms

bull Numerical ratings such as the 1-5 stars provided in the book recommender associated with Amazoncom

bull Ordinal ratings such as ldquostrongly agree agree neutral disagree strongly disagreerdquo where the user is asked to select the term that best indicates her opinion regarding an item (usually via questionnaire)

bull Binary ratings that model choices in which the user is simply asked to decide if a certain item is good or bad

bull Unary ratings can indicate that a user has observed or purchased an item or otherwise rated the item positively In such cases the absence of a rating indicates that we have no information relating the user to the item (perhaps she purchased the item somewhere else)

Another form of user evaluation consists of tags associated by the user with the items the system presents For instance in Movielens RS (httpmovielensumnedu) tags represent how MovieLens users feel about a movie eg ldquotoo longrdquo or ldquoactingrdquo In transactions collecting implicit ratings the system aims to infer the users opinion based on the userrsquos actions For example if a user enters the keyword ldquoYogardquo at Amazoncom she will be provided with a long list of books In return the user may click on a certain book on the list in order to receive additional information At this point the system may infer that the user is somewhat interested in that book

In conversational systems ie systems that support an interactive process the transaction model is more refined In these systems user requests alternate with system actions That is the user may request a recommendation and the system may produce a suggestion list But it can also request additional user preferences to provide the user with better results Here in the transaction model the system collects the various requests-responses and may eventually learn to modify its interaction strategy by observing the outcome of the recommendation process

Recommendation Techniques

In order to implement its core function identifying the useful items for the user a RS must predict that an item is worth recommending In order to do this the system must be able to predict the utility of some of them or at least compare the utility of some items and then decide what items to recommend based on this comparison The prediction step may not be explicit in the recommendation algorithm but we can still apply this unifying model to describe the general role of a RS

To illustrate the prediction step of a RS consider for instance a simple non-personalized recommendation algorithm that recommends just the most popular songs The rationale for using this approach is that in absence of more precise information about the userrsquos preferences a popular song ie something that is liked (high utility) by many users will also be probably liked by a generic user at least more than another randomly selected song Hence the utility of these popular songs is predicted to be reasonably high for this generic user

This view of the core recommendation computation as the prediction of the utility of an item for a user has been suggested in They model this degree of utility of the user u for the item i as a (real valued) function R(u i) as is normally done in collaborative filtering by considering the ratings of users for items Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items ie to compute ˆR(u i) where we denote with ˆR the estimation computed by the RS of the true function R Consequently having computed this prediction for the active user u on a set of items ie ˆR(u i1) ˆR(u iN) the system will recommend the items i j1 i jK (K le N) with the largest predicted utility K is typically a small number ie much smaller than the cardinality of the item data set or the items on which a user utility prediction can be computed ie RSs ldquofilterrdquo the items that are recommended to users

As mentioned above some recommender systems do not fully estimate the utility before making a recommendation but they may apply some heuristics to hypothesize that an item is of use to a user This is typical for instance in knowledge-based systems These utility predictions are computed with specific algorithms (see below) and use various kind of knowledge about users items and the utility function itself For instance the system may assume that the utility function is Boolean and therefore it will just determine whether an item is or is not useful for the user Consequently assuming that there is some available knowledge (possibly none) about the user who is requesting the recommendation knowledge about items and other users who received recommendations the system will leverage this knowledge with an appropriate algorithm to generate various utility predictions and hence recommendations

To provide a first overview of the different types of RSs we want to quote a taxonomy provided by that has become a classical way of distinguishing between recommender systems and referring to them [25] distinguishes between six different classes of recommendation approaches

Content-based The system learns to recommend items that are similar to the ones that the user liked in the past The similarity of items is calculated based on the features associated with the compared items For example if a user has positively rated a movie that belongs to the comedy genre then the system can learn to recommend other movies from this genre Chapter 3 provides an overview of content based recommender systems imposing some order among the extensive and diverse aspects involved in their design and implementation It presents the basic concepts and terminology of content-based RSs their high level architecture and their main advantages and drawbacks The chapter then surveys state-of-the-art systems that have been adopted in several application domains The survey encompasses a thorough description of both classical and advanced techniques for representing items and user profiles Finally it discusses trends and future research which might lead towards the next generation of recommender systems

Collaborative filtering The simplest and original implementation of this approach [93] recommends to the active user the items that other users with similar tastes liked in the past The similarity in taste of two users is calculated based on the similarity in the rating history of the users This is the reason why [94] refers to collaborative filtering as ldquopeople-to-people correlationrdquo Collaborative filtering is considered to be the most popular and widely implemented technique in RS

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 5: Recomender System Notes

site browsing patterns (in a Web-based recommender system) or travel search patterns (in a travel recommender system) Moreover user data may include relations between users such as the trust level of these relations between users A RS might utilize this information to recommend items to users that were preferred by similar or trusted users

Transactions We generically refer to a transaction as a recorded interaction between a user and the RS Transactions are log-like data that store important information generated during the human-computer interaction and which are useful for the recommendation generation algorithm that the system is using For instance a transaction log may contain a reference to the item selected by the user and a description of the context (eg the user goalquery) for that particular recommendation If available that transaction may also include an explicit feedback the user has provided such as the rating for the selected item

In fact ratings are the most popular form of transaction data that a RS collects These ratings may be collected explicitly or implicitly In the explicit collection of ratings the user is asked to provide her opinion about an item on a rating scale According to ratings can take on a variety of forms

bull Numerical ratings such as the 1-5 stars provided in the book recommender associated with Amazoncom

bull Ordinal ratings such as ldquostrongly agree agree neutral disagree strongly disagreerdquo where the user is asked to select the term that best indicates her opinion regarding an item (usually via questionnaire)

bull Binary ratings that model choices in which the user is simply asked to decide if a certain item is good or bad

bull Unary ratings can indicate that a user has observed or purchased an item or otherwise rated the item positively In such cases the absence of a rating indicates that we have no information relating the user to the item (perhaps she purchased the item somewhere else)

Another form of user evaluation consists of tags associated by the user with the items the system presents For instance in Movielens RS (httpmovielensumnedu) tags represent how MovieLens users feel about a movie eg ldquotoo longrdquo or ldquoactingrdquo In transactions collecting implicit ratings the system aims to infer the users opinion based on the userrsquos actions For example if a user enters the keyword ldquoYogardquo at Amazoncom she will be provided with a long list of books In return the user may click on a certain book on the list in order to receive additional information At this point the system may infer that the user is somewhat interested in that book

In conversational systems ie systems that support an interactive process the transaction model is more refined In these systems user requests alternate with system actions That is the user may request a recommendation and the system may produce a suggestion list But it can also request additional user preferences to provide the user with better results Here in the transaction model the system collects the various requests-responses and may eventually learn to modify its interaction strategy by observing the outcome of the recommendation process

Recommendation Techniques

In order to implement its core function identifying the useful items for the user a RS must predict that an item is worth recommending In order to do this the system must be able to predict the utility of some of them or at least compare the utility of some items and then decide what items to recommend based on this comparison The prediction step may not be explicit in the recommendation algorithm but we can still apply this unifying model to describe the general role of a RS

To illustrate the prediction step of a RS consider for instance a simple non-personalized recommendation algorithm that recommends just the most popular songs The rationale for using this approach is that in absence of more precise information about the userrsquos preferences a popular song ie something that is liked (high utility) by many users will also be probably liked by a generic user at least more than another randomly selected song Hence the utility of these popular songs is predicted to be reasonably high for this generic user

This view of the core recommendation computation as the prediction of the utility of an item for a user has been suggested in They model this degree of utility of the user u for the item i as a (real valued) function R(u i) as is normally done in collaborative filtering by considering the ratings of users for items Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items ie to compute ˆR(u i) where we denote with ˆR the estimation computed by the RS of the true function R Consequently having computed this prediction for the active user u on a set of items ie ˆR(u i1) ˆR(u iN) the system will recommend the items i j1 i jK (K le N) with the largest predicted utility K is typically a small number ie much smaller than the cardinality of the item data set or the items on which a user utility prediction can be computed ie RSs ldquofilterrdquo the items that are recommended to users

As mentioned above some recommender systems do not fully estimate the utility before making a recommendation but they may apply some heuristics to hypothesize that an item is of use to a user This is typical for instance in knowledge-based systems These utility predictions are computed with specific algorithms (see below) and use various kind of knowledge about users items and the utility function itself For instance the system may assume that the utility function is Boolean and therefore it will just determine whether an item is or is not useful for the user Consequently assuming that there is some available knowledge (possibly none) about the user who is requesting the recommendation knowledge about items and other users who received recommendations the system will leverage this knowledge with an appropriate algorithm to generate various utility predictions and hence recommendations

To provide a first overview of the different types of RSs we want to quote a taxonomy provided by that has become a classical way of distinguishing between recommender systems and referring to them [25] distinguishes between six different classes of recommendation approaches

Content-based The system learns to recommend items that are similar to the ones that the user liked in the past The similarity of items is calculated based on the features associated with the compared items For example if a user has positively rated a movie that belongs to the comedy genre then the system can learn to recommend other movies from this genre Chapter 3 provides an overview of content based recommender systems imposing some order among the extensive and diverse aspects involved in their design and implementation It presents the basic concepts and terminology of content-based RSs their high level architecture and their main advantages and drawbacks The chapter then surveys state-of-the-art systems that have been adopted in several application domains The survey encompasses a thorough description of both classical and advanced techniques for representing items and user profiles Finally it discusses trends and future research which might lead towards the next generation of recommender systems

Collaborative filtering The simplest and original implementation of this approach [93] recommends to the active user the items that other users with similar tastes liked in the past The similarity in taste of two users is calculated based on the similarity in the rating history of the users This is the reason why [94] refers to collaborative filtering as ldquopeople-to-people correlationrdquo Collaborative filtering is considered to be the most popular and widely implemented technique in RS

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 6: Recomender System Notes

To illustrate the prediction step of a RS consider for instance a simple non-personalized recommendation algorithm that recommends just the most popular songs The rationale for using this approach is that in absence of more precise information about the userrsquos preferences a popular song ie something that is liked (high utility) by many users will also be probably liked by a generic user at least more than another randomly selected song Hence the utility of these popular songs is predicted to be reasonably high for this generic user

This view of the core recommendation computation as the prediction of the utility of an item for a user has been suggested in They model this degree of utility of the user u for the item i as a (real valued) function R(u i) as is normally done in collaborative filtering by considering the ratings of users for items Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items ie to compute ˆR(u i) where we denote with ˆR the estimation computed by the RS of the true function R Consequently having computed this prediction for the active user u on a set of items ie ˆR(u i1) ˆR(u iN) the system will recommend the items i j1 i jK (K le N) with the largest predicted utility K is typically a small number ie much smaller than the cardinality of the item data set or the items on which a user utility prediction can be computed ie RSs ldquofilterrdquo the items that are recommended to users

As mentioned above some recommender systems do not fully estimate the utility before making a recommendation but they may apply some heuristics to hypothesize that an item is of use to a user This is typical for instance in knowledge-based systems These utility predictions are computed with specific algorithms (see below) and use various kind of knowledge about users items and the utility function itself For instance the system may assume that the utility function is Boolean and therefore it will just determine whether an item is or is not useful for the user Consequently assuming that there is some available knowledge (possibly none) about the user who is requesting the recommendation knowledge about items and other users who received recommendations the system will leverage this knowledge with an appropriate algorithm to generate various utility predictions and hence recommendations

To provide a first overview of the different types of RSs we want to quote a taxonomy provided by that has become a classical way of distinguishing between recommender systems and referring to them [25] distinguishes between six different classes of recommendation approaches

Content-based The system learns to recommend items that are similar to the ones that the user liked in the past The similarity of items is calculated based on the features associated with the compared items For example if a user has positively rated a movie that belongs to the comedy genre then the system can learn to recommend other movies from this genre Chapter 3 provides an overview of content based recommender systems imposing some order among the extensive and diverse aspects involved in their design and implementation It presents the basic concepts and terminology of content-based RSs their high level architecture and their main advantages and drawbacks The chapter then surveys state-of-the-art systems that have been adopted in several application domains The survey encompasses a thorough description of both classical and advanced techniques for representing items and user profiles Finally it discusses trends and future research which might lead towards the next generation of recommender systems

Collaborative filtering The simplest and original implementation of this approach [93] recommends to the active user the items that other users with similar tastes liked in the past The similarity in taste of two users is calculated based on the similarity in the rating history of the users This is the reason why [94] refers to collaborative filtering as ldquopeople-to-people correlationrdquo Collaborative filtering is considered to be the most popular and widely implemented technique in RS

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 7: Recomender System Notes

A comprehensive survey of neighborhood-based methods for collaborative filtering Neighborhood methods focus on relationships between items or alternatively between users An item-item approach models the preference of a user to an item based on ratings of similar items by the same user Nearest-neighbors methods enjoy considerable popularity due to their simplicity efficiency and their ability to produce accurate and personalized recommendations The authors will address the essential decisions that are required when implementing a neighborhood based recommender system and provide practical information on how to make such decisions

Finally the chapter deals with problems of data sparsity and limited coverage often observed in large commercial recommender systems A few solutions to overcome these problems are presented

Demographic This type of system recommends items based on the demographic profile of the user The assumption is that different recommendations should be generated for different demographic niches Many Web sites adopt simple and effective personalization solutions based on demographics For example users are dispatched to particular Web sites based on their language or country Or suggestions may be customized according to the age of the user While these approaches have been quite popular in the marketing literature there has been relatively little proper RS research into demographic systems

Knowledge-based Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and ultimately how the item is useful for the user Notable knowledge based recommender systems are case-based In these systems a similarity function estimates how much the user needs (problem description) match the recommendations (solutions of the problem) Here the similarity score can be directly interpreted as the utility of the recommendation for the user Constraint-based systems are another type of knowledge-based RSs

In terms of used knowledge both systems are similar user requirements are collected repairs for inconsistent requirements are automatically proposed in situations where no solutions could be found and recommendation results are explained The major difference lies in the way solutions are calculated Case-based recommenders determine recommendations on the basis of similarity metrics whereas constraint based recommenders predominantly exploit predefined knowledge bases that contain explicit rules about how to relate customer requirements with item features

Knowledge-based systems tend to work better than others at the beginning of their deployment but if they are not equipped with learning components they may be surpassed by other shallow methods that can exploit the logs of the humancomputer interaction (as in CF)

Community-based This type of system recommends items based on the preferences of the users friends This technique follows the epigram ldquoTell me who your friends are and I will tell you who you arerdquo Evidence suggests that people tend to rely more on recommendations from their friends than on recommendations from similar but anonymous individuals This observation combined with the growing popularity of open social networks is generating a rising interest in community-based systems or as or as they usually referred to social recommender systems This type of RSs models and acquires information about the social relations of the users and the preferences of the userrsquos friends The recommendation is based on ratings that were provided by the userrsquos friends In fact these RSs are following the rise of social-networks and enable a simple and comprehensive acquisition of data related to the social relations of the users

Hybrid recommender systems These RSs are based on the combination of the above mentioned techniques A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 8: Recomender System Notes

For instance CF methods suffer from new-item problems ie they cannot recommend items that have no ratings This does not limit content-based approaches since the prediction for new items is based on their description (features) that are typically easily available Given two (or more) basic RSs techniques several ways have been proposed for combining them to create a new hybrid system As we have already mentioned the context of the user when she is seeking a recommendation can be used to better personalize the output of the system For example in a temporal context vacation recommendations in winter should be very different from those provided in summer Or a restaurant recommendation for a Saturday evening with your friends should be different from that suggested for a workday lunch with co-workers

Application and Evaluation

Recommender system research is being conducted with a strong emphasis on practice and commercial applications since aside from its theoretical contribution is generally aimed at practically improving commercial RSs Thus RS research involves practical aspects that apply to the implementation of these systems These aspects are relevant to different stages in the life cycle of a RS namely the design of the system its implementation and its maintenance and enhancement during system operation

The aspects that apply to the design stage include factors that might affect the choice of the algorithm The first factor to consider the applicationrsquos domain has a major effect on the algorithmic approach that should be taken [72] provide a taxonomy of RSs and classify existing RS applications to specific application domains Based on these specific application domains we define more general classes of domains for the most common recommender systems applications

bull Entertainment - recommendations for movies music and IPTV

bull Content - personalized newspapers recommendation for documents recommendations of Web pages e-learning applications and e-mail filters

bull E-commerce - recommendations for consumers of products to buy such as books cameras PCs etc

bull Services - recommendations of travel services recommendation of experts for consultation recommendation of houses to rent or matchmaking services

Data Collection

The logical component in charge of pre-processing the data and generating the input of the recommender algorithm is referred to as data collector The data collector gathers data from different sources such as the EPG for information about the live programs the content provider for information about the VOD catalog and the service provider for information about the users

The Fastweb recommender system does not rely on personal information from the users (eg age gender occupation) Recommendations are based on the past usersrsquo behavior (what they watched) and on any explicit preference they have expressed (eg preferred genres) If the users did not specify any explicit preferences the system is able to infer them by analyzing the usersrsquo past activities

An important question has been raised in Section 92 users interact with the IPTV system by means of the STB but typically we cannot identify who is actually in front of the TV Consequently the STB collects the behavior and the preferences of a set of users (eg the component of a family) This represents a considerable problem since we are limited to generate per-STB recommendations In order to simplify the notation in the rest of the

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 9: Recomender System Notes

paper we will refer to user and STB to identify the same entity The user-disambiguation problem has been partially solved by separating the collected information according to the time slot they refer to For instance we can roughly assume the following pattern housewives use to watch TV during the morning children during the afternoon the whole family at evening while only adults watch TV during the night By means of this simple time slot distinction we are able to distinguish among different potential users of the same STB

Formally the available information has been structured into two main matrices practically stored into a relational database the item-content matrix (ICM) and the user-rating matrix (URM)

The former describes the principal characteristics (metadata) of each item In the following we will refer to the item-content matrix as W whose elements wci represent the relevance of characteristic (metadata) c for item i The ICM is generated from the analysis of the set of information given by the content provider (ie the EPG) Such information concerns for instance the title of a movie the actors the director(s) the genre(s) and the plot Note that in a real environment we can face with inaccurate information especially because of the rate new content is added every day The information provided by the ICM is used to generate a content-based recommendation after being filtered by means of techniques for PoS (Part-of-Speech) tagging stop words removal and latent semantic analysis Moreover the ICM can be used to perform some kind of processing on the items (eg parental control)

The URM represents the ratings (ie preferences) of users about items In the following we will refer to such matrix as R whose elements rpi represent the rating of user p about item i Such preferences constitute the basic information for any collaborative algorithm The user rating can be either explicit or implicit according to the fact that the ratings are explicitly expressed by users or are implicitly collected by the system respectively

Explicit ratings confidently represent the user opinion even though they can be affected by biases due to user subjectivity item popularity or global rating tendency The first bias depends on arbitrary interpretations of the rating scale For instance in a rating scale between 1 and 5 some user could use the value 3 to indicate an interesting item someone else could use 3 for a not much interesting item Similarly popular items tend to be overrated while unpopular items are usually underrated Finally explicit ratings can be affected by global attitudes (eg users are more willing to rate movies they like)

On the other hand implicit ratings are inferred by the system on the basis of the user-system interaction which might not match the user opinion For instance the system is able to monitor whether a user has watched a live program on a certain channel or whether the user has uninterruptedly watched a movie Despite explicit ratings are more reliable than implicit ratings in representing the actual user interest towards an item their collection can be annoying from the userrsquos perspective

Long Tail

In statistics a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution The distribution could involve popularities random numbers of occurrences of events with various probabilities etc A probability distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution A long-tail distribution will arise with the inclusion of many values unusually far from the mean which increase the magnitude of the skewness of the distributionA long-tailed distribution is a particular type of heavy-tailed distribution The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 10: Recomender System Notes

selling large volumes of a reduced number of popular items The total sales of this large number of non-hit items is called the long tail

Personalization

Today personalization is something that occurs separately within each system that one interacts with Recommender systems are one technique for personalization in essence the personalization occurs slowly as each system builds up information about your likes and dislikes about what interests you and what fails to interest you There are numerous other personalization techniques most of these rely either on collection of system usage history which is then employed to change the behavior of the system or on the user taking the time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters making selections or engaging in dialogs with the system

There are several problems with this model at least from the users point of view Investment in personalizing one system (either through explicit action or just long use) are not transferable to another system (Of course from the system operators point of view this may be very desirable it increases switching costs for users and thus helps lock in a user base) Information such as likes and dislikes or usage patterns are scattered across multiple systems and cant be combined to obtain maximum leverage And the user does not have control of the information bases that define his or her profile If you want to buy books from multiple online booksellers this is annoying But if we are concerned with developing information discovery systems to assist users in a world of information overload these problems are critical People obtain information from a multiplicity of sources and personalization has to happen close to the end user this is the only place where there is enough information to do personalization effectively to keep track of whats new and what isnt what has and has not proven useful The user needs to become anhub and a switch moving data to allow accurate personalization from one system to another

Levels of measurement

What a scale actually means and what we can do with it depends on what its numbers represent Numbers can be grouped into 4 types or levels nominal ordinal interval and ratio Nominal is the most simple and ratio the most sophisticated Each level possesses the characteristics of the preceding level plus an additional quality

Nominal

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 11: Recomender System Notes

Nominal is hardly measurement It refers to quality more than quantity A nominal level of measurement is simply a matter of distinguishing by name eg 1 = male 2 = female Even though we are using the numbers 1 and 2 they do not denote quantity The binary category of 0 and 1 used for computers is a nominal level of measurement They are categories or classifications Nominal measurement is like using categorical levels of variables described in the Doing Scientific Research section of the Introduction module

Examples MEAL PREFERENCE Breakfast Lunch Dinner RELIGIOUS PREFERENCE 1 = Buddhist 2 = Muslim 3 = Christian 4 = Jewish 5 =

Other POLITICAL ORIENTATION Republican Democratic Libertarian Green

Nominal time of day - categories no additional information

Ordinal

Ordinal refers to order in measurement An ordinal scale indicates direction in addition to providing nominal information LowMediumHigh or FasterSlower are examples of ordinal levels of measurement Ranking an experience as a nine on a scale of 1 to 10 tells us that it was higher than an experience ranked as a six Many psychological scales or inventories are at the ordinal level of measurement

Examples RANK 1st place 2nd place last place LEVEL OF AGREEMENT No Maybe Yes POLITICAL ORIENTATION Left Center Right

Ordinal time of day - indicates direction or order of occurrence spacing between is uneven

Interval

Interval scales provide information about order and also possess equal intervals From the previous example if we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale then we would have an interval scale An example of an interval scale is temperature either measured on a Fahrenheit or Celsius scale A degree represents the same underlying amount of heat regardless of where it occurs on the scale Measured in Fahrenheit units the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68 Equal-interval scales of measurement can be devised for opinions and attitudes Constructing them involves an understanding of mathematical and statistical principles beyond those covered in this course But it is important to understand the different levels of measurement when using and interpreting scales

Examples TIME OF DAY on a 12-hour clock POLITICAL ORIENTATION Score on standardized scale of political orientation OTHER scales constructed so as to possess equal intervals

Interval time of day - equal intervals analog (12-hr) clock difference between 1 and 2 pm is same as difference between 11 and 12 am

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 12: Recomender System Notes

Ratio

In addition to possessing the qualities of nominal ordinal and interval scales a ratio scale has an absolute zero (a point where none of the quality being measured exists) Using a ratio scale permits comparisons such as being twice as high or one-half as much Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement -- time Although an individuals reaction time is always greater than zero we conceptualize a zero point in time and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds

Examples RULER inches or centimeters YEARS of work experience INCOME money earned last year NUMBER of children GPA grade point average

Ratio - 24-hr time has an absolute 0 (midnight) 14 oclock is twice as long from midnight as 7 oclock

Applications

The level of measurement for a particular variable is defined by the highest category that it achieves For example categorizing someone as extroverted (outgoing) or introverted (shy) is nominal If we categorize people 1 = shy 2 = neither shy nor outgoing 3 = outgoing then we have an ordinal level of measurement If we use a standardized measure of shyness (and there are such inventories) we would probably assume the shyness variable meets the standards of an interval level of measurement As to whether or not we might have a ratio scale of shyness although we might be able to measure zero shyness it would be difficult to devise a scale where we would be comfortable talking about someones being 3 times as shy as someone else

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations To have this advantage often ordinal data are treated as though they were interval for example subjective ratings scales (1 = terrible 2= poor 3 = fair 4 = good 5 = excellent) The scale probably does not meet the requirement of equal intervals -- we dont know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent) In order to take advantage of more powerful statistical techniques researchers often assume that the intervals are equal

Data Preprocessing

Data have quality if they satisfy the requirements of the intended use There are many factors comprising data quality including accuracy completeness consistency timeliness believability and interpretability

Imagine that you are a manager at AllElectronics and have been charged with analyzing the companyrsquos data with respect to your branchrsquos sales You immediately set out to perform this task You carefully inspect the companyrsquos database and data warehouse identifying and selecting the attributes or dimensions (eg item price and units sold) to be included in your analysis Alas You notice that several of the attributes for various tuples have no recorded value For your analysis you would like to include information as to whether each item purchased was advertised as on sale yet you discover that this information has not been recorded Furthermore users of your database system have reported errors unusual values and inconsistencies in the data recorded for some transactions In other words the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest or containing only aggregate data) inaccurate or noisy (containing errors or values that deviate from the expected) and inconsistent (eg containing discrepancies in the department codes used to categorize items)Welcome to the real world

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 13: Recomender System Notes

This scenario illustrates three of the elements defining data quality accuracy completeness and consistency Inaccurate incomplete and inconsistent data are commonplace properties of large real-world databases and data warehouses There are many possible reasons for inaccurate data (ie having incorrect attribute values) The data collection instruments used may be faulty There may have been human or computer errors occurring at data entry Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (eg by choosing the default value ldquoJanuary 1rdquo displayed for birthday) This is known as disguised missing data Errors in data transmission can also occur There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption Incorrect data may also result from inconsistencies in naming conventions or data codes or inconsistent formats for input fields (eg date) Duplicate tuples also require data cleaning

Incomplete data can occur for a number of reasons Attributes of interest may not always be available such as customer information for sales transaction data Other data may not be included simply because they were not considered important at the time of entry Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions Data that were inconsistent with other recorded data may have been deleted Furthermore the recording of the data history or modifications may have been overlookedMissing data particularly for tuples with missing values for some attributes may need to be inferred

Recall that data quality depends on the intended use of the data Two different users may have very different assessments of the quality of a given database For example a marketing analyst may need to access the database mentioned before for a list of customer addresses Some of the addresses are outdated or incorrect yet overall 80 of the addresses are accurate The marketing analyst considers this to be a large customer database for target marketing purposes and is pleased with the databasersquos accuracy although as sales manager you found the data inaccurate

Timeliness also affects data quality Suppose that you are overseeing the distribution of monthly sales bonuses to the top sales representatives at AllElectronics Several sales representatives however fail to submit their sales records on time at the end of the month There are also a number of corrections and adjustments that flow in after the monthrsquos end For a period of time following each month the data stored in the database are incomplete However once all of the data are received it is correct The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality

Two other factors affecting data quality are believability and interpretability Believability reflects how much the data are trusted by users while interpretability reflects how easy the data are understood Suppose that a database at one point had several errors all of which have since been corrected The past errors however had caused many problems for sales department users and so they no longer trust the data The data also use many accounting codes which the sales department does not know how to interpret Even though the database is now accurate complete consistent and timely sales department users may regard it as of low quality due to poor believability and interpretability

Data Cleaning

Real-world data tend to be incomplete noisy and inconsistent Data cleaning (or data cleansing) routines attempt to fill in missing values smooth out noise while identifying outliers and correct inconsistencies in the data In this section you will study basic methods for data cleaning Section 321 looks at ways of handling missing values Section 322 explains data smoothing techniques Section 323 discusses approaches to data cleaning as a process

321 Missing Values

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 14: Recomender System Notes

Imagine that you need to analyze AllElectronics sales and customer data You note that many tuples have no recorded value for several attributes such as customer income How can you go about filling in the missing values for this attribute Letrsquos look at the following methods

1 Ignore the tuple This is usually done when the class label is missing (assuming the mining task involves classification) This method is not very effective unless the tuple contains several attributes with missing values It is especially poor when the percentage of missing values per attribute varies considerably By ignoring the tuple we do not make use of the remaining attributesrsquo values in the tuple Such data could have been useful to the task at hand

2 Fill in the missing value manually In general this approach is time consuming and may not be feasible given a large data set with many missing values

3 Use a global constant to fill in the missing value Replace all missing attribute values by the same constant such as a label like ldquoUnknownrdquo or 10485761 If missing values are replaced by say ldquoUnknownrdquo then the mining program may mistakenly think that they form an interesting concept since they all have a value in commonmdashthat of ldquoUnknownrdquo Hence although this method is simple it is not foolproof

4 Use a measure of central tendency for the attribute (eg the mean or median) to fill in the missing value Chapter 2 discussed measures of central tendency which indicate the ldquomiddlerdquo value of a data distribution For normal (symmetric) data distributions the mean can be used while skewed data distribution should employ the median (Section 22) For example suppose that the data distribution regarding the income of AllElectronics customers is symmetric and that the mean income is $56000 Use this value to replace the missing value for income

5 Use the attribute mean or median for all samples belonging to the same class as the given tuple For example if classifying customers according to credit risk we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple If the data distribution for a given class is skewed the median value is a better choice

6 Use the most probable value to fill in the missing value This may be determined with regression inference-based tools using a Bayesian formalism or decision tree induction For example using the other customer attributes in your data set you may construct a decision tree to predict the missing values for income Decision trees and Bayesian inference are described in detail in Chapters 8 and 9 respectively while regression is introduced in Section 345

Noisy Data

ldquoWhat is noiserdquo Noise is a random error or variance in a measured variable In Chapter 2 we saw how some basic statistical description techniques (eg boxplots and scatter plots) and methods of data visualization can be used to identify outliers which may represent noise Given a numeric attribute such as say price how can we ldquosmoothrdquo out the data to remove the noise Letrsquos look at the following data smoothing techniques

Binning Binning methods smooth a sorted data value by consulting its ldquoneighborhoodrdquo that is the values around it The sorted values are distributed into a number of ldquobucketsrdquo or bins Because binning methods consult the neighborhood of valuesthey perform local smoothing Figure 32 illustrates some binning techniques In this example the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (ie each bin contains three values) In smoothing by bin means each value in a bin is replaced by the mean value of the bin For example the mean of the values 4 8 and 15 in Bin 1 is 9 Therefore each original value in this bin is replaced by the value 9

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 15: Recomender System Notes

Similarly smoothing by bin medians can be employed in which each bin value is replaced by the bin median In smoothing by bin boundaries the minimum and maximum values in a given bin are identified as the bin boundaries Each bin value is then replaced by the closest boundary value In general the larger the width the greater the effect of the smoothing Alternatively bins may be equal width where the interval range of values in each bin is constant Binning is also used as a discretization technique and is further discussed in Section 35

Regression Data smoothing can also be done by regression a technique that conforms data values to a function Linear regression involves finding the ldquobestrdquo line to fit two attributes (or variables) so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression where more than two attributes are involved and the data are fit to a multidimensional surface Regression is further described in Section 345

Outlier analysis Outliers may be detected by clustering for example where similar values are organized into groups or ldquoclustersrdquo Intuitively values that fall outside of the set of clusters may be considered outliers (Figure 33) Chapter 12 is dedicated to the topic of outlier analysis

Data Integration

Data mining often requires data integrationmdashthe merging of data from multiple data stores Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set This can help improve the accuracy and speed of the subsequent data mining process

Data Transformation Strategies Overview

In data transformation the data are transformed or consolidated into forms appropriate for mining

Strategies for data transformation include the following

1 Smoothing which works to remove noise from the data Techniques include binning regression and clustering

2 Attribute construction (or feature construction) where new attributes are constructed and added from the given set of attributes to help the mining process

3 Aggregation where summary or aggregation operations are applied to the data For example the daily sales data may be aggregated so as to compute monthly and annual total amounts This step is typically used in constructing a data cube for data analysis at multiple abstraction levels

4 Normalization where the attribute data are scaled so as to fall within a smaller range such as 104857610 to 10 or 00 to 10

5 Discretization where the raw values of a numeric attribute (eg age) are replaced by interval labels (eg 0ndash10 11ndash20 etc) or conceptual labels (eg youth adult senior) The labels in turn can be recursively organized into higher-level concepts resulting in a concept hierarchy for the numeric attribute Figure 312 shows a concept hierarchy for the attribute price More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users

6 Concept hierarchy generation for nominal data where attributes such as street can be generalized to higher-level concepts like city or country Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 16: Recomender System Notes

Data Transformation by Normalization

The measurement unit used can affect the data analysis For example changing measurement units from meters to inches for height or from kilograms to pounds for weight may lead to very different results In general expressing an attribute in smaller units will lead to a larger range for that attribute and thus tend to give such an attribute greater effect or ldquoweightrdquo To help avoid dependence on the choice of measurement units the data should be normalized or standardized This involves transforming the data to fall within a smaller or common range such as [-1 1] or [00 10] (The terms standardize and normalize are used interchangeably in data preprocessing although in statistics the latter term also has other connotations)

Normalizing the data attempts to give all attributes an equal weight Normalization is particularly useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor classification and clustering If using the neural network backpropagation algorithm for classification mining (Chapter 9) normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase For distance-based methods normalization helps prevent attributes with initially large ranges (eg income) from outweighing attributes with initially smaller ranges (eg binary attributes) It is also useful when given no prior knowledge of the data

There are many methods for data normalization We study min-max normalization z-score normalization and normalization by decimal scaling For our discussion let A be a numeric attribute with n observed values v1 v2 vn

Min-max normalization performs a linear transformation on the original data Suppose that minA and maxA are the minimum and maximum values of an attribute A Min-max normalization maps a value vi of A to v0 i in the range [new minAnew maxA] by computing

Min-max normalization preserves the relationships among the original data values It will encounter an ldquoout-of-boundsrdquo error if a future input case for normalization falls outside of the original data range for A

In z-score normalization (or zero-mean normalization) the values for an attribute A are normalized based on the mean (ie average) and standard deviation of A A value vi of A is normalized to v0 i by computing

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 17: Recomender System Notes

Decimal scaling Suppose that the recorded values of A range from 1048576986 to 917 The maximum absolute value of A is 986 To normalize by decimal scaling we therefore divide each value by 1000 (ie j D 3) so that 1048576986 normalizes to 10485760986 and 917 normalizes to 0917

Note that normalization can change the original data quite a bit especially when using z-score normalization or decimal scaling It is also necessary to save the normalization parameters (eg the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins Section 322 discussed binning methods for data smoothing These methods are also used as discretization methods for data reduction and concept hierarchy generation For example attribute values can be discretized by applying equal-width or equal-frequency binning and then replacing each bin value by the bin mean or median as in smoothing by bin means or smoothing by bin medians respectively These techniques can be applied recursively to the resulting partitions to generate concept hierarchies Binning does not use class information and is therefore an unsupervised discretization technique It is sensitive to the user-specified number of bins as well as the presence of outliers

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 18: Recomender System Notes

DistanceSimilarity Measures

Similarity measure of how close to each other two instances are The ldquocloserrdquo the instances are to each other the larger is the similarity value

Dissimilarity measure of how different two instances are Dissimilarity is large when instances are very different and is small when they are close

Proximity refers to either similarity or dissimilarity

Distance metric a measure of dissimilarity that obeys the following laws (laws of triangular norm)

Conversion of similarity and dissimilarity measures

Typically given a similarity measure one can ldquorevertrdquo it to serve as the dissimilarity measure and vice versa

Conversions may differ Eg if d is a distance measure one can use

as the corresponding similarity measure If s is the similarity measure that ranges between 0 and 1 (so called degree of similarity) then the corresponding dissimilarity measure can be defined as

In general any monotonically decreasing transformation can be applied to convert similarity measures into dissimilarity measures and any monotonically increasing transformtaion can be applied to convert the measures the other way around

Distance Metrics for Numeric Attributes

When the data set is presented in a standard form each instance can be treated as a vector macrx = (x1 xN) of measures for attributes numbered 1 N

Consider for now only non-nominal scales

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 19: Recomender System Notes

Computing User Similarity

A critical design decision in implementing userndashuser CF is the choice of similarity function Several different similarity functions have been proposed and evaluated in the literature Pearson correlation This method computes the statistical correlation (Pearsonrsquos r) between two userrsquos common ratings to determine their similarity GroupLens and BellCore both used this method The correlation is computed by the following

Click here Collaborative Filtering Recommender Systems

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 20: Recomender System Notes

Clustering

Click here Applied Multivariate Statistical Analysis ebook

Model Based Clustering ndash Click Here

Expectation Maximization - Click Here

UV-Decomposition ndash Click Here

SVD Method ndash Click Here

Page 21: Recomender System Notes