multimodal usage visualization for large websites · multimodal usage visualization for large...

Multimodal Usage Visualization for Large Websites Bibek Bhattarai, Mike Wong, Rahul Singh

Department of Computer Science, San Francisco State University, San Francisco, CA [email protected], [email protected], [email protected]

ABSTRACT Large websites pose the following challenges for comprehension of user behavior: users’ behaviors are complex and diverse, the web log data is very noisy, and the quantity of the web log data is of a magnitude that defies direct analysis. In this paper we present an integrated multimodal approach for usability analysis of large websites. Our research combines web content mining and web usage mining techniques in a novel integrated system which visualizes usage patterns and user goals. Furthermore, it compares web usage with website structure of content, giving a measure of design quality. Bringing usage mining and content mining together allows web designers to discover root-cause problems in the web design. The system displays multimodal information in a reflective interface. This interface provides a direct interactive visualization and query environment to discover web usage patterns. For any given usage pattern, our system uncovers the related information goal, using webpage semantic analysis and information foraging techniques. We evaluate our technique and demonstrate our system’s value to improve web design and understand web usage.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Storage and Retrieval – clustering, information filtering, query formulation.

Keywords Web usage visualization, information foraging, user sessions, multimodal analysis, information scent, information retrieval, visualization, mining.

1. INTRODUCTION Web engineering today faces the enormous challenge of iteratively improving website design to facilitate use. Web designers must re-engineer websites to manage vast amount of multimedia information and make this information accessible. The success of any individual website is dependent on web users finding the information they seek (their information goal). Website usability is a measure of the ease by which users find their information goal. For e-commerce sites, better web usability generates more purchase decisions, which increases revenue. For educational sites, web usability means providing relevant information to the community, thereby increasing the utility and prestige of the educational institution. Improving web usability means increasing the probability that users find their information goal. Web designers would like to discover ways to restructure their website to increase usability. To accomplish this, it is necessary to visualize patterns in how users are using the website (the usage patterns) and quantify the success for each usage pattern. Usage patterns can be thought of as a common pattern between groups of user sessions, which can be thought of as a series of pages that a user visits within a span of time (rigorous definition follows later).

Developing an understanding of users’ usage patterns and information goals is critical in making web content accessible. The usage patterns are captured in the web log data. Web designers need a means to reorganize the log data into sensible views to detect these usage patterns. Our research combines web content mining and web usage mining techniques in a novel integrated system which visualizes usage patterns and user goals. As a prototypical example of a large website featuring multimodal information and sustaining various types of usage patterns, we use SkyServer.

1.1 Challenges in Analyzing Web Usage Some of the significant challenges in analyzing web usage include: Websites are designed for change. Inactive websites suffer

from data decay, or decreasing accuracy and utility of aging data; therefore active websites must continuously change their content to stay relevant. Likewise, websites must regularly restructure to accommodate growth and support new features. Constant change presents a difficulty for content mining. For example, SkyServer incorporates a revision number in their URLs; when a newer revision of the data is released, many of the pages are moved, effectively deleting the content.

Multimodal sites offer multiple interfaces, each having some distinct usage patterns. Analyzing multimodal sites requires a more general definition of a usage pattern. For example, SkyServer offers several modes of user interaction: static content browsing, text-based queries (SQL), and querying by clicking on an image of the sky (receiving textual and visual feedback). To be able to capture usage patterns more accurately, a log analysis system must understand the ways a user relates with each of these modes of interactions and parse the web logs accordingly.

User Session 1: http://skyserver.sdss.org/dr2/en/ http://skyserver.sdss.org/dr2/en/sdss/ http://skyserver.sdss.org/dr2/en/sdss/data/data.asp http://skyserver.sdss.org/dr2/en/sdss/instruments/instruments.asp User Session 2: http://skyserver.sdss.org/dr2/en/ http://skyserver.sdss.org/dr2/en/sdss/

http://skyserver.sdss.org/dr2/en/sdss/data/data.asp

Figure 1. User sessions for SkyServer

Usage log mining only provides information about how users are consuming data, not why. Usage log mining provides information about how users are using the website to fulfill their information goal. But it cannot provide information about what the users’ information goals are. For example, consider two different users sessions for SkyServer [1] website as shown in figure 1. Log analysis shows that

SFSU-CS-TR-21

both of the user sessions followed the similar browsing path. It also shows that user 1 visited instruments.asp, while user 2 left from data.asp. But, log analysis can not provide information about, why both users followed similar path but chose different page to exit? What was their respective information goal? Web content analysis uncovers that both users were interested SDSS data processing methodology. It also reveals user 1 was also interested in instrument used in SDSS project to capture data. Thus, web content analysis helps us uncovering differences in information goal for the given two user sessions that followed similar browsing path.

2. PRIOR RESEARCH This paper describes the current state of web usage mining and the features of some of the currently available tools. We then discuss how we achieve a completely different experience from currently existing tools. Web usage mining has been a growing field [16, 9, 10] over the past decade as website developers struggle to understand their users’ usage patterns (and thereby predict the users’ needs). Many current products try to fill this business need and claim to have the following features: usage pattern discovery, and pattern clustering (called user segment discovery), and content similarity (e.g. product co-occurrence). Because the field is fairly mature, there is a well understood methodology which has been refined over the years. We refer to [22] and repeat the basic steps only to set the stage for discussion on the respective challenges our system overcomes. One of the common approaches found in prior research is extraction of usage patterns from the usage log data [9, 10, 20, 22]. One of the notable challenges in pattern discovery includes clustering user requests into user sessions [16]. Discovering user sessions is known to be a difficult problem, and there are a variety of approaches [2, 14, 9], each with differing degrees of applicability for a given usage log. Other studies have attempted using machine learning and visualization techniques [5, 14] to extract usage patterns. These approaches perform usage analysis by identifying and clustering user sessions. Research works based solely on this approach do not take into account web content, and therefore cannot analyze the user goal as it relates with a usage pattern. Furthermore, after clustering the usage patterns, this approach does not suggest actionable improvements to web design. Another common approach is web content based analysis [3, 4, 7, 20]. In this approach, researchers try to extract the information goal for a browsing pattern. They also cluster the patterns based on an extracted information goal and web content. Research works based solely on this approach do not use the information provided by the usage log. For example, they do not analyze the usage pattern in the context of the time the usage pattern occurred. Our research philosophy radically differs from prior research. Many web systems focus on a single interpretation of what it means for a usage pattern to be interesting, since that is the algorithm that is implemented. Instead, we present a novel user interface to let the web designer decide what it means for a usage pattern to be interesting. The proposed system begins with an interactive visualization interface, where the visualization interface and the query interface are integrated. We discuss the details of our interactive visualization interface later in this paper. A web designer can use

this interactive visualization interface to begin a line of querying. Each query has the potential of presenting the web designer with a formulation to ask new queries. In this way, our system allows web designers to intuitively drill deeper into the underlying meaning. By itself, usage pattern analysis is insufficient because analyzing only usage patterns does not give a clear indication of what were the users’ goals. To address this issue, we correlate a usage pattern with users’ goals. This correlation also helps to measure the usability of the website and identify the web design flaw root-causes. Some of the prior research work formulates a web usage mining technique based on content mining of the website [7, 3, 4, 20]. They extract users’ information goal related to a particular usage pattern. Because usage analysis based solely on web content information is incomplete, they can not provide the global picture of overall web usage. We propose a system which overcomes this limitation by combining together the information retrieved from usage mining and content mining.

3. OVERVIEW OF THE PROPOSED APPROACH Our research focuses on bringing together the information provided by the web log data and the information provided by web content. Information such as time of use, location of the user, and what the user accesses is all available from the web logs. Information such as website structure can be directly mined from the web content, and semantic data can be inferred from content mining. Bringing these different aspects of web usage together creates a more complete experience and understanding of the usage patterns. The proposed system starts with web usage mining by means of usage visualization. The web log information is categorized by several aspects (e.g. time, location, session, traffic, which pages are being accessed, etc.). Aspects that have an existing mental model representation (e.g. timelines for time, and maps for location) are displayed using those existing models as components in an integrated interface. Information is presented in the most sensible representation that preserves context. The interface is a reflective interface, which means that making a change in one of the components propagates the change to all other components; that is, changes are reflected in the other components. Using the reflective interface of the system, web designers are able to analyze the usage log and discover usage patterns from relationships between different aspects. This includes spatial and temporal characteristics related to the usage pattern. This system lets web designers perform further mining of a usage patterns to uncover information goal. The web designer can then mine the content of the website in the context of the usage pattern. If the results of content mining are logically consistent with the web designer’s knowledge of the domain and website structure, then the web designer can use all of this information to draw interesting conclusions such as the case study experiment described later. If the results are logically inconsistent, then the web designer can dismiss the apparent usage pattern as an inliers. This is just one of the advantages of combining web usage and web content mining. First, the usage log data is preprocessed to remove erroneous data and noise. Then usage patterns are uncovered from the usage log data. Finally, with user sessions and other patterns defined, the

SFSU-CS-TR-21

usage patterns are analyzed. The patterns can be analyzed in a variety of ways, including searching for clues for web system improvement, website improvement, capturing a larger revenue stream, and user characterization. We perform web content mining based on Natural Language Processing (NLP) and information foraging techniques. First, we analyze web content using a common NLP technique known as Term Frequency Inverse Document Frequency (TFIDF) [23]. This technique predicts the importance of a term in a document depending on how commonly the term occurs in a given document collection. Then we use a technique based on Information Foraging theory [19] to predict the user information goal and the user flow in a website. This technique is based on the known fact that users predict the content of a distal page based on the information hint provided by the link pointing to the page [19]. Using Breadth First Search (BFS) we compute the shortest distance (optimal) path between a user’s start page and final goal page. Then using directed graph we construct graphical representation of the user’s path, optimal path and the user flow prediction related to the usage pattern. Graphical representation of the web structure and usage data can help web designers to perform better visual analysis of web usability. However, structural and usage graphs of large websites are overwhelmingly complex and not useful for visualization and understanding. In this system we overcome the problem by presenting a context-based subset of the web structure and usage graph that are related to the usage pattern being analyzed.

3.1 Benefits of Integrating Web Usage Visualization and Web Content Mining The approach taken in this paper combines two separate technical perspectives: web usage analysis using multimodal visualization and content mining with structural and user flow visualization. By uniting these two components, we address the challenges listed above (Section 1.1) the following ways:

Multimodal Visualization is a general technique. Visualization is completely data-dependent; it is unaffected by rapidly changing data and does not make any presumptions beyond the data schema. Outliers in the data are self-identifying. Given that over 92% of surveyed web servers [26] have a common subset of log information, visualizing this subset is general to most websites. The advantage of an interactive, unified visualization/query UI is that a user can drill-down and recursively query non-outliers to differentiate good data from inliers and recognize patterns. Visualization systems which offer multiple modes of information (such as timelines to display temporal information, maps to display spatial information, and charts to discover relationships between numeric and enumerated data) offer a more complete view of the web log information than a simple listing. Because these modes are all associated with the web log data, allowing a user to manipulate any of these dimensions creates a querying environment that leads to deep understanding.

Visualization and content mining are both enhanced by domain knowledge. Research shows that people can identify complex relationships in patterns in a few seconds [27]. Some prior works on usage pattern mining take this for granted in choosing what patterns are “interesting” [8, 13].

Therefore, relying on visualization techniques is at least as good as similar judgment-based works. Most importantly, identifying important usage patterns requires domain knowledge and knowledge of the web site design, especially for multimodal systems. Therefore a system which promotes free pattern recognition through visualization provides a more general solution. Finally, a web designer with domain knowledge can use the results of content mining to validate or dismiss an apparent visualized usage pattern. That is, Integrating web usage mining and content mining provides the complete picture. Web usage mining tells web designers how web users are using the website to fulfill their information goal. Web content mining tells us what users are looking for, that is their information goal related to a particular usage pattern.

4. SKY SERVER AS A MODEL LARGE WEBSITE For experimental evaluation of our system we use real usage data collected from SkyServer [1]. The goal of this website is to provide web access to Sloan Digital Sky Survey (SDSS) [26] data through standard web browsers. The website provides various types of interaction between the user and the data [25], Simple point-and-click interaction allows user to click on

images of various different celestial object and retrieve data related to those objects.

Text and GUI SQL web service interface where user can write their own query to access interact with SDSS database

Tools that let the user to enter astronomical information related to a particular object and retrieve its images and spectra

The targeted audience for this website ranges from a kids learning astronomy at their school to research scientists and astronomer. A well formulated design is must to provide easier and optimal access to the SDSS data for the wide range of users. SkyServer is a very large website, offering views and data for over 80 million astronomical phenomena, totaling over one-and-a-half terabytes. Our copy of the usage log data is approximately 35 gigabytes and only spans from May of 2003 to October of 2004. One of the challenges in dealing with freely accessible large websites is that many of the usage patterns are too complex to cleanly distinguish one from another. Another challenge is that the signal-to-noise ratio is very high. That is to say that there are very large numbers of information-poor web requests (e.g. erroneous web requests, web service error conditions, button images, navigational images, logos, cascading style sheets) relative to informative requests (e.g. successful web service requests, web pages and images of astronomical phenomena). Simple aggregate analysis of the web logs reveals: less than 0.4% of all web requests are informative requests approximately 38.3% are made by web robots approximately 17.4% of these web robots are using a single

web service Of the requests made by humans, over 73.3% of the user

sessions don’t progress past three pages

SFSU-CS-TR-21

What this tells us is that web robots comprise the single most common use pattern. Furthermore, the majority of users only visit for short sessions. At this point, simple aggregate analysis cannot reveal deeper insights about usage analysis.

5. SYSTEM DESCRIPTION Our methodology integrates well-researched web usage mining techniques and a novel reflective multimodal user interface. Figure 2 and figure 6 show the flow diagram, including usage pattern discovery, usage pattern analysis, information goal extraction, and the user flow prediction process. We first describe the specifics of our approach to web usage mining and then describe how our user interface offers a unique means of experiencing the data.

5.1 Web Usage Visualization Web usage mining discovers how users are using the website by extracting patterns from preprocessed log information and integrating these patterns with inferred data from other data sources. The overview of our approach for usage visualization is shown in figure 2.

Figure 2. Web Usage Mining Flow

(Pattern Analysis integrates with Web Content Mining) The salient steps in the process include: Data Preprocessing We begin by extracting the data from a relational database management system (RDBMS) to text files. This is to preserve the original data and to facilitate data cleaning. We then process the text files, scrubbing the data clean by validating data against the database schema and eliminating image accesses which are associated with the website interface. Because SkyServer is a multimedia site, images of astronomical phenomena are intrinsic to the content; accesses to these images must be preserved. Finally, the data is recreated with the exact same schema in another RDBMS system to enable pattern discovery with SQL queries. Definition of Unique Users Central to web usage analysis is the idea that users are discrete entities that exhibit categorical behavior in consuming web content. To categorize the behaviors, it is essential to identify

each user. Prior work also explores the concerns regarding what constitutes a distinct user [21]. We define a unique user as a tuple having a distinct value for IP address and user-agent. Definition of User Sessions Patterns arise from one or more page visits where a user consumes a small chunk of web content. Studies have shown that user sessions are typically delimited by a timeout value of 25 minutes [20], and so we use a timeout threshold of 30 minutes. For each discovered session, we cache the starting time of each session and the duration of the session as a whole. Usage Pattern Analysis We use direct information from the web usage logs and correlate these data with one another. This correlation gives the web designer a broad overview of how the web site is being used. From this eagle’s-eye perspective, the web designer can freely formulate hypotheses based on observed patterns, or decide that the current context is too broad for analysis. The web designer can use the visualization system to constrain the context further, following up on their line of reasoning (or search for a train of thought). By zooming in to concrete details, the web designer can then support their experiment with concrete data, or reject the experiment and begin a new one. This allows for the user to follow deep lines of analysis. This is different from other studies [5, 9, 27] which focus on algorithmic approaches rather than our explore-and-discover approach.

Figure 3. Web Usage Log Visualization (no query) Interactive Visualization Interface Another distinctive feature about our system is that the query and selection interface. That is, our interface has the following four attributes: it employs simple conceptual models (a map, a timeline, and a chart), the query space is the same as the presentation space, and it considers state information (through the reflection mechanism) and it promotes perceptual analysis and exploration. We believe that using interactive visualization interfaces is a practical way to engage human cognitive pattern recognition. The interactive visualization interface combines the query interface and visualization interface in a fashion similar to popular spreadsheet applications. That is, the visualization interface has

SFSU-CS-TR-21

elements that the user can manipulate to formulate a query. Once the manipulation is done, the visualization interface changes to reflect the solution of the query. In the case of the spreadsheet example, changing a few cells automatically causes the relevant sums to update. Reflective Multimodal User Interface The proposed system provides a user interface that leverages common and easy-to-use conceptual models (a timeline, a map, and a chart). These elements present the various dimensions of the web log data in a way that a web designer can quickly scan and comprehend. The chart displays log information correlation; the aspects of the log data to be correlated are selected by the web designer. This presentation mode clearly reveals interesting relationships between different aspects of the log data to the web designer, such as hourly periodicity in traffic (e.g. traffic spikes at 10 am), or the seasonal correlation with session duration (e.g. longer sessions during mid-to-late summer). Log Information Correlation One of the features that our user interface offers is direct correlation of the information contained within (or directly derived from) the usage log. The correlation domains include: browsers (web agents), page, entry page, exit page, date, user session duration, day of the week, month and hour. The correlation ranges include: user session duration, hits, user sessions, and unique users. The user can choose to chart any domain as the independent value, and any range as the dependent value for charting. Obviously, some of the choices are non-informative (duration per duration) and are not allowed. The user interface is demonstrated in figure 4.

Figure 4. Domain and Range for Log Information Correlation Spatial Information Bearing in mind the cautions of [21] considered (with regards to user identity), we search for evidence of the location of the user. We extract information from various trustworthy web references. By plotting this information on a map, we can easily distinguish where the traffic (e.g. hits, sessions, visitors, etc.) is coming from. The size of the dots corresponds to the traffic contribution of the location; bigger dots mean more traffic. Correlating this information with aggregate user goals may show the web designer clues on how to further customize localization issues for their site. Our interface also supports reflective interaction for spatial data. By clicking on the map, the user is shown a breakdown of the log information correlation by locality. By clicking on a specific city (indicated by red dots on the map), the user can drill down to view information with the appropriate spatial constraints. Likewise, clicking on a specific location of the geographical breakdown for information correlation will highlight that geographical location. Temporal Information Log information includes timestamps. We can take slices of time (e.g. 2 weeks, 1 year, or 3 years) and constrain the spatial information and log information accordingly. This gives the web

designer insights on seasonality, or recurring events. The user interface is displayed in figure 5. Usage Case Selection for Content Analysis A web designer can drag-select a set of locations on the map or drag-select a set of data points on the chart. All sessions matching these criteria are then selected. Web requests for each session is recorded and forwarded for web content analysis.

Figure 5. Timeline Control

5.2 Web Content Mining: The usage pattern visualization stage provides information about: How users are using the website

However, it does not provide information or insights about: What users are looking for Users’ level of success in achieving their information goal

Web content mining techniques are used to extract the information from all the pages visited during a session. The information goal (what users are looking for) is a subset of the information extracted from the pages visited during the session. We also compute the correlation between information goal and the links present in the web pages of the website. We use the correlation value to measure usability of the website (users’ level of success in achieving their information goal) related to the information goal. Thus, in conjunction with web usage mining, content mining provides a comprehensive perspective on usage patterns. Given a web session identified through web usage mining we, perform web content analysis extract information goal related to the web session compute and compare the optimal (shortest) path with the

user path from start page to the information goal calculate overall user flow on the site for the extracted

information goal Comparison between the computed shortest path and a specific user path (taken from the web logs) provides information about how successful the user was in reaching the information goal optimally. Additionally, computation of the user flow provides quantitative measure of how successful will other users with similar goal to reach their destination. Web Content Analysis For the extraction of user’s information goal, we first need to extract the information stored in each page of the website. We then form the user’s information goal as a possible subset of the summary of the extracted information belonging to pages that is visited during a session. Using a parser we extract topology information of the given website and then construct a vector of web pages present in the website and also a structural adjacency matrix T. If there exists a link from page i to page j then, 0.1),( =jiT

SFSU-CS-TR-21

We also extract the content of each web page and construct a vector of terms as collection of all unique terms present in the website. Using the terms vector and the web pages vector we construct term-page matrix,TPTFIDF,

),(),( jiTFIDF ptTFIDFjiTP = (1.0)

We calculate the importance of a term t in a page p using TFIDF (explained in section 3). We use normalized TFIDF value. Normalized TFIDF value takes length of the page also into consideration for calculation of term’s importance. Normalize TFIDF is calculated as,

⎟⎟⎠

⎞⎜⎜⎝

⎛×⎟⎟

⎠

⎞⎜⎜⎝

⎛=

dfN

NtfTFIDF page

term2log (2.0)

Where tf is the frequency count of term in a given page, Nterm is total number of terms in the page, Npage is total number of documents in the collection, and df is the frequency count of pages in which the term occurs.

InfoFortermEaccalcTFIBefis impit bvalfinain oimpuseWemat

⎩⎨⎧

=otherwise

pagevisitsuserjiU

:0.0:0.1

),( (3.0)

We then construct a vector I consisting of list of web page importance values. If IP is the importance value assigned to page P visited during given usage session, then we construct I as,

⎩⎨⎧

=otherwise

sessiongivenduringvistedpIpI

:0.0:

(4.0)

All pages can be weighted equally, or all pages can be weighted in incremental order (progressively increasing the importance value), or the final page can be weighted the highest (remaining pages will be weighted equal). Now we can obtain the list of terms related to given session by multiplying TPTFIDF matrix with vector I.

ITPL TFIDF ×= (5.0)

We then sort the list of terms with respect to their importance value. The top 20 terms are used to form the user’s information goal related to the given usage pattern. Experiments show that we are able to achieve a more accurate estimation of the information goal using our approach than using the approach used in [7]. Details of the experiment and its results are provided in section 6. Shortest Path Computation and Comparison: We compute the shortest path from the start page to the final page of the user session using Breadth First Search (BFS). Our underlying assumption is that the shortest path represents the most optimal (direct) path to the desired information goal. Comparison of the actual user path with the optimal shortest path may provide analysis of how well the links are organized in the website. To compute a user’s path with the optimal path we use a simple greedy algorithm. We use this algorithm to compare the users’ path with the shortest path. We first compare the final page of the shortest path with pages in users’ path. We start the comparison from final page of users’ path and move towards start page until we find a match or all pages in users’ path have been compared. For every mismatch we assign a score of -1 and if a matching page is found we assign a score of +1, mark the page as “matched” and begin the next iteration. We start the next iteration

SF

Figure 6: Web Content Flow Diagram, showingIntegrated Web Usage Mining Flow

(from Pattern Analysis Step)

rmation Goal Extraction: the extraction of the information goal, we first extract a list of s from each page that is visited during the given session.

h of the visited pages is assigned an importance value. We ulate importance of each term in the list as summation of its DF value corresponding to the visited pages it belongs to. ore summation, the TFIDF value of each term in a visited page multiplied by the importance value of the page. Thus, ortance of a term in the list also depends on the visited pages elongs to. For example, if we assign the highest importance ue to the final page in the session, then terms appearing in the l page will be given higher importance than terms appearing ther pages. Finally, we sort the list of terms based on their ortance and use the 20 most important terms to form the rs’ information goal summary. start by constructing a usage adjacency matrix U, a subset rix of T. If a user visited the link from page i to page j then,

by taking the page prior to the final page of the shortest path and compare it with the pages in the user path. We repeat the iteration until all the pages in the shortest path have been sequentially compared with the pages in the user path. At the end we calculate the sum for all the pages to calculate the total match/mismatch score for the path. The difference between the total match/mismatch score and the shortest path length gives the measure of similarity between the user path and the shortest path. A score equal to length of shortest path means exact match, while a score less than length of shortest path means user path is different from the shortest path. For example, consider the users’ path and shortest path showed in figure 7. For explanation purpose, we have labeled users’ path as A, B, C, D and E, and shortest path as I, II and III. We first take page III of shortest path and compare it with pages in users’ path starting from page E towards A. Page III matches with page E, we assign the score of +1 and mark page E as “matched”. In next iteration we will take page II and start comparison form page D excluding E that is already matched. Page II matches with page D, we assign the score of +1 and mark page D as “matched”. We

SU-CS-TR-21

now take page I and start comparing from page C. We find two mismatches at page C and page B, thus mismatch score of -1 will be assigned to each page. Finally page I matches with page A we assign +1 score to page A. Now we calculate the total score as,

1)1()1()1()1()1( =++++−+−++

The user did not visit the direct link between page A and page D, but instead visited page B to page C then to page D. A similar approach can be used to match the pages starting from the start page. In this paper we begin the comparison starting from final page because we weight the final page with the highest importance value for the given web session. By varying the page weights depending on the specific context variants of such analysis can be conducted.

Figure 7: Comparison between users’ path and shortest path User Flow Computation: Using the information goal we compute the related user flow through the website. User flow computation will let us measure how successful will other users with similar information goal be able to reach their destination page. For user flow computation, we use a technique based on Information foraging theory [19]. Information foraging theory correlates human’s sense making behavior with users’ browsing behavior. That is, users’ try to predict the information stored in distal page by looking at the text or graphical snippets present on the link pointing to the distal page. We calculate the flow of user from one page to another based on correlation between users’ information goal and information stored in link between the two pages.

• Calculation of Information Correlation: We first calculate correlation between users’ information goal and information stored in link. Total amount of information correlation is calculated as sum of information in link that matches with information goal. That is, we sum TFIDF value of all the terms that are present in both text in links and the information goal. We use the text present in the title of the distal page to calculate information correlation, in the cases where the text is absent in the link. For any term tk present both in the information goal vector and in the link text between page i and page j, or title of page j, we construct the information correlation matrix, IC, as follows

∑= )(.),( ktIDFTFjiIC (6.0)

Our approach is based on approach used in [7]. The system described in [7] uses the information goal hand-picked by web designer. In this research work we try to extend the idea further by using the information goal automatically computed by the system. We use the information goal extracted from real usage patterns to calculate the user flow.

• User Flow Calculation: The user flow is computed by simulating users through an activation function A. This function will simulate the users entering through the entry page of the session and browsing through different pages of the websites. Total percentage of users at a given time in a page depends on total information correlation value for all the links pointing to the page. The dampening factor α controls the number of users browsing to next step of simulation.

EtAICtA +−××= ))1(()( α (7.0)

E simulates users flowing through the links from the entry (or start) page of the usage pattern. If i is start page of the usage pattern then,

⎩⎨⎧=

otherwisepagestartis

iE:0.0:0.1

][ Users’ Path: A: http://skyserver.sdss.org/dr2/en/ B: http://skyserver.sdss.org/dr2/en/sdss/ C: http://skyserver.sdss.org/dr2/en/sdss/release/ D: http://skyserver.sdss.org/dr2/en/sdss/pubs/ E: http://skyserver.sdss.org/dr2/en/sdss/dr2paper/ Shortest Path: I: http://skyserver.sdss.org/dr2/en/ II: http://skyserver.sdss.org/dr2/en/sdss/pubs/ III: http://skyserver.sdss.org/dr2/en/sdss/dr2paper/

The initial activation vector A(1) = E. Final activation vector A(n) will give percentage of users in each node of the website after n simulations.

Figure 8. Visualization of user session (Green dotted line), and related shortest path (Blue dashed line) and user flow (Orange

solid line)

5.3 User session visualization: Figure 8 shows the visualization for a sample user session. We construct the directed graph to represent a subset of the web structure that is related to the user session being explored. For a given user session, we display: the path followed by user the shortest path suggested by the system the user flow prediction based on information goal

Orange colored circles represent the web pages that have some relevance to the user goal. The size of the red rectangular bar represents the total percentage of users that will visit the page. The orange directed lines (solid) linking the circles represents the direction of the user flow. Similarly, the green directed lines (dotted) and blue directed lines (dashed) represent the actual users’ paths and system-computed shortest path, respectively. Pages that the user visits during the session, but are not present in

SFSU-CS-TR-21

the user flow graph, are represented by green squares. The same is true for blues triangles that represent the page that is present only in shortest computed path. When a mouse pointer is moved over a square in the graph, information about the page (link information and thumbnail image of webpage) and percentage of user flow distribution in the page displayed as tool tip. The tool tip for the final page of a session also displays the match/mismatch score between user path and shortest path. This graphical representation helps web designer to visually analyze a user session and its related information. For any user session, the graph shows: the difference between the user path and the shortest path the page where the user path diverges from the shortest path the predicted user flow to related pages for the information

goal extracted This lets the web designer visually identify the pages that have erroneous correlation value, driving users away from their goal. The system also provides web designers with the option to visualize users’ paths, users flow graph, and the shortest path separately.

6. RESULTS Visualization Predictive Usability We compare the usability of our visualization system versus a major commercial product using predictive usability engineering, specifically the keystroke-level model [28]. Briefly, this model states averaged times for typical human-computer interactions (HCI). For example, a single mouse click takes 0.20 seconds, on average, and pointing to an icon with a mouse takes 1.10 seconds, on average. The model predicts the time for a single use case as the simple sum of all the average times for each HCI event. The tasks were selected as being typical common tasks for web designers to perform as a regular part of their duties.

Table 1: Predictive Usability Comparison of a Major Commercial Product (WebTrends) to Proposed System

Task WebTrends Our System Initial Set-up (using 8-letter username and password)

14.9s 1.35s

Finding Hits/Hour 1.35s 4.15s Finding Hits/Page 1.35s 4.15s Finding Sessions/Hour 2.65s 5.65s Finding Hits/Browser for a week in December

6.75s 7.15s

Finding Hits/Hour from the USA

N/A 2.85s

Finding location of most sessions for two months, starting in June

N/A 7.15s

6.1 Case Study Session Analysis: In this case study we perform analysis of following user session as shown in Figure 10. We explore the user session further. First the information goal related to the user session is extracted. The information goal

extracted is (ordered by rank): wavelength, light, wave, spectrum, color, violet, visible, slider, electromagnet, space, angstrom, rays, diagram, travel, peak, longer, radiation.

User Session: http://skyserver.sdss.org/dr1/en/proj/advanced/color/sdssfilters.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/fromstars.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/temperature.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/thermalrad.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/peakwavelength.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/amounts.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/explore.asp http://skyserver.sdss.org/dr1/en/help/download http://skyserver.sdss.org/dr1/en/help/docs/algorithm.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/sdssfilters.asp http://skyserver.sdss.org/dr1/en/proj/advanced/hr http://skyserver.sdss.org/dr1/en/proj/advanced/color/sdssfilters.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/explore.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/amounts.asp http://skyserver.sdss.org/dr1/en/proj/advanced/color/whatis.asp

Figure 30. User session for Case Study

Figure 11. Session visualization of the user session for case study A closer look at the path followed by the user during this session reveals that user visits many related pages and also repeatedly visits the session start page sdssfilters.asp. A possible hypothesis for such behavior could be that the user was confused and was trying to get to a final goal page by visiting every probable page with link information that is related to the goal. The comparison of the computated of the user flow diagram and the shortest path with the user path strengthens our hypothesis about user behavior. There was a directly linked shortest path between sessions start page (sdssfilter.asp) and final page (whatis.asp). But the user failed to visit the link, match-mismatch score between user path and shortest path is -11. The computed user flow distribution for final goal page is 12.22%, but there are numerous other links in the website with equally strong information correlation ( %2± ) between user’s information goal and link text. It also shows that link text of peakwavelength.asp has stronger correlation (22.32%) with information goal than the link text of the final page. This ambiguous false correlation demonstrated by numerous links may be the strong reason for confusion that drove the user away from the final goal page. Figure 11, provides the visualization of the user session, related shortest path and the related user flow distribution. Table 2,

SFSU-CS-TR-21

summarizes the predicted user flow distribution for the information goal related to given user session (it only includes pages with > 5% of distribution).

Table 2: Predicted user flow distribution

Page address <Link Text>

Predicted user flow

(%)

http://skyserver.sdss.org/dr1/en/proj/advanced/color/sdssfilters.asp <SDSS Filters> (Start Page)

100

http://skyserver.sdss.org/dr1/en/proj/advanced/color/peakwavelength.asp <Peak Wavelengths>

22.32

http://skyserver.sdss.org/dr1/en/proj/advanced/color/whatis.asp <What is color?>(Final Page)

12.22

http://skyserver.sdss.org/dr1/en/proj/advanced/color/fromstars.asp

<Light from Stars>

12.21

http://skyserver.sdss.org/dr1/en/proj/advanced/color/

<Color>

12.03

http://skyserver.sdss.org/dr1/en/proj/advanced/color/amounts.asp

<Amounts of light>

11.91

http://skyserver.sdss.org/dr1/en/proj/advanced/color/explore.asp

<Colors of SDSS starts>

11.74

http://skyserver.sdss.org/dr1/en/proj/advanced/color/research.asp

<Color in research>

11.53

http://skyserver.sdss.org/dr1/en/proj/advanced/hr

<H-R Diagram>

9.84

http://skyserver.sdss.org/dr1/en/proj/advanced/color/making.asp

<Making a Diagram>

8.21

http://skyserver.sdss.org/dr1/en/proj/advanced/color/colorcolor.asp

<Color-color Diagrmams>

8.22

http://skyserver.sdss.org/dr1/en/proj/advanced/color/sdssstars.asp

<Diagram for SDSS>

8.19

Evaluation for information goal extraction algorithm: We compared our approach of information goal prediction with the approach discussed in [7]. We used our approach to compute information goal related to a user session. Then we used approach discussed in [7] to compute information goal related to the same user session. In both the cases we assigned highest importance to the final page. But the terms returned as possible information goals by the approach discussed in [7] were found to be less relevant to final page the user visited. Table 3 shows the results of our experiments in detail. We can see that terms ranked as top five terms by our approach appeared frequently in final page. While terms ranked as top 5 by the approach in [7] were least frequently occurring in the final page. We were able to achieve better result by using usage matrix as oppose to adjacency matrix used by [7]. For the approach proposed in [7] terms that appear in most of the page infiltrate the results even though the terms might not be important in the final page. This is because of noise introduced by use of adjacency matrix in large highly interlinked website like SkyServer [1]

6.2 Broad Visualization of Large Patterns Web system engineers are keenly interested in finding out where the majority of the website traffic comes from. Visualizing this information helps to address localization issues and system performance decisions such as scheduling system downtime for

minimal interruption, or where to place a redundant server for best workload distribution.

Table 3: Comparison between IUNIS algorithm and our approach

Terms Rank using

out algorithm

Rank using algorithm

in [6]

Frequency count of

term in the page

telescop 1 2 17

mirror 2 31 6

meter 3 30 8

dome 4 32 4

focus 5 - 4

Data - 1 3

schema - 3 1

Imag - 4 3

object - 5 2

univers - 6 1

Figure 12. Visualization of query tool usage from the US and Europe Figure 12 presents the visual representation of the web traffic of a specific substructure of the Sky Server website (the search and query tools) with regards to the location of the internet service providers of the client. In this particular example, the chart clearly shows that one of the search tools on Sky Server, shownearest.asp, is by far the most heavily used tool. The map representation simply shows that the majority of the usage comes from the US costal regions (west coast not shown) and around the Great Lakes region.

7. CONCLUSION Through experimentation and case studies we have shown that discovering user behavior benefits from a two-tiered approach; high-level usage pattern visualization and detailed content mining. The usage visualization component of the system helps to provide a high-level understanding of web site usage. It also offers web designers an exploratory system to discover specific usage patterns down to individual sessions. From the predictive usability model, we have shown that for a given set of tasks, our system has comparable usability to WebTrends, a commercial product. Web usage mining identifies and extracts various interesting usage patterns hidden in large set of log data. Web content mining lets us further explore the usage patterns and provides more

SFSU-CS-TR-21

insight about the pattern. These two techniques together provide a powerful means of inspecting user behaviour. While the quantity of our experimental data is preliminary, the case studies show that the approach is usable and provides more insight to the usage pattern.

8. REFERENCES [1] Sloan Digital Sky Survey project’s website SkyServer:

“http://skyserver.sdss.org/”

[2] Banerjee, A., Ghosh, J. Clickstream Clustering using Weighted Longest Common Subsequences. Proceedings of the 1st SIAM International Conference on Data Mining: Workshop on Web Mining, 2001

[3] Blackmon M. H., Polson P. G., Kitajima M., Lewis C. Cognitive Walkthrough for the Web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’02), 2002.

[4] Blackmon M. H., Polson P. G., Kitajima M. Repairing Usability Problems Identified by the Cognitive Walkthrough for the Web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’03), 2003.

[5] Cadez, I., et al. Visualization of Navigation Patterns on a Website Using Model-Based Clustering. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD 2000), pp. 280-284, 2000.

[6] Card, S. K., and Moran, T. P. The Keystroke-Level Model for User Performance Time with Interactive Systems, Communications of the ACM, pp. 396-410, 1980

[7] Chi E. H., P. L. Pirolli, Chen K., Pitkow J. Using Information Scent to Model User Information Needs and Actions on the Web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’01)

[8] Chi E. H., Rosien A., Supattanasiri G., Williams A., Royer C., Chow C., Robles E., Dalal B., Chen J., Cousins S. The Bloodhound Project: Automating Discovery of Web Usability Issues using the InfoScent Simulator. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’03), 2003.

[9] Cooley, R. The Use of Web Structure and Content to Identify Subjectively Interesting Web Usage Patterns. 2003 ACM Transactions on Internet Technology 3(2), pp. 93-116, 2003.

[10] Cooley, R., Mobasher B., Srivastava, J.Web Mining: Information and Pattern Discovery on the World Wide Web. Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI ’97), pp. 558-567, 1997.

[11] Cooley, R., Mobashar B., Srivastava, J. Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems 1(1), 1999

[12] Heer, J. and Chi, E. H. Identification of Web User Traffic Composition and Multimodal Clustering and Information Scent, in Proc. Of the Workshop on Web Minning, SIAM Conference on Data Minning, April 2001, Chicago, IL. pp. 51-58

[13] Heer, J., Chi, E. H. Separating the Swarm: Categorization Methods for User Sessions on the Web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’02) 4(1), pp. 243-250, 2002.

[14] Jin, X., Zhou, Y., Mobasher, B. Web Usage Mining Based on Probabalistic Latent Semantic Analysis. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD ’04), pp. 197-205, 2004.

[15] Krane D. E. and Raymer M. L. Fundamental Concepts of Bioinformatics. Benjamin Cummings, 1st Edition (2004)

[16] Kosala, R., and Blockeel H. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD ’00) Explorations 2(1), pp. 1-15, 2000.

[17] Kroger, J. K., et al. Varieties of Sameness: the impact of relational complexity on perceptual comparisons, Cognitive Science 28(2004) pp. 335-358

[18] Netcraft October 2005 Web Server Survey, http://news.netcraft.com/archives/2005/10/04/october_2005_web_server_survey.html

[19] Pirolli P. L., and Card S. K. (1999) Information foraging. Psychological Review. 106: p. 643-675.

[20] Pirolli, P., Pitkow, J. and Rao, R. Silk from a sow’s ear: Extracting uable structures from the web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’96), pp. 118-125, 1996.

[21] Pitkow, J. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6) pp. 1065-1073, 1995.

[22] Pitkow, J. and Pirolli, P. Mining longest repeated subsequences to predict World Wide surfing. In the Proceedings of the USENIX Conference on Internet, 1999

[23] Scheuetze, H., Manning, C. (1999) Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

[24] Srivastava, J. et al. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD ’00) Explorations 1(2), pp. 12-23, 2000.

[25] Szalay A. S., Gray J., Thakar A. R., Kunszt P. Z., Malik T., Raddick J., Stoughton C., vandenBerg J.. The SDSS SkyServer- Public Access to the Sloan Digital Sky Server Data. ACM Proceedings of Special Interest Group on Management of Data (SIGMOD 2002).

[26] Szalay A. S., Kunszt P. Z., Thakar A., Gray J., Slutz D., Brunner R. J. Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey.

[27] Wang, Q., Makaroff, D. J., Edwards, H. K. Characterizing Customer Groups for an E-commerce Website, ACM Conference on Electronic Commerce (ACMEC ’04), pp. 218-227, 2004.

[28] Zaïne, O. R., et al. Mining MultiMedia Data. Proceedings of the conference of the Centre for Advanced Studies on Collaborative research, pp. 24

SFSU-CS-TR-21

http://news.netcraft.com/archives/2005/10/04/october_2005_web_server_survey.html

http://news.netcraft.com/archives/2005/10/04/october_2005_web_server_survey.html

multimodal usage visualization for large websites · multimodal usage visualization for large...

Documents