discovery of rare sequential topic patterns …jrret.com/archives/2016/paper/oct/jrret16oct2.pdf ·...

12
Journal of Recent Research in Engineering and Technology, 3(10), OCT 2016, PP.13-24 ISSN (Online): 2349 –2252, ISSN (Print):2349 –2260 © Bonfay Publications, 2016 13 Research Article DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS USERAWARE FREQUENT MINING IN DOCUMENT STREAMS 1 T.Karthik, 2 Mrs.V.Bakyalakshi 1 Research Scholar, 2 Associate Professor 1&2 Department of Information Technology 1&2 Sri Jayendra sarswathy maha vidyalaya College of Arts and Science, Coimbatore, Tamil Nadu, India Received 27 AUG 2016; Accepted 24 OCT 2016 Abstract The process of data mining produces various patterns from a given data source. The most recognized data mining tasks are the process of discovering frequent item sets, frequent sequential patterns, frequent sequential rules and frequent association rules. Numerous efficient algorithms have been proposed to do the above processes. In my research work Textual documents created and distributed on the Internet are ever changing in various forms in order to characterize and detect personalized and abnormal behaviors of Internet users, proposeing Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviors. Research work consider three phases: preprocessing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user- aware rarity analysis on derived STPs. Experiments shows that our approach can indeed discover special users and interpretable URSTPs effectively.. Keywords: KDD, User-aware Rare STPs Cross Layer, Failure Recovery, Link Stability. 1. INTRODUCTION Data mining or knowledge discovery in databases (KDD) is a collection of exploration techniques based on advanced analytical methods and tools for handling a large amount of information to extract the hidden information from the databases. These techniques can find novel patterns that may assist an enterprise in understanding the business better and in forecasting. Data mining is a collection of techniques for efficient automated discovery of previously unknown valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so that they may be used indecision making process.”Almost all enterprise collects a variety of information electronically. In the recent years the cost of storage of data has reduced considerably. The enterprises have started to realize that the information accumulated over the years constitute

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

Journal of Recent Research in Engineering and Technology, 3(10), OCT 2016, PP.13-24

ISSN (Online): 2349 –2252, ISSN (Print):2349 –2260 © Bonfay Publications, 2016

13

Research Article

DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS USERAWARE FREQUENT MINING IN DOCUMENT STREAMS

1T.Karthik, 2Mrs.V.Bakyalakshi 1Research Scholar, 2Associate Professor

1&2Department of Information Technology 1&2 Sri Jayendra sarswathy maha vidyalaya College of Arts and Science, Coimbatore, Tamil

Nadu, India Received 27 AUG 2016; Accepted 24 OCT 2016

Abstract

The process of data mining produces various patterns from a given data source. The most recognized data mining tasks are the process of discovering frequent item sets, frequent sequential patterns, frequent sequential rules and frequent association rules. Numerous efficient algorithms have been proposed to do the above processes. In my research work Textual documents created and distributed on the Internet are ever changing in various forms in order to characterize and detect personalized and abnormal behaviors of Internet users, proposeing Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviors. Research work consider three phases: preprocessing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs. Experiments shows that our approach can indeed discover special users and interpretable URSTPs effectively..

Keywords:

KDD, User-aware Rare STPs Cross Layer, Failure Recovery, Link Stability.

1. INTRODUCTION

Data mining or knowledge discovery in databases (KDD) is a collection of exploration techniques based on advanced analytical methods and tools for handling a large amount of information to extract the hidden information from the databases. These techniques can find novel patterns that may assist an enterprise in understanding the business better and in forecasting. “Data mining is a collection of

techniques for efficient automated discovery of previously unknown valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so that they may be used indecision making process.”Almost all enterprise collects a variety of information electronically. In the recent years the cost of storage of data has reduced considerably. The enterprises have started to realize that the information accumulated over the years constitute

Page 2: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

T.Karthik Journal of Recent Research in Engineering and Technology

14

important strategic asset and there is potential business intelligence hidden in the large volumes of data. This intelligence can be a secret weapon on which the success of business may depend on.

Databases today can range in size into the terabytes more than 1,000,000,000,000 bytes of data. Within these masses of data lies hidden information of strategic importance. But when there are so many trees, how do you draw meaningful conclusions about the forest? The newest answer is data mining, which is being used both to increase revenues and to reduce costs. The potential returns are enormous. Innovative organizations worldwide are already using data mining to locate and appeal to higher-value customers, to reconfigure their product offerings to increase sales, and to minimize losses due to error or fraud.

Document streams are created and distributed in various forms on the Internet, such as news streams, emails, micro-blog articles, chatting messages, research pa- per archives, web forum discussions, and so forth. The contents of these documents generally concentrate on some specific topics, which reflect offline social events and users’ characteristics in real life. To mine these pieces of information, a lot of researches of text mining focused on extracting topics from document collections and document streams through various probabilistic topic models, such as classical and their extensions ,Taking advantage of these extracted topics in document streams, most of existing works analyzed the evolution of individual topics to detect and predict social events as well as user behaviors . to correlations among different topics appearing in successive documents

published by a specific user, so some hidden but significant information to reveal personalized behaviors has been neglected. In order to characterize user behaviors in published document streams, we study on the correlations among topics extracted from these documents, especially the sequential relations

Sequential Topic Patterns (STPs). Each of them records the complete and repeated behavior of a user when she is publishing a series of documents, and are suitable for inferring users’ intrinsic characteristics and psychological statuses. Firstly, compared to individual topics, STPs capture both combinations and orders of topics, so can serve well as discriminative units of semantic association among documents in ambiguous situations. Secondly, compared to document-based patterns, topic-based patterns contain abstract information of document contents and are thus beneficial in clustering similar documents and finding some regularity about Internet users. Thirdly, the probabilistic description of topics helps to maintain and accumulate the uncertainty degree of individual topics, and can thereby reach high confidence level in pattern matching for uncertain data.

For a document stream, some STPs may occur frequently and thus reflect common behaviors of involved users. Be- yond that, there may still exist some other patterns which are globally rare for the general population, but occur relatively often for some specific user or some specific group of users.

User-aware Rare STPs (URSTPs). Compared to frequent ones, discovering them is especially interesting and significant. Theoretically, it defines a new

Page 3: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

Journal of Recent Research in Engineering and Technology T.Karthik

15

kind of patterns for rare event mining, which is able to characterize personalized and abnormal behaviors for special users. Practically, it can be applied in many real-life scenarios of user behavior analysis, of documents, and are suitable for inferring users’ intrinsic characteristics and psychological statuses. Firstly, compared to individual topics, STPs capture both combinations and orders of topics, so can serve well as discriminative units of semantic association among documents in ambiguous situations. Secondly, compared to document-based patterns, topic-based patterns contain abstract information of document contents and are thus beneficial in clustering similar documents and finding some regularities about Internet users. Thirdly, the probabilistic description of topics helps to maintain and accumulate the uncertainty degree of individual topics, and can thereby reach high confidence level in pattern matching for uncertain data host, fault tolerance and electronic commerce. Only with those security guarantees and extensions, mobile agents will be used in a wide range of applications. The focus should be made upon fault tolerance in mobile agent systems.

Sentiment analysis is one of the text mining also known as opinion mining which classifies a text according to the sentimental polarities. There are also many names and quite different tasks, e.g. sentiment analysis, opinion mining, opinion extraction, sentiment mining, affect analysis, emotion analysis, review mining, etc. It helps to identify the mindset of a writer with respect to particular topic. This is very much useful for customers who wants to analyze the reviews of products before purchase or companies that monitors the customer feedback about their products. It involves in

building a system to collect and determine opinions about the product made in blog posts, comments, reviews or tweets. Both sentiment analysis and opinion mining express mutual meaning. Some authors state that Opinion mining and Sentiment Analysis has quite different meaning. Opinion mining selects and analyzes people opinion about a specific topic whereas sentiment analysis identifies the sentiment words expressed in it and then analyze them. A fundamental technology in many current opinion-mining and sentiment-analysis applications is classification.

Sentiment classification is a special task of text classification whose objective is to classify a text according to the sentimental polarities of opinions it contains favorable or unfavorable, positive or negative.. In a typical machine learning approach, a document (text) is modeled as a bag-of-words, i.e. a set of content words without any word order or syntactic relation information. In other words, the underlying assumption is that the sentimental orientation of the whole text depends on the sum of the sentimental polarities of content words. Although this assumption is reasonable and has led to initial success, it is linguistically unsound since many function words and constructions can shift the sentimental polarities of a text.

Sentiment Analysis or opinion mining is a most researched field of study which helps to determine the quality of an event, entity or its attributes, services offered based on the users feedback or opinion. The research began with treating the problem as a text classification problem where the reviews expressed a positive, negative or neutral opinion. This type of classification determined whether a

Page 4: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

T.Karthik Journal of Recent Research in Engineering and Technology

16

sentence is subjective or objective. However it becomes essential to perform a detailed analysis of a review to determine the features of product/service that was praised or criticized by a consumer. Also the consumers needed a rating for comparison with respect to the other providers for quick decision making when they want to buy a product or use a service

2. RELATED WORK

The contributions of various scholars are studied for analyzing the merits and demerits in order to enhance the consequences for making the system work better.

Authors: B. Rhodes and T. Starner Remembrance Agent (RA) is a program which can display list of documents relevant to user context. It runs without continuous user interaction. RA basically used to give suggestions to user which can be pursue or ignored as desired. Computer waits for user interaction, RA uses such wait cycle to perform search information which is relevant to user’s current scenario. RA reminds about relevant data of user task. RA gives suggestions in one line description at bottom of screen. EMACS-19, lisp used as front-end for RA. Front-end displays one-line suggestions and number used to indicate its relevancy. Full text document can be available and can be obtained if requested. A program which is used as back-end for RA, produces suggestions for given text query from pre-indexed documents. SMART is used to decide suggestions for RA, depending upon common word frequency.

M. Habibi and A. Popescu-Belis In our project, we also focus on building concise, diverse and relevant lists of documents which provide information to the users according to their need. From this paper,

we get the concept of merging lists of documents which are retrieved through multiple implicit queries which are formed for short conversations fragments. The goal of the method used in this paper is to obtain a unique and precise list of documents that can be recommended to the users in real time. The obtained list should cover the maximum number of implicit queries and thus topics. The proposed method rewards: a. Topic similarity – It means selecting the most relevant documents to the conversation. b. Topic diversity –It is related with covering of the maximum number of implicit queries and topics in a precise and relevant list of recommendations when more than one topic is discussed in the conversation. ACLD.

P. N. Garner using a just-in time retrieval system for extraction of useful information. We are using ACLD i.e., Automatic Content Linking Device (ACLD), a just-in-time document recommendation system for meetings, for our project. From this paper, we get the concept of our recommendation system ACLD. ACLD is a recommendation system which is used for analysing spoken input from one or more speakers and retrieving relevant documents. b. Content Linking: Scenarios of Use One of the main scenarios of use for the ACLD is for meeting rooms. The ACLD performs search of relevant documents for people in the meeting without interrupting the discussion flow. In other scenarios, ACLD is used for live or recorded lectures. K. J. Hammond the main functions is information retrieval. This retrieved information can be used for obtaining relevant and related documents. It is observed that our interactions with everyday applications, such as, word processors, web browsers etc. gives rich contextual information that can be used to

Page 5: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

Journal of Recent Research in Engineering and Technology T.Karthik

17

support just in time retrieval of relevant documents behavior for all possible patterns for which the author proposes a systematic method for contrasting the performance of online RDT protocols. J. Budzik Diverse Keyword Extraction from Conversation. a new method for keyword extraction that rewards both word similarity, to extract the most representative words, and word diversity, to cover several topics if necessary. They propose to build a topical representation of a conversation fragment, and then to select keywords using topical similarity while also rewarding the diversity of topic coverage, inspired by some summarization methods. Automatic Keyword Extraction from Document Using Conditional Random Field. In keywords extraction based on CRF is proposed and implemented. They collected documents from database of Information Center for Social Sciences of RUC‘, which is available at http://art.zlzx.org/. This method randomly chose 600 academic documents in the field of economics from the database.

Subba Rao described the speech is spontaneous and contains disfluencies and ill-formed sentences. It is thus questionable whether to adopt approaches that have been shown before to perform well in written text for automatic keyword extraction in meeting transcripts. This paper evaluates several different keyword extraction algorithms using the transcripts of the ICSI meeting corpus Starting from the simple TFIDF baseline, they introduce knowledge sources based on POS filtering, word clustering, and sentence salience score. In addition, they also investigate a graph-based algorithm in order to leverage more global information and reinforcement from summary sentences. They have used different performance

measurements: comparing to human annotated keywords using individual F-measures and a weighted score relative to the oracle system performance, and conducting novel human evaluation. This query proceeds in the form of an internal query representation and is sent to preferred information adapters. Each information adapter converts the query into the source-specific query language and performs a search. Watson represents a task context as a collection of words associated with the current document of users. Thus concept of information retrieval using user’s interactions with everyday applications as context.

3. PROPOSED METHODOLOGIES

Based on the investigations from the literature survey of various scholars the solution to the problem is achieved which are classified and presented below.

In order to characterize user behaviors in published document streams, we study on the correlations among topics extracted from these documents, especially the sequential relations, and specify them as Sequential Topic Patterns (STPs).To solve the innovative and significant problem of mining URSTPs in document streams, many new technical challenges are raised and will be tackled in this paper. Firstly, the input of the task is a textual stream, so existing techniques of sequential pattern mining for probabilistic databases cannot be directly applied to solve this problem. A preprocessing phase is necessary and crucial to get abstract and probabilistic descriptions of documents by topic extraction, and then to recognize complete and repeated activities of Internet users by session identification. Secondly, in view of the real-time requirements in many applications, both the accuracy and the

Page 6: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

T.Karthik Journal of Recent Research in Engineering and Technology

18

efficiency of mining algorithms are important and should be taken into account, especially for the probability computation process. Thirdly, different from frequent patterns, the user-aware rare pattern concerned here is a new concept and a formal criterion must be well defined, so that it can effectively characterize most of personalized and abnormal behaviors of Internet users, and can adapt to different application scenarios. And correspondingly, unsupervised mining algorithms for this kind of rare patterns need to be designed in a manner different from existing frequent pattern mining algorithms.

1. To the best of our knowledge, this is the first work that gives formal definitions of STPs as well as their rarity measures, and puts forward the problem of mining URSTPs in document streams, in order to characterize and detect personalized and abnormal behaviors of Internet users.

2. A framework to pragmatically solve this problem, and design corresponding algorithms to support it.

3. Preprocessing procedures with heuristic methods for topic extraction and session identification. Then, borrowing the ideas of pattern-growth in uncertain environment, two alternative algorithms are designed to discover all the STP candidates with support values for each user. That provides a trade-off between accuracy and efficiency. At last, we present a user-aware rarity analysis algorithm according to the formally defined criterion to pick out URSTPs and associated users.

A. SEQUENTIAL TOPIC PATTERNS On the Internet, the documents are created and distributed in a sequential way and thus compose various forms of published document streams for specific websites. In this paper, we abbreviate them as document streams. A document stream is defined as a sequence DS = h(d1 , u1, t1 ), (d2 , u2, t2 ), · · · , (dN , uN , tN )i, where di (i = 1, . . . , N ) is a document published by user ui at time ti on a specific website, and ti ≤ tj for all i ≤ j. one user cannot write two documents simulta- neously, so we can assume that at any time point, for any specific user, at most one document is published. Formally, if ti = tj , then ui = uj always hold.Each document stream can be transformed into a topic-level document stream of the form T DS = h(td1 , u1 , t1 ), (td2 , u2, t2 ), · · · , (tdN , uN , tN )i, by extracting topics for each document.

In this Research provide attention to

the correlations among successive documents published by the same user in a document stream. A kind of fundamental but important correlations is the sequential relation among topics of these documents, which can be defined by sequential topic patterns, and abbreviated as STPs. They are suitable to characterize users’ complete and personalized behaviors when publishing documents in a website.

Sequential Topic Pattern(STP) α is

defined as a topic sequence hz1 , z2 , · · · , zn i, where each zi ∈ T is a learnt topic. n = |α| denotes the number of topics contained in α, and is called the length of α.

Page 7: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

Journal of Recent Research in Engineering and Technology T.Karthik

19

A pattern with length n is called an n-STP. Since STPs reflect users’ characteristics which probably show repeated behaviors, their instances should be discov- ered not in the whole document stream involving different users and a long time period, but in some subsequences re- lated to a specific user during a certain time period. Each of such subsequences, called a session of the document stream, consists of a series of possibly correlated messages posted by a user during a time period on some micro-blog sites or Internet forums.

The topic-level document stream TDS, a session s is defined as a subsequence of T DS associated with the same user, s = h(td1 , u, t1), (td2 , u, t2 ), · · · , (td|s| , u, t|s|)i T DS.

For a specific user, there may exist

multiple sessions in a document stream, which should be disjoint and consecutive. Each ellipse represents a session, and all the sessions in each line constitute a document subsequence for a specific user. Many heuristic techniques can be applied for session identification in different scenarios. When the session set of a topic-level document stream is obtained To find some concrete instances of an STP for each session. This process can be regarded as sequence matching between the ordered topics specified in the STP and the probabilistic topics occurring in the ordered documents belonging to a specific session.

B. USER-AWARE RARE SEQUENTIAL

TOPIC PATTERNS

Sequential pattern mining focused on frequent patterns but for STPs, many infrequent ones are also interesting and should be discovered. Specifically, when

Internet users’ publish documents, the personalized behaviors characterized by STPs are generally not globally frequent but even rare, since they expose special and abnor- mal motivations of individual authors, as well as particular events having occurred to them in real life. Therefore, the STPs we would like to mine for user behavior analysis on the Internet should be distinguishing features of involved users, and thus satisfy the following two conditions:

1) They should be globally rare for all sessions involving all users of a document stream;

2) They should be locally and relatively frequent for the sessions associated with a specific user.

Starting with the classical concept of support to describe the frequency. For deterministic sequential pattern mining, the support of a pattern α is defined as the number or pro- portion of the sequences containing α in the target database but inapplicable for uncertain sequences like topic- level document streams. Instead, the expected support is appropriate to measure the frequency on uncertain sequences, and can be computed by summing up the occurrence probabilities of α in all sequences. In other words, it expresses the expected number of sequences containing α. To measure the frequency of STPs, we modify it a little to record the proportion of sessions where α occurs also in terms of expectation, via dividing the summation by the number of sessions. That is necessary because the session number here is no longer a constant to consider both the global frequency and the local frequency of α for different users. The main processing frame- work consists of three phases. At first, textual documents are crawled from some micro-blog sites or

Page 8: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

T.Karthik Journal of Recent Research in Engineering and Technology

20

forums, and constitute a document stream as the input of our approach. Then, as preprocessing procedures, the original stream is transformed to a topic- level document stream and then divided into many sessions to identify complete user behaviors. Finally and most importantly, we discover all the STP candidates in the document stream for all users, and further pick out significant URSTPs associated to specific users by user-aware rarity analysis. In order to fulfill this task, we design a group of algorithms. To unify the notations, many variables are denoted and stored in the key-value form. For example, User Sess represents the set of user-session pairs, and each of its elements is denoted as hu : Su i, in which the user u is the key of the map and its value Su is a set containing all the sessions associated with u.

The input includes an original document stream

DS = h(d1 , u1, t1 ), (d2 , u2, t2 ), · · · , (dN , uN , tN )i,

A scaled support threshold hss and a relative rarity threshold hrr .

Algorithm 1. Main(DS, hss , hrr )

1: U ser Sess ← Preprocess(DS);

2: U ser ST P ← ∅;

3: for all hu : Su i ∈ U ser Sess do

4: Start a new thread;

5: ST P Suppu ← UpsSTP(∅, Su , ∅, Su );

6: U ser ST P ← U ser ST P ∪ {hu : ST P Suppu i};

7: U ser U RST P ← URSTPMiner(U ser ST P, U ser Sess,hss , hrr );

8: return U ser U RST P ;

After preprocessing to set of user-session pairs. For each of them with a specific user u, a new thread is started and a pattern-growth based subprocedure UpsSTP is recursively invoked to find all the STP can- didates for u, paired with their support values

3.1 PREPROCESSING

Topic Extraction

In order to obtain a topic-level document stream, we at first employ the classical probabilistic topic models like LDA and Twitter-LDA to get the topic proportion of each document and the word distribution of each learnt topic, with a predefined topic number K .

To select some representative topics to get an approximate

topic-level document. The input of this process is the topic proportion of a document d of the form {p(z|d)}z∈T satisfying Pz∈T p(z|d) = 1, while the output is a topic-level

Topic Probability Threshold. It selects all the prob- abilities more than or equal to a predefined thresh- old htp . Formally, for all i = 1, . . . , K ′, pi ≥ htp holds, and for all z ∈ T − {z1, · · · , zK ′ }, p(z|d) < htp holds.

2) Probability Summation Threshold. After sorting the probability values of the K topics in the non- increasing order, it selects them according to the order as many as possible such that their

Page 9: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

Journal of Recent Research in Engineering and Technology T.Karthik

21

summation is less than or equal to a predefined threshold

Session Identification

Since each session should contain a complete publishing behavior of an individual user, we need to at first divide the document stream according to different users, which is an easy job as the author of each document is explicitly given in the input stream. The result for each user u is a subsequence of the topic-level document stream restricted to that user, i.e., T DSu = h(td1 , u, t1 ), (td2 , u, t2), · · · , (tdN , u, tN )i. Af- ter that, we also need to partition the subsequence to identify complete and repeated activities as consecutive and non-overlapped sessions. They constitute a session set Su = {s1 , s2 , · · · , sm } satisfying T DSu = s1 ◦ s2 ◦ · · · ◦ sm , where ◦ is the concatenation operator.

Time Interval Heuristics. It assumes that the time interval of any two adjacent documents in the same session is less than or equal to a predefined thresh- old hti . The algorithm named tiPartition is shown in Algorithm . It examines each document on the input stream orderly to see whether it should be the starting point of a new session, by checking the condition that the time difference between it and its previous document exceeds the threshold

Time Span Heuristics. It assumes that the duration of each session is less than or equal to a predefined threshold hts . The algorithm is named tsPartition, and the only distinction from Algorithm 2 is the condition for a new session in line 3. Specifically, the time point of the previous document ti−1 is replaced by the starting time point of the current session tk , and hts is used as threshold. Due to the page limit, the pseudocode is omitted here.

Algorithm 2. tiPartition(T DSu , hts )

1: j, k ← 1;

2: for i = 1 to N do

3: if ti − ti−1 > hti then

4: sj ← h(tdk , u, tk ), · · · , (tdi−1 , u, ti−1 )i;

5: j ← j + 1;

6: k ← i;

7: sj ← h(tdk , u, tk ), · · · , (tdN , u, tN )i;

8: m ← j;

9: return Su = {s1 , s2 , · · · , sm };

4. PERFORMANCE ANALYSIS

The dataset into two parts: one for training,

the other for testing. For OLDA, a test set is

a collection of unseen documents wd, and

the model is described by the topic matrix

Φ and the hyperparameter α for topic-

distribution of documents. The LDA

parameters Θ is not taken into

consideration as it represents the topic-

distributions for the documents of the

training set, and can therefore be ignored

to compute the likelihood of unseen

documents. Therefore, we need to evaluate

the log-likelihood

L(w)=logp(w|Φ,α)=∑dlogp(wd|Φ,α).L(w)=logp(w|Φ,α)=∑dlogp(wd|Φ,α) (1)

of a set of unseen documents wdwd given the topics Φ and the hyperparameter α for

Page 10: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

T.Karthik Journal of Recent Research in Engineering and Technology

22

topic-distributionθd of documents. Likelihood of unseen documents can be used tocompare models; higher likelihood implying a better model.

The measure traditionally used for topic

models is the textit{perplexity} of held-out

documents wd defined as

perplexity(test set w)=exp{−L(w)count of tokens}

(2)

which is a decreasing function of the log-

likelihood L(w) of the unseen documents

wd; the lower the perplexity, the better the

model.we use the perplexity score that

trained two latent variable models including

Smoothed unidiagram, and our OLDA, on

two corpora, a paper collection in DBLP

and a set of movie documents in IMDB, to

compare the generalization performance of

the three models. In both datasets, we

removed the stop words and conducted

experiments using 5-fold cross-validation.

Figure Shows presents the results on the

DBLP dataset. The perplexity of OLDP is

Lesser than or similar to SMOOTHED

UNIDIAGRAM, when the number of topics

T ≤ 100; when T increases, SMOOTHED

UNIDIAGRAM are running into serious

over-fitting, while the trend of OLDA keeps

going down and the perplexity is

significantly lower than those of the

baselines.

Perplexity is seen as a good measure of performance for LDA. The idea is that you keep a holdout sample, train your LDA on the rest of the data, then calculate the perplexity of the holdout.

The perplexity could be given by the formula:

per(Dtest)=exp{−∑Md=1logp(wd)∑Md=1Nd} (3)

holding out a proportion of the words in each document, and giving using predictive probabilities of these held-out words given the document-topic mixtures as well as the topic-word mixtures. This is obviously not ideal as it doesn't evaluate performance on any held-out documents.

Page 11: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

Journal of Recent Research in Engineering and Technology T.Karthik

23

4. CONCLUSION

Mining URSTPs in published document streams on complex event patterns based on document topics, for real-time monitoring on abnormal behaviors of Internet users. In this Research work the mining problem are formally defined, and a group of algorithms are designed and combined to systematically solve this problem. The experiments conducted on synthetic datasets demonstrate that the proposed approach is very effective and efficient in discovering special users as well as interesting and interpretable URSTPs from Internet document streams, which can well capture users’ personalized and abnormal behaviors and characteristics.

As this innovative research direction on Web data mining, measures the user-aware rarity to accommodate different requirements improve the mining algorithms mainly on the degree of parallelism, and study on-the-fly algorithms aiming at realtime document streams. Moreover, based on STPs, we will try to define more complex event patterns, such as imposing timing constraints on sequential topics, and design corresponding efficient mining algorithms.It provides a trade-off between accuracy and efficiency. And Proposed techniques of sequential pattern mining for probabilistic databases can be directly applied.Most of existing works on sequential pattern mining focused on frequent patterns, but for STPs, many infrequent ones are also interesting and should be discovered.

In future, we have intended to employ asymmetric cryptographic scheme to provide high security and data integrity.

An effective location based packet forwarding protocol can be implemented.

REFERENCES

1. C. C. Aggarwal, Y. Li, J. Wang, and J. Wang, “Frequent pattern mining with uncertain data,” in Proc. ACM SIGKDD’09, 2009, pp. 29–38.

2. R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc. IEEE ICDE’95, 1995, pp. 3–14.

3. J. Allan, R. Papka, and V. Lavrenko, “On-line new event detection and tracking,” in Proc. ACM SIGIR’98, 1998, pp. 37–45.

4. T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Zuefle, “Probabilistic frequent itemset mining in uncertain databases,” in Proc. ACM SIGKDD’09, 2009, pp. 119–128.

5. D. Blei and J. Lafferty, “Correlated topic models,” Adv. Neural Inf. Process. Syst., vol. 18, pp. 147–154, 2006.

6. D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proc. ACM ICML’06, 2006, pp. 113–120.

7. D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

8. J. Chae, D. Thom, H. Bosch, Y. Jang, R. Maciejewski, D. S. Ebert, and T. Ertl, “Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition,” in Proc. IEEE V AST’12, 2012, pp. 143–152.

9. K. Chen, L. Luesukprasert, and S. T. Chou, “Hot topic extraction based on timeline analysis and multidimensional sentence modeling,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 8, pp. 1016–1025,

Page 12: DISCOVERY OF RARE SEQUENTIAL TOPIC PATTERNS …jrret.com/archives/2016/paper/oct/jrret16Oct2.pdf · works analyzed the evolution of individual topics to detect and predict social

T.Karthik Journal of Recent Research in Engineering and Technology

24

2007. 10. C. K. Chui and B. Kao, “A

decremental approach for mining frequent itemsets from uncertain data,” in Proc. PAKDD’08, 2008, pp. 64–75.

11. W. Dou, X. Wang, D. Skau, W. Ribarsky, and M. X. Zhou, “LeadLine: Interactive visual analysis of text data through event identification and exploration,” in Proc. IEEE V AST’12, 2012, pp.93–102.

12. G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu, “Parameter free bursty events detection in text streams,” in Proc. VLDB’05, 2005, pp. 181–192.

13. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu, “FreeSpan: frequent pattern-projected sequential pattern mining,”in Proc. ACM SIGKDD’00, 2000, pp. 355–359

14. N. Hariri, B. Mobasher, and R. Burke, “Context-aware music recommendation based on latent topic sequential patterns,” in Proc. ACM RecSys’12, 2012, pp. 131–138.

15. T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. ACM SIGIR’99, 1999, pp. 50–57.

16. L. Hong and B. D. Davison, “Empirical study of topic modeling in Twitter,” in Proc. ACM SOMA’10, 2010, pp. 80–88. [17] Z. Hu, H. Wang, J. Zhu, M. Li, Y. Qiao, and C. Deng, “Discovery of rare sequential topic patterns in document stream,” in Proc. SIAM SDM’14, 2014, pp. 533–541.

17. A. Krause, J. Leskovec, and C. Guestrin, “Data association for topic intensity tracking,” in Proc. ACM ICML’06, 2006, pp. 497–504.

18. W. Li and A. McCallum, “Pachinko allocation: DAG-structured mixture

models of topic correlations,” in Proc. ACM ICML’06, vol. 148, 2006, pp. 577–584.