a_survey_of_topic_pattern_mining_in_text_mining.pdf

8/16/2019 A_Survey_of_Topic_Pattern_Mining_in_Text_Mining.pdf

1/7


2/7

M Trupthi et al A Survey of Topic Pattern Mining in Text Mining

408 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 05, May, 2016

that topic modeling evolution provides such as topic overtime (TOT), dynamic topic models (DTM), multiscaletopic tomography, dynamic topic correlation detection,detecting topic evolution in scientific literature and the webof topics. Furthermore, it will going to present theoverview of each category and provides examples if any

and some limitations and characteristics of each part.

This paper is organized as follows. Section II providesthe first category methods of topic modeling with its fourmethods and their general concepts as subtitles. Section IIIoverviews of second category which is topic modelingevolution including its parts. Then it is followed byconclusions in Section IV.

II. METHODS OF TOPIC MODELING

A. Latent Semantic Analysis

Latent semantic analysis (LSA) is a method or atechnique in the area of Natural language processing

(NLP). The main goal of Latent semantic analysis (LSA) isto create vector based representation for text‟s to makesemantic content. By vector representation (LSA)computes the similarity between text‟s to pick the heistefficient related words. In the past LSA was named aslatent semantic indexing (LSI) but improved forinformation retrieval tasking. So, finding few documentsthat close to the query that given from many documents.LSA should have many aspects to give approach such askey words matching, Wight key words matching and vectorrepresentation depends on occurrences of words indocuments. Also, Latent semantic analysis (LSA) usessingular value decomposition (SVD) to rearrange the data.

SVD is a method that uses a matrix to reconfigure andcalculates all the dimensions of vector space. In addition,the dimensions in vector space will be computed andorganized from most to the least important. In LSA themost significant assumption will be used to find themeaning of the text otherwise least important will beignored in the assumption. By searching about words thathave a high rate of similarity will be occurred if that wardshave similar vector. To describe the most essential steps inLSA first collecting a huge set of relevant text then divideit by documents. Second make co-occurrence matrix forterms and documents also giving the cell name such asdocuments x, terms y and m for dimensional value forterms and n dimensional vector for documents. Third eachcell will be whetted and calculated finials SVD will play

big roil to compute all the dimensions and make threematrices.

B. Probabilistic Latent Semantic Analysis

Puzicha and Thomas Hofmann introduced it in 1999.PLSA is a method that can be automated documentindexing which is based on a statistical latent class modelfor factor analysis of count data, and also this method triesto improve the Latent Semantic Analysis (LSA) in a

probabilistic sense by using a generative model. The maingoal of PLSA is that identifying and distinguishing

between different contexts of word usage without recourse

to a dictionary or thesaurus. It includes two importantimplications: First one, it allows to disambiguate polysemy,i.e., words with multiple meanings. Second thong, it

discloses topical similarities by grouping together wordsthat shared a common context [3].

According to Kakkonen, Myller, Sutinen, and Timonen,2008, PLSA is based on a statistical model that is referredas an aspect model . An aspect model is a latent variable

model for co-occurrence data, which associates unobservedclass variables with each observation‖ [4]. The PLSAmethod comes to improve the method of LSA, and also tosolve some other problems that LSA cannot solve it. PLSAhas been successful in many real-world applications,including computer vision, and recommender systems.However, since the number of parameters grows linearlywith the number of documents, PLSA suffers from overfitting problems. Even though, it will discuss some of theseapplications later [5].

In the other hand, PLSA based on algorithm anddifferent aspects. In this probabilistic model, it introduce aLatent variable zk ∈ {z1, z2,..., zK}, which corresponds to

a potential semantic layer. Thus, the full model: p (di) on behalf of the document in the data set the probability; p (wj| zk) zk representatives as defined semantics, the relatedterm (word) of the opportunities are many; p (zk | di)represents a semantic document distribution. Using thesedefinitions, will generate model use it to generate new data

by the following steps: [3]

1) Select a document di with probability P (di),

2) Pick a latent class zk with probability P (zk | di),

3) Generate a word wj with probability P (wj |zk).

Fig. 1 High-Level View of PLSA

PLSA has two different formulations that can present

this method. The first formulation is symmetricformulation, which will help to get the word (w) and thedocument (d) from the latent class c in similar ways. Byusing the conditional probabilities P (d | c) and P (w | c). Thesecond formulation is the asymmetric formulation. In thisformulation each document d , a latent class is chosenconditionally to the document according to P (c | d ), and theword can be generated from that class according to P (w | c)[6]. Each of these two formulations has rules andalgorithms that could be used for different purposes. Thesetwo formulations have been improved right now and thiswas happened when they released the RecursiveProbabilistic Latent Semantic Analysis (RPLAS). This

method is extension for the PLAS; also it was improvingfor the asymmetric and symmetric formulations.


3/7



Fig. 2 A graphical model representation of the aspectmodel in the asymmetric (a) and symmetric (b)

parameterization [3]

In terms of PLSA applications, It has various

applications such as information retrieval and filtering, NLP and machine learning from text. In specific, some ofthese applications are automatic essay grading,classification, topic tracking, image retrieval and automaticquestion recommendation. Two applications are discussed

below:

Image retrieval: PLSA model has the visual featuresthat it uses to represent each image as a collection ofvisual words from a discrete and finite visualvocabulary. Having an occurrence of visual words inan image is hereby counted into a co-occurrencevector. Each image has the co-occurrence vectors thatcan help to build the co-occurrence table that is used to

train the PLSA model. After knowing the PLSAmodel, can apply the model to all the images in thedatabase. Then, the pediment of the vector is torepresent it for each image, where the vector elementsdenote the degree to which an image depicts a certaintopic [7].

Automatic question recommendation: One of thesignificant application that PLSA deal with is questionrecommendation tasks, in this kind of application theword is independent of the user if the user want aspecific meaning, so when the user get the answers andthe latent semantics under the questions, then he canmake recommendation based on similarities on these

latent semantics. Wang, Wu and Cheng, 2008 reportedthat -Therefore, PLSA could be used to model theusers‟ profile (represented by the questions that theuser asks or answers) and the questions as wellthrough estimating the probabilities of the latent topics

behind the words. Because the user„s profile isrepresented by all the questions that he/she asks oranswers, we only need to consider how to model thequestion properly [8].

C. Latent Dirichlet Allocation

The reason of appearance of latent Dirichlet allocation(LDA) model is to improve the way of mixture models thatcapture the exchangeability of both words and documents

from the old way by PLSA and LSA. This was happeningIn 1990, so the classic representation theorem lays downthat any collection of exchangeable random variables has a

representation as a mixture distribution, in general aninfinite mixture [9].

Latent Dirichlet Allocation (LDA) is an Algorithmfor text mining that is based on statistical (Bayesian) topicmodels and it is very widely used. LDA is a generative

model that means that it tries mimics what the writing process is. So it tries to generate a document given thetopic. It can also be applied to other types of data. Thereare tens of LDA based models including: temporal textmining, author- topic analysis, supervised topic models,latent Dirichlet co-clustering and LDA based bio-informatics [11], [18].

In the simple basic idea of the process eachdocument is modeled as a mixture of topics, and each topicis a discrete probability distribution that defines how likelyeach word is to appear in a given topic. These topic

probabilities give a concise representation of a document.Here, a "document" is a "bag of words" with no structure

beyond the topic and word statistics.There are huge numbers of electronic document

collections such as the web, scientific interesting, blogs andnews articles literature in the recent past has posed severalnew, challenges to researchers in the data miningcommunity. Especially there is growing need for automatictechniques to visualize, analyze and summarize mine thesedocument collections. In the recent past, latent topicmodeling has become very popular as a completelyunsupervised technique for topic discovery in largedocument collections. This model, such as LDA [10].

Fig.3 A graphical model representation of LDA

LDA models each of D documents as a mixture over Klatent topics, each of which describes a multinomialdistribution over a W word vocabulary. Figure 3 shows the

graphical model representation of the LDA model. Thegenerative process for the basic LDA is as follows:

For each of Nj words in document j

1) Choose a topic zij ∼ Mult(θj)

2) Choose a word xij ∼ Mult(φzij )

Where the parameters of the multinomials for topicsin a document θj and words in a topic φk have Dirichlet

priors [12].

Indeed, there are several of applications based on theLatent Dirichlet Allocation (LDA) method such as:

Role discovery: Social network analysis (SNA) isthe study of mathematical models for interactionsamong people, organizations and groups. Becauseof the emergence connections among the 9/11hijackers and the huge data sets of human on the


4/7



popular web service like facebook.com andMySpace.com, there has been growing interest insocial network analysis. That leads to exist ofAuthor-Recipient-Topic (ART) model for socialnetwork analysis. The model combines the LatentDirichlet Allocation (LDA) and the Author- Topic

(AT) model. The Idea of (ART) is learn topicdistributions based on the direction-sensitivemessages sent between the senders and receivers[13].

Emotion topic: The Pairwise-Link-LDA model,which is focus on the problem joint modeling oftext and citations in the topic modeling area. It„s

builds on the ideas of LDA and MixedMembership Stochastic Block Models (MMSB)and allows modeling arbitrary link structure [14].

Automatic essay grading : The Automatic essay

grading problem is closely related to automatic

text categorization, which has been researchedsince the 1960s. Comparison of DimensionReduction Methods for Automated EssayGrading. LDA has been shown to be reliablemethods to solve information retrieval tasks frominformation filtering and classification todocument retrieval and classification [15].

Anti-Phishing : Phishing emails are ways to theaterthe sensitive information such as accountinformation, credit card, and social securitynumbers. Email Filtering or web site filtering isnot the effective way to prevent the Phishingemails. Because latent topic models are clusters of

words that appear together in email, user canexpect that in a phishing email the words "click"and "account" often appear together. Usual latenttopic models do not take into account differentclasses of documents, e.g. phishing or non-

phishing. For that reason the researchersdeveloped a new statistical model, the latentClass-Topic Model (CLTOM), which is anextension of latent Dirichlet allocation (LDA)[16].

Example of LDA: This section is to provide anillustrative example of the use of an LDA model on realdata. By using the subset of the TREC AP corpus

containing 16,000 documents. First, remove the stop-wordsin TREC AP corpus before running topic modeling. Afterthat, use the EM algorithm to find the Dirichlet andconditional multinomial parameters for a 100-topic LDAmodel. The top words from some of the resultingmultinomial distributions are illustrated in Figure 4. As aresult, these distributions seem to capture some of theunderlying topics in the corpus (it is named according tothese topics [9].

Fig. 4. Most likely words from 4 topics in LDA fromthe AP corpus: the topic titles in quotes are not part of thealgorithm

D. Correlated Topic Model

Correlated Topic Model (CTM) is a kind of statistical

model used in natural language processing and machinelearning. Correlated Topic Model (CTM) used to discoverthe topics that shown in a group of documents. The key forCTM is the logistic normal distribution. Correlated TopicModels (CTM) is depending on LDA.

Table 1. The characteristics of topic modeling methods [17]

Name of The Methods Characteristics

Latent SemanticAnalysis (LSA)

* LSA can get from the topic if there areany synonym words.

* Not robust statistical background.

Probabilistic LatentSemantic Analysis

(PLSA)

* It can generate each word from a singletopic; even though various words in one

document may be generated fromdifferent topics.

* PLSA handles polysemy.

Latent DiricheletAllocation (LDA)

* Need to manually remove stop-words.

* It is found that the LDA cannot makethe representation of relationships amongtopics.

Correlated Topic Model(CTM)

* Using of logistic normal distribution tocreate relations among topics.

* Allows the occurrences of words inother topics and topic graphs.

Table II. The limitations of topic modeling methods [17]

Name of TheMethods

Limitations

Latent SemanticAnalysis (LSA)

- It is hard to obtain and to determine thenumber of topics.

- To interpret loading values with probabilitymeaning, it is hard to operate it.

Probabilistic LatentSemantic Analysis(PLSA)

- At the level of documents, PLSA cannot do probabilistic model.

Latent DiricheletAllocation (LDA)

- It becomes unable to model relations amongtopics that can be solved in CTM method.

Correlated TopicModel (CTM)

- Requires lots of calculation

- Having lots of general words inside thetopics.


5/7



III. METHODS ABOUT TOPIC EVOLUTION MODELS

A. Overview of Topic Evolution Models

It is important to model topic evolution, so people can

identify topics within the context (i.e. time) and see howtopics evolve over time. There are a lot of applicationswhere topic evolution models can be applied. For example,

by checking topic evolution in scientific literature, it cansee the topic lineage, and how research on one topicinfluences on another.

This section will review several important papersthat model topic evolutions. These papers model topicevolution by using different models, but all of themconsider the important factor time when model topics. Forexample, probabilistic time series models are used tohandle the issue in paper dynamic topic models, and non-homogeneous Poisson processes and multiscale analysiswith Haar wavelets are employed in paper, multiscale topictomography to model topic evolution.

B. Non-Markov Continuous-Time Method

Since most of the large data sets have dynamic co-occurrence patterns, and words i.e., topic co-occurrence

patterns change over time, so TOT models consider thechanges over time by taking into account both the word co-occurrence pattern and time [19]. In this method, a topic isconsidered as being associated with a continuousdistribution over time.

In TOT, topic discovery is influenced not only by

word co-occurrences, but also temporal information, foreach document, multinomial distribution over topics issampled from Dirichlet, words are generated frommultinomial of each topic, and Beta distribution of eachtopic generates the document„s time stamp. If there exists a

pattern of a strong word co-occurrence for a short time,TOT will create a narrow-time-distribution topic. If a

pattern of a strong word co-occurrence exists for a while, itwill generate a broad-time-distribution topic.

The main point of this paper is that it models topicevolution without discretizing time or making Markovassumptions that the state at time t + t1 is independent ofthe state at time t. By using this method on U.S.

Presidential State-of-the-Union address of two centuries,TOT discovers topics of time-localization, and alsoimproves the word-clarity over LDA. Anotherexperimental result on the 17-year NIPS conferencedemonstrates clear topical trends.

C. TIFF Dynamic Topic Models (DTM)

The authors in this paper developed a statistical modelof topic evolution, and developed approximate posteriorinference techniques to decide the evolving topics from asequential document collection [20]. It assumes that corpusof documents is organized based on time slices, and thedocuments of each time slice are modeled with K-component model, and topics associated with time slice t

evolve from topics corresponding to slice time t-1.

Dynamic topic models estimate topic distribution atdifferent epochs. It uses Gaussian prior for the topic

parameters instead of Dirichlet prior, and can capture thetopic evolution over time slices. By using this model, whatwords are different from the previous epochs can be

predicted.

D. A Non-parametric Approach to Dynamic Topic

Correlation Detection (DCTM)This method models topic evolution by discretizing

time [21]. In this method, each corpus contains a set ofdocuments, each of which contains documents with thesame timestamp. It assumes that all documents in a corpusshare the same time-scale, and that each document corpusshares the same vocabulary of size d.

Basically, DCTM maps the high-dimensional space(words) to lower-dimensional space (topics), and modelsthe dynamic topic evolution in a corpus. A hierarchy overthe correlation latent space is constructed, which is calledtemporal prior. The temporal prior is used to capture thedynamics of topics and correlations.

DCTM works as follows: First of all, for eachdocument corpus, the latent topics are discovered, and thisis done by first summarizing the contribution of documentsat certain time, which is done by aggregating the features inall documents. Then, Gaussian process latent variablemodel (GP-LVM) is used to capture the relationship

between each pair of document and topic set. Next,hierarchical Gaussian process latent variable model (HGP-LVM) is employed to model the relationship between each

pair of topic sets. They also use the posterior inference oftopic and correlations to identify the dynamic changes oftopic-related word probabilities, and to predict topicevolution and topic correlations.

An important feature of this paper is that it is non- parametric model since it can marginalize out the parameters, and it exhibits faster convergence than thegenerative processes.

Example: Detecting Topic Evolution of ScientificLiterature

This method employs the observation that citationindicates important relationship between topics, and it usescitation to model topic evolution of scientific literature[22]. Not only papers that are in a corpus D(t) areconsidered for topic detection, but those papers that arecited are also taken into account. It uses Bayesian model to

identify topic evolution.

It works as follows: first, it tries to identify a newtopic by identifying significant content changes in adocument corpus. If the new content is obviously differentfrom the original content, and the new content is shared bylater documents, the new content is identified as a newtopic.

The next step is to explore the relationship between thenew topics and the original topics. It works by findingmember documents of each topic, and examining therelationship. It also uses citation relationship to findmember documents of each topic. That is, if a paper citesthe start paper, this paper will be the member paper of thestart paper. In addition, those papers that are textually closeto the start paper; they are also member papers of the start

paper. The relationship between the original topics and thenew discovered topics is identified by using citation count


6/7



in this paper. Their experimental results demonstrate thatcitations can better understand topic evolutions.

Summary of topic evolution models

This paper summarizes the main characteristics of topicevolution models discussed in section 3, which is listed asfollows:

TABLE III. The Main Characteristics of Topic EvolutionModels

Main characteristics ofmodels

Models

Modeling topic evolution by continuous-time model

1)”Topics over time: a non-markovcontinuous-time model of topical trends”

Modeling topic evolution

by discretizing time

1)”Dynamic topic models”

2)”Multiscale topic tomography”

3)“ANon-parametric Approach to Pair-wiseDynamic Topic Correlation Detection”

Modeling topic evolution by using citationrelationship as well asdiscretizing time

1)”Detecting topic evolution in scientificliterature: How can citations help”

2)”The Web of Topics: Discovering theTopology of Topic Evolution in a Corpus”

Comparison of Two Categories

The main difference of the two categories thatmodel topics is that the first category model topics withoutconsidering time and the methods in the first categorymainly model words.

The methods in the second category model topicsconsidering time, by using continuous-time, by discretizingtime, or by combining time discretization and citationrelationship. Due to the different characteristics of thesetwo categories, the methods in the second category aremore accurate in terms of topic discovery.

IV. CONCLUSION

This survey paper, presented two categories that can be

under the term of topic modeling in text mining. In the firstcategory, it has discussed the general idea about the fourtopic modeling methods including Latent semantic analysis(LSA), Probabilistic latent semantic analysis (PLSA),Latent Dirichlet allocation (LDA), and Correlated topicmodel (CTM). In addition, it explained the different

between these four methods in term of characteristics,limitations and the theoretical backgrounds. Paper does notgo into specific details of each of these methods. It onlydescribes the high-level view of these topics that relate it totopic modeling in text mining. Furthermore, it hasmentioned some of the applications that have beeninvolved in these four methods. Also, it has mentioned thateach method of these four is improving for the old one andmodifies some of disadvantages that have found in the

previous methods. Model topics without taking intoaccount time will confound the topic discovery. In thesecond category, paper has discussed the topic evolution

models, which model topics by considering time. Several papers use different methods to model topic evolution.Some of them model topic evolution by discretizing time,some of them use continuous-time model, and some ofthem employ citation relationship as well as timediscretization to model topic evolution. All of these papers

model topics by considering the important factor time.

R EFERENCES

[1] Blei, D.M., and Lafferty, J. D. “Dynamic Topic Models”,Proceedings of the 23rd International Conference on MachineLearning, Pittsburgh, PA, 2006.

[2] Steyvers, M., and Griffiths, T. (2007). “Probabilistic topic models”.In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds),Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum

[3] Hofmann, T., “Unsupervised learning by probabilistic latentsemantic analysis”, Machine Learning, 42 (1), 2001, 177- 196.

[4]

Kakkonen, T., Myller, N., Sutinen, E., and Timonen, J.,“Comparison of Dimension Reduction Methods for AutomatedEssay Grading”, Educational Technology &Society, 11 (3), 2008,275-288.

[5] Liu, S., Xia, C., and Jiang, X., “Efficient Probabilistic LatentSemantic Analysis with Sparsity Control”, IEEE InternationalConference on Data Mining, 2010, 905-910.

[6] Bassiou, N., and Kotropoulos C. “RPLSA: A novel updatingscheme for Probabilistic Latent Semantic Analysis”, Department ofInformatics, Aristotle University of Thessaloniki, Box 451Thessaloniki 541 24, Greece Received 14 April 2010.

[7] Romberg, S., Hörster, E., and Lienhart, R., “Multimodal pLSA onvisual features and tags”, The Institute of Electrical and ElectronicsEngineers Inc., 2009, 414-417.

[8] Wu, H., Wang, Y., and Cheng, X., “Incremental probabilistic latentsemantic analysis for automatic question recommendation”, ACM New York, NY, USA, 2008, 99-106.

[9] Blei, D.M., Ng, A.Y., and Jordan, M.I., “Latent DirichletAllocation”, Journal of Machine Learning Research, 3, 2003, 993-1022.

[10] Ahmed,A., Xing,E.P., and William W. “Joint Latent Topic Modelsfor Text and Citations”, ACM New York, NY, USA, 2008.

[11] Zhi-Yong Shen,Z.Y., Sun,J., and Yi-Dong Shen,Y.D., CollectiveLatent Dirichlet Allocation”, Eighth IEEE International Conferenceon Data Mining, pages 1019 – 1025, 2008.

[12] Porteous, L.,Newman,D., Ihler, A., Asuncion, A., Smyth, P., andWelling, M., “Fast Collapsed Gibbs Sampling For Latent DirichletAllocation”, ACM New York, NY, USA, 2008.

[13] McCallum, A., Wang, X., and Corrada-Emmanuel, A., “Topic androle discovery in social networks with experiments on enron andacademic email”, Journal of Artificial Intelligence Research, 30 (1),2007, 249-272.

[14] Bao, S., Xu, S., Zhang, L., Yan, R., Su, Z., Han, D., and Yu, Y.,Joint Emotion-Topic Modeling for Social Affective Text Mining”,Data Mining, 2009. ICDM ‗09. Ninth IEEE InternationalConference, 2009, 699-704.

[15] Kakkonen, T., Myller, N., and Sutinen, E., “Applying latentDirichlet allocation to automatic essay grading”, Lecture Notes inComputer Science, 4139, 2006, 110-120.

[16] Bergholz, A., Chang, J., Paaß, G., Reichartz, F., and Strobel, S.,Improved phishing detection using model-based features”, 2008.

[17] Lee,S., Baker,J., Song,J., and Wetherbe, J.C., “An EmpiricalComparison of Four Text Mining Methods”, Proceedings of the43rd Hawaii International Conference on System Sciences, 2010.

[18] X. Wang and A. McCallum. “Topics over time: a non-markovcontinuous-time model of topical trends”. In International

conference on Knowledge discovery and data mining, pages 424 – 433, 2006.


7/7



[19] D. M. Blei and J. D. Lafferty. Dynamic topic models. InInternational conference on Machine learning, pages 113 – 120,2006.

[20] R. M. Nallapati, S. Ditmore, J. D. Lafferty, and K. Ung. Multiscaletopic tomography. In Proceedings of KDD„07, pages 520– 529,2007.

[21] Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, and C. L. Giles. Detecting

topic evolution in scientific literature: How can citations help? InCIKM, 2009.

[22] Yookyung Jo, John E. Hopcroft, and Carl Lagoze. The Web ofTopics: Discovering the Topology of Topic Evolution in a Corpus,The 20th International World Wide Web Conference, 2011.

a_survey_of_topic_pattern_mining_in_text_mining.pdf

Documents