clustering of web documents jinfeng chen. zhong su, qiang yang, honghiang zhang, xiaowei xu and...
Post on 27-Dec-2015
226 Views
Preview:
TRANSCRIPT
Clustering of Web DocumentsClustering of Web Documents
Jinfeng ChenJinfeng Chen
Zhong Su, Qiang Yang, HongHiang Zhang, Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Xiaowei Xu and Yuhen Hu, Correlation-Correlation-based Document Clustering using Web based Document Clustering using Web LogsLogs, 2001. , 2001.
Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Ma and Jinwen Ma,Learning to Cluster Web Learning to Cluster Web Search ResultsSearch Results
Correlation-based Correlation-based Document Clustering using Web LogsDocument Clustering using Web Logs
Introduction Introduction Using web log data to construct clusters.Using web log data to construct clusters.
Frequent simultaneous visits to two seemingly Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are unrelated documents should indicate that they are in fact closely related.in fact closely related.
Basic algorithm is DBSCAN, an algorithm to group Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters neighboring objects of the database into clusters based on local distance information.based on local distance information.
DBSCANDBSCAN
Does not require the user to pre-specify the number of Does not require the user to pre-specify the number of clusters.clusters.
Only one scan through the database.Only one scan through the database. A radius value A radius value εε and a value and a value MptsMpts..
εε - distance measure (radius) - distance measure (radius)
Mpts – number of minimal points that should occur in Mpts – number of minimal points that should occur in around a dense objectaround a dense object
DBSCAN algorithmDBSCAN algorithm (con’d)(con’d)
Algorithm DBSCAN(DB, Algorithm DBSCAN(DB, εε,Minpts),Minpts)
for each o belong to DB dofor each o belong to DB do
if o is not yet assigned to a clusterif o is not yet assigned to a cluster
if o is a core-object thenif o is a core-object then
collect all objects density-reachable form ocollect all objects density-reachable form o
according to according to εε and MinPts and MinPts
assign them to a new cluster;assign them to a new cluster;
Limitations of DBSCAN in Clustering of Limitations of DBSCAN in Clustering of web documentweb document
Performance clustering using a fixed threshold value to Performance clustering using a fixed threshold value to determine “dense” regions in the document space. determine “dense” regions in the document space.
Thus the algorithm often cannot distinguish between Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space dense and loose points, often the entire document space is lumped into a single cluster.is lumped into a single cluster.
bridge
RDBC algorithmRDBC algorithm(recursive density based clustering)(recursive density based clustering) Key difference between RDBC and DBSCAN is that in Key difference between RDBC and DBSCAN is that in
RDBC, the identification of core points are performed RDBC, the identification of core points are performed separately from that of clustering each individual data separately from that of clustering each individual data points.points.
Different values of Different values of εε and Mpts are used in RDBC to and Mpts are used in RDBC to identify this core point set, Cset.identify this core point set, Cset.
RDBC algorithmRDBC algorithm (con’d)(con’d)
For avoid connecting too many clusters through “bridgeFor avoid connecting too many clusters through “bridge”” Set initial value Set initial value εε==εε1 and Mpts=Mpts1;1 and Mpts=Mpts1; WebPageSet=web_logWebPageSet=web_log RDBC(RDBC(εε,Mpts, WebPageSet) {,Mpts, WebPageSet) { use use εε, Mpts to get the core point Cset, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2if size (Cset > size(webPageSet)/2 { DBSCAN({ DBSCAN(εε,Mpts, WebPageSet) },Mpts, WebPageSet) } elseelse { { εε= = εε/2; Mpts=Mpts/4;/2; Mpts=Mpts/4; RDBC (RDBC (εε,, Mpts, WebPageSet);Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset)Collect all other points in (WebPageSet-Cset) around clusters found in last step according to around clusters found in last step according to εε22
}} }}
Construct WebPageSet from web logsConstruct WebPageSet from web logs
Step 1Step 1
Step 2Step 2 Delete visit of image files.Delete visit of image files. Step 3Step 3 Extract sessions from the data.Extract sessions from the data.
T
Visi
ts
Construct WebPageSet Construct WebPageSet (con’d)(con’d)
Step 4 Create a distance matrixStep 4 Create a distance matrix
1) 1) Determine the size of a moving window,Determine the size of a moving window,
within which URL requests within which URL requests
will be regarded as co-occurrence.will be regarded as co-occurrence.
2) Calculate the co-occurrence times N2) Calculate the co-occurrence times N i,,ji,,j, and, and
NNii, N, Nj j of this pair of URL’s.of this pair of URL’s.
sessi on
wi ndow
Pi
Pj
Co-occurrence t i me
Ni
Construct WebPageSet Construct WebPageSet (con’d)(con’d)
Step 4 Create a distance matrixStep 4 Create a distance matrix 33) P(p) P(pii| p| pjj)= N)= Ni,ji,j /N /Njj
4) Three Distance function4) Three Distance function
Experimental ValidationExperimental Validation
ConclusionsConclusions
A new algorithm for clustering web A new algorithm for clustering web documents based only on the log data.documents based only on the log data.
It change the parameters intelligently It change the parameters intelligently during the recursively process, RDBC can during the recursively process, RDBC can give clustering results more superior than give clustering results more superior than that of DBSCANthat of DBSCAN
Learning to Cluster Web Search ResultsLearning to Cluster Web Search Results
Introduction Introduction This algorithm based on salient phrase come This algorithm based on salient phrase come
from documents contents.from documents contents.
Fast enough to be used in online calculation Fast enough to be used in online calculation engine.engine.
Characteristics of Cluster web search resultsCharacteristics of Cluster web search results
Existing search engines such as Google ,Yahoo Existing search engines such as Google ,Yahoo and MSN often return long list of search results.and MSN often return long list of search results.
Clustering of similar search results helps users Clustering of similar search results helps users find relevant results.find relevant results.
Clustered Search resultsClustered Search results
Conventional Search resultsConventional Search results
Procedure of algorithmProcedure of algorithm
Step 1: Search result fetchingStep 1: Search result fetching
Step 2: Document paring and Phrase Step 2: Document paring and Phrase property calculationproperty calculation
Step 3: Salient phrase rankingStep 3: Salient phrase ranking
Search result fetchingSearch result fetching
Input a query to a conventional web Input a query to a conventional web search enginesearch engine
Getting the webpage of results returned by Getting the webpage of results returned by engine.engine.
Extracting the title and snippets.Extracting the title and snippets.
Document parsingDocument parsing
Step 1: CleaningStep 1: Cleaning Stemming (use Porter’ algorithm)Stemming (use Porter’ algorithm) Sentence boundary identificationSentence boundary identification
Step 2:Post-processingStep 2:Post-processing• Punctuation eliminationPunctuation elimination• Filter out stop-words, ex: ‘too’ ‘are’ Filter out stop-words, ex: ‘too’ ‘are’ • Filter out query wordFilter out query word• Ex: Ex: Microsoft softwareMicrosoft software is available to students. is available to students.
Phrase property calculationPhrase property calculation Five properties Five properties 1.1.Phrase Frequency/Inverted Document FrequencyPhrase Frequency/Inverted Document Frequency
2.Phrase Length2.Phrase Length
LEN=n ex:LEN(”big”) =1 LEN=n ex:LEN(”big”) =1
Phrase property calculation (con’d)Phrase property calculation (con’d)
3.Intra-Cluster Similarity3.Intra-Cluster Similarity
o: centroido: centroid
Here di={TFIDF1,TFIDF2,…},Here di={TFIDF1,TFIDF2,…}, Each component of the vectors represents TFIDF of a Each component of the vectors represents TFIDF of a
phrasephrase
Phrase property calculation (con’d)Phrase property calculation (con’d)4. Cluster Entropy4. Cluster Entropy
5. Phrase Independence5. Phrase Independence
Ex: three “vectors” has… Ex: three “vectors” has… with some “vectors” be…with some “vectors” be…
Learning to rank key phrasesLearning to rank key phrases
Using Regression model to combine Using Regression model to combine above five properties, calculating a single above five properties, calculating a single salience score for each phrasesalience score for each phrase
Regression is a algorithm which tries to Regression is a algorithm which tries to determine the relationship between two determine the relationship between two random variables X=(x1,x2,…xn) and y.random variables X=(x1,x2,…xn) and y.
Here x=(TFIDF,LEN,ICS,CE,IND)Here x=(TFIDF,LEN,ICS,CE,IND)
Learning to rank key phrasesLearning to rank key phrases Three RegressionThree Regression
• Linear RegressionLinear Regression
Logistic RegressionLogistic Regression
• Support Vector RegressionSupport Vector Regression
EvaluationEvaluation
ConclusionsConclusions
Change the search result clustering Change the search result clustering problem to be a supervised salient phrase problem to be a supervised salient phrase ranking problem.ranking problem.
Generate the correct clusters with short Generate the correct clusters with short name, thus could improve user’s browsing name, thus could improve user’s browsing efficiency through search result.efficiency through search result.
Thanks!Thanks!
top related