automatic detection of social tag spams using a text mining approach hsin-chang yang associate...
TRANSCRIPT
Automatic Detection of Social Tag Spams Using a Text
Mining Approach
Hsin-Chang YangAssociate Professor
Department of Information Management
National University of Kaohsiung
2
Outline
IntroductionAssociation Discovery by SOMTag Spam DetectionExperimental ResultsConclusions
3
Social Bookmarking –Why?
Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits:
Alleviation of efforts in Web page annotationImprovement of retrieval precisionSimplification of Web page classification
How folksonomy works?
SimpleA user (ui) annotates a Web page (oj) with a set of tags (Tij).
Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj O, and tij T.
4
5
CollaborationSemantic relatednessPossibility of spam
Characteristics of Folksonomy
6
Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages.Arise for advertisement or promotional purposes.Misleading users and deterioration of retrieval result.
Tag Spams
System Architecture
7
Web pagesWeb pages TagsTags
Page/tag associations
Page/tag associations
Association discovery
Association discovery
PreprocessingPreprocessing
Web page vectors
Web page vectors
Tag vectors
Tag vectors
SOM trainingSOM training
Page clustersPage
clustersTag
clustersTag
clusters
Synaptic weight vectors
Synaptic weight vectors
LabelingLabeling
Preprocessing
Bag of words approachWeb page Pi is transformed to a binary vector Pi.
Ti, which is the tag list of Pi, is transformed to a binary vector Ti.
8
9
SOM Training
All Pi and Ti were trained by the self-organizing map algorithm separately.Two maps MP and MT were obtained after the training.
10
Labeling
We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled.The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).
Association Discovery
Finding associations between page clusters and tag clusters.We used a voting scheme to find the associations.
11
Pi
TiPCM TCMTCMpC
PCMkC
+1
12
Architecture of Tag Spam Detection
Incoming Web page
Incoming Web page
Incoming tag list
Incoming tag list
Page/tag associations
Page/tag associations
PreprocessingPreprocessing
Incoming page vectorIncoming
page vectorIncoming tag vectorIncoming tag vector
LabelingLabeling
Labeled page cluster
Labeled page cluster
Labeled tag clusterLabeled
tag cluster
Spam detectionSpam detection
Tag spamsTag spams
PCM and TCMPCM and TCM
13
Spam Detection
Two types of tag spamsDocument-scope detection (post-level detection)
The whole tag list is identified as spam.
Tag-scope detection (tag-level detection)Individual tags are identified as spams.
Let PI and TI be the incoming Web page and its tag list, respectively.Let PI and TI be labeled to and , respectively.
PCMpC TCM
qC
14
Document-Scope Detection
Relatedness between page cluster and tag cluster :
Q: neighborhood of A = [aij] is the correlation matrix between
PCM and TCM. apk = 1 if and are related; otherwise apk = 0
D: geometric distance between two clusters
PCMpC
TCMqC
QC
TCMq
TCMk
pkTCMq
PCMp
TCMk
CCD
a
QCCS
11
,,
TCMqC
PCMpC TCM
kC
TI is identified as spam if TCMq
PCMp CCS ,
15
Tag-Scope Detection
A tag is a spam if it is inconsistent to other tags in the same tag cluster.Let Ti = {tij } be a tag list and
An incoming tag tIj TI is a spam if tIj W.
ITCMqi TCT
iTW
16
Experimental Result
Dataset1500 Web page / tag list pairs collected from www.delicious.comeach pair was inspected manually both in post-level and tag-level583 distinct Web pagesSizes of vocabularies
Web pages: 13437tag lists: 5157
average number of tags per page: 4.7
17
Experimental Result
Parametersmap sizes
PCM: 10 10TCM: 10 10
training epochsPCM: 400TCM: 200
: 0.7
18
Experimental Result
Number of training / test data: 1000 / 500Confusion matrix for document-scope detection
Accuracy = (118 + 273) / 500 = 78.2%Recall = 118 / (118 + 44) = 72.8%Precision = 118 / (118 + 65) = 64.5%
Actual result
Spam Non-spam
Predicted result
Spam 118 65
Non-spam
44 273
Further Result of Document-Scope Detection
Result after 10-fold cross validationConfusion matrix
Accuracy = (123.1 + 271) / 500 = 78.8%Recall = 123.1 / (123.1 + 43.6) = 73.8%Precision = 123.1 / (123.1 + 62.3) = 66.4%
19
Actual result
Spam Non-spam
Predicted result
Spam 123.1 62.3
Non-spam
43.6 271
Further Result of Tag-Scope Detection
Result after 10-fold cross validationConfusion matrix
* average number of tags per page
Accuracy = (1.4 + 2.2) / 4.7 = 76.6%Recall = 1.4 / (1.4 + 0.4) = 77.8%Precision = 1.4 / (1.4 + 0.7) = 66.7%
20
Actual result
Spam Non-spam
Predicted result
Spam 1.4* 0.7
Non-spam
0.4 2.2
21
Conclusions
A novel scheme for tag spam detection based on text mining.Relatedness between Web pages and tags were discovered based on self-organizing map.Use only the content of Web pages instead of user behaviors.
Thanks for your attention.
22