![Page 1: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/1.jpg)
Automatic Detection of Social Tag Spams Using a Text
Mining Approach
Hsin-Chang YangAssociate Professor
Department of Information Management
National University of Kaohsiung
![Page 2: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/2.jpg)
2
Outline
IntroductionAssociation Discovery by SOMTag Spam DetectionExperimental ResultsConclusions
![Page 3: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/3.jpg)
3
Social Bookmarking –Why?
Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits:
Alleviation of efforts in Web page annotationImprovement of retrieval precisionSimplification of Web page classification
![Page 4: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/4.jpg)
How folksonomy works?
SimpleA user (ui) annotates a Web page (oj) with a set of tags (Tij).
Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj O, and tij T.
4
![Page 5: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/5.jpg)
5
CollaborationSemantic relatednessPossibility of spam
Characteristics of Folksonomy
![Page 6: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/6.jpg)
6
Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages.Arise for advertisement or promotional purposes.Misleading users and deterioration of retrieval result.
Tag Spams
![Page 7: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/7.jpg)
System Architecture
7
Web pagesWeb pages TagsTags
Page/tag associations
Page/tag associations
Association discovery
Association discovery
PreprocessingPreprocessing
Web page vectors
Web page vectors
Tag vectors
Tag vectors
SOM trainingSOM training
Page clustersPage
clustersTag
clustersTag
clusters
Synaptic weight vectors
Synaptic weight vectors
LabelingLabeling
![Page 8: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/8.jpg)
Preprocessing
Bag of words approachWeb page Pi is transformed to a binary vector Pi.
Ti, which is the tag list of Pi, is transformed to a binary vector Ti.
8
![Page 9: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/9.jpg)
9
SOM Training
All Pi and Ti were trained by the self-organizing map algorithm separately.Two maps MP and MT were obtained after the training.
![Page 10: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/10.jpg)
10
Labeling
We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled.The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).
![Page 11: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/11.jpg)
Association Discovery
Finding associations between page clusters and tag clusters.We used a voting scheme to find the associations.
11
Pi
TiPCM TCMTCMpC
PCMkC
+1
![Page 12: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/12.jpg)
12
Architecture of Tag Spam Detection
Incoming Web page
Incoming Web page
Incoming tag list
Incoming tag list
Page/tag associations
Page/tag associations
PreprocessingPreprocessing
Incoming page vectorIncoming
page vectorIncoming tag vectorIncoming tag vector
LabelingLabeling
Labeled page cluster
Labeled page cluster
Labeled tag clusterLabeled
tag cluster
Spam detectionSpam detection
Tag spamsTag spams
PCM and TCMPCM and TCM
![Page 13: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/13.jpg)
13
Spam Detection
Two types of tag spamsDocument-scope detection (post-level detection)
The whole tag list is identified as spam.
Tag-scope detection (tag-level detection)Individual tags are identified as spams.
Let PI and TI be the incoming Web page and its tag list, respectively.Let PI and TI be labeled to and , respectively.
PCMpC TCM
qC
![Page 14: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/14.jpg)
14
Document-Scope Detection
Relatedness between page cluster and tag cluster :
Q: neighborhood of A = [aij] is the correlation matrix between
PCM and TCM. apk = 1 if and are related; otherwise apk = 0
D: geometric distance between two clusters
PCMpC
TCMqC
QC
TCMq
TCMk
pkTCMq
PCMp
TCMk
CCD
a
QCCS
11
,,
TCMqC
PCMpC TCM
kC
TI is identified as spam if TCMq
PCMp CCS ,
![Page 15: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/15.jpg)
15
Tag-Scope Detection
A tag is a spam if it is inconsistent to other tags in the same tag cluster.Let Ti = {tij } be a tag list and
An incoming tag tIj TI is a spam if tIj W.
ITCMqi TCT
iTW
![Page 16: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/16.jpg)
16
Experimental Result
Dataset1500 Web page / tag list pairs collected from www.delicious.comeach pair was inspected manually both in post-level and tag-level583 distinct Web pagesSizes of vocabularies
Web pages: 13437tag lists: 5157
average number of tags per page: 4.7
![Page 17: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/17.jpg)
17
Experimental Result
Parametersmap sizes
PCM: 10 10TCM: 10 10
training epochsPCM: 400TCM: 200
: 0.7
![Page 18: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/18.jpg)
18
Experimental Result
Number of training / test data: 1000 / 500Confusion matrix for document-scope detection
Accuracy = (118 + 273) / 500 = 78.2%Recall = 118 / (118 + 44) = 72.8%Precision = 118 / (118 + 65) = 64.5%
Actual result
Spam Non-spam
Predicted result
Spam 118 65
Non-spam
44 273
![Page 19: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/19.jpg)
Further Result of Document-Scope Detection
Result after 10-fold cross validationConfusion matrix
Accuracy = (123.1 + 271) / 500 = 78.8%Recall = 123.1 / (123.1 + 43.6) = 73.8%Precision = 123.1 / (123.1 + 62.3) = 66.4%
19
Actual result
Spam Non-spam
Predicted result
Spam 123.1 62.3
Non-spam
43.6 271
![Page 20: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/20.jpg)
Further Result of Tag-Scope Detection
Result after 10-fold cross validationConfusion matrix
* average number of tags per page
Accuracy = (1.4 + 2.2) / 4.7 = 76.6%Recall = 1.4 / (1.4 + 0.4) = 77.8%Precision = 1.4 / (1.4 + 0.7) = 66.7%
20
Actual result
Spam Non-spam
Predicted result
Spam 1.4* 0.7
Non-spam
0.4 2.2
![Page 21: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/21.jpg)
21
Conclusions
A novel scheme for tag spam detection based on text mining.Relatedness between Web pages and tags were discovered based on self-organizing map.Use only the content of Web pages instead of user behaviors.
![Page 22: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National](https://reader036.vdocuments.us/reader036/viewer/2022070415/56649f115503460f94c23beb/html5/thumbnails/22.jpg)
Thanks for your attention.
22