automatic detection of social tag spams using a text mining approach hsin-chang yang associate...

Automatic Detection of Social Tag Spams Using a Text

Mining Approach

Hsin-Chang YangAssociate Professor

Department of Information Management

National University of Kaohsiung

2

Outline

IntroductionAssociation Discovery by SOMTag Spam DetectionExperimental ResultsConclusions

3

Social Bookmarking –Why?

Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits:

Alleviation of efforts in Web page annotationImprovement of retrieval precisionSimplification of Web page classification

How folksonomy works?

SimpleA user (ui) annotates a Web page (oj) with a set of tags (Tij).

Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj O, and tij T.

4

5

CollaborationSemantic relatednessPossibility of spam

Characteristics of Folksonomy

6

Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages.Arise for advertisement or promotional purposes.Misleading users and deterioration of retrieval result.

Tag Spams

System Architecture

7

Web pagesWeb pages TagsTags

Page/tag associations


Association discovery

Association discovery

PreprocessingPreprocessing

Web page vectors

Web page vectors

Tag vectors

Tag vectors

SOM trainingSOM training

Page clustersPage

clustersTag

clustersTag

clusters

Synaptic weight vectors

Synaptic weight vectors

LabelingLabeling

Preprocessing

Bag of words approachWeb page Pi is transformed to a binary vector Pi.

Ti, which is the tag list of Pi, is transformed to a binary vector Ti.

8

9

SOM Training

All Pi and Ti were trained by the self-organizing map algorithm separately.Two maps MP and MT were obtained after the training.

10

Labeling

We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled.The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).

Association Discovery

Finding associations between page clusters and tag clusters.We used a voting scheme to find the associations.

11

Pi

TiPCM TCMTCMpC

PCMkC

+1

12

Architecture of Tag Spam Detection

Incoming Web page

Incoming Web page

Incoming tag list

Incoming tag list



PreprocessingPreprocessing

Incoming page vectorIncoming

page vectorIncoming tag vectorIncoming tag vector

LabelingLabeling

Labeled page cluster

Labeled page cluster

Labeled tag clusterLabeled

tag cluster

Spam detectionSpam detection

Tag spamsTag spams

PCM and TCMPCM and TCM

13

Spam Detection

Two types of tag spamsDocument-scope detection (post-level detection)

The whole tag list is identified as spam.

Tag-scope detection (tag-level detection)Individual tags are identified as spams.

Let PI and TI be the incoming Web page and its tag list, respectively.Let PI and TI be labeled to and , respectively.

PCMpC TCM

qC

14

Document-Scope Detection

Relatedness between page cluster and tag cluster :

Q: neighborhood of A = [aij] is the correlation matrix between

PCM and TCM. apk = 1 if and are related; otherwise apk = 0

D: geometric distance between two clusters

PCMpC

TCMqC

QC

TCMq

TCMk

pkTCMq

PCMp

TCMk

CCD

a

QCCS

11

,,

TCMqC

PCMpC TCM

kC

TI is identified as spam if TCMq

PCMp CCS ,

15

Tag-Scope Detection

A tag is a spam if it is inconsistent to other tags in the same tag cluster.Let Ti = {tij } be a tag list and

An incoming tag tIj TI is a spam if tIj W.

ITCMqi TCT

iTW

16

Experimental Result

Dataset1500 Web page / tag list pairs collected from www.delicious.comeach pair was inspected manually both in post-level and tag-level583 distinct Web pagesSizes of vocabularies

Web pages: 13437tag lists: 5157

average number of tags per page: 4.7

17

Experimental Result

Parametersmap sizes

PCM: 10 10TCM: 10 10

training epochsPCM: 400TCM: 200

: 0.7

18

Experimental Result

Number of training / test data: 1000 / 500Confusion matrix for document-scope detection

Accuracy = (118 + 273) / 500 = 78.2%Recall = 118 / (118 + 44) = 72.8%Precision = 118 / (118 + 65) = 64.5%

Actual result

Spam Non-spam

Predicted result

Spam 118 65

Non-spam

44 273

Further Result of Document-Scope Detection

Result after 10-fold cross validationConfusion matrix

Accuracy = (123.1 + 271) / 500 = 78.8%Recall = 123.1 / (123.1 + 43.6) = 73.8%Precision = 123.1 / (123.1 + 62.3) = 66.4%

19

Actual result

Spam Non-spam

Predicted result

Spam 123.1 62.3

Non-spam

43.6 271

Further Result of Tag-Scope Detection

Result after 10-fold cross validationConfusion matrix

* average number of tags per page

Accuracy = (1.4 + 2.2) / 4.7 = 76.6%Recall = 1.4 / (1.4 + 0.4) = 77.8%Precision = 1.4 / (1.4 + 0.7) = 66.7%

20

Actual result

Spam Non-spam

Predicted result

Spam 1.4* 0.7

Non-spam

0.4 2.2

21

Conclusions

A novel scheme for tag spam detection based on text mining.Relatedness between Web pages and tags were discovered based on self-organizing map.Use only the content of Web pages instead of user behaviors.

Thanks for your attention.

22

automatic detection of social tag spams using a text mining approach hsin-chang yang associate...

Documents

tagscope detectiona

tag lists

tag list of pi

incoming tag tij ti

tag cluster map tcm

web page oj

tag spamssystem architecture

page cluster map pcm