automatic detection of social tag spams using a text mining approach hsin-chang yang associate...

22
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

Upload: baldwin-peters

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

Automatic Detection of Social Tag Spams Using a Text

Mining Approach

Hsin-Chang YangAssociate Professor

Department of Information Management

National University of Kaohsiung

Page 2: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

2

Outline

IntroductionAssociation Discovery by SOMTag Spam DetectionExperimental ResultsConclusions

Page 3: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

3

Social Bookmarking –Why?

Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits:

Alleviation of efforts in Web page annotationImprovement of retrieval precisionSimplification of Web page classification

Page 4: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

How folksonomy works?

SimpleA user (ui) annotates a Web page (oj) with a set of tags (Tij).

Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj O, and tij T.

4

Page 5: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

5

CollaborationSemantic relatednessPossibility of spam

Characteristics of Folksonomy

Page 6: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

6

Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages.Arise for advertisement or promotional purposes.Misleading users and deterioration of retrieval result.

Tag Spams

Page 7: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

System Architecture

7

Web pagesWeb pages TagsTags

Page/tag associations

Page/tag associations

Association discovery

Association discovery

PreprocessingPreprocessing

Web page vectors

Web page vectors

Tag vectors

Tag vectors

SOM trainingSOM training

Page clustersPage

clustersTag

clustersTag

clusters

Synaptic weight vectors

Synaptic weight vectors

LabelingLabeling

Page 8: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

Preprocessing

Bag of words approachWeb page Pi is transformed to a binary vector Pi.

Ti, which is the tag list of Pi, is transformed to a binary vector Ti.

8

Page 9: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

9

SOM Training

All Pi and Ti were trained by the self-organizing map algorithm separately.Two maps MP and MT were obtained after the training.

Page 10: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

10

Labeling

We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled.The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).

Page 11: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

Association Discovery

Finding associations between page clusters and tag clusters.We used a voting scheme to find the associations.

11

Pi

TiPCM TCMTCMpC

PCMkC

+1

Page 12: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

12

Architecture of Tag Spam Detection

Incoming Web page

Incoming Web page

Incoming tag list

Incoming tag list

Page/tag associations

Page/tag associations

PreprocessingPreprocessing

Incoming page vectorIncoming

page vectorIncoming tag vectorIncoming tag vector

LabelingLabeling

Labeled page cluster

Labeled page cluster

Labeled tag clusterLabeled

tag cluster

Spam detectionSpam detection

Tag spamsTag spams

PCM and TCMPCM and TCM

Page 13: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

13

Spam Detection

Two types of tag spamsDocument-scope detection (post-level detection)

The whole tag list is identified as spam.

Tag-scope detection (tag-level detection)Individual tags are identified as spams.

Let PI and TI be the incoming Web page and its tag list, respectively.Let PI and TI be labeled to and , respectively.

PCMpC TCM

qC

Page 14: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

14

Document-Scope Detection

Relatedness between page cluster and tag cluster :

Q: neighborhood of A = [aij] is the correlation matrix between

PCM and TCM. apk = 1 if and are related; otherwise apk = 0

D: geometric distance between two clusters

PCMpC

TCMqC

QC

TCMq

TCMk

pkTCMq

PCMp

TCMk

CCD

a

QCCS

11

,,

TCMqC

PCMpC TCM

kC

TI is identified as spam if TCMq

PCMp CCS ,

Page 15: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

15

Tag-Scope Detection

A tag is a spam if it is inconsistent to other tags in the same tag cluster.Let Ti = {tij } be a tag list and

An incoming tag tIj TI is a spam if tIj W.

ITCMqi TCT

iTW

Page 16: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

16

Experimental Result

Dataset1500 Web page / tag list pairs collected from www.delicious.comeach pair was inspected manually both in post-level and tag-level583 distinct Web pagesSizes of vocabularies

Web pages: 13437tag lists: 5157

average number of tags per page: 4.7

Page 17: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

17

Experimental Result

Parametersmap sizes

PCM: 10 10TCM: 10 10

training epochsPCM: 400TCM: 200

: 0.7

Page 18: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

18

Experimental Result

Number of training / test data: 1000 / 500Confusion matrix for document-scope detection

Accuracy = (118 + 273) / 500 = 78.2%Recall = 118 / (118 + 44) = 72.8%Precision = 118 / (118 + 65) = 64.5%

Actual result

Spam Non-spam

Predicted result

Spam 118 65

Non-spam

44 273

Page 19: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

Further Result of Document-Scope Detection

Result after 10-fold cross validationConfusion matrix

Accuracy = (123.1 + 271) / 500 = 78.8%Recall = 123.1 / (123.1 + 43.6) = 73.8%Precision = 123.1 / (123.1 + 62.3) = 66.4%

19

Actual result

Spam Non-spam

Predicted result

Spam 123.1 62.3

Non-spam

43.6 271

Page 20: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

Further Result of Tag-Scope Detection

Result after 10-fold cross validationConfusion matrix

* average number of tags per page

Accuracy = (1.4 + 2.2) / 4.7 = 76.6%Recall = 1.4 / (1.4 + 0.4) = 77.8%Precision = 1.4 / (1.4 + 0.7) = 66.7%

20

Actual result

Spam Non-spam

Predicted result

Spam 1.4* 0.7

Non-spam

0.4 2.2

Page 21: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

21

Conclusions

A novel scheme for tag spam detection based on text mining.Relatedness between Web pages and tags were discovered based on self-organizing map.Use only the content of Web pages instead of user behaviors.

Page 22: Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National

Thanks for your attention.

22