automatic detection of social tag spams using a text mining approach hsin-chang yang associate...

Automatic Detection of Social Tag Spams Using a Text

Mining Approach

Hsin-Chang YangAssociate Professor

Department of Information Management

National University of Kaohsiung

Outline

IntroductionAssociation Discovery by SOMTag Spam DetectionExperimental ResultsConclusions

Social Bookmarking –Why?

Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits:

Alleviation of efforts in Web page annotationImprovement of retrieval precisionSimplification of Web page classification

How folksonomy works?

SimpleA user (ui) annotates a Web page (oj) with a set of tags (Tij).

Generally represented as a set of tuples (ui, oj, Tij), where ui U, oj O, and tij T.

CollaborationSemantic relatednessPossibility of spam

Characteristics of Folksonomy

Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages.Arise for advertisement or promotional purposes.Misleading users and deterioration of retrieval result.

Tag Spams

System Architecture

Web pagesWeb pages TagsTags

Page/tag associations

Association discovery

PreprocessingPreprocessing

Web page vectors

Tag vectors

SOM trainingSOM training

Page clustersPage

clustersTag

clusters

Synaptic weight vectors

LabelingLabeling

Preprocessing

Bag of words approachWeb page Pi is transformed to a binary vector Pi.

Ti, which is the tag list of Pi, is transformed to a binary vector Ti.

SOM Training

All Pi and Ti were trained by the self-organizing map algorithm separately.Two maps MP and MT were obtained after the training.

Labeling

We labeled each Web page on MP by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled.The same approach was applied on all tag lists on MT and obtained tag cluster map (TCM).

Association Discovery

Finding associations between page clusters and tag clusters.We used a voting scheme to find the associations.

TiPCM TCMTCMpC

Architecture of Tag Spam Detection

Incoming Web page

Incoming tag list

Page/tag associations

PreprocessingPreprocessing

Incoming page vectorIncoming

page vectorIncoming tag vectorIncoming tag vector

LabelingLabeling

Labeled page cluster

Labeled tag clusterLabeled

tag cluster

Spam detectionSpam detection

Tag spamsTag spams

PCM and TCMPCM and TCM

Spam Detection

Two types of tag spamsDocument-scope detection (post-level detection)

The whole tag list is identified as spam.

Tag-scope detection (tag-level detection)Individual tags are identified as spams.

Let PI and TI be the incoming Web page and its tag list, respectively.Let PI and TI be labeled to and , respectively.

PCMpC TCM

Document-Scope Detection

Relatedness between page cluster and tag cluster :

Q: neighborhood of A = [aij] is the correlation matrix between

PCM and TCM. apk = 1 if and are related; otherwise apk = 0

D: geometric distance between two clusters

pkTCMq

PCMpC TCM

TI is identified as spam if TCMq

PCMp CCS ,

Tag-Scope Detection

A tag is a spam if it is inconsistent to other tags in the same tag cluster.Let Ti = {tij } be a tag list and

An incoming tag tIj TI is a spam if tIj W.

ITCMqi TCT

Experimental Result

Dataset1500 Web page / tag list pairs collected from www.delicious.comeach pair was inspected manually both in post-level and tag-level583 distinct Web pagesSizes of vocabularies

Web pages: 13437tag lists: 5157

average number of tags per page: 4.7

Experimental Result

Parametersmap sizes

PCM: 10 10TCM: 10 10

training epochsPCM: 400TCM: 200

Experimental Result

Number of training / test data: 1000 / 500Confusion matrix for document-scope detection

Accuracy = (118 + 273) / 500 = 78.2%Recall = 118 / (118 + 44) = 72.8%Precision = 118 / (118 + 65) = 64.5%

Actual result

Spam Non-spam

Predicted result

Spam 118 65

Non-spam

44 273

Further Result of Document-Scope Detection

Result after 10-fold cross validationConfusion matrix

Accuracy = (123.1 + 271) / 500 = 78.8%Recall = 123.1 / (123.1 + 43.6) = 73.8%Precision = 123.1 / (123.1 + 62.3) = 66.4%

Actual result

Spam Non-spam

Predicted result

Spam 123.1 62.3

Non-spam

43.6 271

Further Result of Tag-Scope Detection

Result after 10-fold cross validationConfusion matrix

* average number of tags per page

Accuracy = (1.4 + 2.2) / 4.7 = 76.6%Recall = 1.4 / (1.4 + 0.4) = 77.8%Precision = 1.4 / (1.4 + 0.7) = 66.7%

Actual result

Spam Non-spam

Predicted result

Spam 1.4* 0.7

Non-spam

0.4 2.2

Conclusions

A novel scheme for tag spam detection based on text mining.Relatedness between Web pages and tags were discovered based on self-organizing map.Use only the content of Web pages instead of user behaviors.

Thanks for your attention.

automatic detection of social tag spams using a text mining approach hsin-chang yang associate...

tagscope detectiona

tag lists

tag list of pi

incoming tag tij ti

tag cluster map tcm

web page oj

tag spamssystem architecture

page cluster map pcm

Documents

advisor : cheng-hsin chuang advisee : kai-chieh chang...

論文著述 - amazon web...

hsin hsin ming

gene hsin chang department of economics, the university of...

heat and mass transfer in arotary kiln incinerator … and...

1 good design and bad design hsin-i chang(christine) ph.d....

presenter : yu-ting lu authors : hsin-chang yang, han-wei...

from text to image: generating visual query for image...

ahah, i’m a designer?!: becoming empowered designers...

homeland security information network-intelligence (hsin ......

eslite hsin chu

direct observations of aerosol effects on carbon and water...

robust spammer detection by nash reinforcement …mated that...

hirsutosterols a–g, polyoxygenated steroids from a...

world journal of - bsdwebstorage.blob.core.windows.net ·...

advisor: hsin-hsi chen reporter: y.h chang 2008-03-21

cheng ming lin, yu shang lai, hsin ping liu, chang yu chen...

philosophy (i-li) versus philology (k'ao...

chang gung university 13/07/20061 channel analysis and...

detecting opinion spams through supervised boosting...