active link inference for case building investigation

1
Active Link Inference for Case Building Investigation Reihaneh Rabbany & David Bayani & Artur W. Dubrawski How can we infer connections between entities, given the evidences of their relations? How can we help an investigator to find salient patterns by connecting the dots more efficiently? > inferring links where the user is able to actively provide feedback and guide the inference Online Human Trafficing ILO: 4.5 million people victimized globally by forced sexual exploitation in 2014; 21% were children MCMEC: 1,432% increase in reports of suspected child sex trafficking in 2014 correlated with the increased use of online sex advertisements Figure 1: A wide-spread issue: reported human trafficking cases. Sample dis- tribution of advertisements posted on Backpage.com, which contain a phone number reported to a human trafficking hotline. Active Link inference for case building to combat human trafficking in online escort advertisements find how different advertisements are linked together and point to orga- nized activities; e.g. advertisements for different potential victims which are linked by phone numbers, catch phrases or text patterns, images with the same background, or other evidences of connection. > an initial lead is treated as the seed query to find connected entities and identify the person of interest. Data Source Millions of escort advertisements scraped from Backpage.com for cities across the US and Canada posted between 2013 to 2017. For each advertisement, we have its unstructured text (title and body), attached images, date and location posted. Problem Definition D = {d 1 ,d 2 ...d n } connected to k different types of evidences (e.g. phone number, image, bi-grams) E = {X 1 , X 2 ... X k } denote the set of indicator matrices for these k modalities i.e. X m R n×c m + for m [1 ...k ] c m is the cardinality (number of unique evidences) of modality m (e.g. number of unique phone numbers) Each column of X m shows a set of datapoints that share the corre- sponding evidence (e.g. datapoints that all share a particular phone number), and each row of X m shows the evidences associated to the corresponding datapoint (all the phone numbers mentioned in a particular datapoint). Active Link Inference Given seed datapoint of interest, i, and assum- ing an unknown label set y = {y 1 ,y 2 ,...,y n } where y j =1 if the user deems node y j related to the node i and zero otherwise, find the maximum number of related entities to i given a fixed query budget, b> 0. Data Representation Each indicator matrix: biadjacency matrix of a 2-partite graph > k-partite graph representation for the data Given which and seed node of interest v i , the Active Link Inference trans- lates to finding positive nodes that are (tightly) connected to node v i with length even paths (i.e. through shared evidences); whereas positive nodes means when we query label of the found endpoint v j , we have y j > 0. u 1 u 1 u 2 { shared evidences AD 1 X 1 X 2 X m AD 2 AD i AD n Figure 2: k-modal evidence graph: ads and evidences associated with them. Our proposed method navigates through this graph to efficiently find re- lated nodes while learning the importance of each modality and each piece of evidence from the labels (user’s feedbacks) obtained while expanding. Method Description Θ =[θ 1 2 ,...θ k ]; θ m denotes the importance of modality m s j =[s j 1 ,s j 2 ,...s j k ]; s j m shows the tie strength of reaching v j from evidences in modality m s j m = v i L + u∈N (v i )∩N (v j ) 1 d 2 u - v i L - u∈N (v i )∩N (v j ) 1 d 2 u ; where L + = {j L| y j > 0}, and L - = {j L| y j < 0} infer the next node: j * arg max j m s j m θ j m the node with highest overall evidence support. adjust the importance of responsible modality : m * = arg max m s j m θ m θ * m = δθ * m if y j < 0 and θ * m = (2 - δ )θ * m if y j > 0; δ (0, 1) is the learning rate One Discovred Case Figure 3: Green advertisement shows the seed advertisement used in this case study. The black nodes are the related advertisements discovered which are con- nected to the seed though shared evidences, plotted as red nodes. Although these two text don’t show high similarity, these two ads are truly related as they share the same phone number.Out method was able to correctly discover it, and this representation explains how these ads are connected through shared evidences. Performance Evaluation No Feedback (NF): max score No Modality (NM): re-weigh the evidences No Evidence Weights (NE): learn the importance of modalities Random Walk (RW): restarts from the seed whenever it reaches a negative node. Figure 4: Comparison with baselines on three datasets: 2.5 million ads posted in DC, Maryland, Virginia area (HT-DMV), and another one of about 4 million ads posted between July to December 2013 (HT-13). Strong indicators (url, email, and phone numbers) as labels; as well as Discogs, a public online music database of 3.5 million different releases: date of the release, artists (primary and extra) involved, record labels, track information, companies involved in the production of the release, etc. Here, we find releases of the same artist as the given seed release based on their shared information.

Upload: others

Post on 09-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Active Link Inference for Case Building InvestigationReihaneh Rabbany & David Bayani & Artur W. Dubrawski

How can we infer connections between entities, given the evidences oftheir relations? How can we help an investigator to find salient patternsby connecting the dots more efficiently?

> inferring links where the user is able to actively provide feedbackand guide the inference

Online Human Trafficing

– ILO: 4.5 million people victimized globally by forced sexualexploitation in 2014; 21% were children

– MCMEC: 1,432% increase in reports of suspected child sextrafficking in 2014

– correlated with the increased use of online sex advertisements

100

1K

10K

Figure 1: A wide-spread issue: reported human trafficking cases. Sample dis-tribution of advertisements posted on Backpage.com, which contain a phonenumber reported to a human trafficking hotline.

Active Link inference for case building tocombat human trafficking in online escort advertisements

find how different advertisements are linked together and point to orga-nized activities; e.g. advertisements for different potential victims whichare linked by phone numbers, catch phrases or text patterns, images withthe same background, or other evidences of connection.

> an initial lead is treated as the seed query to find connectedentities and identify the person of interest.

Data Source

Millions of escort advertisements scraped from Backpage.com forcities across the US and Canada posted between 2013 to 2017. Foreach advertisement, we have its unstructured text (title and body),attached images, date and location posted.

Problem Definition

– D = {d1, d2 . . . dn} connected to k different types of evidences(e.g. phone number, image, bi-grams)

– E = {X1,X2 . . .Xk} denote the set of indicator matrices forthese k modalities i.e. Xm ∈ Rn×cm+ for m ∈ [1 . . . k]• cm is the cardinality (number of unique evidences) of modality m (e.g.

number of unique phone numbers)Each column of Xm shows a set of datapoints that share the corre-

sponding evidence (e.g. datapoints that all share a particular phonenumber), and each row of Xm shows the evidences associated tothe corresponding datapoint (all the phone numbers mentioned ina particular datapoint).

Active Link Inference Given seed datapoint of interest, i, and assum-ing an unknown label set y = {y1, y2, . . . , yn} where yj = 1 if the userdeems node yj related to the node i and zero otherwise, find the maximumnumber of related entities to i given a fixed query budget, b > 0.

Data Representation

Each indicator matrix: biadjacency matrix of a 2-partite graph >k-partite graph representation for the data

Given which and seed node of interest vi, the Active Link Inference trans-lates to finding positive nodes that are (tightly) connected to node vi withlength even paths (i.e. through shared evidences); whereas positive nodesmeans when we query label of the found endpoint vj, we have yj > 0.

u1 u1

u2{shared evidences

AD1

X1 X2 Xm

AD2

ADi

ADn

Figure 2: k-modal evidence graph: ads and evidences associated with them.

Our proposed method navigates through this graph to efficiently find re-lated nodes while learning the importance of each modality and each pieceof evidence from the labels (user’s feedbacks) obtained while expanding.

Method Description

– Θ = [θ1, θ2, . . . θk]; θm denotes the importance of modality m– sj = [s

j1, s

j2, . . . s

jk]; s

jm shows the tie strength of reaching vj from

evidences in modality m– sjm =

∑vi∈L+

∑u∈N (vi)∩N (vj)

1d2u−∑vi∈L−

∑u∈N (vi)∩N (vj)

1d2u

;where L+ = {j ∈ L| yj > 0}, and L− = {j ∈ L| yj < 0}

– infer the next node:j∗← argmaxj

∑m s

jmθ

jm

the node with highest overall evidence support.– adjust the importance of responsible modality :– m∗ = argmaxm s

jmθm

– θ∗m = δθ∗m if yj < 0 and θ∗m = (2− δ)θ∗m if yj > 0;– δ ∈ (0, 1) is the learning rate

One Discovred Case

Figure 3: Green advertisement shows the seed advertisement used in this casestudy. The black nodes are the related advertisements discovered which are con-nected to the seed though shared evidences, plotted as red nodes. Although thesetwo text don’t show high similarity, these two ads are truly related as they sharethe same phone number.Out method was able to correctly discover it, and thisrepresentation explains how these ads are connected through shared evidences.

Performance Evaluation

– No Feedback (NF): max score– No Modality (NM): re-weigh the evidences– No Evidence Weights (NE): learn the importance of modalities– Random Walk (RW): restarts from the seed whenever it

reaches a negative node.

0

10

20

30

40

Rele

van

t E

nti

ties

RW

NF

NM NE Method

Discogs

0

10

20

30

40

Rele

van

t E

nti

ties

RW

NF

NM

NE

Method

HT-DMV

0

10

20

30

40

Rele

van

t E

nti

ties

RW

NF

NM

NE

Method

HT-13

Figure 4: Comparison with baselines on three datasets: 2.5 million ads postedin DC, Maryland, Virginia area (HT-DMV), and another one of about 4 millionads posted between July to December 2013 (HT-13). Strong indicators (url, email,and phone numbers) as labels; as well as Discogs, a public online music databaseof 3.5 million different releases: date of the release, artists (primary and extra)involved, record labels, track information, companies involved in the productionof the release, etc. Here, we find releases of the same artist as the given seedrelease based on their shared information.