an efficient algorithm for enumerating pseudo cliques dec/18/2007 isaac, sendai takeaki uno national...

23
An Efficient Algorithm for An Efficient Algorithm for Enumerating Pseudo Cliques Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University for Advanced Studies

Upload: polly-hoover

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

An Efficient Algorithm forAn Efficient Algorithm for

Enumerating Pseudo Cliques Enumerating Pseudo CliquesAn Efficient Algorithm forAn Efficient Algorithm for

Enumerating Pseudo Cliques Enumerating Pseudo Cliques

Dec/18/2007 ISAAC, Sendai

Takeaki UnoNational Institute of Informatics

& The Graduate University for Advanced Studies

Page 2: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Introducing Pseudo CliquesIntroducing Pseudo CliquesIntroducing Pseudo CliquesIntroducing Pseudo Cliques

Page 3: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Analyzing Large Scale DatabaseAnalyzing Large Scale DatabaseAnalyzing Large Scale DatabaseAnalyzing Large Scale Database

•• By rapid growth of database size, we have to analyze databases in some computational way

•• Finding cliques in similarity/relation graphs is a popular way to classify the data, or get characterizations of the data

Group of similar or related objects

•• Thanks to good properties such as monotonicity, (maximal) cliques can be enumerated very quickly (up to 1,000,000/sec)

・・ Now, we are motivated to find more rich object, dense structures,

such as pseudo cliques

Page 4: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Def. Pseudo CliqueDef. Pseudo CliqueDef. Pseudo CliqueDef. Pseudo Clique

•• For a vertex set K, the density of K is

(#edges connecting vertices in K) (|K|-1)|K| /2

-- K is a clique density is 1 -- K is an independent set density is 0 if density is high, K is nearly a clique

maximum #edges in S

We want to solve the problem of

enumerating all pseudo cliqus of the given graphenumerating all pseudo cliqus of the given graph

For given θ, K is a pseudo cliquepseudo clique (density of K) ≧ θFor given θ, K is a pseudo cliquepseudo clique (density of K) ≧ θ

ave. ratio of vertices adjacent

to a vertex

Page 5: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Existing ResultsExisting ResultsExisting ResultsExisting Results

•• Easy to find one pseudo clique

two connected vertices always form a pseudo clique

•• Finding a pseudo clique of size k is NP-complete

Reducing k-clique problem by setting θ= 1

•• Approximation algorithms for maximizing the density for size k

-- O(|V|1/3-ε) approaximation algorithm

-- O((n/k)ε) approx. if optimal solution is dense [Tokuyama el al.]

-- PTAS if Ω(n2) edges [Arora et al.]

• • Many heuristic algorithms in data mining, data engineering, natural sciences

• • However, no algorithm for "complete" enumeration

Page 6: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Hardness for Branch-and-BoundHardness for Branch-and-BoundHardness for Branch-and-BoundHardness for Branch-and-Bound

•• A straightforward approach is branch and bound

•• In each iteration, divide the problem into two non-empty problems by the inclusion of a vertex

      

vv1, 1, vv22 vv1, 1, vv22 vv1, 1, vv22 vv1, 1, vv22

vv11 vv1 1

The existence of pseudo clique is NP-comp.

The existence of pseudo clique is NP-comp.

Page 7: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Proof of the HardnessProof of the HardnessProof of the HardnessProof of the Hardness

        For given graph G, threshold θ, and vertex set U, the problem of checking the existence of a pseudo clique including U is NP-complete

        For given graph G, threshold θ, and vertex set U, the problem of checking the existence of a pseudo clique including U is NP-complete

Theorem 1Theorem 1

Proof: reducing the problem of clique of k vertices

input graph G=(V,E)

input graph G=(V,E)

Add 2|V|2 vertices as U

Add 2|V|2 vertices as U

density =density =|V|2 -1

|V|2

•• only (U + clique) is pseudo clique•• density increases by increase of pseudo clique size•• setting εs.t. clique of size at least k induces a pseudo clique

|V|2 -1|V|2θ= +ε

Page 8: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Is This Really Hard?Is This Really Hard?Is This Really Hard?Is This Really Hard?

•• We proved NP-hardness for "very dense graphs"

unclear for middle dense graph

possibility for polynomial time enumeration

θ= 1

θ= 0

easyeasy

easyeasy

hardhard

????????????????????

Page 9: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Polynomial Time EnumerationPolynomial Time EnumerationPolynomial Time EnumerationPolynomial Time Enumeration

Page 10: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Reverse Search ApproachReverse Search ApproachReverse Search ApproachReverse Search Approach

•• Introduce an acyclic parent-child relation on all pseudo cliques

Need an algorithm for listing up all childrenNeed an algorithm for listing up all children

objectsobjectsobjectsobjects

Enumeration by traversing the tree induced by the relationEnumeration by traversing the tree induced by the relation

Page 11: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Parent of Pseudo CliqueParent of Pseudo CliqueParent of Pseudo CliqueParent of Pseudo Clique

•• v*(K) : min. deg. min. index vertex in G[K]

•• The parent of pseudo clique K K \ v*(K)

K

The parent of K

•• Density of K = = ave. degree G[K] / (|K|-1)

•• The parent is the removal of most "sparse" vertex from K, thus is a pseudo clique

•• The parent is smaller than its child    acyclic relation

Page 12: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Ex. Enumeration TreeEx. Enumeration TreeEx. Enumeration TreeEx. Enumeration Tree

•• threshold = .7

11 22

44 5533

7766

• • •• • •

• • •• • •

Page 13: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Finding ChildrenFinding ChildrenFinding ChildrenFinding Children

•• A child is obtained by adding a vertex to the parent

•• degK(v): #vertices in K adjacent to v

(can be maintained in O(Δ) time for vertex addition)

•• K∪v is a child of K ①① K∪v is a pseudo clique lower bound for degK(v)

② ② v*(K∪v) = v upper bound for degK(v)

-- degK(v) < min. deg. of K K∪v is always a child

-- degK(v) > min. deg. of K +1 K∪v never be a child

•• degK(v) = min. deg. of K or +1 next slide…

Page 14: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Detailed ConditionDetailed ConditionDetailed ConditionDetailed Condition

•• S(K): sequence of vertices in K in the order of (degree, index)•• v is a child v is the top of S(K∪v)

•• v is child only if v is adjacent to all vertices preceding to v in S(K)

•• For each vertex, find the first "non-adjacent vertex" in S(K)•• This can be done in O(Δ2) time

Computation time for one iteration is O(Δ2 + log |V|) ( O(Δk + log |V|) if k-degenerate)

Computation time for one iteration is O(Δ2 + log |V|) ( O(Δk + log |V|) if k-degenerate)

top of S(K) is v*(K)

Page 15: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Computational ExperimentsComputational ExperimentsComputational ExperimentsComputational Experiments

Page 16: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

ImplementationImplementationImplementationImplementation

•• Code is a simple version

- - update |degK(vi)| at each addition

adding u to K takes O(deg(u)) time

- - to find children, vi satisfying

θ|K|(|K|+1) - (#edges in K) ≦≦ | degK(vi)| ≦≦ d*(K)+1

O( C d*(K)) == O(|E|) time + + O(1) time for each

C :=:= #vertices vi, | degK(vi)| == d*(K), d*(K)+1

Seems to be not large for Seems to be not large for ##childrenchildrenSeems to be not large for Seems to be not large for ##childrenchildren

Page 17: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Problem InstancesProblem InstancesProblem InstancesProblem Instances

•• Pentium M 1.1GHz, 256MB memory, Cygwin, C, gcc

•• Test instances are: - - random graphs (make edge with probability p),

- - locally dense random graphs (vertex i is adjacent to vertices from i-k to i+k with probability 1/2

- - graphs generated from real-world data (co-author graph)

Page 18: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Random GraphsRandom GraphsRandom GraphsRandom Graphs

•• p= 0.1, #vertices = 200 to 2000, threshold 0.8, 0.9

Computation time linearly increase as ave. degreeComputation time linearly increase as ave. degree

random graph p=0.1

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

200

282

400

565

800

1131

1600

2262

3200

4524

6400

#verticestim

e (

sec)

& #

cliques

#cliquetime per 1M cliquetime clique#p- clique 0.9time per 1M 0.9time 0.9#p- clique 0.8time per 1M 0.8time 0.8

Page 19: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Locally Dense Random Graph Locally Dense Random Graph Locally Dense Random Graph Locally Dense Random Graph

•• make edge from a vertex to its neighbors with p=0.5 •• #vertices 100 to 25600, threshold 0.8, 0.9

•• 10 times slower than clique enumeration• • computation time per one clique does not change

•• 10 times slower than clique enumeration• • computation time per one clique does not change

locally dense random graph

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

10000000001000

4000

16000

64000

3E

+05

#verticestim

e (s

ec)

& #

cliq

ues

#clique

time per 1M clique

time clique

#p-clique 0.9

time per 1M 0.9

time 0.9

#p-clique 0.8

time per 1M 0.8

time 0.8

Page 20: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Randomly Generated Scale Free GraphRandomly Generated Scale Free GraphRandomly Generated Scale Free GraphRandomly Generated Scale Free Graph

•• Add vertices of degree 10 iteratively, to a clique of 10 vertices• • Vertices to be connected are chosen according to their current degrees

Computation time increases quite slowlyComputation time increases quite slowly

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

#vertices

tim

e &

#cliq

ues

#cliquetime per 1M cliquetime clique#p-clique 0.9time per 1M 0.9time 0.9#p-clique 0.8time per 1M 0.8time 0.8

Page 21: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Real-world InstanceReal-world InstanceReal-world InstanceReal-world Instance

•• co-author graph of academic paper database•• #vertices = 30,000, #edges = 125,000, scale free

Computation time for one pseudo clique does not depend on threshold

Computation time for one pseudo clique does not depend on threshold

real-world data

0.1

10

1000

100000

10000000

10000000001 1

0.98

0.95

0.93 0.9

0.88

0.85

0.83

thresholdtime

& #

p-cl

ique

s

#p-cliquetimetime per 1M

Page 22: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

Bottom-widenessBottom-widenessBottom-widenessBottom-wideness

•• Why good in practice?

•• The algorithm generates several recursive calls

recursion tree expands exponentially by going down

computation time is dominated by the lowest levels

•• On lower levels, small degree vertices are added fast!

When pseudo cliques are sufficiently large (over 5?)

min. degree is small on average

computation time is short on average at lower levels

When pseudo cliques are sufficiently large (over 5?)

min. degree is small on average

computation time is short on average at lower levels

・・・・・・

Long timeLong time

Short timeShort time

Page 23: An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University

ConclusionConclusionConclusionConclusion

•• First polynomial delay polynomial space algorithm for enumerating pseudo cliques

•• Hardness result for straight forward branch-and-bound

•• Evaluate practical efficiency by computational experiments

Future works:

•• Explain the gap between theory and practice

•• Introduce maximality and their enumeration

•• Apply the technique to other structures (pseudo bla bla bla)

(path, tree, bipartite clique, matching …)

•• What is crucial for the compuation (enumeration) of structures with ambiguity