svd example: users-to-movieslaber/svd-applications.pdf · 2017. 9. 18. · svd – example:...

18/09/2017

1

Single Value Decomposition

SVD – Example: Users-to-Movies

• A = U VT - example: Users to Movies

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org 2

=

SciFi

Romnce

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

m

n

U

VT

“Concepts” AKA Latent dimensions AKA Latent factors





=

SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

0.13 0.02 -0.01

0.41 0.07 -0.03

0.55 0.09 -0.04

0.68 0.11 -0.05

0.15 -0.59 0.65

0.07 -0.73 -0.67

0.07 -0.29 0.32

12.4 0 0

0 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09





SciFi-concept

Romance-concept

=

SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

0.13 0.02 -0.01

0.41 0.07 -0.03

0.55 0.09 -0.04

0.68 0.11 -0.05

0.15 -0.59 0.65

0.07 -0.73 -0.67

0.07 -0.29 0.32

12.4 0 0

0 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09

18/09/2017

2


• A = U VT - example:



Romance-concept

U is “user-to-concept” similarity matrix

SciFi-concept

=

SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

0.13 0.02 -0.01

0.41 0.07 -0.03

0.55 0.09 -0.04

0.68 0.11 -0.05

0.15 -0.59 0.65

0.07 -0.73 -0.67

0.07 -0.29 0.32

12.4 0 0

0 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09





SciFi

Romnce

SciFi-concept

“strength” of the SciFi-concept

=

SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

0.13 0.02 -0.01

0.41 0.07 -0.03

0.55 0.09 -0.04

0.68 0.11 -0.05

0.15 -0.59 0.65

0.07 -0.73 -0.67

0.07 -0.29 0.32

12.4 0 0

0 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09





SciFi-concept

V is “movie-to-concept” similarity matrix

SciFi-concept

=

SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

0.13 0.02 -0.01

0.41 0.07 -0.03

0.55 0.09 -0.04

0.68 0.11 -0.05

0.15 -0.59 0.65

0.07 -0.73 -0.67

0.07 -0.29 0.32

12.4 0 0

0 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09 J. Leskovec, A. Rajaraman, J. Ullman:

Mining of Massive Datasets, http://www.mmds.org

8

SVD - Interpretation #1

‘movies’, ‘users’ and ‘concepts’: • U: user-to-concept similarity matrix

• V: movie-to-concept similarity matrix

• : its diagonal elements: ‘strength’ of each concept

18/09/2017

3

Case study: How to query? • Q: Find users that like ‘Matrix’

• A: Map query into a ‘concept space’ – how?



=

SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

1 1 1 0 0

3 3 3 0 0

4 4 4 0 0

5 5 5 0 0

0 2 0 4 4

0 0 0 5 5

0 1 0 2 2

0.13 0.02 -0.01

0.41 0.07 -0.03

0.55 0.09 -0.04

0.68 0.11 -0.05

0.15 -0.59 0.65

0.07 -0.73 -0.67

0.07 -0.29 0.32

12.4 0 0

0 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.09

0.12 -0.02 0.12 -0.69 -0.69

0.40 -0.80 0.40 0.09 0.09





5 0 0 0 0

q =

Matrix

Alie

n

v1

q

v2

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

Project into concept space:

Inner product with each

‘concept’ vector vi





v1

q

q*v1

5 0 0 0 0

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

v2

Matrix

Alie

n

q =

Project into concept space:

Inner product with each

‘concept’ vector vi

Case study: How to query? Compactly, we have:

qconcept = q V

E.g.:



movie-to-concept similarities (V)

=

SciFi-concept

5 0 0 0 0

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

q =

0.56 0.12

0.59 -0.02

0.56 0.12

0.09 -0.69

0.09 -0.69

x 2.8 0.6

18/09/2017

4

Case study: How to query? • How would the user d that rated

(‘Alien’, ‘Serenity’) be handled? dconcept = d V

E.g.:



movie-to-concept similarities (V)

=

SciFi-concept

0 4 5 0 0

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

q =

0.56 0.12

0.59 -0.02

0.56 0.12

0.09 -0.69

0.09 -0.69

x 5.2 0.4

Case study: How to query?

• Observation: User d that rated (‘Alien’, ‘Serenity’) will be similar to user q that rated (‘Matrix’), although d and q have zero ratings in common!



0 4 5 0 0

d =

SciFi-concept

5 0 0 0 0

q =

Mat

rix

Alie

n

Sere

nit

y

Cas

abla

nca

Am

elie

Zero ratings in common Similarity ≠ 0

2.8 0.6

5.2 0.4

Centralizando os Dados

• Em algumas aplicações é interessante centralizar os dados antes de aplicar a decomposição SVD

– A centralização consiste em subtrair cada linha da matriz A da média das linhas da matriz (centroide dos pontos respresentados pelas linhas de A)

Centralizando os dados

• Ao aplicar o SVD sem centralizar os dados obtermos o subespaço (hiperplano que passa pela origem) minimiza a soma dos quadrados das distâncias perpendiculares aos dados da matriz A

• Ao aplicar o SVD após centralizar os dados obtermos o espaço afim (affine space) que minimiza a soma dos quadrados das distâncias perpendiculares aos dados da matriz A

18/09/2017

5

Centralizando os Dados

Compressão de Dados

Cenário

• 130 imagens de ‘3’ escritos por mãos

– cada uma representada em uma matriz 16x16.

– cada representação precisa de 16x16 floats => imagens em R256

Compressão de Dados Compressão de Dados

• Modelo com duas direções

• Ao todo 16x16=256 direções possíveis

• 12 direções principais respondem por 63% da variância e 50 por 90% delas

18/09/2017

6


Pontos com círculo em volta no gráfico a esquerda


• Na figura anterior temos a coleção de dígitos 3 projetadas no espaço das duas primeiras direções principais

Alguma interpretação para as componentes?


• Na figura anterior temos a coleção de dígitos 3 projetadas no espaço das duas primeiras direções principais

– Primeira componente (movimento horizontal da parte inferior do 3)

– Segunda componente (espessura)


18/09/2017

7

Entrada. Matriz A representando coleção com n documentos a partir de um vocabulário de d termos

Saida. Dado um conjunto de queries (novos documentos) q(1),...,q(p), determinar eficientemente a similaridade (produto interno) entre cada documento e cada query.

Similaridade Aproximada

Abordagem 1

• Vetor de similaridades entre A e query q é dado por Aq.

• Pode ser computado em O(nd)


Abordagem 2

• Utilizar Ak em vez de A

– Temos Ak = 1u1v1T + …+ kukvk

T , onde ui é um

vetor em Rn e vi é um vetor em Rd

– Logo, iuiviTq pode ser calculado em O(n+d)

• calculamos viT q em O(d) e depois i ui (vi

T q) em O(n)

• Complexidade total: O(k(n+d)).

– Vantajoso se k<<min{n,d}


Abordagem 2

• Qual o erro de usar Ak?

– Máximo em q de |Aq- Akq |= |(A- Ak)q|

– O erro pode ser Ilimitado se q não tiver nenhuma restrição


18/09/2017

8

Def. A norma espectral ||A||2 de A é igual a

max{ |Ax|:|x| 1}

• ||A- Ak ||2 mede o maior erro possível para vetores de norma menor ou igual a 1

• Possível mostrar (Teo 3.9) que

||A- Ak ||2 ||A-B||2

para toda matriz B de rank no máximo k


• Coleção com n documentos a partir de um vocabulário de d termos

• Podemos representar a coleção utilizando uma matriz A de n linhas e d colunas

Como rankear os documentos da coleção de acordo com sua importância intrínseca?

Ranking de documentos

Ranking

1. Calcule a direção principal da coleção v1 (melhor representação unidimensional da coleção conforme norma de Frobenius)

2. Projete cada documento na direção v1 e ordene do maior para o menor (coordenadas de u1 )

Ranking de documentos

HITS: Hubs and Authorities

18/09/2017

9

Hubs and Authorities • HITS (Hypertext-Induced Topic Selection)

– Is a measure of importance of pages or documents, similar to PageRank

– Proposed at around same time as PageRank (‘98)

• Goal: Say we want to find good newspapers

– Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers

• Idea: Links as votes

– Page is more important if it has more links

• In-coming links? Out-going links?

38


http://www.mmds.org

Finding newspapers

• Hubs and Authorities Each page has 2 scores:

– Quality as an expert (hub):

• Total sum of votes of authorities pointed to

– Quality as a content (authority):

• Total sum of votes coming from experts

• Principle of repeated improvement

39

NYT: 10

Ebay: 3

Yahoo: 3

CNN: 8

WSJ: 9


http://www.mmds.org

Hubs and Authorities

Interesting pages fall into two classes: 1. Authorities are pages containing

useful information – Newspaper home pages – Course home pages – Home pages of auto manufacturers

2. Hubs are pages that link to authorities – List of newspapers – Course bulletin – List of US auto manufacturers

40


http://www.mmds.org

Counting in-links: Authority

41

(Note this is idealized example. In reality graph is not bipartite and

each page has both the hub and authority score)

Each page starts with hub

score 1. Authorities collect

their votes


http://www.mmds.org

18/09/2017

10

Counting in-links: Authority

42



Sum of hub

scores of nodes

pointing to NYT.

Each page starts with hub

score 1. Authorities collect

their votes


http://www.mmds.org

Expert Quality: Hub

43

Hubs collect authority scores



Sum of authority

scores of nodes that

the node points to.


http://www.mmds.org

Reweighting

44

Authorities again collect

the hub scores


each page has both the hub and authority score) J. Leskovec, A. Rajaraman, J. Ullman:


Mutually Recursive Definition • A good hub links to many good authorities

• A good authority is linked from many good hubs

• Model using two scores for each node:

– Hub score and Authority score

– Represented as vectors 𝒉 and 𝒂

45 J. Leskovec, A. Rajaraman, J. Ullman:


18/09/2017

11

Hubs and Authorities • Each page 𝒊 has 2 scores:

– Authority score: 𝒂𝒊

– Hub score: 𝒉𝒊

HITS algorithm:

• Initialize: 𝑎𝑗(0)= 1/ N, hj

(0)= 1/ N

• Then keep iterating until convergence:

– ∀𝒊: Authority: 𝑎𝑖(𝑡+1)= ℎ𝑗

(𝑡)𝒋→𝒊

– ∀𝒊: Hub: ℎ𝑖(𝑡+1)= 𝑎𝑗

(𝑡)𝒊→𝒋

– ∀𝒊: Normalize:

𝑎𝑖𝑡+1

2

𝑖 = 1, ℎ𝑗𝑡+1

2

𝑗 = 1

[Kleinberg ‘98]

46

i

j1 j2 j3 j4

𝒂𝒊 = 𝒉𝒋𝒋→𝒊

j1 j2 j3 j4

𝒉𝒊 = 𝒂𝒋𝒊→𝒋

i


http://www.mmds.org

Hubs and Authorities • HITS converges to a single stable point

• Notation:

– Vector 𝒂 = (𝑎1… , 𝑎𝑛), 𝒉 = (ℎ1… , ℎ𝑛)

– Adjacency matrix 𝑨 (NxN): 𝑨𝒊𝒋 = 1 if 𝑖𝑗, 0 otherwise

• Then 𝒉𝒊 = 𝒂𝒋𝒊→𝒋

can be rewritten as 𝒉𝒊 = 𝑨𝒊𝒋 ⋅ 𝒂𝒋𝒋

So: 𝒉 = 𝑨 ⋅ 𝒂

• Similarly, 𝒂𝒊 = 𝒉𝒋𝒋→𝒊

can be rewritten as 𝒂𝒊 = 𝑨𝒋𝒊 ⋅ 𝒉𝒋𝒋 = 𝑨𝑻 ⋅ 𝒉 47

[Kleinberg ‘98]


http://www.mmds.org

Hubs and Authorities • HITS algorithm in vector notation:

– Set: 𝒂𝒊 = 𝒉𝒊 =𝟏

𝒏

Repeat until convergence:

– 𝒉 = 𝑨 ⋅ 𝒂

– 𝒂 = 𝑨𝑻 ⋅ 𝒉

– Normalize 𝒂 and 𝒉

• Then: 𝒂 = 𝑨𝑻 ⋅ (𝑨 ⋅ 𝒂)

new 𝒉

new 𝒂

𝒂 is updated (in 2 steps): 𝑎 = 𝐴𝑇(𝐴 𝑎) = (𝐴𝑇𝐴) 𝑎 h is updated (in 2 steps): ℎ = 𝐴

(𝐴𝑇ℎ) = (𝐴 𝐴𝑇) ℎ

Repeated matrix powering 48

ℎ𝑖𝑡 − ℎ𝑖

𝑡−12

𝑖

< 𝜀

𝑎𝑖𝑡 − 𝑎𝑖

𝑡−12

𝑖

< 𝜀

Convergence criterion:


http://www.mmds.org

Properties

• Under reasonable assumptions about A, HITS converges to vectors h* and a*:

– a* is the first right singular vector of matrix A AT

– h* is the projection of matrix A AT onto a*



18/09/2017

12

Example of HITS

1 1 1

A = 1 0 1

0 1 0

1 1 0

AT = 1 0 1

1 1 0

h(yahoo)

h(amazon)

h(m’soft)

=

=

=

.58

.58

.58

.80

.53

.27

.80

.53

.27

.79

.57

.23

. . .

. . .

. . .

.788

.577

.211

a(yahoo) = .58

a(amazon) = .58

a(m’soft) = .58

.58

.58

.58

.62

.49

.62

. . .

. . .

. . .

.628

.459

.628

.62

.49

.62

50

Yahoo

M’soft Amazon


http://www.mmds.org

?????

Hubs que apontam para amazon não são bons!!!

Definição. Um autovetor de uma matriz quadrada B é um vetor v que satisfaz

Bv=v

para um escalar , denominado autovalor

• Como uma matriz B pode ser vista como uma transformação linear então um autovetor é um subespaço de dimensão 1 invariante para B

• Autovetores aparecem uma série de aplicações importantes (e.g. PageRank, HITS)

SVD e Auto Vetores

Observação. Se B=ATA então

(i) os vetores singulares a direita de A, v1,..,v k

são os autovetores de B com autovalores 12

,...,k2

(ii) O vetores singulares a esquerda de A, u1,..,u k

são os autovetores de BT=AAT com autovalores 1

2 ,...,k2

(iii) A matriz B é semi-definida positiva, ou seja, yBy>= 0 para todo y em Rn

SVD e Auto Vetores SVD: Limitações + Aproximação ótima com dimensão reduzida

em termos da norma de Frobenius - Interpretação

– Uma direção específica é combinação linear de todas as linhas da matriz

- Falta de esparsidade – Vetores singulares podem ser densos mesmo quando

a matriz é esparsa!



=

U

VT

18/09/2017

13

CUR Decomposition

CUR Decomposition

• Goal: Express A as a product of matrices C,U,R

Make ǁA-C·U·RǁF small

• “Constraints” on C and R:



A C U R

Frobenius norm:

ǁXǁF = Σij Xij2

CUR Decomposition • Goal: Express A as a product of matrices C,U,R

Make ǁA-C·U·RǁF small

• “Constraints” on C and R:



A C U R

Frobenius norm:

ǁXǁF = Σij Xij2

CUR: How it Works

• Sampling columns (similarly for rows):



Note this is a randomized algorithm, same

column can be sampled more than once

18/09/2017

14

Computing U (DKM 2006) Notation

• r(i): i-th sampled row of A;

• c(j): j-th sampled column of A

• Q[j]: probability of choosing the j-th row of A

• P[j]: probability of choosing the j-th column of A

• r: number of rows and

• c: number of columns

Define W as the matrix of r lines and c columns where

𝑊𝑖,𝑗 =𝐴𝑟 𝑖 ,𝑐(𝑗)

𝑟𝑐 𝑄[𝑟 𝑖 ]𝑃 𝑐 𝑗

Computing U (DKM 2006) 1. Let CTC =1

2 (v1 v1T

)+...+ k2 (vk vk

T) be the SVD decomposition of CTC

2. Define 𝑊′ =

𝑣𝑖𝑣𝑇𝑖

12

𝑘𝑖=1

3. Define

U= W’ W T

CUR: Provably good approx. to SVD

• For example:

– Select 𝒄 = 𝑶𝒌 𝒍𝒐𝒈 𝒌

𝜺𝟐 columns of A using

ColumnSelect algorithm

– Select 𝒓 = 𝑶𝒌 𝒍𝒐𝒈 𝒌

𝜺𝟐 rows of A using

ColumnSelect algorithm

– Set 𝑼 = 𝑾+

• Then:

with probability 98%



In practice:

Pick 4k cols/rows

for a “rank-k” approximation

SVD error CUR error

CUR: Pros & Cons

+ Easy interpretation • Since the basis vectors are actual

columns and rows

+ Sparse basis • Since the basis vectors are actual

columns and rows

- Duplicate columns and rows • Columns of large norms will be sampled many

times

J. Leskovec, A. Rajaraman, J. Ullman:


74

Singular vector Actual column

18/09/2017

15

Solution

• If we want to get rid of the duplicates:

– Throw them away

– Scale (multiply) the columns/rows by the square root of the number of duplicates



A Cd

Rd

Cs

Rs

Construct a small U

SVD vs. CUR



SVD: A = U VT

Huge but sparse Big and dense

CUR: A = C U R

Huge but sparse Big but sparse

dense but small

sparse and small

SVD vs. CUR: Simple Experiment

• DBLP bibliographic data – Author-to-conference big sparse matrix

– Aij: Number of papers published by author i at conference j

– 428K authors (rows), 3659 conferences (columns) • Very sparse

• Want to reduce dimensionality – How much time does it take?

– What is the reconstruction error?

– How much space do we need?



Results: DBLP- big sparse matrix

• Accuracy: – 1 – relative sum squared errors

• Space ratio: – #output matrix entries / #input matrix entries

• CPU time J. Leskovec, A. Rajaraman, J. Ullman:


78

SVD

CUR

CUR no duplicates

SVD

CUR

CUR no dup

Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM ’07.

CUR

SVD

18/09/2017

16

Bibliografia

• Cap 3. Foundations of Data Science, Avrim Blum, John Hopcroft and Ravindran Kannan http://www.cs.cornell.edu/jeh/book.pdfs

• Cap 10. Mining Massive Datasets • Fast Monte Carlo Algorithms for Matriices III:

Computing a Compressed Approximate Matrix Decomposition (DKM, SICOMP 2006)

• Cap 14 de Elements of Statistical Learning • http://www.cs.cmu.edu/~jimeng/papers/SunSD

M07.pdf

http://www.cs.cmu.edu/~jimeng/papers/SunSDM07.pdf

http://www.cs.cmu.edu/~jimeng/papers/SunSDM07.pdf

svd example: users-to-movieslaber/svd-applications.pdf · 2017. 9. 18. · svd – example:...

Documents