cs 744: marius

17
CS 744: MARIUS Shivaram Venkataraman Fall 2021 Hi !

Upload: others

Post on 29-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 744: MARIUS

CS 744: MARIUS

Shivaram VenkataramanFall 2021

Hi !

Page 2: CS 744: MARIUS

ADMINISTRIVIA

- Midterm grades today!- Course Project: Check in by Nov 30th

→ Pick up paupers in my office hours

Yieis OH

↳ Monday- DONT PANIC !

- on the class on Tuesday- Atrip the discussion

- solution walk through

Page 3: CS 744: MARIUS

Scalable Storage Systems

Datacenter Architecture

Resource Management

Computational Engines

Machine Learning SQL Streaming Graph

ApplicationsV M

Power graph#Graph✗

Page 4: CS 744: MARIUS

EXAMPLE: LINK PREDICTION

Task: Predict potential connections in a social network

4

[0.25, 0.45, 0.30]

[0.15, 0.85, 0.92]

…Find K-nearest

neighbors

embedding

flirt of recommendations f) onepe. -vertex

fear )frying

?

captures embeddingsthe structure of trigraph

Page 5: CS 744: MARIUS

BACKGROUND: GRAPH EMBEDDING MODELS

Score functionCapture structure of the graph given source, destination embedding

Loss functionMaximize score for edges in graphMinimize for others (negative edges)

5

NEGATIVE SAMPLING

Sample from edges not in the graph!

Two options1. According to data distribution

2. Uniformly

→ for a particular edge-

↳ cosine similarity

7score for negative

edge

T-

\ regularization--

scorefor positive edge

Page 6: CS 744: MARIUS

TRAINING ALGORITHM

SGD/AdaGrad optimizer

Sample positive, negative edges

Access source, dest embeddings for each edge in batch

6

for i in range(num_batches)B = getBatchEdges(i)E = getEmbeddingParams(B)G = computeGrad(E, B)updateEmbeddingParams(G)

Rennet

in-

Inputs

take a batch of edgesallhe

weights are used ! I epoch = all positive edges

-

I

- - -

-

return the⇒

÷=¥¥÷ Y"

Nxd → TfN et te

update embedding barons end .

Page 7: CS 744: MARIUS

CHALLENGE: LARGE GRAPHS

Large graphs à Large model sizes

Example3 Billion vertices, d = 400 Model size = 3 billion * 400 * 4 = 4.8 TB!

Need to scale beyond GPU memory, CPU memory!

7

f- Billions of vertices D= too to 1000

-

☐µ=3

Page 8: CS 744: MARIUS

CHALLENGE: DATA MOVEMENT

(a) Sample edges, embeddings from CPU memory (DGL-KE)

(b) Partition embeddings so that one partition fits on GPU memory. Load sequentially (Pytorch-BigGraph)

8

One epoch on the Freebase86m knowledge graph

Data movement overheads à low GPU util

(12% GPU utilization

batch gradients

☒E::÷÷¥Cpu memory-

Page 9: CS 744: MARIUS

MARIUS

I/O efficient system for learning graph embeddings

Marius Design- Pipelined training- Partition ordering

9

Page 10: CS 744: MARIUS

PIPELINED TRAINING

10

splits the algorithminto 5 stages

and pipelines stagesto

utilize CPU&

Gpu&

Peted- same

/ queuesize controls time

-staleness !

¥ . I•

Edges , epma!Em)b2

0 - ¥I

stale data !- If biz

has shared embeddingspdatembeddingswith by then bis embeddings- sparse

access patterns⇒ not too many conflicts

are stale

Page 11: CS 744: MARIUS

OUT of MEMORY training

11

Key idea: Maintain a cache of partitions in CPU memory

QuestionsOrder of partition traversal? How to perform eviction?

adjacencymatrix

mygraph

¥É¥¥÷lefortisall

the squares

→Cache in

CPU memory

Oo 01 Oh-

Node embedding-

parameters

Page 12: CS 744: MARIUS

ELIMINTATION ORDERING

12

Initialize cache with c partitions

Swap in partition that leads to highest number of unseen pairs

Achieved by fixing c-1 partitions and swap remaining in any order

µ:!;d no conflicts processeddosa i

±XXX

2

✗ 0.

'

3 .-

iÉclose to the tower bound

number swapsof

Page 13: CS 744: MARIUS

SUMMARY

Graph Embeddings: Learn embeddings from graph data for ML

Marius: Efficient single-machine trainingPipelining to use CPU, GPU Partition buffer, BETA ordering

-minimize number of swaps

Page 14: CS 744: MARIUS

DISCUSSIONhttps://forms.gle/LtoT8nEmw3oLvXuo9

Page 15: CS 744: MARIUS

If you were going to repeat the COST analysis for knowledge graph embedding training, what would you expect to find and why?

→ Distributed systems need 4*-8 apu, to match µiyle9P⇒→Mar ~ 3 Gpus of DGL

→ communication coat ifwe go

to many GPUs

u

gka training→ compute >>

morecompute

-

☒reason

→ I/o the bigger bottleneck? !

a-↳ date dependent

Page 16: CS 744: MARIUS

How does the partitioning scheme used in this paper differ from partitioning schemes used in PowerGraph and why?

Workload was focused on

→vertex and its neighbors → PageRank job

→ edge drives the computation

↳ source ,dot for this edge

Page 17: CS 744: MARIUS

NEXT STEPS

Next class: New module!Project check-ins by Nov 30th