cs 744: marius
TRANSCRIPT
CS 744: MARIUS
Shivaram VenkataramanFall 2021
Hi !
ADMINISTRIVIA
- Midterm grades today!- Course Project: Check in by Nov 30th
→ Pick up paupers in my office hours
Yieis OH
↳ Monday- DONT PANIC !
- on the class on Tuesday- Atrip the discussion
- solution walk through
Scalable Storage Systems
Datacenter Architecture
Resource Management
Computational Engines
Machine Learning SQL Streaming Graph
ApplicationsV M
Power graph#Graph✗
EXAMPLE: LINK PREDICTION
Task: Predict potential connections in a social network
4
[0.25, 0.45, 0.30]
[0.15, 0.85, 0.92]
…Find K-nearest
neighbors
embedding
flirt of recommendations f) onepe. -vertex
fear )frying
?
captures embeddingsthe structure of trigraph
BACKGROUND: GRAPH EMBEDDING MODELS
Score functionCapture structure of the graph given source, destination embedding
Loss functionMaximize score for edges in graphMinimize for others (negative edges)
5
NEGATIVE SAMPLING
Sample from edges not in the graph!
Two options1. According to data distribution
2. Uniformly
→ for a particular edge-
↳ cosine similarity
7score for negative
edge
T-
\ regularization--
scorefor positive edge
TRAINING ALGORITHM
SGD/AdaGrad optimizer
Sample positive, negative edges
Access source, dest embeddings for each edge in batch
6
for i in range(num_batches)B = getBatchEdges(i)E = getEmbeddingParams(B)G = computeGrad(E, B)updateEmbeddingParams(G)
Rennet
in-
Inputs
take a batch of edgesallhe
weights are used ! I epoch = all positive edges
-
I
- - -
-
return the⇒
÷=¥¥÷ Y"
Nxd → TfN et te
update embedding barons end .
CHALLENGE: LARGE GRAPHS
Large graphs à Large model sizes
Example3 Billion vertices, d = 400 Model size = 3 billion * 400 * 4 = 4.8 TB!
Need to scale beyond GPU memory, CPU memory!
7
f- Billions of vertices D= too to 1000
-
☐µ=3
CHALLENGE: DATA MOVEMENT
(a) Sample edges, embeddings from CPU memory (DGL-KE)
(b) Partition embeddings so that one partition fits on GPU memory. Load sequentially (Pytorch-BigGraph)
8
One epoch on the Freebase86m knowledge graph
Data movement overheads à low GPU util
(12% GPU utilization
batch gradients
☒E::÷÷¥Cpu memory-
MARIUS
I/O efficient system for learning graph embeddings
Marius Design- Pipelined training- Partition ordering
9
PIPELINED TRAINING
10
splits the algorithminto 5 stages
and pipelines stagesto
utilize CPU&
Gpu&
Peted- same
/ queuesize controls time
-staleness !
¥ . I•
Edges , epma!Em)b2
0 - ¥I
stale data !- If biz
has shared embeddingspdatembeddingswith by then bis embeddings- sparse
access patterns⇒ not too many conflicts
are stale
OUT of MEMORY training
11
Key idea: Maintain a cache of partitions in CPU memory
QuestionsOrder of partition traversal? How to perform eviction?
adjacencymatrix
•
mygraph
¥É¥¥÷lefortisall
the squares
→Cache in
CPU memory
Oo 01 Oh-
→
Node embedding-
parameters
ELIMINTATION ORDERING
12
Initialize cache with c partitions
Swap in partition that leads to highest number of unseen pairs
Achieved by fixing c-1 partitions and swap remaining in any order
µ:!;d no conflicts processeddosa i
±XXX
✗
2
✗ 0.
'
3 .-
iÉclose to the tower bound
number swapsof
SUMMARY
Graph Embeddings: Learn embeddings from graph data for ML
Marius: Efficient single-machine trainingPipelining to use CPU, GPU Partition buffer, BETA ordering
→
-minimize number of swaps
DISCUSSIONhttps://forms.gle/LtoT8nEmw3oLvXuo9
If you were going to repeat the COST analysis for knowledge graph embedding training, what would you expect to find and why?
→ Distributed systems need 4*-8 apu, to match µiyle9P⇒→Mar ~ 3 Gpus of DGL
→ communication coat ifwe go
to many GPUs
u
gka training→ compute >>
morecompute
-
☒reason
→ I/o the bigger bottleneck? !
a-↳ date dependent
How does the partitioning scheme used in this paper differ from partitioning schemes used in PowerGraph and why?
Workload was focused on
→vertex and its neighbors → PageRank job
→ edge drives the computation
↳ source ,dot for this edge
NEXT STEPS
Next class: New module!Project check-ins by Nov 30th