fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

University of Colorado at BoulderCore Research Lab

FastForward for Efficient Pipeline Parallelism:FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free QueueA Cache-Optimized Concurrent Lock-Free Queue

Tipp Moseley and Manish VachharajaniUniversity of Colorado at Boulder

2008.02.21

John Giacomoni

University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab

Why?Why?Why Pipelines?Why Pipelines?

• Multicore systems are the future• Many apps can be pipelined if the

granularity is fine enough– ≈ < 1 µs– ≈ 3.5 x interrupt handler


Fine-GrainFine-GrainPipelining ExamplesPipelining Examples

• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)


Network ProcessingNetwork ProcessingScenariosScenarios

Link Mbps fps ns/frame

T-1 1.5 2,941 340,000

T-3 45.0 90,909 11,000

OC-3 155.0 333,333 3,000

OC-12 622.0 1,219,512 820

GigE 1,000.0 1,488,095 672

OC-48 2,500.0 5,000,000 200

10 GigE 10,000.0 14,925,373 67

OC-192 9,500.0 19,697,843 51


Core-PlacementsCore-Placements

4x4 NUMA Organization(ex: AMD Opteron Barcelona)

APP

IP OP

Dec Enc

APP

IP

APP

OP

IP

Dec

App

Enc

OP


ExampleExample3 Stage Pipeline3 Stage Pipeline


CommunicationCommunicationOverheadOverhead



Locks 320ns

GigE



Locks 320ns

GigE

Lamport 160ns



Locks 320ns

Lamport 160ns

Hardware 10ns

GigE



Locks 320ns

Lamport 160ns

Hardware 10nsFastForward 28ns

GigE


More Fine-GrainMore Fine-GrainPipelining ExamplesPipelining Examples

• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)

• Signal Processing– Media transcoding/encoding/decoding– Software Defined Radios

• Encryption– Counter-Mode AES

• Other Domains– Fine-grain kernels extracted from sequential applications


FastForwardFastForward

• Cache-optimized point-to-point CLF queue1.Fast2.Robust against unbalanced stages3.Hides die-die communication4.Works with strong to weak memory consistency

models


Lamport’sLamport’sCLF Queue (1)CLF Queue (1)

lamp_enqueue(data) {NH = NEXT(head);

while (NH == tail) {};

buf[head] = data;head = NH;

}

lamp_dequeue(*data) {

while (head == tail) {} *data = buf[tail]; tail = NEXT(tail);}






}

head tail

buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]

buf[ ] buf[ ] buf[ ] buf[n]


AMD OpteronAMD OpteronCache ExampleCache Example

M






}

head tail



Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation






}

head



Observe how cachelines will still ping-pong.What if the head/tail comparison was eliminated?

tail


FastForwardFastForwardCLF Queue (1)CLF Queue (1)




}

ff_enqueue(data) {

while(0 != buf[head]);

buf[head] = data;head = NEXT(head);

}


buf[1]buf[0]


ff_enqueue(data) {



}

head



tail

Observe how head/tail cachelines will NOT ping-pong.BUT, “buf” will still cause the cachelines to ping-pong.



ff_enqueue(data) {



}

head



tail

Solution: Temporally slip stages by a cacheline.N:1 reduction in coherence misses per stage.


Slip TimingSlip Timing


Slip TimingSlip TimingLostLost


Maintaining SlipMaintaining Slip(Concepts)(Concepts)

• Use distance as the quality metric– Explicitly compare head/tail– Causes cache ping-ponging– Perform rarely


Maintaining SlipMaintaining Slip(Method)(Method)

adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); }}


ComparativeComparativePerformancePerformance

Lamport FastForward


Thrashing andThrashing andAuto-BalancingAuto-Balancing

FastForward (Thrashing) FastForward (Balanced)


CacheCacheVerificationVerification

FastForward (Thrashing) FastForward (Balanced)


On/Off DieOn/Off DieCommunicationsCommunications

M

On-die communicationOff-die communication


On/Off-dieOn/Off-diePerformancePerformance

FastForward (On-Die) FastForward (Off-Die)


ProvenProvenPropertyProperty

• “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”


WorkWorkin Progressin Progress

• Operating Systems– 27.5 ns/op

• 3.1 % cost reduction vs. reported 28.5 ns– Reduced jitter

• Applications– 128bit AES encrypting filter

• Ethernet layer encryption at 1.45 mfps• IP layer encryption at 1.51 mfps• ~10 lines of code for each.


Gazing intoGazing intothe Crystal Ballthe Crystal Ball

Locks 320ns

Lamport 160ns

Hardware 10nsFastForward 28ns

GigE


Shared Memory Accelerated QueuesNow Available!

http://ce.colorado.edu/core

[email protected]

fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Documents