fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
DESCRIPTION
FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue. John Giacomoni. Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21. Why? Why Pipelines?. Multicore systems are the future - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/1.jpg)
University of Colorado at BoulderCore Research Lab
FastForward for Efficient Pipeline Parallelism:FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free QueueA Cache-Optimized Concurrent Lock-Free Queue
Tipp Moseley and Manish VachharajaniUniversity of Colorado at Boulder
2008.02.21
John Giacomoni
![Page 2: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/2.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Why?Why?Why Pipelines?Why Pipelines?
• Multicore systems are the future• Many apps can be pipelined if the
granularity is fine enough– ≈ < 1 µs– ≈ 3.5 x interrupt handler
![Page 3: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/3.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Fine-GrainFine-GrainPipelining ExamplesPipelining Examples
• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)
![Page 4: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/4.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Network ProcessingNetwork ProcessingScenariosScenarios
Link Mbps fps ns/frame
T-1 1.5 2,941 340,000
T-3 45.0 90,909 11,000
OC-3 155.0 333,333 3,000
OC-12 622.0 1,219,512 820
GigE 1,000.0 1,488,095 672
OC-48 2,500.0 5,000,000 200
10 GigE 10,000.0 14,925,373 67
OC-192 9,500.0 19,697,843 51
![Page 5: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/5.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Core-PlacementsCore-Placements
4x4 NUMA Organization(ex: AMD Opteron Barcelona)
APP
IP OP
Dec Enc
APP
IP
APP
OP
IP
Dec
App
Enc
OP
![Page 6: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/6.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ExampleExample3 Stage Pipeline3 Stage Pipeline
![Page 7: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/7.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ExampleExample3 Stage Pipeline3 Stage Pipeline
![Page 8: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/8.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
![Page 9: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/9.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
GigE
![Page 10: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/10.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
GigE
Lamport 160ns
![Page 11: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/11.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
Lamport 160ns
Hardware 10ns
GigE
![Page 12: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/12.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
Lamport 160ns
Hardware 10nsFastForward 28ns
GigE
![Page 13: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/13.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
More Fine-GrainMore Fine-GrainPipelining ExamplesPipelining Examples
• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)
• Signal Processing– Media transcoding/encoding/decoding– Software Defined Radios
• Encryption– Counter-Mode AES
• Other Domains– Fine-grain kernels extracted from sequential applications
![Page 14: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/14.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
FastForwardFastForward
• Cache-optimized point-to-point CLF queue1.Fast2.Robust against unbalanced stages3.Hides die-die communication4.Works with strong to weak memory consistency
models
![Page 15: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/15.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (1)CLF Queue (1)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
lamp_dequeue(*data) {
while (head == tail) {} *data = buf[tail]; tail = NEXT(tail);}
![Page 16: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/16.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (2)CLF Queue (2)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
head tail
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
![Page 17: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/17.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
AMD OpteronAMD OpteronCache ExampleCache Example
M
![Page 18: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/18.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (2)CLF Queue (2)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
head tail
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation
![Page 19: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/19.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (3)CLF Queue (3)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Observe how cachelines will still ping-pong.What if the head/tail comparison was eliminated?
tail
![Page 20: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/20.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
FastForwardFastForwardCLF Queue (1)CLF Queue (1)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;head = NEXT(head);
}
![Page 21: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/21.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
buf[1]buf[0]
FastForwardFastForwardCLF Queue (2)CLF Queue (2)
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;head = NEXT(head);
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
tail
Observe how head/tail cachelines will NOT ping-pong.BUT, “buf” will still cause the cachelines to ping-pong.
![Page 22: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/22.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
FastForwardFastForwardCLF Queue (3)CLF Queue (3)
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;head = NEXT(head);
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
tail
Solution: Temporally slip stages by a cacheline.N:1 reduction in coherence misses per stage.
![Page 23: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/23.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Slip TimingSlip Timing
![Page 24: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/24.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Slip TimingSlip TimingLostLost
![Page 25: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/25.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Maintaining SlipMaintaining Slip(Concepts)(Concepts)
• Use distance as the quality metric– Explicitly compare head/tail– Causes cache ping-ponging– Perform rarely
![Page 26: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/26.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Maintaining SlipMaintaining Slip(Method)(Method)
adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); }}
![Page 27: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/27.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ComparativeComparativePerformancePerformance
Lamport FastForward
![Page 28: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/28.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Thrashing andThrashing andAuto-BalancingAuto-Balancing
FastForward (Thrashing) FastForward (Balanced)
![Page 29: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/29.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CacheCacheVerificationVerification
FastForward (Thrashing) FastForward (Balanced)
![Page 30: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/30.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
On/Off DieOn/Off DieCommunicationsCommunications
M
On-die communicationOff-die communication
![Page 31: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/31.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
On/Off-dieOn/Off-diePerformancePerformance
FastForward (On-Die) FastForward (Off-Die)
![Page 32: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/32.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ProvenProvenPropertyProperty
• “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”
![Page 33: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/33.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
WorkWorkin Progressin Progress
• Operating Systems– 27.5 ns/op
• 3.1 % cost reduction vs. reported 28.5 ns– Reduced jitter
• Applications– 128bit AES encrypting filter
• Ethernet layer encryption at 1.45 mfps• IP layer encryption at 1.51 mfps• ~10 lines of code for each.
![Page 34: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/34.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Gazing intoGazing intothe Crystal Ballthe Crystal Ball
Locks 320ns
Lamport 160ns
Hardware 10nsFastForward 28ns
GigE
![Page 35: FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue](https://reader036.vdocuments.us/reader036/viewer/2022062222/56815cbd550346895dcabf0b/html5/thumbnails/35.jpg)
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Shared Memory Accelerated QueuesNow Available!
http://ce.colorado.edu/core