a high-speed inter-process communication architecture for fpga-based hardware acceleration of...
TRANSCRIPT
A High-Speed Inter-Process A High-Speed Inter-Process Communication Architecture Communication Architecture
for FPGA-based Hardware Acceleration for FPGA-based Hardware Acceleration of Molecular Dynamicsof Molecular Dynamics
Presented by: Presented by: Chris ComisChris ComisSeptember 23, 2005September 23, 2005
Supervisor:Supervisor: Professor Paul ChowProfessor Paul Chow
2
OutlineOutline
1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Integration into a Programming ModelIntegration into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions
3
What is Molecular Dynamics?What is Molecular Dynamics?
A method of calculating the time-evolution of A method of calculating the time-evolution of molecular configurationsmolecular configurations
Useful in the analysis of protein foldingUseful in the analysis of protein folding Many applications in rational drug designMany applications in rational drug design
4
1.1. Forces (i.e. F=ma) are calculated between an atom Forces (i.e. F=ma) are calculated between an atom and all other atoms in the systemand all other atoms in the system
An O(nAn O(n22) problem across 10,000+ atoms) problem across 10,000+ atoms
2.2. Force calculations are performed at femtosecond Force calculations are performed at femtosecond timestepstimesteps
Interesting results may take several Interesting results may take several μμs of s of simulation (10simulation (1099+ timesteps required)+ timesteps required)
MD is Computationally MD is Computationally ChallengingChallenging
MD simulations are typically run on supercomputersMD simulations are typically run on supercomputers
5
An FPGA-based MD AcceleratorAn FPGA-based MD Accelerator
An ongoing collaborative project involves An ongoing collaborative project involves the development of an FPGA-based MD the development of an FPGA-based MD AcceleratorAccelerator
Advantages to an FPGA-based approach:Advantages to an FPGA-based approach:1.1. Massive parallel computationMassive parallel computation2.2. Forces can be parallelizedForces can be parallelized3.3. Force computations can be accelerated ~88xForce computations can be accelerated ~88x4.4. High-speed Serial I/O (SERDES) may be High-speed Serial I/O (SERDES) may be
leveragedleveraged
6
Area of FocusArea of Focus
Develop communication protocol using Develop communication protocol using high-speed SERDES linkshigh-speed SERDES links
Requirements:Requirements: ReliabilityReliability Light-weightLight-weight Minimal trip-time for small packets Minimal trip-time for small packets Must be abstracted at the hardware and Must be abstracted at the hardware and
software levelssoftware levels
7
OutlineOutline
1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Integration into a Programming ModelIntegration into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions
8
Blocks Blocks →→ computation computationArrows Arrows →→ communication communication
A Partial MD SimulatorA Partial MD Simulator
Computation blocksComputation blockscan be hardwarecan be hardwareor software executedor software executedon MicroBlazeon MicroBlazesoft processorssoft processors
Software must be writtenSoftware must be writtenusing a programming using a programming modelmodel
9
System-Level OverviewSystem-Level Overview
The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model
10
System-Level OverviewSystem-Level Overview
The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model
The model is then adapted The model is then adapted for SERDES developmentfor SERDES development
11
System-Level OverviewSystem-Level Overview
The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model
The model is then adapted The model is then adapted for SERDES developmentfor SERDES development
1.1. Producers and consumer Producers and consumer hardware blocks are hardware blocks are implementedimplemented
12
System-Level OverviewSystem-Level Overview
The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model
The model is then adapted The model is then adapted for SERDES developmentfor SERDES development
1.1. Producers and consumer Producers and consumer hardware blocks are hardware blocks are implementedimplemented
2.2. An FSL (FIFO) is used as an An FSL (FIFO) is used as an abstracted method of data abstracted method of data transport with SERDES logictransport with SERDES logic
13
System-Level OverviewSystem-Level Overview
The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model
The model is then adapted The model is then adapted for SERDES developmentfor SERDES development
1.1. Producers and consumer Producers and consumer hardware blocks are hardware blocks are implementedimplemented
2.2. An FSL is used as an An FSL is used as an abstracted method of data abstracted method of data transport with SERDES logictransport with SERDES logic
3.3. An OPB bus interface is An OPB bus interface is added for register access of added for register access of componentscomponents
14
System-Level OverviewSystem-Level Overview
The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model
The model is then adapted The model is then adapted for SERDES developmentfor SERDES development
1.1. Producers and consumer Producers and consumer hardware blocks are hardware blocks are implementedimplemented
2.2. An FSL is used as an An FSL is used as an abstracted method of data abstracted method of data transport with SERDES logictransport with SERDES logic
3.3. An OPB bus interface is An OPB bus interface is added for register access of added for register access of componentscomponents
4.4. Deep FIFOs are added for Deep FIFOs are added for logging high-speed datalogging high-speed data
15
OutlineOutline
1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Integration into a Programming ModelIntegration into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions
16
Protocol OverviewProtocol Overview A synchronous acknowledgement-based protocol was A synchronous acknowledgement-based protocol was
chosenchosen Simple and predictableSimple and predictable An inherent delay in waiting for acknowledgementsAn inherent delay in waiting for acknowledgements
To mask this delay:To mask this delay: Multiple producers are connected to the SERDES interfaceMultiple producers are connected to the SERDES interface The link is time-multiplexed across multiple producersThe link is time-multiplexed across multiple producers
17
Protocol OverviewProtocol Overview All data has a word width of 4 bytesAll data has a word width of 4 bytes Data packets:Data packets:
Variable size (between 32 and 2016 bytes)Variable size (between 32 and 2016 bytes) A 32-bit CRC is appendedA 32-bit CRC is appended
Acknowledgements:Acknowledgements: 8 bytes in size8 bytes in size Can interrupt transmission of data packetsCan interrupt transmission of data packets
18
Transmit LogicTransmit Logic
Transmitter consists mainly of two componentsTransmitter consists mainly of two components1.1. Dual-port buffers:Dual-port buffers:
The start address of the packet is kept in case a resend is The start address of the packet is kept in case a resend is necessarynecessary
2.2. Scheduler:Scheduler: Schedules ready packets in a round-robin fashionSchedules ready packets in a round-robin fashion
From Producer via FSL To Scheduler of SERDES Link
19
Receive LogicReceive Logic
Receiver consists mainly of two components:Receiver consists mainly of two components:1.1. Dual-port buffers:Dual-port buffers:
The start address of the packet is kept in case errors occurThe start address of the packet is kept in case errors occur
2.2. Three-stage Dataflow Pipeline:Three-stage Dataflow Pipeline:Stage 1: Determine if incoming data is properly formattedStage 1: Determine if incoming data is properly formattedStage 2: Evaluate incoming data against all possible errors Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handlerStage 3: Pass results to acknowledgement handler
From SERDES Link To Consumer via FSL
20
Design EffortDesign Effort
Majority of design effort was in error Majority of design effort was in error handling:handling: Transmitter: Transmitter:
Determine which packet combinations corrupt Determine which packet combinations corrupt the systemthe system
Establish a priority among conflicting packet Establish a priority among conflicting packet typestypes
Receiver: Receiver: Handle all possible combinations of Handle all possible combinations of
transmission errorstransmission errors
21
OutlineOutline
1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Integration into a Programming ModelIntegration into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions
22
Test EnvironmentTest Environment
All SERDES tests performed across a Xilinx All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAsVirtex-II Pro XC2VP7 and XC2VP30 series FPGAs
Ribbon cables were used to transfer serial data Ribbon cables were used to transfer serial data between non-impedance controlled connectorsbetween non-impedance controlled connectors
23
Reliability and SustainabilityReliability and Sustainability
Verification test environment:Verification test environment: Send data concurrently from three producers to Send data concurrently from three producers to
three respective consumersthree respective consumers Pseudo-random packet lengthPseudo-random packet length Consumers read from FSL at variable ratesConsumers read from FSL at variable rates
Reliability:Reliability: Run this test under extremely poor line conditionsRun this test under extremely poor line conditions
Sustainability: Sustainability: Run this test under normal line conditions for a Run this test under normal line conditions for a
long period of timelong period of time
24
ReliabilityReliability
Reliability: 128-second Test ResultsReliability: 128-second Test ResultsType of ErrorType of Error Average # of ErrorsAverage # of Errors
Soft Error (x10Soft Error (x1066)) 1.3121.312
Hard ErrorHard Error 722977722977
Frame ErrorFrame Error 2222
CRC ErrorCRC Error 1841418414
Receive Buffer Full (x10Receive Buffer Full (x1066)) 1.8041.804
Lost AcknowledgmentLost Acknowledgment 8176981769
25
SustainabilitySustainability
Sustainability: 8-hour Test Results Sustainability: 8-hour Test Results MeasurementMeasurement ResultResult
Resent Packets due to Resent Packets due to Receive Buffer Full (x10Receive Buffer Full (x1066))
502.353502.353
Successful Packets (x10Successful Packets (x1066)) 5666.8215666.821
Total Packets (x10Total Packets (x1066)) 6169.1746169.174
Approximate Bit-Rate (x10Approximate Bit-Rate (x1099)) 1.7551.755
26
Comparison Against Other Comparison Against Other Communication MechanismsCommunication Mechanisms
Two configurations are usedTwo configurations are used Configuration A: Saturate the channel with Configuration A: Saturate the channel with
packetspackets Configuration B: Loop-back testConfiguration B: Loop-back test
Compare against:Compare against: Simple FPGA-based 100BaseT EthernetSimple FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT EthernetTCP/IP FPGA-based 100BaseT Ethernet TCP/IP Cluster-based Gigabit EthernetTCP/IP Cluster-based Gigabit Ethernet
27
Throughput ResultsThroughput Results
28
One-way Trip Time ResultsOne-way Trip Time Results
29
Area ConsumptionArea Consumption
Each SERDES Interface takes approximately Each SERDES Interface takes approximately 8% of a Xilinx XC2VP308% of a Xilinx XC2VP30
Debug logic substantially increases area Debug logic substantially increases area consumption:consumption:
FF usage increases 68% FF usage increases 68% LUT usage increases 43%LUT usage increases 43%
Area MeasurementArea Measurement FFsFFs LUTsLUTs
Area with Debug LogicArea with Debug Logic 34863486 32183218
Area without Debug LogicArea without Debug Logic 20742074 22442244
30
OutlineOutline
1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Integration into a Programming ModelIntegration into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions
31
Integration into a Integration into a Programming ModelProgramming Model
while (1) {
MPI_Send(data_outgoing, 64, MPI_INT, 0, 0,
MPI_COMM_WORLD);
MPI_Recv(data_incoming, 64, MPI_INT, 0, 0,
MPI_COMM_WORLD, &status);
}
Hardware abstraction: FSLHardware abstraction: FSL Software abstraction: An MPI-based Software abstraction: An MPI-based
Programming ModelProgramming Model
Modified MPI_Send and MPI_Recv function Modified MPI_Send and MPI_Recv function callscalls
32
Integration into a Integration into a Programming ModelProgramming Model
Replaced producers and consumers with a Replaced producers and consumers with a MicroBlaze processorMicroBlaze processor
Several communication scenarios were testedSeveral communication scenarios were tested
ScenarioScenario Bit-Rate (Mbps)Bit-Rate (Mbps)
MicroBlaze to MicroBlaze (no traffic)MicroBlaze to MicroBlaze (no traffic) 4.304.30
MicroBlaze to MicroBlaze (traffic)MicroBlaze to MicroBlaze (traffic) 4.104.10
MicroBlaze to Hardware Consumer (no traffic)MicroBlaze to Hardware Consumer (no traffic) 7.787.78
Hardware Producer to MicroBlaze (no traffic)Hardware Producer to MicroBlaze (no traffic) 8.908.90
33
OutlineOutline
1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Incorporation into a Programming ModelIncorporation into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions
34
ConclusionsConclusions
Final Results:Final Results: Reliable and sustainableReliable and sustainable Abstracted at the software and hardware levelAbstracted at the software and hardware level 2074 FFs and 2244 LUTs required for SERDES 2074 FFs and 2244 LUTs required for SERDES
logic onlylogic only Given a channel rate of 2.5Gbps, maximum Given a channel rate of 2.5Gbps, maximum
bidirectional throughput of bidirectional throughput of 1.928Gbps1.928Gbps Minimum packet trip-time of Minimum packet trip-time of 1.231.23μμss
35
AcknowledgementsAcknowledgements
Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August 2005.
Professor RProfessor Réégis Pomgis Pomèès, Chris Madills, Chris Madill Professor Paul Chow, Professor C.Y. Chen, Professor Paul Chow, Professor C.Y. Chen,
Lesley Shannon, Arun Patel, Manuel Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, House,, Nathalie Chan, Lorne Applebaum, Patrick AklPatrick Akl
ReferencesReferences
36
Transmitter Packet Collision Transmitter Packet Collision HandlingHandling
Packets are enclosed by 8B/10B control Packets are enclosed by 8B/10B control characters (K-characters) characters (K-characters)
The type of packet is distinguished by the The type of packet is distinguished by the K-characters usedK-characters used
Certain combinations of control characters Certain combinations of control characters cannot be nestedcannot be nested Clock correction has priority over acknowledgementClock correction has priority over acknowledgement Acknowledgement cannot interrupt the end of a Acknowledgement cannot interrupt the end of a
data packetdata packet Clock correction must avoid the beginning and end Clock correction must avoid the beginning and end
of a data packetof a data packet
37
Receiver Error HandlingReceiver Error Handling
All combinations of errors at the receiver are All combinations of errors at the receiver are handled correctlyhandled correctly Data errors (CRC errors)Data errors (CRC errors) Disparity errors or invalid characters (soft errors)Disparity errors or invalid characters (soft errors) Errors in framing (frame errors)Errors in framing (frame errors) Channel failures (hard errors)Channel failures (hard errors) Lost acknowledgements/repeat packetsLost acknowledgements/repeat packets Receiver buffers fullReceiver buffers full
38
Test Configuration ATest Configuration A
Send data concurrently from three producers to Send data concurrently from three producers to three respective consumersthree respective consumers
Producers write to FSL as fast as possibleProducers write to FSL as fast as possible Consumers read from FSL as fast as possibleConsumers read from FSL as fast as possible Analyze best-case throughput results Analyze best-case throughput results
39
Test Configuration BTest Configuration B
Send data from a producer to a consumerSend data from a producer to a consumer Delay a packet write from a producer until a packet Delay a packet write from a producer until a packet
has been completely received by the consumer on has been completely received by the consumer on the same FPGAthe same FPGA
A communication loop results that determines A communication loop results that determines round-trip trip time (and therefore one-way trip time)round-trip trip time (and therefore one-way trip time)