a high-speed inter-process communication architecture for fpga-based hardware acceleration of...

A High-Speed Inter-Process A High-Speed Inter-Process Communication Architecture Communication Architecture

for FPGA-based Hardware Acceleration for FPGA-based Hardware Acceleration of Molecular Dynamicsof Molecular Dynamics

Presented by: Presented by: Chris ComisChris ComisSeptember 23, 2005September 23, 2005

Supervisor:Supervisor: Professor Paul ChowProfessor Paul Chow

2

OutlineOutline

1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Integration into a Programming ModelIntegration into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions

3

What is Molecular Dynamics?What is Molecular Dynamics?

A method of calculating the time-evolution of A method of calculating the time-evolution of molecular configurationsmolecular configurations

Useful in the analysis of protein foldingUseful in the analysis of protein folding Many applications in rational drug designMany applications in rational drug design

4

1.1. Forces (i.e. F=ma) are calculated between an atom Forces (i.e. F=ma) are calculated between an atom and all other atoms in the systemand all other atoms in the system

An O(nAn O(n22) problem across 10,000+ atoms) problem across 10,000+ atoms

2.2. Force calculations are performed at femtosecond Force calculations are performed at femtosecond timestepstimesteps

Interesting results may take several Interesting results may take several μμs of s of simulation (10simulation (1099+ timesteps required)+ timesteps required)

MD is Computationally MD is Computationally ChallengingChallenging

MD simulations are typically run on supercomputersMD simulations are typically run on supercomputers

5

An FPGA-based MD AcceleratorAn FPGA-based MD Accelerator

An ongoing collaborative project involves An ongoing collaborative project involves the development of an FPGA-based MD the development of an FPGA-based MD AcceleratorAccelerator

Advantages to an FPGA-based approach:Advantages to an FPGA-based approach:1.1. Massive parallel computationMassive parallel computation2.2. Forces can be parallelizedForces can be parallelized3.3. Force computations can be accelerated ~88xForce computations can be accelerated ~88x4.4. High-speed Serial I/O (SERDES) may be High-speed Serial I/O (SERDES) may be

leveragedleveraged

6

Area of FocusArea of Focus

Develop communication protocol using Develop communication protocol using high-speed SERDES linkshigh-speed SERDES links

Requirements:Requirements: ReliabilityReliability Light-weightLight-weight Minimal trip-time for small packets Minimal trip-time for small packets Must be abstracted at the hardware and Must be abstracted at the hardware and

software levelssoftware levels

7

OutlineOutline


8

Blocks Blocks →→ computation computationArrows Arrows →→ communication communication

A Partial MD SimulatorA Partial MD Simulator

Computation blocksComputation blockscan be hardwarecan be hardwareor software executedor software executedon MicroBlazeon MicroBlazesoft processorssoft processors

Software must be writtenSoftware must be writtenusing a programming using a programming modelmodel

9

System-Level OverviewSystem-Level Overview

The MD simulator is The MD simulator is simplified to a simplified to a Producer/Consumer modelProducer/Consumer model

10



The model is then adapted The model is then adapted for SERDES developmentfor SERDES development

11




1.1. Producers and consumer Producers and consumer hardware blocks are hardware blocks are implementedimplemented

12





2.2. An FSL (FIFO) is used as an An FSL (FIFO) is used as an abstracted method of data abstracted method of data transport with SERDES logictransport with SERDES logic

13





2.2. An FSL is used as an An FSL is used as an abstracted method of data abstracted method of data transport with SERDES logictransport with SERDES logic

3.3. An OPB bus interface is An OPB bus interface is added for register access of added for register access of componentscomponents

14





2.2. An FSL is used as an An FSL is used as an abstracted method of data abstracted method of data transport with SERDES logictransport with SERDES logic

3.3. An OPB bus interface is An OPB bus interface is added for register access of added for register access of componentscomponents

4.4. Deep FIFOs are added for Deep FIFOs are added for logging high-speed datalogging high-speed data

15

OutlineOutline


16

Protocol OverviewProtocol Overview A synchronous acknowledgement-based protocol was A synchronous acknowledgement-based protocol was

chosenchosen Simple and predictableSimple and predictable An inherent delay in waiting for acknowledgementsAn inherent delay in waiting for acknowledgements

To mask this delay:To mask this delay: Multiple producers are connected to the SERDES interfaceMultiple producers are connected to the SERDES interface The link is time-multiplexed across multiple producersThe link is time-multiplexed across multiple producers

17

Protocol OverviewProtocol Overview All data has a word width of 4 bytesAll data has a word width of 4 bytes Data packets:Data packets:

Variable size (between 32 and 2016 bytes)Variable size (between 32 and 2016 bytes) A 32-bit CRC is appendedA 32-bit CRC is appended

Acknowledgements:Acknowledgements: 8 bytes in size8 bytes in size Can interrupt transmission of data packetsCan interrupt transmission of data packets

18

Transmit LogicTransmit Logic

Transmitter consists mainly of two componentsTransmitter consists mainly of two components1.1. Dual-port buffers:Dual-port buffers:

The start address of the packet is kept in case a resend is The start address of the packet is kept in case a resend is necessarynecessary

2.2. Scheduler:Scheduler: Schedules ready packets in a round-robin fashionSchedules ready packets in a round-robin fashion

From Producer via FSL To Scheduler of SERDES Link

19

Receive LogicReceive Logic

Receiver consists mainly of two components:Receiver consists mainly of two components:1.1. Dual-port buffers:Dual-port buffers:

The start address of the packet is kept in case errors occurThe start address of the packet is kept in case errors occur

2.2. Three-stage Dataflow Pipeline:Three-stage Dataflow Pipeline:Stage 1: Determine if incoming data is properly formattedStage 1: Determine if incoming data is properly formattedStage 2: Evaluate incoming data against all possible errors Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handlerStage 3: Pass results to acknowledgement handler

From SERDES Link To Consumer via FSL

20

Design EffortDesign Effort

Majority of design effort was in error Majority of design effort was in error handling:handling: Transmitter: Transmitter:

Determine which packet combinations corrupt Determine which packet combinations corrupt the systemthe system

Establish a priority among conflicting packet Establish a priority among conflicting packet typestypes

Receiver: Receiver: Handle all possible combinations of Handle all possible combinations of

transmission errorstransmission errors

21

OutlineOutline


22

Test EnvironmentTest Environment

All SERDES tests performed across a Xilinx All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAsVirtex-II Pro XC2VP7 and XC2VP30 series FPGAs

Ribbon cables were used to transfer serial data Ribbon cables were used to transfer serial data between non-impedance controlled connectorsbetween non-impedance controlled connectors

23

Reliability and SustainabilityReliability and Sustainability

Verification test environment:Verification test environment: Send data concurrently from three producers to Send data concurrently from three producers to

three respective consumersthree respective consumers Pseudo-random packet lengthPseudo-random packet length Consumers read from FSL at variable ratesConsumers read from FSL at variable rates

Reliability:Reliability: Run this test under extremely poor line conditionsRun this test under extremely poor line conditions

Sustainability: Sustainability: Run this test under normal line conditions for a Run this test under normal line conditions for a

long period of timelong period of time

24

ReliabilityReliability

Reliability: 128-second Test ResultsReliability: 128-second Test ResultsType of ErrorType of Error Average # of ErrorsAverage # of Errors

Soft Error (x10Soft Error (x1066)) 1.3121.312

Hard ErrorHard Error 722977722977

Frame ErrorFrame Error 2222

CRC ErrorCRC Error 1841418414

Receive Buffer Full (x10Receive Buffer Full (x1066)) 1.8041.804

Lost AcknowledgmentLost Acknowledgment 8176981769

25

SustainabilitySustainability

Sustainability: 8-hour Test Results Sustainability: 8-hour Test Results MeasurementMeasurement ResultResult

Resent Packets due to Resent Packets due to Receive Buffer Full (x10Receive Buffer Full (x1066))

502.353502.353

Successful Packets (x10Successful Packets (x1066)) 5666.8215666.821

Total Packets (x10Total Packets (x1066)) 6169.1746169.174

Approximate Bit-Rate (x10Approximate Bit-Rate (x1099)) 1.7551.755

26

Comparison Against Other Comparison Against Other Communication MechanismsCommunication Mechanisms

Two configurations are usedTwo configurations are used Configuration A: Saturate the channel with Configuration A: Saturate the channel with

packetspackets Configuration B: Loop-back testConfiguration B: Loop-back test

Compare against:Compare against: Simple FPGA-based 100BaseT EthernetSimple FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT EthernetTCP/IP FPGA-based 100BaseT Ethernet TCP/IP Cluster-based Gigabit EthernetTCP/IP Cluster-based Gigabit Ethernet

27

Throughput ResultsThroughput Results

28

One-way Trip Time ResultsOne-way Trip Time Results

29

Area ConsumptionArea Consumption

Each SERDES Interface takes approximately Each SERDES Interface takes approximately 8% of a Xilinx XC2VP308% of a Xilinx XC2VP30

Debug logic substantially increases area Debug logic substantially increases area consumption:consumption:

FF usage increases 68% FF usage increases 68% LUT usage increases 43%LUT usage increases 43%

Area MeasurementArea Measurement FFsFFs LUTsLUTs

Area with Debug LogicArea with Debug Logic 34863486 32183218

Area without Debug LogicArea without Debug Logic 20742074 22442244

30

OutlineOutline


31

Integration into a Integration into a Programming ModelProgramming Model

while (1) {

MPI_Send(data_outgoing, 64, MPI_INT, 0, 0,

MPI_COMM_WORLD);

MPI_Recv(data_incoming, 64, MPI_INT, 0, 0,

MPI_COMM_WORLD, &status);

}

Hardware abstraction: FSLHardware abstraction: FSL Software abstraction: An MPI-based Software abstraction: An MPI-based

Programming ModelProgramming Model

Modified MPI_Send and MPI_Recv function Modified MPI_Send and MPI_Recv function callscalls

32

Integration into a Integration into a Programming ModelProgramming Model

Replaced producers and consumers with a Replaced producers and consumers with a MicroBlaze processorMicroBlaze processor

Several communication scenarios were testedSeveral communication scenarios were tested

ScenarioScenario Bit-Rate (Mbps)Bit-Rate (Mbps)

MicroBlaze to MicroBlaze (no traffic)MicroBlaze to MicroBlaze (no traffic) 4.304.30

MicroBlaze to MicroBlaze (traffic)MicroBlaze to MicroBlaze (traffic) 4.104.10

MicroBlaze to Hardware Consumer (no traffic)MicroBlaze to Hardware Consumer (no traffic) 7.787.78

Hardware Producer to MicroBlaze (no traffic)Hardware Producer to MicroBlaze (no traffic) 8.908.90

33

OutlineOutline

1.1. MotivationMotivation2.2. System-Level OverviewSystem-Level Overview3.3. Protocol DevelopmentProtocol Development4.4. ResultsResults5.5. Incorporation into a Programming ModelIncorporation into a Programming Model6.6. Conclusions/QuestionsConclusions/Questions

34

ConclusionsConclusions

Final Results:Final Results: Reliable and sustainableReliable and sustainable Abstracted at the software and hardware levelAbstracted at the software and hardware level 2074 FFs and 2244 LUTs required for SERDES 2074 FFs and 2244 LUTs required for SERDES

logic onlylogic only Given a channel rate of 2.5Gbps, maximum Given a channel rate of 2.5Gbps, maximum

bidirectional throughput of bidirectional throughput of 1.928Gbps1.928Gbps Minimum packet trip-time of Minimum packet trip-time of 1.231.23μμss

35

AcknowledgementsAcknowledgements

Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August 2005.

Professor RProfessor Réégis Pomgis Pomèès, Chris Madills, Chris Madill Professor Paul Chow, Professor C.Y. Chen, Professor Paul Chow, Professor C.Y. Chen,

Lesley Shannon, Arun Patel, Manuel Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, House,, Nathalie Chan, Lorne Applebaum, Patrick AklPatrick Akl

ReferencesReferences

36

Transmitter Packet Collision Transmitter Packet Collision HandlingHandling

Packets are enclosed by 8B/10B control Packets are enclosed by 8B/10B control characters (K-characters) characters (K-characters)

The type of packet is distinguished by the The type of packet is distinguished by the K-characters usedK-characters used

Certain combinations of control characters Certain combinations of control characters cannot be nestedcannot be nested Clock correction has priority over acknowledgementClock correction has priority over acknowledgement Acknowledgement cannot interrupt the end of a Acknowledgement cannot interrupt the end of a

data packetdata packet Clock correction must avoid the beginning and end Clock correction must avoid the beginning and end

of a data packetof a data packet

37

Receiver Error HandlingReceiver Error Handling

All combinations of errors at the receiver are All combinations of errors at the receiver are handled correctlyhandled correctly Data errors (CRC errors)Data errors (CRC errors) Disparity errors or invalid characters (soft errors)Disparity errors or invalid characters (soft errors) Errors in framing (frame errors)Errors in framing (frame errors) Channel failures (hard errors)Channel failures (hard errors) Lost acknowledgements/repeat packetsLost acknowledgements/repeat packets Receiver buffers fullReceiver buffers full

38

Test Configuration ATest Configuration A

Send data concurrently from three producers to Send data concurrently from three producers to three respective consumersthree respective consumers

Producers write to FSL as fast as possibleProducers write to FSL as fast as possible Consumers read from FSL as fast as possibleConsumers read from FSL as fast as possible Analyze best-case throughput results Analyze best-case throughput results

39

Test Configuration BTest Configuration B

Send data from a producer to a consumerSend data from a producer to a consumer Delay a packet write from a producer until a packet Delay a packet write from a producer until a packet

has been completely received by the consumer on has been completely received by the consumer on the same FPGAthe same FPGA

A communication loop results that determines A communication loop results that determines round-trip trip time (and therefore one-way trip time)round-trip trip time (and therefore one-way trip time)

a high-speed inter-process communication architecture for fpga-based hardware acceleration of...

Documents

serdes developmentproducers

consumer hardware blocks

fpgabased approach

xhighspeed serial io

implementedan fsl fifo

molecular configurationsuseful

rational drug designmd

chris comisseptember