a new less memory intensive net model for timing driven

Arno Messiaen

driven analytical placementA new less memory intensive net model for timing

Academic year 2014-2015Faculty of Engineering and ArchitectureChairman: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems

Master of Science in Electrical EngineeringMaster's dissertation submitted in order to obtain the academic degree of

Counsellor: Ir. Elias VansteenkisteSupervisor: Prof. dr. ir. Dirk Stroobandt

The author gives permission to make this master dissertation available for consultation andto copy parts of this master dissertation for personal use. In the case of any other use, thecopyright terms have to be respected, in particular with regard to the obligation to state expresslythe source when quoting results from this master dissertation.

Arno Messiaen, 10/08/2015

ii

Acknowledgements

I wish to express my sincere gratitude to Professor dr. ir. Dirk Stroobandt for making thisresearch possible as leader of the HES research group within the ELIS department at GhentUniversity. I sincerely thank Ir. Elias Vansteenkiste for the day-to-day counselling during thecourse of my thesis, providing me with new ideas and motivation during difficult momentsand proofreading this text. I also thank the other thesis students and doctoral students whosupported me throughout this journey. Finally, I also wish to thank my family for providingme with a lot of love and support when I came home after a difficult day of working on thisthesis.

iii

A new less memory intensive net model for

timing driven analytical placement

door

Arno Messiaen

Masterproef ingediend tot het behalen van de academische graad van

Master of Science in Electrical Engineering

Academiejaar 2014–2015

Promotor: Prof. Dr. Ir. D. Stroobandt

Scriptiebegeleiders: Ir. E. Vansteenkiste

Faculteit Ingenieurswetenschappen

Universiteit Gent

Vakgroep Elektronica en Informatiesystemen

Voorzitter: Prof. Dr. Ir. R. Van de Walle

Abstract

We introduce a new hybrid net model for timing-driven analytical placement. This new hybridnet model decreases the average critical path delay obtained after global placement with 14%compared to wire-length-driven analytical placement. The obtained HPWL (Half PerimeterWire-Length) remains the same. This is a very interesting feature of the hybrid net model.We also introduce a new gradual legalization method which leads to a decrease in the obtainedHPWL after global placement of 2% on average and up to 23% for some individual benchmarkcircuits compared to the traditional recursive partitioning based complete legalization algo-rithm. Although some previous work about analytical placement on FPGAs has already beenpublished, none of these publications have resulted in the release of a publicly available analyt-ical FPGA placement framework. Also, there are very few implementation details available inthe existing publications done about this subject. We will release our code as an open sourceproject. In this way we hope to encourage the academic research efforts done on analyticalplacement algorithms for FPGAs.

Keywords

FPGA, placement, analytical, timing driven

A new less memory intensive net model fortiming-driven analytical placement

Arno Messiaen

Promotor: Dirk Stroobandt, supervisor(s): Elias Vansteenkiste

Abstract— We introduce a new hybrid net model fortiming-driven analytical placement. This new hybridnet model decreases the average critical path delayobtained after global placement with 14% comparedto wire-length-driven analytical placement. The ob-tained HPWL (Half Perimeter Wire-Length) remainsthe same. This is a very interesting feature of thehybrid net model. We also introduce a new graduallegalization method which leads to a decrease in theobtained HPWL after global placement of 2% on aver-age and up to 23% for some individual benchmark cir-cuits compared to the traditional recursive partitioningbased complete legalization algorithm. Although someprevious work about analytical placement on FPGAshas already been published, none of these publicationshave resulted in the release of a publicly available an-alytical FPGA placement framework. Also, there arevery few implementation details available in the exist-ing publications done about this subject. We will re-lease our code as an open source project. In this way wehope to encourage the academic research efforts doneon analytical placement algorithms for FPGAs.

Keywords— FPGA, placement, analytical, timing-driven

I. Introduction

Simulated Annealing (SA) based placement algo-rithms have been dominant in the FPGA world formany years. Because of the emergence of ever-largercommercial devices these SA based placement algo-rithms were replaced by analytical placement algo-rithms in commercial FPGA tools. This was donebecause analytical placers need less run-time in orderto obtain a similar quality of result compared to SAbased placers. Unfortunately all academic tools whichare publicly available at this moment still use SAbased placers. Moreover all publications that wheredone on analytical placement algorithms for FPGAsare very reluctant to give implementations details ontheir work. This is also illustrated by the fact thatnone of the placers developed in these works are pub-licly available, even not as an executable. These is-sues are addressed in this thesis by the developmentof an open source framework for analytical placementon FPGAs.

The wire-length-driven analytical placer that wasdeveloped as part of this work is strongly based on theHeterogeneous Analytical Placer (HeAP) described in[1]. We propose a new gradual legalization methodfor use in this placer. This new legalization methodlegalizes overlapping blocks over multiple iterations ofthe placement algorithm, hoping to obtain a betterresult at the end of global placement.

Although some previous work about analyticalplacement on FPGAs has already been published, very

few of these publications can be considered as timing-driven as they don’t take any timing information intoaccount during global placement. In this work we haveevaluated two new net models which allow to incorpo-rate timing information during global placement in avery natural way: the source-sink net model and thehybrid net model.

II. Analytical placement

Analytical placers solve a system of equations in or-der to find a global placement. This global placementis then refined using a detailed placement algorithm.This detailed placement algorithm can be a low tem-perature simulated annealing stage for example. Alsoother refinement methods are possible [1] [3].

A. Wire-length-driven analytical placement

Our wire-length-driven analytical placer was basedon HeAP (Heterogeneous Analytical Placer) [1].HeAP is a quadratic placer, which has as a conse-quence that the system of equations to be solved islinear. HeAP is characterized by a special methodfor reducing the amount of overlap in the solution ob-tained by solving the linear system of equations: apseudo net is added to each block. This pseudo netconnects this block to its location after legalization.This method of resolving overlap was first introducedin the analytical ASIC placer SimPL [2].

HeAP uses a recursive partitioning based legaliza-tion algorithm that always completely legalizes the so-lution obtained by solving the linear system of equa-tions. In this work we propose a new gradual legaliza-tion method. This gradual legalization method allowssome overlap during the first iterations of the analyti-cal placement algorithm. The amount of allowed over-lap is gradually reduced, until from a certain iterationonwards the solution is always completely legalized.Our expectation is that gradual legalization will givethe analytical placement algorithm more freedom tofind a good legalized position for blocks part of largeclusters of overlapping blocks.

B. Timing-driven analytical placement

Not many analytical placers for FPGAs have incor-porated timing information in their system of equa-tions in order to also optimize for critical path delayduring global placement. We propose three differentmethods to add timing information to the linear sys-tem of equations of our quadratic placer. Each of thesemethods uses a different net model. The net model of

an analytical placer determines how a net is brokendown in two-pin connections that can be added to thelinear system of equations.

The first method uses the bound-to-bound netmodel and adds a timing cost weight factor to theweight of each two-pin connection. This is done on anet-by-net basis. This method is based on the workdone in [3]. Note that the bound-to-bound net modelis considered to be the current state of the art in wire-length-driven analytical placers.

The second method uses our newly proposed source-sink net model. This net model only adds two-pinconnections which involve the source pin of the netto the linear system of equations. This allows to addtiming information in a very natural way: each con-nection weight is multiplied with the criticality of therepresented connection. We can thus add timing infor-mation on a connection-by-connection basis instead ofon a net-by-net basis when using the bound-to-boundnet model.

The third method uses a newly proposed hybrid netmodel. This net model adds all two-pin connectionspart of the bound-to-bound net model to the linearsystem formulation using the weights used in the wire-length-driven case. All two-pin connections part of thesource-sink net model that correspond with a connec-tion with a criticality of 0.8 or higher are also addedto the linear system. In this way the total expres-sion that is minimized by solving the linear system ofequations is equal to the HPWL cost plus a numberof terms representing connections with a criticality of0.8 of higher. In this way it is also mathematicallyclear that our hybrid net model will optimize for bothoptimization goals: HPWL cost and critical path de-lay.

III. Results

The evaluation of our analytical placers was doneusing the benchmark circuits in the VTR benchmarksuite. For detailed placement a low temperature sim-ulated annealing stage was used.

A. Wire-length-driven analytical placement

The results obtained for our wire-length-driven an-alytical placers are presented in Table I. All of the re-sults in this table are geometric means (over all bench-mark designs) of the obtained ratios relative to the re-sults obtained with our wire-length-driven simulatedannealing based reference placer. Rows two and threegive the average HPWL cost after global placement,rows four and five give the average HPWL cost af-ter detailed placement. Each time the complete andgradual legalization methods are compared with eachother.

Table I indicates that using gradual legalization re-sults in a 2% lower HPWL cost after global placementcompared to the complete legalization method. It hasto be mentioned that some of the larger benchmarkcircuits gave very good results using gradual legal-ization, while some of the smaller benchmark circuits

HPWL

globaltotal 1,38gradual 1,36

detailedtotal 1,03gradual 1,03

TABLE I

Wire-length-driven analytical placement results

summary

gave results that were worse than when using com-plete legalization. This indicated that more researchneeds to be done on the parameters of the gradual le-galization algorithm. The average speed-up that wasobtained was close to 5× for both total and graduallegalization.

B. Timing-driven analytical placement

The results obtained for our timing-driven analyti-cal placers (global placement only) are presented inTable II. We also added the results for our wire-length-driven analytical placers for reference. AllHPWL costs and critical path delays in this table aregeometric means over all VTR benchmark circuits andare relative to the results obtained using the timing-driven simulated annealing based reference placer.B2B, S-S and hybrid refer to the timing-driven an-alytical placers using the bound-to-bound net model,source-sink net model and hybrid net model respec-tively.

HPWL CP

WLDtotal 1,25 1,55gradual 1,24 1,54

TDB2B 1,44 1,46S-S 1,54 1,51hybrid 1,2 1,34

TABLE II

Timing-driven analytical placement: results after

global placement

It is immediately clear that the results obtained bythe hybrid net model are the best by far, both in termsof HPWL cost as in terms of critical path delay. Theobtained average HPWL cost using the timing-drivenhybrid net model is even slightly better than when us-ing the wire-length-driven bound-to-bound net model.Table III summarizes the results obtained after de-tailed placement for the placer using the hybrid netmodel.

After detailed placement the average HPWL costis 2% higher compared to the results obtained usingthe timing-driven simulated annealing based referenceplacer. The average critical path delay on the otherhand is 2% lower. The achieved speed-up was 2.86×.The additional memory usage necessary for storing thematrices describing the linear system when using thehybrid net model is 8% higher on average compared

HPWL CP

global 1,2 1,34detailed 1,02 0,98

TABLE III

Timing-driven analytical placement using our hybrid

net model

to the bound-to-bound net model.

IV. Conclusion and future work

This work has fulfilled the need for an open sourceanalytical FPGA placer. A new gradual legalizationmethod was proposed for use in wire-length-driven an-alytical placement. Finally, also a new hybrid netmodel for use in timing-driven analytical placementwas introduced.

Several parts of the analytical placers developed inthis work can be further optimized by optimizing someof the algorithm’s parameters. Efforts will also bemade to divide the placement problem in several subproblems (clusters) hoping for better placement qual-ity and a higher speed-up (by parallelization). Finally,the amount of time spent in detailed placement willbe reduced by optimizing its parameters and tryingnew methods.

References

[1] M. Gort and J.H. Anderson, Analytical placement for het-erogeneous fpgas, Proceedings of the 22nd InternationalConference on Field Programmable Logic and Applications(FPL), pp. 143–150, 2012

[2] M. Kim, D. Lee and I.L. Markov, Simpl: An effective place-ment algorithm, International Conference on Computer-Aided Design, pp. 50–60, 2012

[3] T. Lin, P. Banerjee and Y. Chang, An efficient and effectiveanalytical placer for fpgas, Proceedings of the 50th AnnualDesign Automation Conference, pp. 10:1–10:6, 2013

Contents

Acknowledgements iii

Abstract iv

Extended abstract v

Table of Contents viii

Abbreviations x

1 Introduction 1

1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 CAD flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Placement algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Evaluation of a circuit placement . . . . . . . . . . . . . . . . . . . . . . 6

2.3.2 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.3 Analytical placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 The evolution of placement algorithms . . . . . . . . . . . . . . . . . . . 16

3 Analytical placement 19

3.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Analytical placer details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

viii

CONTENTS ix

3.2.1 Building the linear system . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Net models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Legalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.4 Pseudo connections and convergence . . . . . . . . . . . . . . . . . . . . 38

3.2.5 Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.6 Timing-driven analytical placement . . . . . . . . . . . . . . . . . . . . 42

3.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Method and results 46

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 Benchmark circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.3 Timing information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.4 Placement evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Reference simulated annealing based placers . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Wire-length-driven simulated annealing based placer . . . . . . . . . . . 50

4.2.2 Timing-driven simulated annealing based placer . . . . . . . . . . . . . . 51

4.3 Wire-length-driven analytical placers . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Timing-driven analytical placers . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1 Comparing the different net models . . . . . . . . . . . . . . . . . . . . 55

4.4.2 Results after refinement for our hybrid net model . . . . . . . . . . . . . 56

5 Conclusions and future work 60

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2.1 Comparison to low effort simulated annealing . . . . . . . . . . . . . . . 61

5.2.2 Other future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Bibliography 66

List of Figures 68

List of Tables 70

List of abbreviations

FPGA Field-Programmable Gate Array

CLB Complex Logic Block

FF FlipFlop

LUT LookUp Table

IO Input/Output

IOB Input/Output Block

RAM Random Access Memory

DSP Digital Signal Processing

MAC Multiply and Accumulate

CAD Computer-Aided Design

HDL Hardware Description Language

VHDL Very high speed integrated circuit Hardware Description Language

HPWL Half Perimeter Wire-Length

BB Bounding Box

ASIC Application Specific Integrated Circuit

SA Simulated Annealing

VPR Versatile Place and Route

VTR Verilog-to-Routing

CRS Compressed Row Storage

QUIP Quartus University Interface Program

QPF Quadratic Placement tool for FPGAs

HeAP Heterogeneous Analytical Placer

CP Critical Path

x

Chapter 1

Introduction

FPGAs are a specific type of programmable logic devices. They basically consist of a two-dimensional array of programmable logic blocks. These programmable logic blocks can beconnected using the routing network present in the FPGA in order to implement any combina-torial or sequential logic circuit (that will fit in the device). FPGA applications are developedusing a HDL (Hardware Description Language). The FPGA tool flow converts this HDL designinto a binary bit stream that can be used to program the device. One of the most importantand time consuming stages of this tool flow is the placement. The placement algorithm placesthe logic blocks present in the net list generated during the previous stages of the tool flow onthe physical resources in the FPGA. The placement problem is not a trivial problem as con-nections spanning a long distance along the FPGA will result in large signal delays. Thereforeconnected blocks need to be placed close to each other. Unfortunately the circuit placementproblem is an NP-complete problem. This basically means that no time-efficient way to com-pute a solution to the problem using a computer program exists. Solutions to NP-completeproblems can only be found in reasonable time by using an approximation technique. Thisthesis revolves entirely around FPGA placement algorithms. More precisely, this work focusseson analytical FPGA placement algorithms.

1.1 Problem definition

Simulated annealing based algorithms have been used for a long time on FPGAs. Simulatedannealing has the advantage that a good quality of result can be obtained, but the algorithmneeds a lot of time to achieve this. Even more so, the run-time of the algorithm increases fasterthan linear with the problem size (the number of blocks to place). Experiments conducted aspart of this thesis have shown that the run-time is proportional to the number of blocks toplace to the power of 1.6. Because today’s commercial devices have become very large almostall commercial FPGA placement tools have migrated towards analytical placement algorithms,which have a shorter run-time while giving the same quality of results. Although some aca-demic research has already been done on analytical placement on FPGAs all publicly availableacademic placers still use a simulated annealing based algorithm. Moreover all publicationsthat where done on analytical placement algorithms for FPGAs are very reluctant to give im-plementations details on their work. One of the goals of this thesis is to address this issue bydeveloping an open source framework for analytical placement on FPGAs.

1

CHAPTER 1. INTRODUCTION 2

Very few of the already existing publications about analytical placement on FPGAs have ad-dressed the issue of timing-driven analytical placement. Therefore another goal of this thesisis to introduce a new method for incorporating timing information in analytical placementalgorithms. This is done by searching for new net models that allow to introduce this timinginformation in a natural way. Along the way we try to minimize the memory usage of this newnet model, as the main downside of analytical placement algorithms compared to simulatedannealing based algorithms is a higher memory usage.

1.2 Structure

Chapter 2 clarifies the problem that is tackled in this thesis. We give an overview of FPGAsand their CAD flow in general and also introduce the technical details of placement algorithmsfor FPGAs.

Chapter 3 introduces the analytical placement algorithms that were built and evaluated in thisthesis. We also give an overview of the previous work that has been done on this subject.Finally we give an overview of the contributions done by this thesis.

Chapter 4 summarizes the obtained simulation results and uses these results to discuss theperformance of the placers described in Chapter 3.

Finally, Chapter 5 gives the conclusions of this thesis and does some propositions on possiblefuture work.

Chapter 2

Background

2.1 FPGAs

(a) Homogeneous architecture (b) Heterogeneous architecture

Figure 2.1: FPGA architectures

A field programmable gate array (FPGA) is a specific type of programmable logic device.More specifically an FPGA is field-programmable, which means it can be programmed andreprogrammed in the field (without returning it to the manufacturer) and basically consists ofan array of logic blocks (hence the term gate array). The basic structure of an FPGA is shownin Figure 2.1a. Each logic block in the FPGA looks exactly the same and contains multiplelookup tables (LUTs) and registers. The LUTs implement combinational logic functions, whilethe registers are memory elements that allow to build sequential circuits. An example of a verysimple logic block is shown in Figure 2.2. More sophisticated FPGA architectures (like foundin commercial FPGAs sold today) contain more complex logic blocks, hence the term complexlogic block (CLB) is often used in literature and in the industry.

Next to the CLBs an FPGA contains two more important building blocks, as shown in Figure

3

CHAPTER 2. BACKGROUND 4

Figure 2.2: A logic block containing one LUT and one register

2.1a. First of all we have the input/output blocks (IOBs). These blocks allow to interface withthe outside world, as they are connected to the pins of the FPGA package. Finally, the FPGAalso contains a decent amount of routing infrastructure. This routing infrastructure allowsfor CLBs and IOBs to be interconnected, so that a useful digital circuit can be implemented.The routing infrastructure consists of routing channels and switch blocks. A routing channelis a collection of wires in between two switch blocks. As shown in Figure 2.1a these routingchannels can run in horizontal or in vertical direction. The switch blocks (also called switchmatrices) allow to make connections between wires in different routing channels. Next to thatthey also allow the neighbouring logic blocks to connect to the routing network. A schematicrepresentation of a switch block is shown in Figure 2.3. This figure was reproduced from thedatasheet of a commercial device [17].

Figure 2.3: Switch matrix found in Xilinx XC4000E and XC4000X devices [17]

Note that Figure 2.3 also indicates the types of wires which are being connected in the switchmatrix. Single wires are wires that connect two neighbouring switch blocks (this can be hori-zontal or vertical neighbours). Double wires on the other hand skip a switch block and in thisway connect two switch blocks which are two grid positions away from each other. In modernFPGA architectures also different types of wires exist, such as quads (spanning four grid posi-tions) hex wires (spanning six grid positions), etc. Finally, long wires are wires covering 12 to18 grid positions.


The FPGA architecture described above is sometimes called a homogeneous architecture, asthere is only one type of logic block present in the array. The architectures used in commercialdevices nowadays are heterogeneous. Next to the standard logic blocks also other types ofblocks are included in the array. These blocks are sometimes referred to as hard blocks as theyare hard wired to execute a certain function, and therefore are less flexible than a standard logicblock. Hard blocks are typically grouped in columns in the FPGA array, as shown in Figure2.1b. An example of a type of hard block is the DSP block. This type of block is optimizedto perform multiply and accumulate operations (MAC operations), and can therefore executethis functionality a lot faster and area-efficient than when the same functionality would beimplemented using standard logic blocks. Figure 2.4 shows a DSP48 slice used by Xilinx inVirtex-4 devices [18] (note that a DSP block typically consists of two DSP slices in Xilinxarchitectures). Another example of a type of hard block is the RAM block, which contains acertain amount of addressable memory.

Figure 2.4: DSP48 slice as found in Xilinx Virtex-4 FPGAs [18]

2.2 CAD flow

Figure 2.5 shows a schematic representation of a typical FPGA toolflow. An FPGA applicationdesigner will typically describe the FPGA’s desired functionality using a hardware descriptionlanguage (HDL) such as VHDL or Verilog. Some tools allow different design methods suchas schematic design or more high level languages such as OpenCL or C, but HDL design iscurrently the most used method in the industry. The first step in the toolflow is called synthesis.In this step the HDL description of the circuit is translated into a netlist of basic logic gates(e.g., AND, OR, etc.), registers, and IO blocks. Also hard blocks such multipliers and RAMblocks are inferred and added to the netlist. Next, the mapping stage will map the basic logicgates into LUTs. Finally, the packer will pack the LUTs and registers into CLBs. This meansthat after packing we end up with a netlist only containing the FPGA building blocks as shownin Figure 2.1b.

The next steps in the toolflow are placement and routing. Placement places the building blocks


Figure 2.5: Schematic representation of the FPGA toolflow

on physical locations in the FPGA. This is not a trivial problem as connections spanning along distance along the FPGA will result in large signal delays, which will result in poor circuitperformance. Because of this strongly interconnected blocks have to be placed close to eachother. After placement the routing stage will use the FPGA’s routing infrastructure to makeall necessary connections between the circuit blocks. Note that placement and routing areclosely related, as a poor placement will result in a lot of congestion. Congestion means thatan insufficient amount of wires are available to make a connection in the most efficient way,which has as a consequence that a detour needs to be made. This results in poorer circuitperformance.

The final step in the toolflow is the bitstream generation. This stage takes the results of allthe previous toolflow stages as input and generates the FPGA programming file (also calledbitstream). The bitstream is then used by the FPGA programmer device to physically programthe FPGA such that it exhibits the desired functionality.

2.3 Placement algorithms

The research done in this thesis focusses entirely on the placement stage of the FPGA toolflow.Therefore we will dive a little bit deeper into this stage of the toolflow in this section.

2.3.1 Evaluation of a circuit placement

As mentioned before circuit placement is a very important stage in the FPGA toolflow. Thisis because the quality of the routed design can only be as good as the limit imposed by thequality of the placement generated before the routing step. The quality of a routed design isevaluated by two main criteria:

• Maximal path delay in the circuit

• The amount of wiring necessary

The maximal path delay in the circuit determines the maximal operating frequency of thecircuit. As for most applications for which an FPGA is used a minimal desired clock frequencyis defined, this is a very important parameter. When less wiring is necessary to route adesign less wires will have to switch states when there is activity. Because of this less parasitic


capacitance will have to be charged and/or discharged which results in less power consumption.The amount of wiring that is necessary on the FPGA in order to be able to route as manydesigns as possible is important information for the FPGA vendor. If designs can be routedusing less wiring the routing channels on the FPGA can be made smaller. This results in lessnecessary area to implement the chip which reduces the cost of it.

A circuit placement is typically evaluated by routing the design using this placement andevaluating the maximal circuit delay and power consumption. Unfortunately the routing stagetakes quite a long time to complete which imposes that it is not possible to use these postrouting measures in the kernel of the placement algorithm. This means that if we want toevaluate a placement during the circuit placement stage we will need other criteria to do so.These criteria are often referred to as cost functions.

A very simple but often used cost function is the so called half perimeter wire-length (HPWL)of the circuit. The HPWL is an estimate for the routing cost of the circuit. The HPWL of anet is defined as the sum of the width and the height of its bounding box. The total HPWL ofthe circuit is then defined as the sum of the half perimeter wire-lengths of all nets in the circuit.Figure 2.6 shows a net connecting five elements with each other. The red circles indicate circuitIOs, which will always be considered as being at fixed positions. The green rectangles indicatemovable CLBs. The bounding box of the net is indicated by the blue rectangle.

Figure 2.6: The bounding box of a net

When evaluating different circuit placements using HPWL the placement with the lowestHPWL is considered as the best placement. Note that half perimeter wire-length is a verynatural way of evaluating circuit placements. As mentioned before placement has the goal ofplacing strongly interconnected circuit elements closely to each other. Minimizing the HPWLof the circuit will try to achieve exactly this.

When looking back at Figure 2.6 it is easy to see that the half perimeter wire-length cost func-tion will actually underestimate the minimal needed interconnection wire-length after routingfor nets containing more than three elements. Because of this reason the HPWL of a net ismultiplied with a factor q(n) (≥ 1) which is dependant on the number of pins n the net isconnecting (n = 5 in Figure 2.6). Table 2.1 gives q(n) as a function of n. Mathematically theHPWL cost of a circuit can than be written as follows (note that n is a counter counting over


all nets in this equation):

CostHPWL =

Nnets∑n=1

q(n) · (bbx(n) + bby(n)) (2.1)

n q(n)

1 ≤ n ≤ 3 1

4 1.0828

5 1.1536

6 1.2206

7 1.2823

8 1.3385

9 1.3991

10 1.4493

11 ≤ n ≤ 15 (n− 10) ∗ (1.6899− 1.4493)/5 + 1.4493

16 ≤ n ≤ 20 (n− 15) ∗ (1.8924− 1.6899)/5 + 1.6899

21 ≤ n ≤ 25 (n− 20) ∗ (2.0743− 1.8924)/5 + 1.8924

26 ≤ n ≤ 30 (n− 25) ∗ (2.2334− 2.0743)/5 + 2.0743

31 ≤ n ≤ 35 (n− 30) ∗ (2.3895− 2.2334)/5 + 2.2334

36 ≤ n ≤ 40 (n− 35) ∗ (2.5356− 2.3895)/5 + 2.3895

41 ≤ n ≤ 45 (n− 40) ∗ (2.6625− 2.5356)/5 + 2.5356

46 ≤ n ≤ 50 (n− 45) ∗ (2.7933− 2.6625)/5 + 2.6625

> 50 (n− 50) ∗ 0.02616 + 2.7933

Table 2.1: q(n) as a function of the number of pins connected by the net n [4]

Although the half perimeter wire-length cost function is a very natural and easy way of eval-uating circuit placements the most optimal placement in terms of this cost function will notnecessarily be the most optimal placement in terms of maximal circuit delay. To further ex-plain this we first need to introduce the timing graph. Figure 2.7 shows the timing graph (rightpart of the figure) of a very simple example circuit (left part of the figure) [2]. The examplecircuit consists of two inputs, two LUTs, one register and one output. Note that we considerLUTs and registers instead of CLBs. The corresponding timing graph consists of vertices anddirected edges. Every vertex represents an input or output pin of a basic circuit element suchas a LUT, register or IO pad. The pins of circuit inputs and register outputs are always inputsto the graph, while pins of a LUT are always internal vertices of the graph. Register inputpins and circuit output pins are always outputs of the timing graph. The connections betweenthese circuit elements are represented by directed edges in the graph. With every edge thecorresponding delay of the connection is associated. Note that next to the circuit elementsthat were already mentioned also hard blocks can be added to a timing graph. It is importantto differentiate between combinational hard blocks and sequential hard blocks. Multiplier hardblocks for example are combinational which has as a consequence that they will be treated asLUTs in the timing graph. RAM blocks on the other hand are sequential and are thereforetreated as registers in the timing graph.

When using a timing graph during circuit placement it is obvious that we cannot exactlyknow the delay associated with a certain connection as the circuit has not been routed yet. A


Figure 2.7: The timing graph of a simple circuit [2]

possible solution to this problem is the following. Consider a connection between two circuitelements. We call the distance between these two elements in the x-direction and y-directionrespectively ∆x and ∆y. Before building the timing graph we use the router tool to routea connection between two circuit elements which are a distance ∆x and ∆y apart from eachother (again in x-direction and y-direction respectively). We do this for every possible ∆xand ∆y that can occur and each time on a completely blank FPGA. For every (∆x,∆y) pairwe store the associated delay in a so called delay lookup matrix. This means that calculatingthe delay between two circuit elements while building or updating the timing graph simplifiesto searching for the delay associated with the (∆x,∆y) pair. Note that we can only do thisbecause of the strongly repetitive and homogeneous nature of an FPGA. Even when usinga heterogeneous FPGA this reasoning remains valid because the number of hard blocks willalways be limited compared to the number of CLBs. Next to that a hard block also doesn’tdramatically disturb the FPGA’s grid structure and routing network [12]. Different matricesare used depending on the type of the blocks which are connected. For each type of hard blockfor example new delay lookup matrices need to be calculated.

When the timing graph has been constructed we can look into how we can use it during circuitplacement. We start by associating an arrival time with every vertex in the timing graph.Input vertices of the timing graph always have an arrival time of zero. For all other vertices inthe timing graph the arrival time is defined as follows:

Tarrival(i) = max∀j∈fanin(i)(Tarrival(j) + delay(j, i)) (2.2)

The maximal delay (Dmax) in the circuit is equal to the maximal arrival time that occurs inthe timing graph. Note that the vertex that has the maximal arrival time will always be anoutput vertex of the timing graph. Also note that Dmax is only a prediction of the maximaldelay that will occur in the circuit after routing. Routing congestion could for example resultin a higher maximal delay after routing.


After calculating Dmax we can associate a required arrival time with every vertex in the timinggraph. The required arrival time of an output vertex of the circuit is always equal to Dmax.For all other vertices in the timing graph the required arrival time is defined as follows:

Trequired(i) = min∀j∈fanout(i)(Trequired(j)− delay(i, j)) (2.3)

Note that arrival times are calculated in forward direction (from input vertices to outputvertices of the timing graph), while required times are calculated in backward direction (fromoutput vertices to input vertices of the timing graph). Finally, we can define the slack of aconnection (i, j). The slack of a connection indicates how many delay may be added to theconnection before the path that the connection is on becomes critical. The slack of a connection(i, j) is defined as follows:

slack(i, j) = Trequired(j)− Tarrival(i)− delay(i, j) (2.4)

All connections which are part of the critical path (slowest path) of the circuit will have a slackof 0: no delay may be added to the connection or the maximal delay of the circuit will increase.It is easy to see that when placing a circuit we will give high priority to reducing the delay ofconnections which have a slack close to zero.

As a final step we will now introduce a new cost function which will take timing informationbetter into account than the half perimeter wire-length cost function did. As a result this costfunction will do a better job in reducing the maximal circuit delay. We start by introducingthe timing cost of a connection (i, j):

timingCost(i, j) = delay(i, j) · criticality(i, j)criticality exponent (2.5)

The criticality of a connection (i, j) is defined as follows:

criticality(i, j) = 1− slack(i, j)

Dmax(2.6)

Connections which are on the critical path will have a criticality of 1, as their slack is equal to0. The criticality exponent is used to only make connections with a criticality very close to 1really important. Its value typically varies between 1 and 8. We can now calculate the totaltiming cost of the placement by summing the timing cost of all source-sink connections in thecircuit:

Costtiming =∑

∀(i,j)⊂ circuit

timingCost(i, j) (2.7)

It is important to mention that during placement it is both important to minimize the maximaldelay as the half perimeter wire-length. When completely ignoring wire-length there might bean unacceptable amount of congestion during routing. Completely ignoring the timing coston the other hand might result in an unacceptably high maximal delay in the circuit. This is


partly a consequence of the fact that the half perimeter wire-length cost function doesn’t takelogic depth into consideration. The logic depth of a path is defined as the number of LUTsthat the signal needs to travel through from the source flip flop to the destination flip flop.Figure 2.8a shows a circuit path with a high logic depth, while Figure 2.8b shows a circuitpath with a low logic depth. As paths with a high logic depth are more likely to be critical, itis even more important to minimize interconnection length on these paths than on paths witha lower logic depth. The timing cost function takes this into account, but the half perimeterwire-length function doesn’t. Therefore the two cost functions described above are combinedwhen executing a timing-driven placement [12].

(a) High logic depth

(b) Low logic depth

Figure 2.8: Two circuit paths with a different logic depth

Circuit placement is a computationally complex problem. It actually is a NP-complete problem.This basically means that no efficient way to compute a solution to the problem using acomputer program exists. Solutions to NP-complete problems can only be found in reasonabletime by using an approximation technique.

In the following two sections two different kinds of placement algorithms will be introduced.We will start off with simulated annealing, the most widely used placement algorithm forFPGAs in the academic world at the moment of writing. Next we will introduce analyticalplacement, which is already in wide use in commercial FPGA tools, but has not found its wayto open source academic tools yet. This in contrast to ASIC placement (which is closely relatedto FPGA placement), where both the academic world as commercial vendors have made thetransition to analytical placers.

2.3.2 Simulated annealing

In computer science a heuristic is defined as an approach to find an acceptable solution to anoptimization problem in an acceptable time frame. Heuristic techniques are typically used wheniteratively searching for the optimal solution would be impossible or impractical (e.g. becauseof the amount of time needed). Instead of searching for the most optimal solution, heuristicsguarantee to find a solution that will be sufficient for the application at hand. Simulatedannealing (SA) is a heuristic used for the global optimization problem of locating a goodapproximation to the global optimum of a given function. The name and inspiration for theheuristic comes from annealing in metallurgy, where a metal is heated and cooled using aspecific temperature schedule in order to decrease the number of crystal defects.


When used in the placement stage of the FPGA tool flow, simulated annealing will try to findan acceptable solution to the placement problem by repeatedly swapping the position of blocks.The algorithm basically has four important parameters:

• The cost function

• The temperature profile

• The number of swap attempts per temperature

• The search range in which we search for a swap

Both the half perimeter wire-length cost function as the timing cost function defined in Section2.3.1 can be used as a cost function. During wire-length-driven placement only the HPWL isused. When doing timing-driven placement the HPWL and timing cost functions are combinedinto one single cost function in order to optimize for both.

The algorithm starts by generating a completely random initial placement. The only conditionthat should be fulfilled is that all blocks should be placed on valid positions (sites) in the FPGA(e.g. a RAM block should be placed on a RAM site in the FPGA and not on a CLB site).

After generating a random initial placement random swaps are performed. The process ofgenerating a random swap is illustrated in Figure 2.9. We start by searching for a randomblock in the circuit to place. This block is indicated in green in Figure 2.9. Then we searchfor a valid random site in the FPGA within a square box with side Rlim around the currentposition of the block on the FPGA. This box is indicated in blue in Figure 2.9. If the randomselected site already contains a block the two blocks its positions are switched. If the selectedsite doesn’t contain a block yet the selected block is just moved to the selected site.

Figure 2.9: The search range when searching for a random swap site

After generating a random swap we decide if we will accept the swap. This is done using thecost function and the current temperature. We start by calculating the cost change (∆C) of


the swap. If ∆C is smaller than zero the swap results in a cost decrease and we have thusfound a better placement. Because of this we will always accept swaps that have a negative∆C. Swaps that have a positive ∆C are treated differently: a random experiment is performedto decide if the swap will be accepted or not. The chance on success of the random experiment(and thus an accepted swap) depends on the current temperature and the value of ∆C. A hightemperature even gives swaps which entail a big cost increase a chance to be accepted, while alow temperature will result in only slightly cost increasing swaps to possibly be accepted.

We accept some cost increasing swaps because only accepting cost decreasing swaps wouldquickly result in the algorithm getting stuck in a local minimum. This is illustrated in Figure2.10. This means that not a single swap can be found any more which decreases the placementcost, while the most optimal placement has not yet been found. Therefore we do have to acceptsome cost increasing swaps. The algorithm always starts at a high temperature. This meansthat a lot of cost increasing swaps will be accepted in the beginning, in order to explore thesolution space. During the course of the algorithm the temperature is gradually decreased,until the point where almost no cost increasing swaps are accepted any more at the end of thealgorithm.

Figure 2.10: Simulated annealing temperature schedule [9]

Along with the temperature also the swap search range Rlim is gradually changed. At firstthe search range will span the entire FPGA, which together with the high initial temperatureallows to search the complete solution space. Then the search range is gradually decreased withdecreasing temperature, to eventually become very small at the end of the algorithm. Thismeans that the algorithm will naturally evolve from a more global view to a more local view.

A final parameter of the simulated annealing algorithm is the number of swap attempts thatis executed per temperature interval. This parameter is often used to change the effort levelof the algorithm: a higher number of swap attempts per temperature will result in a longerrun-time, but also in a better quality of the final solution.

The simulated annealing placement algorithm described above is summarized in pseudocodein Listing 2.1.


Listing 2.1: Simulated annealing pseudocode

S = RandomPlacement ( ) ;T = In i t i a lTempera ture ( ) ;Rl imit = I n i t i a l R l i m i t ( ) ;while e x i t C r i t e r i o n ( ) == f a l s e :

while nSwapsDone < nSwapsPerTemperature :Snew = GenerateSwap (S , Rl imit ) ;deltaC = ca l cu l a t eDe l t aCos t (S , Snew ) ;i f de l taCost < 0 :

S = Snew ; // Accept moveelse :

i f randomExperiment ( deltaC ,T) == true :S = Snew ; // Accept move

nSwapsDone++;T = UpdateTemperature ( ) ;Rl imit = UpdateRlimit ( ) ;

2.3.3 Analytical placement

In this section we will briefly introduce the general concepts behind analytical placement. Asthis was the main subject of this thesis, the next chapter in this text will describe the placerthat was developed as part of this work more elaborately.

The general principle behind analytical placement is that the placement problem is representedas a linear system. Solving this linear system will then result in the most optimal x- and y-locations for all movable blocks. Let us consider the (rather trivial) netlist depicted in Figure2.11. The two red circles indicate fixed blocks. This can be IOs, or other blocks that areconsidered fixed during circuit placement. We will only consider the placement problem in thex-direction here, as the solution in y-direction is fairly easy to obtain: all blocks should beplaced at the same y-location. Therefore only the x-location of the fixed blocks is shown in thefigure. The green rectangles indicate movable blocks, such as CLBs.

Figure 2.11: A very simple netlist to be placed

We can now describe the placement problem as follows: ”Find the most optimal locations ofthe movable blocks A and B such that the total necessary wire-length is minimized.” Note thatwe are considering wire-length-driven placement here as no timing information is taken intoaccount. We can express the sum of the squares of the used wire-lengths in the x-direction as


follows:

ΦX = (XA − 1)2 + (XA −XB)2 + (XB − 3)2 (2.8)

In order to solve the placement problem we should find the values of XA and XB that minimizeΦX . This can be done by taking the partial derivatives of Equation 2.8 with respect to XA andXB and equalizing the resulting expressions to zero. Note that while doing so we are actuallyminimizing the square wire-length instead of the wire-length itself. The reason for this will beexplained later on. By taking the partial derivatives we obtain the following expressions:

∂ΦX

∂XA= (XA − 1) + (XA −XB) = 0 (2.9)

∂ΦX

∂XB= −(XA −XB) + (XB − 3) = 0 (2.10)

It is clear that we end up with a linear system of equations. This linear system of equationscan be represented by a matrix and a vector:

A · −→x =−→b with A =

[2 −1−1 2

],−→x =

[XA

XB

],−→b =

[13

](2.11)

Solving this linear system using an off-the-shelf linear system solver gives the following solution:

−→x =

[5373

](2.12)

Note that when the problem in the y-direction would not have been trivial it could be solved inexactly the same way as the problem in the x-direction. Also note that no matter how complexthe netlist to place is the problems in the x- and y-direction are always completely independentfrom each other. This has as a consequence that both problems can be solved at the sametime, leading to a significant reduction of the computation time.

As already mentioned before we have now found the placement that minimizes the squarewire-length. But in reality we have to minimize the wire-length itself. The reason why we usethe square wire-length when constructing the linear system is because this always results in asymmetric and positive definite matrix. Very efficient tools exist to solve this kind of matrixsystem (e.g.: the conjugate gradient method), while solving methods for random linear systemsrequire much more computation time.

It is clear from the reasoning above that the solution given by solving the linear system can beany collection of rational numbers. We should keep in mind that we are placing circuit blockson an FPGA, which is inherently discrete. This means that only integer block coordinatescoincide with a legal placement. Furthermore, the linear solver might place several blocks onexactly the same location. This has as a consequence that we cannot just take the solutiongiven by the linear solver as a final result: we should first legalize the solution. But afterlegalizing the solution we do not necessarily end up with a perfect placement. Legalization


will most likely deteriorate the circuit’s placement. This is especially the case when a lot ofoverlap has to be resolved by the legalizer. Because of this we will have to solve the linearsystem again, but this time while keeping the previous solution in mind. After performing acertain number of iterations the point will be reached where no further improvement can beobtained. At this point the analytical placer will typically not have reached a sufficiently goodplacement, although it will definitely be close. Therefore a final refinement stage is typicallydone. This can be a low temperature and low effort simulated annealing stage for example.

As became clear from the discussion above, analytical placement is not as simple as just solvinga system of equations and taking the solution for granted. Therefore we have devoted Chapter3 entirely to describing the analytical placer that was built as part of this thesis.

2.3.4 The evolution of placement algorithms

FPGA tool vendors have used simulated annealing based placement algorithms for a very longtime. Simulated annealing has the advantage that it can be proven that the algorithm willalways converge to the most optimal solution when given enough time [7]. The importanceof this statement should not be overestimated because the time needed to find this mostoptimal solution will quickly become very large. This brings us to the biggest disadvantage ofsimulated annealing: a bad run-time scaling. To illustrate this disadvantage we have examinedthe relationship between run-time and problem size for a publicly available academic simulatedannealing based placer. This SA placer is part of the Versatile Place and Route (VPR) tool[1]. The benchmarks used to conduct this experiment were part of the Verilog-to-Routing(VTR) benchmark suite [11]. The VPR placer performed a wire-length-driven, high effort(inner num = 10.0) placement. The results of the experiment are shown in Figure 2.12. Thehorizontal axis represents the problem size, while the vertical axis represents the placement run-time. Both axes use a logarithmic scale. The blue diamonds indicate the obtained measurementresults. The dark red line indicates the power fitting that was done on these measurementpoints. The mathematical result of this fitting is shown in Expression 2.13. In this expressionn denotes the problem size (number of movable blocks), while t resembles the runtime inseconds.

t = 0.00013366 · n1.64 (2.13)

It is clear from Expression 2.13 that the run-time scales faster than linear with the problemsize. This is especially a problem when considering very large problem sizes. When FPGAswere introduced on the market for the first time they were really small compared to today’sdevices. Because of this the exponential run-time scaling was not really a problem back then.But as the size of the devices increased with time, the run-time of circuit placement rose tounacceptable levels. Because of this, FPGA tool vendors started to migrate towards analyticalplacement algorithms.

The run-time of analytical placers outperforms those of simulated annealing based placers.This will be illustrated in Chapter 4. But analytical placers have a different disadvantage: thememory usage is higher than when using simulated annealing. This is especially a problemwhen very large circuits have to be placed. As already mentioned in Section 2.3.3, analyticalplacers solve a system of equations in order to find an acceptable placement. In our case


Figure 2.12: The run-time of VPR’s SA placemer versus the problem size (number of movableblocks) on a log-log chart

this system of equations is linear, and thus can be represented using a matrix and a vector.This matrix is always square and the amount of rows (columns) equals the number of movableblocks. Luckily this matrix is sparse, so we only have to store non-zero entries in memory. Whenperforming analytical placement we always have two such matrices in memory: one storing theproblem in x-direction and one storing the problem in y-direction. To illustrate the memoryusage difference with simulated annealing we measured the memory used for storing the x-and y-matrices when using analytical placement on the bgm and stereovision2 benchmarkswhich are part of the VTR benchmark suite. This was done using the bound-to-bound netmodel (see Section 3.2.2 for more information about net models). The performed placementwas wire-length-driven. The Compressed Row Storage (CRS) format was used to store thesparse matrices in memory. This experiment was conducted using the analytical placer thatwas developed as part of this thesis. We compare these numbers with the memory usage ofthe working data during wire-length simulated annealing. The simulated annealing experimentwas conducted using our own simulated annealing placer. The results of the experiment areshown in Table 2.2.

Benchmark Analytical SA

bgm 12MB 3.70MBstereovision2 7.62MB 4.64MB

Table 2.2: Memory usage of matrices in analytical placement versus working data memoryusage in simulated annealing placement

When looking at the results presented in Table 2.2 it is immediately clear that the memoryusage during analytical placement is higher than during simulated annealing. Another (at firstsight rather surprising) fact that immediately stands out is that for simulated annealing thememory usage of the stereovision2 benchmark is highest, while for analytical placement thememory usage of the bgm benchmark is highest. This can easily be explained. For simulatedannealing the memory usage only depends on the number of nets that are present in the


circuit. For analytical placement on the other hand, several factors determine memory usage:the number of blocks to place, the number of nets in the circuit and the number of differentblocks that are connected by these nets. Especially the number of different blocks connectedby a net is an important factor. This explains the memory usage difference noticed in Table2.2.

Chapter 3

Analytical placement

After introducing analytical placement in the previous chapter, this chapter will describe an-alytical placement for FPGAs in more detail. More specifically, previous work dealing withanalytical placement for FPGAs will be discussed and the analytical placer developed as partof this thesis will be described in detail.

3.1 Previous work

Analytical placers for ASICs have been extensively researched in the past. These analyticalASIC placers are often divided into two categories, based on how the system of equationsthat is built and solved looks like. These two categories are quadratic placers and non-linearplacers. The placer that was developed as part of this work falls under the category of quadraticplacers: the equation to minimize is a sum of squares and the eventual system of equationsthat is solved is linear. Non-linear analytical placers build a non-linear system (as the namesuggests) and thus need different and more complex tools to solve the system of equationscompared to quadratic analytical placers. This non-linear system of equations can containlog-sum-exp functions for example.

Obviously ASIC placers are not the focus of this work, but nevertheless we will mention twoanalytical ASIC placers that were of big importance for the FPGA analytical placers where thisthesis is based upon. Kraftwerk2 by Spindler et al. [14] is a quadratic analytical placer thatintroduced the bound-to-bound net model for the first time (see section 3.2.2). This net modelis considered to be the current state of the art in quadratic analytical placers and was alsoused in the FPGA analytical placer where this work was mainly based upon. SimPL by Kimet al. [8] introduced a new way of solving the problem of circuit element overlap introduced bythe linear solver. The placer always keeps track of two solutions: a non-legal solution obtainedby solving the linear system of equations and an almost completely legal solution obtainedby legalizing the solution of the linear system. The algorithm performs multiple iterationsin which new solved (non-legal) and legalized solutions are obtained. The solved solution isgradually spread (the amount of overlap present in it is reduced) over the consecutive iterationsof the algorithm by adding so-called pseudo connections to each circuit element. These pseudoconnections connect the circuit element with its legalized position (the position of the elementin the best legalized solution found so far). By gradually increasing the weights associated

19

CHAPTER 3. ANALYTICAL PLACEMENT 20

with these pseudo connections the linear system solution is gradually spread and will steadilystart to approach the legalized solution. Over the different iterations of the algorithm betterlegalized solutions will be found as the linear system solver takes the information provided bythe pseudo connections into account. In this way the placement cost of the solved and legalizedsolutions will gradually converge towards each other. This methodology to solve circuit elementoverlap was also used in the FPGA analytical placer where this thesis is mostly based upon.More information about pseudo connections can be found in section 3.2.4.

As mentioned before analytical ASIC placers have been widely researched over the past coupleof decades. Unfortunately the amount of work that can be found on analytical FPGA placersis a lot scarcer.

The oldest work known to the author is QPF (Quadratic Placement tool for FPGAs) whichwas developed by Xu et al. [20]. QPF is a wire-length-driven quadratic analytical placer. Itcan only work with homogeneous architectures and results were only published for the rathersmall MCNC benchmarks. The placer uses a low temperature simulated annealing run torefine the final placement. The published results were promising as an average speed up of5.8× was claimed to be achieved compared to VPR’s simulated annealing placer. The averagewire-length increase compared to VPR was 2%.

Gopalakrishnan et al. developed a timing-driven analytical placer for FPGAs named CAPRI[5]. The placer was made architecture aware in order to minimize critical path delay afterrouting. A low temperature simulated annealing run is used to refine the final placement’squality. CAPRI averaged on a 2× speed increase compared to VPR, while the critical pathdelay decreased by 10% on average. CAPRI is a homogeneous placer and the results mentionedabove where obtained using the MCNC benchmarks, which are rather small compared totoday’s standards.

Bian et al. have adopted several ASIC placers (mPL, Capo and FastPlace) to be used in amore FPGA-like context. More concretely the placers were used to place a netlist of equalsized cells where each cell mimics a CLB of an FPGA. So the placers where still placing cellson an ASIC, but the circumstances were adapted in such a way that it looked very much likeplacing CLBs on a homogeneous FPGA. In this way they evaluated the performance of thesestandard ASIC analytical placers when used in an FPGA-like environment [3]. Next to thatthey also developed a new detailed placer based on maximum-bipartite matching. All placersconsidered during the experiments were wire-length-driven. Two different benchmark suiteswere used during the experiments: an ASIC benchmark suite provided by IBM on the onehand and a collection of benchmarks part of the IWLS benchmark suite and Altera’s QUIPbenchmark suite. The study showed that FastPlace clearly outperforms mPL and Capo interms of runtime in this context. Moreover FastPlace was an order of magnitude faster thanVPR, but the resulting wire-length was rather disappointing with a 20% to 50% increase onaverage, depending on the exact circumstances. The other two analytical placers had run-timeswhich were not significantly faster than VPR, but did perform equally in terms of resultingwire-length as opposed to FastPlace.

Starplace is a non-linear analytical FPGA placer developed by Xu et al. [19]. The paper inwhich starplace is presented makes two important contributions to placement on FPGAs. Firstof all they introduce a new wire-length function called star+ (as a substitute for HPWL). Asalready mentioned in section 2.3.3 quadratic analytical placers minimize the square wire-lengthinstead of the HPWL itself because this results in an easier to solve linear system. Star+ can


as opposed to HPWL be minimized directly, which is a major advantage. It has to be notedthat minimizing the star+ wire-length function results in a non-linear system of equationswhich is inherently harder to solve than a linear system of equations. Star+ has the additionaladvantage that when used in simulated annealing, calculating the cost change of a net becauseof a change in position of one of the blocks connected by the net can be done in O(1) time, whichis not the case when using HPWL. The second contribution by Xu et al. is the development ofan analytical placer using the star+ wire-length function. Because of the non-linearity of thestar+ wire-length function starplace is also a non-linear analytical placer. Also, no refinementwas done to improve the final placement. When compared to VPR being run in fast mode,starplace resulted in an average speedup of slightly more than 4×, while the critical path delaywas on average 8.7% less. These results were obtained using the MCNC benchmarks. Asall analytical FPGA placers discussed so far starplace can only handle homogeneous FPGAs.Starplace is also wire-length-driven as it does not include timing information in its problemformulation.

The only analytical placer known to the author that can handle heterogeneous FPGAs isHeAP (Heterogeneous Analytical Placer) which was developed by Gort et al. [6]. HeAP isstrongly based on the quadratic analytical ASIC placer SimPL and was the biggest inspirationfor the analytical placer developed as part of this thesis. Although Gort et al. claim to alsohave implemented a timing-driven variant of HeAP, only the wire-length-driven version wasaccurately described in [6]. Because of this HeAP is generally considered to be wire-length-driven only. HeAP does a final placement refinement using a greedy random block swappingalgorithm. Experimental results indicate that HeAP runs 7.4× faster and ends up with 6% lesswire-length on average compared to VPR being run in fast mode. These results were obtainedusing the Quartus University Interface Program (QUIP) and CHStone benchmark suites.

The most recent analytical FPGA placer to our knowledge is the non-linear analytical placerdeveloped by Lin et al. [10]. This placer can only handle homogeneous FPGAs, but is fullytiming-driven. It is also a multilevel placer, which means that at various stages the blocks areclustered in order to reduce the problem size temporarily. The final placement is refined usinga low temperature simulated annealing run. The authors claim to achieve an average speed-upof 6.91× compared to VPR’s timing-driven placer. The average wire-length is comparable toVPR’s placement result, but the critical path delay is claimed to be 5% less. These resultswhere obtained by using the MCNC benchmarks, together with two larger benchmark circuitswhich where generated by the authors.

3.2 Analytical placer details

In this section the analytical placer that was built as part of this thesis will be described indepth. Figure 3.1 shows the top level flowchart of the analytical placer. Note that the blackcircle indicates the starting point of the work flow, while the smaller encircled black circleindicates the end of the work flow. The different steps of the algorithm will be explained inmore detail in the following subsections.


Figure 3.1: Top level flowchart of the analytical placer built as part of this thesis

3.2.1 Building the linear system

In section 2.3.3 we have already briefly discussed how the linear system is built. We will buildfurther upon this in this section. For the simple netlist shown in Figure 2.11 we ended up withexpression 2.8 for the square wire-length. As a matter of convenience we repeat this equationhere:

ΦX = (XA − 1)2 + (XA −XB)2 + (XB − 3)2 (3.1)

As we said in section 2.3.3, minimizing equation 3.1 by taking partial derivatives and equalizingthese to zero will result in minimization of the square wire-length instead of the wire-lengthitself. We have to minimize the square wire-length instead of the wire-length itself becauseminimizing the latter by taking partial derivatives would not result in a meaningful linearsystem. This problem can be solved by adding constant weight factors before every term inexpression 3.1:

ΦX = w1 · (XA − 1)2 + w2 · (XA −XB)2 + w3 · (XB − 3)2 (3.2)

The only constraint that is set to these weight factors is that they have to be constant. Intiming-driven analytical placement for example these weight factors can be used to incorporatetiming information in the linear system formulation. But we can also use them as a trick tominimize the wire-length instead of the square wire-length. This can be done by choosing the


weight factors as follows:

w1 = 1

|X′A−1|

w2 = 1|X′

A−X′B |

w3 = 1|X′

B−3|

(3.3)

It is important to note that because the weight factors need to be constants we have to useconstant values for X ′A and X ′B in equation 3.3. To indicate this we have added apostrophes inthe names of these constants. The use of the constant values X ′A and X ′B has two importantconsequences. First of all this means that we need a random initial placement before solvingthe linear system, otherwise we have no values to fill in for X ′A and X ′B. Secondly, becausewe use constant weight factors solving the weighted linear system will not result in the exactsolution with minimal wire-length. Instead we will only have an approximation. A moreaccurate approximation can be obtained by building and solving the linear system a secondtime, but now using the placement obtained in the first linear system solve as input. Doingthis a number of times (typically five to seven times) will eventually result in a fairly accurateapproximation of the placement that results in the minimal wire-length.

We will conclude this section by providing a quick way to build the linear system. Until nowwe have always written down the weighted expression for the square wire-length and then tookthe partial derivatives of this equation. It is also possible to build the linear system directly

from the netlist. This is done as follows. We start from a zero-matrix A and a zero-vector−→b .

Every movable block B gets an index iB for use in A and−→b . Fixed blocks don’t get an index.

Every net in the circuit is broken up in a collection of two-pin connections (see section 3.2.2 onhow this is done). When we have a two-pin connection with connection weight wAB betweenmovable blocks A and B we take the following steps:

• Add wAB to the element with indices (iA, iA) in A

• Add wAB to the element with indices (iB, iB) in A

• Subtract wAB from the element with indices (iA, iB) in A

• Subtract wAB from the element with indices (iB, iA) in A

When we have a two-pin connection with connection weight wAF between a movable block Aand a fixed block F we take the following steps:

• Add wAF to the element with indices (iA, iA) in A

• Add wAF · xF with xF the x-position of the fixed block to the element with index iA in−→b

When we have a two-pin connection between two fixed blocks this connection is discarded.We can do this because fixed connections cannot be optimized and thus don’t really add anyinformation to the linear system.


3.2.2 Net models

The simple placement problem that was solved as an example in section 2.3.3 only containedtwo-pin nets. As became clear in that section it is very straightforward to represent thesekinds of nets in a linear system. When a net contains more than two pins, the net has to bebroken down in several two-pin connections. There are several possibilities on how to do this.The method used to break a net down in two-pin connections is referred to as the net model.These net models will be explained in more detail in this section. The net that will be usedas reference during this discussion is represented in Figure 3.2. This net contains two pinsconnected to fixed blocks and three pins connected to movable blocks. We will now considerfour different net models: the clique net model, the star net model, the bound-to-bound netmodel and the the source-sink net model.

Figure 3.2: A net containing five pins in total (two fixed and three movable)

Clique net model

The clique net model was the net model that was originally used in the first quadratic analyticalplacers for ASICs that were described in literature. In the clique net model every two-pinconnection that can be considered in the net is added to the linear system. This is graphicallyrepresented in Figure 3.3. Every double headed arrow in this figure represents a two pinconnection that is added to the linear system.

When N represents the number of pins in a net, the number of two-pin connections that isused to represent this net using the clique net model is given by the following expression:

#connections =N · (N − 1)

2(3.4)

Note that equation 3.4 also includes connections between two fixed pins. As these kinds of con-nections are not actually added to the linear system (because they don’t add any information),equation 3.4 possibly overestimates the number of connections. When we know the number offixed pins Nf in the net, a more precise expression can be found for the number of two-pin


Figure 3.3: Graphical representation of the clique net model

connections used to represent the net using the clique net model:

#connections =

N ·(N−1)

2 , Nf < 2

N ·(N−1)2 − Nf ·(Nf−1)

2 , Nf ≥ 2(3.5)

As became clear in section 3.2.1 every two-pin connection that needs to be added to the linearsystem formulation possibly adds non-zero elements to the matrix system. When we consider

the net shown in Figure 3.3 to be the complete circuit to place, the matrix A will take the

general form shown in equation 3.6. In the matrix A column and row index 1 correspondwith element A (see Figure 3.3), index 2 corresponds with element B and index 3 corresponds

with element C. An X element in the matrix A corresponds with a non-zero element, whilea 0 corresponds with a zero element. Note that when using the clique net model, all matrixelements involved in the net are non-zero.

A =

X X XX X XX X X

(3.6)

The more non-zero elements are present in the matrix system, the bigger the memory use willbe as typically a sparse matrix format is used to store the matrix in memory. Additionally,more non-zero elements in the matrix will result in a longer computation time when solving thelinear system. Because of these two reasons the clique net model is not quite ideal, especiallyfor large nets.


Star net model

The star net model was first introduced in the ASIC macro-cell placer described in [13]. Inthe star net model a virtual star pin is added to the net. It is important to note that thisadditional star pin also needs to be placed, and thus leads to the introduction of additionalvariables in the linear system. After introducing the additional star pin to the net, every pinin the net is connected to this star pin. This is graphically represented in Figure 3.4. Again,every double headed arrow represents a two-pin connection that is added to the linear system.

Figure 3.4: Graphical representation of the star net model

When N represents the number of pins of a net, the number of two-pin connections that isused to represent this net using the star net model is given by the following expression:

#connections = N (3.7)

Note that a two-pin connection in the star net model never connects two fixed pins with eachother because the star pin is by definition not fixed. This has as a result that equation 3.7is always valid, independently from the number of fixed pins present in the net. When weconsider the net shown in Figure 3.4 to be the complete circuit to be placed, we can once again

write down the general form of the matrix A (see equation 3.8), just as we did for the cliquemodel. We use the same conventions as we did for equation 3.6, with the exception that therow and column with index 4 now represent the virtual star pin. Note that as opposed to the

clique model, the star model results in more zero elements in A. The star model introducesadditional variables. This immediately becomes clear when looking at equation 3.8: the matrixhas four rows and columns as opposed to three rows and columns in equation 3.6 for the cliquenet model. For every star pin two additional variables are introduced (one for the problem inx-direction and one for the problem in y-direction). This increases the problem complexity,


and thus can be considered as a disadvantage of the star net model.

A =

X 0 0 X0 X 0 X0 0 X XX X X X

(3.8)

When we have a net connecting a high number of pins it is clear that the number of two-pinconnections will be a lot lower when using the star net model than when using the clique netmodel. A two-pin net for example is more efficiently represented using the clique net model.For three-pin nets the clique net model and star net model result in the same amount of two-pinconnections, but the star net model has the overhead of the additional star pin. Because of thisobservation, several analytical ASIC placers have adopted a hybrid net model in the past. Inthis hybrid net model, two-pin and three-pin nets are represented using the clique net model,while all other nets are represented using the star net model [16].

Bound-to-bound net model

The bound-to-bound net model (also referred to as the b2b model) was first introduced in thequadratic analytical ASIC placer Kraftwerk 2 [14]. It is considered to be the current state ofthe art in quadratic analytical placers. To determine which two-pin connections will be used torepresent the net we first identify the boundary pins of the net. The boundary pins are definedas the pins with the highest or lowest coordinate (x-coordinate if we are considering the problemin the x-direction, y-coordinate if we are considering the problem in the y-direction). All otherpins in the net are defined as inner pins. In the bound-to-bound net model, we only considerthe two-pin connections in the net which involve at least one outer pin. This is illustrated inFigure 3.5. All inner connections (i.e. connections between two inner pins of the net) are notconsidered. The reason for doing so is that inner connections are not contributing to the halfperimeter wire-length (HPWL), which is one of the main criteria used to evaluate a circuitplacement. It can actually be proven that when using the correct weight for every two-pinconnection, the total minimized expression is an exact representation of the HPWL [14]. Fora connection between circuit elements A and B this weight can be written as follows:

wB2BAB =

2

P − 1· 1

|xA − xB|· q(n) (3.9)

In this expression P indicates the number of pins in the net. Note that this weight factor alreadyincludes the factor 2 coming from the derivation of the square wire-length expression. Next tothat the weight also contains a constant factor that makes the square wire-length expressionlinear (see section 3.2.1 for more details). Note that this is the weight factor when consideringthe problem in x-direction. When considering the problem in y-direction the x-coordinatesshould obviously be substituted by y-coordinates. Finally, q(n) is a factor depending on thenumber of pins n connected by the net. This factor is a compensation for the underestimationof the necessary wire-length after routing when only considering the bounding box of the net(see section 2.3.1 for more information about this factor).


Figure 3.5: Graphical representation of the bound-to-bound net model

When all pins in the net are non-fixed the number of two-pin connections in the net can bewritten as follows (N is the number of pins connected by the net):

#connections = 1 + 2 · (N − 2) (3.10)

Note that when two or more of the pins in the net are fixed some connections in the net mightbe between two fixed pins. The exact number of fixed connections depends on the type of pinsthat form the boundaries of the net. Because of this reason we will not write down an exactexpression for the number of non-fixed connections in the bound-to-bound net model.

For a two-pin net the number of two-pin connections is the same as for the clique net model(one two-pin connection). For a three-pin net the number of two-pin connections is the sameas for the clique net model and star net model (three two-pin connections), but just as withthe clique net model the bound-to-bound net model does not have the overhead of introducingan additional star pin. For nets with four or more pins, the number of two pin connections isin between the number of two-pin connections of the star net model and the clique net model.Compared to the star net model the bound-to-bound net model has the additional advantageof exactly representing the HPWL.

We conclude this discussion of the bound-to-bound net model by writing down the general

form of the matrix A when we consider the net shown in Figure 3.5 to be the complete circuitto place, just as we did for the other net models discussed so far. To keep things fair wewill assume that element B is on the left boundary of the net instead of the fixed element atX = 1. More precisely, we switch the positions of these two elements. We do this becausethe boundary pins are the strongest connected pins of the net when using the bound-to-boundnet model. Also, connections between a fixed and a non-fixed circuit element result in at mostone additional matrix element, while connections between two non-fixed elements result in atmost four additional matrix elements. This consideration eventually results into the following


general form of A:

A =

X X 0X X X0 X X

(3.11)

Source-sink net model

The final net model that we discuss is the source-sink net model. This net model has neverbeen used before and thus is a contribution done by this thesis. The use of the source sink netmodel has the following goals:

• Reducing memory usage and increasing linear system solve speed by reducing the number

of non-zero elements in the matrix A compared to all net models used until now.

• Allowing for a more natural incorporation of timing information in the linear systemformulation (more information about timing-driven analytical placement will be given insection 3.2.6).

• Reducing the time needed to construct the linear system compared to the bound-to-boundnet model.

In the source-sink net model only the two-pin connections between the source of the net andthe sinks of the net are considered. This is illustrated in Figure 3.6. In this figure circuitelement B is considered to be the source of the net.

Figure 3.6: Graphical representation of the source-sink net model

Just as with the bound-to-bound net model we associate a weight with every two-pin connectionthat is added to the linear system of equations. Unfortunately we have not found a way to tunethe weights in such a way that we are exactly representing the HPWL. The weight factor shownin expression 3.12 has experimentally proven to work the best. This is the weight associated


with a connection between circuit elements A and B. Note that we use a division by the numberof pins (P) in the net minus one, just as was done in the bound-to-bound net model. In thebound-to-bound net model this factor is one of the tricks to get an exact representation of theHPWL. In the source-sink net model this division is done because the more pins are presentin the net the more internal two-pin connections (two-pin connections that don’t involve a pinthat is lying on the bounding box) are added to the linear system. These connections don’tcontribute to the HPWL. The division by the number of pins in the net minus one decreasesthe influence of these connections.

wS−SAB =

2

P − 1· 1

|xA − xB|(3.12)

The number of two-pin connections used to represent a net connecting N non-fixed pins can bewritten as follows:

#connections = N − 1 (3.13)

When two or more pins in the net are fixed the exact number of two-pin connections thatis actually incorporated in the linear system is less easy to write down. This is because itdepends on the source element being fixed or not. We can find an exact expression though (Nf

represents the number of fixed pins in the net):

#connections =

{N − 1, Source pinnot fixed

N −Nf , Source pin fixed(3.14)

To conclude this section we will write down the general form of the matrix A when we considerthe net shown in Figure 3.6 to be the complete circuit to place. As mentioned before weconsider circuit element B to be the source of the net.

A =

X X 0X X X0 X X

(3.15)

Note that in this case the general form of A is the same as the one written down while discussingthe bound-to-bound net model. This is a pure coincidence. Expression 3.14 clearly indicatesthat in general the number of two-pin connections used to represent the net using the source-sink net model will be lower than when using the bound-to-bound model (see equation 3.10).

Because of this, also the number of non-zero elements in A will typically be lower when usingthe source-sink net model than when using the bound-to-bound net model. In fact, the numberof non-zero elements when using the source-sink net model will in general even be lower thanwhen using the star net model (see equation 3.7).


A more complex example using the bound-to-bound net model

The example that was solved in section 2.3.3 and further built upon in section 3.2.1 onlycontained two-pin nets. We will now solve a more complex example using the bound-to-boundnet model. This example contains multiple nets which connect more than two pins and thusneed a net model in order to be represented. The circuit to be placed is shown in Figure 3.7.

Figure 3.7: More complex example circuit to be placed

When we build and solve the linear system a couple of times in order to approximate the circuitplacement resulting in the minimal total wire-length as closely as possible we end up with thefollowing solution:

−→x =

5.825.825.825.82

, −→y =

0000

(3.16)

It is immediately clear that the linear system solution is the same for all movable blocks. Thissolution is graphically represented in blue dotted line in Figure 3.8.

Figure 3.8: Linear solution (indicated in blue dotted line) of the more complex example


The solution that was obtained by solving the linear system is clearly not a legal one. This isbecause the four movable blocks are all placed right on top of each other, but also because thex-position of these blocks is not an integer value. Because of this we will have to legalize thelinear system solution. The legalization method that was used in this thesis is explained in thefollowing section.

To conclude this section we note that it is actually pretty logical that the linear solver placesall blocks on top of each other. When all blocks connected by a certain net are placed on topof each other the necessary wire-length for this net is zero. This is of course the minimal valuepossible, and is thus the best possible solution. The linear solver does not know anything of ourdesire to place blocks on integer locations which do not overlap because we have not includedthis information in the linear problem formulation.

3.2.3 Legalization

The legalization method that was used in the analytical placer developed as part of this thesisuses a recursive partitioning algorithm. Actually a couple of the already existing analyticalFPGA placers mentioned in section 3.1 use this algorithm or a close variant of it to legalize thesolution generated by solving the system of equations. Among these are HeAP [6], Starplace[19] and the non-linear analytical placer presented in [10]. The method described in this sectionis closest related to the legalization method used in HeAP. We will first go over the details ofthis legalization method for a homogeneous FPGA. After this we will look into the specialconsiderations that need to be taken when using the algorithm on a heterogeneous FPGA.

Legalization for a homogeneous FPGA

When we legalize the linear system solution on a homogeneous FPGA we only have one typeof blocks: CLBs (note that this is only the case if we start from a known and fixed placementof IO blocks, when IO blocks also have to be placed the FPGA should be considered as hetero-geneous). This means that the legalized solution only has to fulfil two requirements in orderto be considered as legal:

• All blocks need to have an integer location within the boundaries of the FPGA.

• No two blocks should have the same integer x- and y-coordinate.

A flowchart of the legalization algorithm is shown in Figure 3.9. The algorithm starts by pickinga random block that has not been legalized yet. The integer x- and y-coordinates (within theboundaries of the FPGA) that are closest to the position of the block given by the linear solverare calculated. Next we identify all other not-yet-legalized blocks that would map to the samediscretized integer x- and y-coordinates. If no such blocks are found we can easily legalize theselected block by giving it this integer x- and y-coordinate. If we do find such blocks we haveidentified an over-utilized area: the area considered only has space for one block, but morethan one block would ideally be placed on this location. The next step is then to check if wecan cluster more such over-utilized areas which are bordering to each other. Figure 3.10a showssuch a clustered over-utilized area (blue dotted line). The FPGA’s homogeneous grid is drawn


Figure 3.9: Flowchart of the legalization algorithm on homogeneous FPGAs

in black. The red squares are not-yet-legalized blocks. These are drawn as red squares on theposition given by solving the linear system of equations.

The clustered over-utilized area shown in Figure 3.10a contains 17 different blocks to place, butonly contains 4 valid positions. We define the utilization U of an area as its occupancy dividedby its capacity. Obviously a utilization greater than one indicates that the area is over-utilized.The utilization of the area surrounded in blue dotted line in Figure 3.10a can be calculated as


(a) Before legalization (b) After legalization

Figure 3.10: A clustered over-utilized area (blue dotted line) and its expanded area (greendotted line)

follows:

Uarea =Oarea

Carea=

17

4= 4.25 (3.17)

After identifying a cluster of over-utilized areas we expand the area in which the blocks inthe cluster will be placed. The area is as equally as possible extended in all directions. Theexpanded area should of course not exceed the boundaries of the FPGA. We keep expandingthe area until the utilization of the area is smaller than or equal to a certain maximal value.At first sight it might be logical to choose one for this maximal value, but choosing 0.9 hasproven in the past to give the best results [6]. For our example the expanded area is shownin green dotted line in Figure 3.10a. This expanded area has a capacity of 20 blocks, so thatthe utilization of the area has dropped to 0.85. When looking to Figure 3.10a again, we seethat there is a not-yet-legalized block which was not in the original over-utilized cluster but isin the expanded area of this cluster. Experiments have shown that adding this block to thecluster gives the best results. This means that we now have 18 blocks in the cluster insteadof 17, and that the utilization has risen to 0.90, which is by coincidence exactly equal to themaximal utilization.

Now that we have enough capacity to give all blocks in the cluster a legal position we can finallystart the real legalization process. This is where the actual recursive partitioning happens. Theexpanded area on which we will place the blocks is officially called the target area. The set ofblocks that we will place on this area (the blocks in the cluster) is officially called the sourceset. A flowchart of the recursive partitioning algorithm is shown in Figure 3.11. The algorithmcan be broken down in the following four steps:

1. Sort the blocks in the source set in order of x-location (y-location).

2. Choose a vertical (horizontal) cut line in the target area in such a way that the capacityof both sub areas is as close to each other as possible. This cut line is sometimes referredto as the target cut line.


3. Put the source blocks (blocks in the source set) with the smallest x-locations (y-locations)in the left (top) sub area and the blocks with the highest x-locations (y-locations) in theright (bottom) sub area. Do this in such a way that the utilization of both sub areas isas close to each other as possible. The division of the blocks in the source set in two subsets is sometimes referred to as generating a source cut.

4. Repeat this process for the two sub areas starting with a horizontal (vertical) target cutline. If only a single source block remains in a sub area the process is finished and theblock is placed on the location which is closest to the position it was given by the linearsolver.

Figure 3.11: Flowchart of the recursive partitioning algorithm


The generation of the source and target cuts for our example is illustrated in Figure 3.12. Theutilization of the left sub area is 0.875, while the utilization of the right sub area is 0.833. It isimpossible to get these utilizations closer to each other.

Figure 3.12: Visual representation of the source and target cuts

Step 4 of the algorithm is usually implemented by doing a recursive call of the recursive parti-tioning function, as shown in Figure 3.11.

The final position of all blocks in the cluster after the recursive partitioning legalization algo-rithm has finished is shown in Figure 3.10b. When all blocks in the current cluster have beenassigned a legal position we proceed with the legalization process by searching for a new clusterof blocks that have not been assigned a legal position yet. This is done as long as there arenot-yet-legalized blocks remaining.

The attentive reader might have discovered one small loophole in the algorithm describedabove: there is no mechanism in the algorithm that prevents two different over-utilized clustersits expanded areas to be overlapping. Note that this can only happen when the expanded areaof the first cluster is not overlapping with the clustered over-utilized area of the second cluster.Because of this the amount of overlap between the two expanded areas will always remainrather small. This overlap is solved by doing a final legalization run after assigning each blockan initial legal position using the process described above. When during this final legalizationrun two blocks are discovered that have the same legalized position, one of both blocks is heldat this position, while the other is given a new valid position by means of a very simple spiralsearch algorithm. This spiral search algorithm searches for the empty position that is closestto the original legalized position of the block.

Special considerations for heterogeneous FPGAs

When we have a heterogeneous FPGA there are three instead of two requirements that needto be fulfilled before the placement is considered to be legal:

• All blocks need to have an integer location within the boundaries of the FPGA.

• No two blocks should have the same integer x- and y-coordinate.

• All blocks need to have a position which corresponds to a site of the correct block type

The last requirement means that a CLB should be placed on a position which correspondswith a CLB site on the FPGA instead of with a multiplier hard block site for example. When


legalizing blocks on a heterogeneous FPGA we will always legalize the different types of blocksseparately. We will for example first legalize all CLBs, then all hard blocks of type one, then allhard blocks of type two and so on. The basic algorithm described for homogeneous FPGAs isstill used on heterogeneous FPGAs. We do have to expand the algorithm with one additionalstep. This is explained in Figure 3.13a. This figure depicts the same example as shown inFigure 3.10a, but now using a heterogeneous FPGA. The blocks to be placed are considered tobe CLBs. We identify a clustered over-utilized area (indicated with blue dotted line) containing17 CLBs in total. The column filled in grey is a column where it is illegal to place CLBs. Thiscan be a multiplier hard block column for example. The expanded area is outlined with greendotted line. Note that because of the presence of this illegal column we have to expand thearea further than in the homogeneous case in order to have a utilization smaller than or equalto the maximal utilization (0.9). Also note that while expanding the area to place the CLBsin the isolated CLB on the right side of the FPGA is added to the set of blocks to legalize inthis iteration of the algorithm.


Figure 3.13: A clustered over-utilized area (blue dotted line) and its expanded area (greendotted line) on a heterogeneous architecture

In the homogeneous case the next step in the legalization procedure would be to use therecursive partitioning algorithm to spread and legalize the blocks in the source set over thetarget area. In the heterogeneous case depicted in Figure 3.13a we cannot do this yet. This isbecause the target area contains some illegal blocks (the blocks filled in grey in Figure 3.13a).This problem is quite easily solved. When we take a closer look at Figure 3.13a it is clearthat the illegal target locations divide the target area in two separate rectangular target areaswhich are completely legal. This means that our problem is solved by sorting the blocks toplace in order of x-coordinate and then dividing them over the two sub areas in such a waythat the utilization in both sub areas is as close to each other as possible. Then we can use therecursive partitioning algorithm to further legalize the solution in both sub areas. Note thatbecause all blocks which are located in the same column of an FPGA are of the same type wecan always use the strategy described above to legalize the solution provided by solving thelinear system of equations. The final result of the legalization process described above is shownin figure 3.13b.


Gradual legalization

When the utilization of a clustered over-utilized area is very high we will have to expand itsarea very much when using the legalization algorithms described above. Looking back at Figure3.10a this means that the area indicated with green dotted line (expanded area) will be a lotbigger than the area indicated with blue dotted line (clustered over-utilized area). Becauseof this some blocks will be moved very far from there original positions (given by the linearsolver). This may have a very negative effect on the HPWL of the legalized placement.

In order to counter this phenomenon we propose a new gradual legalization method. This newlegalization method will allow some overlap in the first iteration of the analytical placementalgorithm (see Figure 3.1). Over the next few iterations of the analytical placement algorithmthe amount of allowed overlap is gradually decreased (hence the name gradual legalization),until at some point we start completely legalizing the solution. In this way we expect that thealgorithm has more freedom to find a better legal placement for clusters of overlapping blocks,especially when these clusters contain a lot of blocks.

Luckily we don’t have to change the legalization algorithm very much. In fact we only need todo two real changes to the algorithm. The first change that has to be done is stopping the areaexpansion process when the maximal legalization is smaller than the number that was passedto the algorithm. Instead of the fixed value of 0.9 in the complete legalization algorithm thismaximal value can now be any number. We could for example pass 4.0 in the first iteration, 3.0in the second iteration, 2.0 in the third iteration, 1.5 in the fourth iteration and 0.9 from thefifth iteration onwards. In fact this was the sequence that was passed to the algorithm duringsome of the experiments described in chapter 4. The second change that needs to be done isstopping the recursive partitioning process when only one location remains in the target areainstead of when only one block remains in the source set.

Figures 3.14a and 3.14b illustrate the legalization process described above for a homogeneousarchitecture. Note that the same example is used as in Figures 3.10a and 3.10b, except for theone block that is outside of the clustered over-utilized area (blue dotted line) but inside theexpanded area (green dotted line). It is immediately clear that the area to place the blocksin needs to expanded a lot less compared to complete legalization. The numbers in Figure3.14b indicate the amount of blocks which are placed at this site after the gradual legalizationprocess finishes.

The gradual legalization process described above can easily be extended to the heterogeneouscase. To do this the same considerations need to be taken into account as described for completelegalization.

3.2.4 Pseudo connections and convergence

In this section we ”close the loop of the algorithm” (see Figure 3.1) by formulating and solvingthe linear system of equations again and in this way (hopefully) obtain a better legalizedsolution. In order to keep the discussion as simple as possible we start the section with adiscussion of pseudo connections and the convergence of the analytical placement algorithmon a homogeneous architecture. After this we will discuss the considerations to keep in mindwhen moving towards a heterogeneous architecture.



Figure 3.14: Gradual legalization on a homogeneous architecture

Homogeneous architecture

At this point we have two different solutions of the placement problem:

1. The solution obtained by solving the linear system of equations. This solution is illegalbecause the blocks might be placed on non-integer locations, there can be a lot of overlapbetween the blocks and some blocks might be placed on the wrong type of FPGA site(e.g. a CLB on an IO site).

2. The solution obtained by legalizing the linear system solution described in item 1 of thisenumeration. This solution is completely legal (possibly not entirely legal when graduallegalization is used because in that case some overlap between blocks can remain, but thisdoesn’t really matter in this discussion) but might not be the best possible legal solution.This is especially the case when the legalization algorithm had to resolve a lot of overlap.

Our goal is now to improve the legalized solution. To do this we would like to formulate thelinear system of equations again, but now in such a way that the amount of overlap in thesolution is reduced. Different methods exist to achieve this goal. The method used in theanalytical placer developed as part of this thesis was first used in the analytical ASIC placerSimPL [8] but was also used in the analytical FPGA placer HeAP [6]. This method addsa two-pin pseudo net to each movable block. This two-pin pseudo net is connected to theconsidered block on the one hand, and a virtual fixed pin at the location of this block afterlegalization on the other hand. These virtual fixed pins are often referred to as anchor points.We always use the best legalized solution found so far as anchor points. This is depicted inFigure 3.15 for the more complex example which was first introduced in Figure 3.7. The greenblocks can be considered as the location of blocks A, B, C and D given by the last solutionobtained by solving the linear system. The blocks drawn in red dotted line can be consideredas the locations of blocks A, B, C and D given by the best legalized solution found so far.

The weights that are used for these pseudo connections are the same as for standard nets,except that they are multiplied with an additional weight factor α · i where i is the iteration


Figure 3.15: The pseudo connections added to the more complex example

count. The weight factor determines how strongly the block will be pulled towards its legalizedlocation. If the weight factor is small overlapping blocks will only be slightly pulled away fromeach other. A very high weight factor on the other hand will result in the linear system solutionbeing almost equal to the legalized solution used as a reference for the pseudo connections. Bygradually increasing this weight factor with the iteration count the solution obtained by solvingthe linear system is gradually pulled towards the the best legalized solution found so far. Duringthis process the linear solver can rearrange some of the blocks in order to improve the legalizedsolution. In this way the quality of the linear system solution will gradually decrease withiteration count, while the quality of the legalized solution will gradually increase with iterationcount. This process is referred to as the convergence of the algorithm. The evolution of thelinear system solution’s quality and the legalized solution’s quality as a function of the iterationcount is shown in figure 3.16. This figure was generated using our analytical placer. The shabenchmark was used as input. This benchmark is part of the VTR benchmark suite [11].

Figure 3.16: Convergence of the sha benchmark

It is clear from figure 3.16 that from a certain iteration number onwards the legal solutiondoesn’t improve much anymore. Therefore it is often not useful to keep iterating until the


legal solution and the linear system solution are almost the same. The algorithm is typicallystopped when the ratio of the linear system solution’s HPWL to the legal solution’s HPWLreaches a certain number. HeAP [6] proposes to use 0.7 for this number, but we have foundthat running the algorithm until a ratio of 0.8 is reached is often useful.

Considerations for heterogeneous FPGAs

When placing a circuit on a heterogeneous FPGA we need to adapt some small things. First ofall we can ask ourselves the question if we should add all movable blocks to the linear systemformulation each iteration or if we should only add one type of blocks at a time (first CLBs,then hard blocks of type 1, then hard blocks of type 2, etc.). This has actually already beenresearched in HeAP [6]. Their research has shown that the best solve order is actually a mix ofthe two proposals: first solve all blocks, then all CLBs, then all hard blocks of type 1, then allhard blocks of type 2, etc. After completing this sequence it is repeated from the start, untilthe maximal ratio of linear system solution HPWL to legal solution HPWL is reached.

Now that we have identified the best solving order, we can ask ourselves the question if weshould still increase the pseudo connection weight factor in the same way as we did for ahomogeneous architecture. We have found that increasing the pseudo connection weight factorevery iteration (also when only hard blocks of type 1 are solved for example) does not give goodresults. We found that only increasing the pseudo connection weight factor when all blocksor only CLBs are solved gives the best results. If other types of blocks are solved, the pseudoconnection weight factor is kept at the same value as in the previous iteration.

3.2.5 Refinement

As already mentioned before the result of the analytical placement algorithm itself is oftennot good enough yet. Therefore a refinement step is needed. This refinement step solves localdisturbances in the placement. The analytical placement algorithm itself is often referred to asthe global placement. The refinement step is often referred to as the detailed placement.

Several algorithms for detailed placement have already been used in the past. These have beenmentioned in Section 3.1. We chose to use low temperature simulated annealing for detailedplacement. This choice is motivated by the fact that we want to get as close as possible to theresults obtained by a complete high effort simulated annealing placement. The other detailedplacement algorithms mentioned in Section 3.1 were often used when comparing the final resultsto a low effort simulated annealing run. We didn’t expect these algorithms to get as close tothe results of a high effort simulated annealing run as we wanted.

For complete details about the simulated annealing algorithm we used in this thesis we refer toSection 2.3.2, [1] and [12]. In this section we will only discuss the changes that we have doneto this algorithm in order to implement a low temperature simulated annealing step.

The first (and most obvious) change that we have done to the original algorithm is changing theway in which the initial temperature is calculated. As mentioned in Section 2.3.2 the currenttemperature determines how much a swap may increase the placement cost (and thus decreasethe solution quality) to still have a chance of being accepted. The temperature mechanismis used to explore the complete solution space and avoids getting stuck in local minima. But


in our case a global placement was already found. Because of this we don’t have to explorethe solution space anymore. Therefore the initial temperature can be chosen lower than whenstarting from a random placement. To determine the initial temperature for a low temperaturesimulated anneal we perform a number of test block swaps. The number of test swaps is equalto the number of blocks to place. We then determine the standard deviation of the cost decreaseof all cost decreasing swaps. The initial temperature is then equal to this standard deviationmultiplied by two:

Tinitial low = 2 · σ∆C negative (3.18)

We chose to only consider the cost changes resulting from cost decreasing swaps because manyof the cost increasing swaps will have a very high ∆C. This is a consequence of the fact thatall blocks have already received a more or less good position during global placement.

When doing a normal simulated annealing placement we calculate the standard deviation ofthe absolute value of the cost changes of all test swaps. The initial temperature is then equalto twenty times this standard deviation:

Tinitial high = 20 · σ∆C all (3.19)

In this case we consider all cost changes because the random initial placement will result inan almost equal amount of cost increasing and cost decreasing swaps. Moreover the absolutevalue of these cost changes will be of the same order of magnitude, as opposed to the lowtemperature anneal refinement where the cost increases (bad swaps) will typically be higher(in absolute value) than the cost decreases (good swaps).

A second change that is done to the simulated annealing algorithm for detailed placement isthe initial value of Rlim (see Figure 2.9). In a normal simulated anneal Rlim is initially equalto the FPGA dimension. This is done in order to be able to explore the complete solutionspace. As this is not necessary anymore when doing detailed placement we can choose a lowerinitial value for Rlim. We chose the initial value of Rlim equal to the maximal horizontal orvertical FPGA dimension divided by three. Rlim is also capped to this value: it can neverbecome greater over the course of the algorithm, only smaller. This was done to keep the lowtemperature anneal fine grained, as a good global placement was already found.

The last parameter that was changed is the number of swaps per temperature. In our lowtemperature simulated anneal the number of swaps per temperature was 60% lower than whendoing a complete, high effort simulated anneal.

3.2.6 Timing-driven analytical placement

Until now we have only considered wire-length-driven analytical placement. If we also wantto optimize the critical path delay of the placement generated by the analytical placementalgorithm we have to incorporate timing information in the algorithm. We have chosen onlyto incorporate timing information in the linear system formulation, as was done in all timing-driven analytical FPGA placers described in literature up till now (see Section 3.1). Thelegalization phase was not changed. We do have to mention that we always use the complete


legalization algorithm, and not the gradual legalization algorithm. This choice was made be-cause in our opinion it doesn’t make sense to generate timing information from a not completelylegal placement.

Three different methods to incorporate timing information in the linear system formulationwere evaluated. We start with discussing how we added timing information to the bound-to-bound net model. Then we look into adding timing information to the source-sink net model.Finally, we present a method to do timing-driven analytical placement using a hybrid net modelwhich uses connections from both the bound-to-bound net model as the source-sink net model.

Adding timing information to the bound-to-bound net model

In order to add timing information to the linear system formulation using the bound-to-boundnet model we based ourselves on the work done in [10]. For convenience we repeat the connec-tion weight used in the bound-to-bound net model when doing wire-length-driven analyticalplacement(see Section 3.2.2 for more details):

wB2BAB =

2

P − 1· 1

|xA − xB|· q(n) (3.20)

When using the bound-to-bound net model the two-pin connections added to the linear systemcan both be between the source pin and a sink pin as between two sink pins of the net. Becauseof this we can only add timing information on a net-by-net basis, and not on a connection-by-connection basis. More precisely, we sum the timing costs of all source pin to sink pinconnections present in the net. For this timing cost we use the same definition as in Expression2.5. This summation gives us a timing factor wf :

wf =∑

∀(i,j)⊂net

timingCost(i, j) (3.21)

This eventually gives us a new connection weight to use when doing timing-driven analyticalplacement using the bound-to-bound net model. This new connection weight is presented inEquation 3.22. Note that we no longer multiply with q(n). This has experimentally proven toproduce the best results.

wTD−B2BAB =

2

P − 1· 1

|xA − xB|· wf (3.22)

Adding timing information to the source-sink net model

As opposed to the bound-to-bound net model the source-sink net model only adds two-pin con-nections to the linear system which involve the source pin of the net and a sink pin. Therefore


we can add timing information on a connection-by-connection basis. We start with repeat-ing the connection weight used when doing wire-length-driven analytical placement using thesource-sink net model (see Section3.2.2 for more details):

wS−SAB =

2

P − 1· 1

|xA − xB|(3.23)

We can now add timing information to this connection weight in a very natural way: we justmultiply the connection weight with the criticality of the two-pin connection it is representing.The criticality of a connection is defined in Equation 2.6. The newly obtained connectionweight is represented in Equation 3.24. Note that as a matter of notational convenience wehave used α to represent the connection its criticality.

wTD−S−SAB =

2

P − 1· 1

|xA − xB|· α (3.24)

A new hybrid net model for timing-driven analytical placement

The timing-driven bound-to-bound net model and timing-driven source-sink net model didn’tperform as expected, especially in terms of HPWL (see Chapter 4 for more details). Becauseof this we propose a new hybrid net model. This net model uses connections from both thebound-to-bound net model as the source-sink net model and does a better job in optimizingboth goals: minimal HPWL and minimal critical path delay.

Our hybrid net model starts by adding all connections which are present in the bound-to-bound net model to the linear system formulation. For these connections the wire-length-drivenconnection weight (see Equation 3.20) is used. After adding all bound-to-bound connections wealso add all source-sink connections for which the criticality is larger than a certain number.We have used 0.8 in our experiments. For these connections the timing-driven source-sinkweight (see Equation 3.24) is used.

It is important to note that our new hybrid net model makes a lot of sense in a mathematicalway. The total expression that is minimized by taking partial derivatives is equal to a sum oftwo terms. Because we use all connections present in the bound-to-bound net model with wire-length-driven weights the first term is exactly equal to the HPWL. The second term consistsof a sum of contributions corresponding with two-pin connections having a criticality higherthan 0.8. As these connections are determining the critical path delay, we are also activelyminimizing the critical path delay. This notion is mathematically represented in Equation 3.25.

minimized expression = HPWL+ sum of critical connections (3.25)

3.3 Contributions

Section 3.1 has shown that some academic work about analytical placement on FPGAs hasalready been published. Unfortunately none of these publications have resulted in the release of


an open source framework for analytical placement on FPGAs where others can freely continueworking upon. Even more so, all of our friendly requests for code or a working executablehave been declined. This is a very remarkable fact as some of the research groups behindthese publications have been contributing to open source FPGA tools for several years. VPR[1], VTR [11] and Torc [15] are the biggest examples of open source FPGA tools. One of themost important contributions of this thesis was to help research groups to get started withanalytical placement on FPGAs by providing a new open source framework. Moreover, theopen source framework will focus on interoperability with VPR: the framework can be used asa replacement for VPR’s simulated annealing based placement.

HeAP [6] has been the biggest inspiration for the analytical placer that was built as part ofthis thesis. But next to reproducing HeAP we have also done several new contributions tothe baseline provided by HeAP. First of all we have introduced a new legalization method foruse during wire-length-driven analytical placement. This new legalization method graduallyresolves the overlap present in the solution of the linear system over multiple iterations of thealgorithm. In this way we expect that the algorithm has more freedom to find a better legalplacement for clusters of overlapping blocks, especially when these clusters contain a lot ofblocks.

Next to proposing a new legalization method we also made our analytical placer timing-driven.A lot of research was done in order to find a good way to incorporate timing information inthe linear system of equations. Along the way we have proposed two new net models: thesource-sink net model and a hybrid net model that uses connections of both the bound-to-bound and source-sink net models. Especially our hybrid net model can be considered as beingquite revolutionary as it is the first time to our knowledge that timing information is addedin the linear system formulation by adding additional connections. Before our work timinginformation was typically incorporated by including a timing factor in existing connections.Our experiments indicate that our new method outperforms the old method on average and isespecially much more consistent across different benchmark circuits.

Chapter 4

Method and results

4.1 Introduction

This chapter describes the simulations that were conducted in order to evaluate the analyticalplacers that were developed as part of this thesis. All implementations were done using thejava programming language. The java language was chosen because of its ease of development:it allowed us to quickly evaluate new ideas for our analytical placer. In total six placers havebeen implemented:

• A wire-length-driven simulated annealing based placer

• A timing-driven simulated annealing based placer

• A wire-length-driven analytical placer using the bound-to-bound net model

• A timing-driven analytical placer using the bound-to-bound net model

• A timing-driven analytical placer using the source-sink net model

• A timing-driven analytical placer using our hybrid net model

The analytical placers have already been described in detail in Chapter 3. The simulatedannealing based placers were implemented in order to have a fair reference for our analyticalplacers. Their implementation was strongly based on the placer that is found in the VersatilePlace and Route (VPR) tool [1] [12]. Our implementations of these simulated annealing basedplacers used the same architecture, fixed IO placement and timing information as our analyticalplacers. This allowed us to make a fair comparison.

All of our placers are able to place CLBs and an unlimited amount of different hard block types.The only condition that must be met is that all blocks in the same column of the architecturein use have to be of the same type. The implemented simulated annealing based placers hadthe additional feature of being able to place IOs. This feature was only used to compare ourimplementation of these algorithms with VPR’s implementation.

46

CHAPTER 4. METHOD AND RESULTS 47

4.1.1 Benchmark circuits

To evaluate our analytical placers we used the Verilog-to-Routing (VTR) benchmark suite.These benchmarks are provided in the blif file format as part of the VTR 7.0 release [11]. Thepacking algorithm of VPR was used to pack the circuits described in the blif files in CLBs, IOs,memory hard blocks and multiplier hard blocks. VPR saves these packed netlists in files usingthe .net file extension. These net files were read by our placement tool in order to generate aplacement for these circuits on our architectures and using our placers. An overview of all thebenchmark circuits that were used to evaluate our placers is given in Table 4.1. This table alsogives the number of inputs, outputs, CLBs, memory hard blocks and multiplier hard blocksthat remain in these benchmark circuits after packing.

Benchmark name # inputs # outputs # CLBs # memories # multipliers

stereovision3 11 30 186 0 0ch intrinsics 99 130 425 1 0diffeq1 162 96 485 0 5diffeq2 66 96 322 0 5mkSMAdapter4B 195 205 1982 5 0sha 38 36 2280 0 0raygentop 239 305 2429 1 7or1200 385 394 3054 2 1mkPktMerge 311 156 232 15 0boundtop 275 192 3070 1 0blob merge 36 100 6019 0 0stereovision0 157 197 14779 0 0mkDelayWorker32B 511 553 5602 43 0bgm 257 32 36480 0 11LU8PEEng 114 102 26455 45 8stereovision1 133 145 13570 0 38stereovision2 149 182 36435 0 213LU32PEEng 114 102 88787 168 32mcml 36 33 106350 159 30

Table 4.1: Overview of the benchmark designs

Because our analytical placers were not able to place IOs these were always considered fixedduring placement. The IOs were always placed before the actual placement algorithm startedits work. This was done using a deterministic algorithm, so that different placements of thesame benchmark always had the same IO placement, independent of the placement algorithmin use.

4.1.2 Architectures

The architectures used to place the benchmark circuits on were always fitted to the circuitin hand. The amount of available CLB sites in the architecture was kept as close as possibleto 120 % of the minimal necessary amount of CLB sites (20 % white space was kept). Wealways used a 1 : 1 aspect ratio. The target architectures only had IOs on the perimeter ofthe chip. Every IO block had exactly one input pin and one output pin at its disposal. If the


architecture did not have enough input or output pins available the architecture was enlargeduntil this was the case. If no hard block sites were needed none were added. If hard blocksites were needed these were added in columns that were distributed over the chip as equallyas possible. A hard block always had the same size as a CLB. It has to mentioned that this isnot very realistic: hard blocks are typically bigger in size than CLBs. This issue will have tobe addressed in future versions of our framework. An example of a generated architecture isshown in Figure 4.1. The grey rectangles on the perimeter of the FPGA are input or outputpins, the blue squares are CLBs, the green squares multiplier hard blocks and finally the redsquares are memory hard blocks.

Figure 4.1: An example of a generated architecture containing two different types of hardblocks

4.1.3 Timing information

Because we did not directly base our architecture on an existing FPGA architecture we did nothave any realistic timing information at our disposal which was readily incorporatable in ourcode. Because of this we generated some sort of dummy timing information using the followingrules:

• The wiring delay between two circuit elements is equal to the manhattan distance (x-distance plus y-distance) multiplied with 0.005.

• The delay associated with passing through a LUT or non-clocked hard block is 0.01.

The parameters used in these two rules are parametrized and can be assigned any value by theuser. Although these two rules do not represent any real timing information they allowed us tobuild a timing graph and perform a placement that optimizes two different (but also in someway related) optimization goals: HPWL and critical path delay.


4.1.4 Placement evaluation

The circuit placements generated by the different placement algorithms were compared usingmultiple measures. The following measures will be used throughout this chapter:

• Run-time

• Memory usage (analytical placers only)

• HPWL

• Critical path delay

The run-time of the analytical placers will be broken down in two parts: analytical time andrefinement time. The analytical time is the time necessary to generate a global placement usingthe analytical placement algorithm itself. The refinement time is the time needed by the lowtemperature anneal to obtain a high quality result. Note that placement refinement will oftenbe referred to with the term detailed placement.

Memory usage will only be taken into account when comparing the bound-to-bound net modelwith our new hybrid net model for timing-driven analytical placement. This comparison ismainly done because we want to know how much more memory is required by our hybrid netmodel compared to the existing bound-to-bound net model.

HPWL and critical path delay will be the measures used to compare placement quality. Whencomparing wire-length-driven placement algorithms with each other only HPWL will be takeninto account. When comparing timing-driven placement algorithms with each other bothHPWL and critical path delay will be taken into account.

To conclude this section we mention that all of the simulation results presented in this chapterwere obtained using the same computer system. This system has a quad-core Intel Core i7-3770CPU running at 3.40GHz and with a cache size of 8MB. There is 16GB of working memoryinstalled.

4.2 Reference simulated annealing based placers

As already mentioned before we have implemented two simulated annealing based placers.These were implemented in order to have a fair reference for our analytical placers. This is notonly important for comparing run-time, but also for comparing solution quality: we know forsure that our simulated annealing based placers and our analytical placers will use the samearchitecture, fixed IO placement and timing information.

We have implemented two simulated annealing based placers: a wire-length-driven version anda timing-driven version. Both of these placers were based on the placers part of the VPR tool[1] [12]. The simulation results obtained using these placers are presented in this section.


4.2.1 Wire-length-driven simulated annealing based placer

The placement results of the VTR benchmark circuits placed by our wire-length-driven simu-lated annealing based placer are shown in table 4.2. For each benchmark we give the run-timeof the algorithm (in seconds) and the quality of the obtained placement in terms of HPWLand critical path delay. These results will be used as a reference for the wire-length-drivenanalytical placers further on in this chapter. Note that this table is the only time that we willpresent absolute numbers for the run-time and solution quality of a wire-length-driven placer.From now on all results for wire-length-driven placers will be presented as ratios with respectto these values. Also note that the HPWL numbers shown in Table 4.2 are divided by 1000 inorder to make the results more presentable.

Benchmark name Run-time (s) HPWL (×1000) Critical path delay

stereovision3 1,059 1,402 0,160ch intrinsics 4,616 8,109 0,335diffeq1 6,874 12,81 0,850diffeq2 3,641 7,210 0,640mkSMAdapter4B 38,89 38,60 0,715sha 35,58 31,32 0,950raygentop 54,55 56,87 0,810or1200 82,99 104,5 2,365mkPktMerge 19,55 32,45 0,445boundtop 66,62 54,76 1,050blob merge 156,3 111,1 1,035stereovision0 654,1 144,2 1,005mkDelayWorker32B 199,7 211,4 1,305bgm 3576 609,6 2,700LU8PEEng 2047 521,3 8,805stereovision1 577,2 218,1 0,760stereovision2 4057 961,2 2,220LU32PEEng 14679 2334 8,700mcml 17759 2471 7,575

Table 4.2: Wire-length-driven simulated annealing results (fixed IOs)

In order to prove that our simulated annealing based algorithm delivers qualitative results ina time efficient way we have compared the solution quality and run-time of our code with theplacer part of the VPR tool. Because VPR also places IOs we have temporarily adapted ouralgorithm so that also IOs are placed. The results of this experiment are shown in table 4.3.The numbers in the run-time ratio column were obtained by dividing the run-time of our javawire-length-driven SA placer by the run-time of VPR’s wire-length-driven placer. The HPWLratio column is obtained in the same way, but now dividing the HPWL costs. Note that we onlymention HPWL ratio as a measure for the quality of the generated placement. This is becausecritical path delay doesn’t say anything in this comparison as VPR uses different timing datacompared to our placers.

When looking at the results presented in Table 4.3 we see that our wire-length-driven SA placeron average gives exactly the same results as VPR’s placer. The small deviations for some of theindividual benchmark circuits can entirely be devoted to the random nature of the simulated


Benchmark name Run-time ratio HPWL ratio

stereovision3 1,17 1,01ch intrinsics 1,66 1,02diffeq1 1,82 1,02diffeq2 1,25 1,00mkSMAdapter4B 1,29 0,99sha 1,00 0,99raygentop 1,39 1,01or1200 1,46 1,00mkPktMerge 1,85 1,02boundtop 1,16 0,99blob merge 0,90 1,02stereovision0 1,04 1,01mkDelayWorker32B 1,55 0,96bgm 1,04 1,02LU8PEEng 1,07 1,01stereovision1 1,21 1,02stereovision2 1,02 0,98

Geometric mean 1,26 1,00

Table 4.3: Comparison between our java based SA placer and VPR’s SA placer (movable IOs,wire-length-driven)

annealing algorithm. This proves that our wire-length-driven SA placer can be used as a validreference for evaluating our analytical placers. When comparing the run-time of both placerswe see that our java placer is on average 26% slower than VPR’s placer. This is easily explainedas VPR was written in the C programming language, while our placer was written using thejava programming language. Because java results in managed code run in a virtual machineand C results in machine code that can immediately be run on the CPU (unmanaged code) Cis inherently faster than java. This is the effect that we are seeing here.

4.2.2 Timing-driven simulated annealing based placer

The placement results of the VTR benchmark circuits placed by our timing-driven simulatedannealing based placer are shown in table 4.4. For each benchmark we give the run-time ofthe algorithm (in seconds) and the quality of the obtained placement in terms of HPWL andcritical path delay. These results will be used as a reference for the timing-driven analyticalplacers further on in this chapter. Note that this table is the only time that we will presentabsolute numbers for the run-time and solution quality of a timing-driven placer. From nowon all results for timing-driven placers will be presented as ratios with respect to these values.Also note that the HPWL numbers shown in Table 4.4 are divided by 1000 in order to makethe results more presentable.

In order to verify if our timing-driven simulated annealing based placer actually works we havecompared the obtained HPWL and critical path delay with the results obtained using wire-length-driven simulated annealing. The results of this experiment are presented in Table 4.5.The HPWL ratio column gives the ratio of the HPWL obtained using timing-driven SA to


Benchmark name Run-time (s) HPWL (×1000 Critical path delay

stereovision3 2,581 1,464 0,130ch intrinsics 8,676 8,379 0,305diffeq1 12,84 13,40 0,670diffeq2 7,216 7,261 0,520mkSMAdapter4B 81,21 41,37 0,525sha 79,21 33,69 0,900raygentop 111,6 59,24 0,765or1200 199,6 116,8 1,300mkPktMerge 22,68 34,64 0,565boundtop 156,2 61,37 0,680blob merge 504,3 128,9 0,730stereovision0 1698 179,3 0,640mkDelayWorker32B 581,3 226,6 0,955bgm 7748 694,6 1,895LU8PEEng 5107 550,5 6,860stereovision1 1735 270,0 0,785stereovision2 9178 1107 1,765LU32PEEng 30602 2640 7,205mcml 39870 2634 6,945

Table 4.4: Timing-driven simulated annealing simulation results

the HPWL obtained using wire-length-driven SA. The results in the critical path delay ratiocolumn are obtained in the same way.

When looking at the results presented in Table 4.5 we see that the critical path delay obtainedusing our timing-driven SA placer is on average 20% lower than the critical path delay obtainedusing our wire-length-driven SA placer. This comes at the cost of a 10% higher average HPWLcost. This was to be expected as it is impossible to completely optimize for both, a certaintrade off needs to be made. These observations prove that our timing-driven SA placer is doingits job as expected.

4.3 Wire-length-driven analytical placers

After discussing the results obtained for the simulated annealing based reference tools we willnow evaluate our wire-length-driven analytical placers. We will only do this for our analyticalplacers using the bound-to-bound net model as the placers using the source-sink net modelgave inferior results.

Table 4.6 gives an overview of the results obtained for our wire-length-driven analytical placerusing the bound-to-bound net model. All of these results are ratios using the wire-length-driven simulated annealing results as a reference (see Table 4.2). As we are discussing awire-length-driven placer we will only evaluate the resulting HPWL. The second and thirdcolumn in Table 4.6 give the HPWL after global placement (the analytical placement itself).The first of these two columns gives the result when using complete legalization, the secondgives the result when using gradual legalization. The maximal utilization sequence used in the


Benchmark name HPWL ratio Critical path delay ratio

stereovision3 1,04 0,81ch intrinsics 1,03 0,91diffeq1 1,05 0,79diffeq2 1,01 0,81mkSMAdapter4B 1,07 0,73sha 1,08 0,95raygentop 1,04 0,94or1200 1,12 0,55mkPktMerge 1,07 1,27boundtop 1,12 0,65blob merge 1,16 0,71stereovision0 1,24 0,64mkDelayWorker32B 1,07 0,73bgm 1,14 0,70LU8PEEng 1,06 0,78stereovision1 1,24 1,03stereovision2 1,15 0,80LU32PEEng 1,13 0,83mcml 1,07 0,92


Table 4.5: Comparison between our wire-length-driven and timing-driven SA placers

gradual legalization algorithm was 4.0, 3.0, 2.0, 1.5 and 0.9 from the fifth iteration onwards(see Section 3.2.2 for more details). Columns four and five in Table 4.6 give the HPWL afterdetailed placement (refinement). We again differentiate between using complete legalizationand gradual legalization. The two final columns in Table 4.6 give the total speed up that wasobtained. This total speed up is defined as the ratio of the run-time of our wire-length-drivensimulated annealing based placer to the total run-time (global + detailed) of the wire-length-driven analytical placer.

When looking at the results presented in Table 4.6 we see that the HPWL cost after globalplacement is on average 38% higher than for the reference wire-length-driven SA placer whenusing complete legalization. When using gradual legalization this is only 36%. It has to benoted that only a limited amount of benchmark circuits get a better result using graduallegalization. This is especially the case for some of the bigger benchmark circuits. This is anindication that gradual legalization can better cope with overlap involving a large amount ofblocks. Further research should be conducted to confirm this. Also, more maximal utilizationsequences should be evaluated than we could do as part of this thesis. When looking at thelargest benchmark designs (the bottom 6 in table 4.6) we see that these don’t perform as goodas the other designs, even with gradual legalization. This is an indication that further researchwill have to be conducted on solving large amounts of overlap. When looking at the resultsafter refinement we see that on average a 3% higher HPWL cost is obtained compared to ourwire-length-driven SA based reference placer. We consider this to be acceptable, especiallyif we know that we get a 5× speed-up in exchange. The speed-up results for the individualbenchmark circuits indicate that our low temperature simulated annealing refinement stagehas not been given the most optimal parameters. When we look at the or1200 benchmark


Benchmark nameHPWL global HPWL detailed Total speed-uptotal gradual total gradual total gradual

stereovision3 1,37 1,38 1,13 1,10 3,77 3,14ch intrinsics 1,16 1,16 1,01 1,01 3,60 8,32diffeq1 1,14 1,16 1,01 1,00 3,81 3,28diffeq2 1,10 1,11 1,01 1,01 11,49 4,30mkSMAdapter4B 1,21 1,24 1,00 1,00 5,60 3,92sha 1,20 1,20 1,04 1,06 10,58 12,99raygentop 1,25 1,27 1,07 1,02 14,52 4,93or1200 1,17 1,15 1,00 1,02 4,36 12,28mkPktMerge 1,17 1,30 1,00 1,00 9,24 7,51boundtop 1,23 1,25 1,06 1,04 9,17 5,57blob merge 1,30 1,34 0,99 0,99 3,10 5,38stereovision0 1,39 1,41 1,03 1,04 7,97 7,16mkDelayWorker32B 1,20 1,25 1,04 1,00 10,95 4,23bgm 1,49 1,41 1,00 1,02 3,12 6,98LU8PEEng 1,78 1,86 1,03 1,02 3,34 2,77stereovision1 2,16 1,65 1,01 1,10 3,72 5,06stereovision2 2,01 1,68 1,02 1,02 2,73 2,79LU32PEEng 1,80 1,86 1,03 1,01 2,42 2,39mcml 1,57 1,55 1,03 1,02 1,99 2,76

Geometric mean 1,38 1,36 1,03 1,03 5,08 4,92

Table 4.6: Wire-length-driven analytical placement results (bound-to-bound net model)

for example both the result after global placement as the result after detailed placement arealmost the same for total legalization and gradual legalization. The achieved speed-up for totallegalization is a lot lower than for gradual legalization. These results are pretty contradictoryand thus indicate that further research will need to be done in order to find good parametersfor our low temperature simulated annealing refinement stage.

Table 4.7 gives the amount of time that is spent in global placement compared to the totalrun-time of the algorithm (global + detailed placement). This averages at 8% for both totallegalization and gradual legalization. The differences between total and gradual legalization forindividual benchmark circuits are bigger than expected when looking to the global placementHPWL results in Table 4.6. This confirms that we did not find the good parameters for thelow temperature simulated annealing stage.

4.4 Timing-driven analytical placers

After discussing the simulation results obtained for the wire-length-driven analytical placer wewill now evaluate the timing-driven analytical placers. We start this section with a comparisonof the different net models we can use with timing-driven analytical placement: the bound-to-bound net model, our source-sink net model and our hybrid net model (see Section 3.2.6for more details). After this we will take a closer look at the results obtained with the bestperforming net model.


Benchmark nameruntime % in global placementtotal gradual



Table 4.7: Run-time distribution wire-length-driven analytical placement

4.4.1 Comparing the different net models

Table 4.8 shows the obtained HPWL cost after global placement for the wire-length-drivenanalytical placer using complete legalization (column two), the wire-length-driven analyticalplacer using gradual legalization (column three), the timing-driven analytical placer using oursource-sink net model (column four), the timing-driven analytical placer using the bound-to-bound net model (column five) and the timing-driven analytical placer using our hybrid netmodel (column six). These results are referenced to the results obtained using the timing-drivensimulated annealing based reference placer (see Table 4.4). Note that we always use completelegalization when doing timing-driven analytical placement.

When looking at the results presented in Table 4.8 it is immediately clear that the timing-drivenanalytical placers using our source-sink net model and the bound-to-bound net model performvery bad in terms of HPWL. They result in an average HPWL cost which is respectively 44%and 54% higher compared to our reference timing-driven simulated annealing placer (globalplacement only). Our hybrid net model on the other hand performs very good: the HPWLcost is only 20% higher on average than for the reference SA placer. This performance is evenbetter than the performance of the wire-length-driven analytical placers. This is suspected tobe a coincidence, but could be worth further investigating.

Table 4.9 shows the obtained critical path delay after global placement for the wire-length-driven analytical placer using complete legalization (column two), the wire-length-driven an-alytical placer using gradual legalization (column three), the timing-driven analytical placerusing our source-sink net model (column four), the timing-driven analytical placer using the


Benchmark nameWire-length-driven Timing-driventotal gradual source-sink B2B hybrid

stereovision3 1,31 1,32 1,36 1,33 1,31ch intrinsics 1,12 1,12 1,22 1,10 1,11diffeq1 1,09 1,11 1,16 1,08 1,06diffeq2 1,09 1,10 1,12 1,17 1,11mkSMAdapter4B 1,13 1,16 1,32 1,33 1,14sha 1,12 1,12 1,50 1,85 1,17raygentop 1,20 1,22 1,38 1,36 1,15or1200 1,05 1,03 1,37 1,17 1,01mkPktMerge 1,09 1,21 1,19 1,23 1,06boundtop 1,10 1,12 1,40 1,43 1,11blob merge 1,12 1,15 1,46 1,66 1,16stereovision0 1,12 1,13 1,44 2,04 1,06mkDelayWorker32B 1,12 1,17 1,40 1,29 1,13bgm 1,31 1,23 1,65 1,70 1,23LU8PEEng 1,69 1,76 1,99 2,72 1,65stereovision1 1,74 1,33 1,43 1,94 1,23stereovision2 1,75 1,46 1,57 1,61 1,44LU32PEEng 1,59 1,65 1,71 2,44 1,55mcml 1,47 1,45 1,91 1,87 1,41

Geometric mean 1,25 1,24 1,44 1,54 1,20

Table 4.8: Timing-driven analytical placers: wire-length comparison after global placement

bound-to-bound net model (column five) and the timing-driven analytical placer using ourhybrid net model (column six). These results are referenced to the results obtained using thetiming-driven simulated annealing based reference placer (see Table 4.4).

When looking at the results presented in Table 4.9 we immediately see that the timing-drivenanalytical placers using the source-sink net model and the bound-to-bound net model alsodon’t give good results in terms of critical path delay. They result in an average critical pathdelay which is respectively 46% and 51% higher than obtained with our reference timing-drivensimulated annealing based placer. This confirms our findings from Table 4.8: the bound-to-bound net model and unfortunately also the source-sink net model don’t perform very wellin timing-driven analytical placement. When looking at the results obtained using our newhybrid net model we can be much more optimistic: the average critical path delay is only 34%higher than obtained using our reference timing-driven SA placer. This is by far the best of allconsidered analytical placers. As the hybrid net model also performed best in terms of HPWL,we will only study the results after detailed placement for the timing-driven analytical placerusing this net model.

4.4.2 Results after refinement for our hybrid net model

Table 4.10 summarizes all results obtained with the timing-driven analytical placer using ournew hybrid net model. All these results are referenced to the results obtained using the timing-driven simulated annealing based reference placer. Columns two and three show the HPWLcost and critical path delay (respectively) obtained after global placement. These results were


Benchmark nameWire-length-driven Timing-driventotal gradual source-sink B2B hybrid


Geometric mean 1,55 1,54 1,46 1,51 1,34

Table 4.9: Timing-driven analytical placers: critical path delay comparison after global place-ment

already given in Tables 4.8 and 4.9 respectively, but are repeated here for convenience. Columnsfour and five show the HPWL cost and critical path delay (respectively) obtained after detailedplacement. Finally, column six gives the total speed up that is achieved compared to the timing-driven reference SA placer.

When we look at the results presented in Table 4.10 we see that we end up with a HPWLcost after detailed placement which is on average 2% higher compared to the reference timing-driven SA placer. The critical path delay after detailed placement on the other hand is onaverage 2% lower compared to the reference timing-driven SA placer. This proves that ouranalytical placer delivers quality results. The average total speed-up is 2.86. This is less thanin the wire-length-driven analytical placement case, but can still be considered as acceptableconsidering the obtained quality of result.

Table 4.11 gives the amount of time that is spent in global placement compared to the totalrun-time of the algorithm (global + detailed placement). This averages at 4%. As was tobe expected because of the lower speed-ups found in Table 4.10 compared to Table 4.6 thispercentage is lower than in the wire-length-driven case.

To conclude this chapter we will compare the maximal memory usage of the bound-to-boundnet model and our hybrid net model. Our hybrid net model will only be of any practicaluse when the additional memory that is needed to store the x- and y-matrices remains small.Therefore this experiment is of great importance. The results of the experiment are shown inTable 4.12. The second column shows the maximal amount of memory needed to store the x-


Benchmark nameGlobal Detailed

Total speed-upHPWL CP HPWL CP


Geometric mean 1,20 1,34 1,02 0,98 2,86

Table 4.10: Total evaluation of timing-driven analytical placer using hybrid net model

and y-matrices during wire-length-driven analytical placement using the bound-to-bound netmodel. The third column shows the maximal amount of memory needed to store the x- andy-matrices during timing-driven analytical placement using our hybrid net model. Finally, thefourth column shows the ratio of the third column to the second column.

When looking at the results presented in Table 4.12 we see that our hybrid net model uses 8%more memory on average compared to the bound-to-bound net model. This additional memoryusage is actually very limited compared to the improvement in the quality of result and is thuseasily accepted.


Benchmark nameRun-time % in

global placement

stereovision3 25,0ch intrinsics 8,1diffeq1 6,2diffeq2 50,7mkSMAdapter4B 4,5sha 5,0raygentop 3,6or1200 2,0mkPktMerge 0,7boundtop 3,2blob merge 2,8stereovision0 1,8mkDelayWorker32B 1,7bgm 3,0LU8PEEng 3,1stereovision1 2,8stereovision2 1,7LU32PEEng 2,9mcml 11,8

Geometric mean 4,0

Table 4.11: Run-time distribution timing-driven analytical placement

Benchmark nameMaximal memory usage (MB)

RatioWLD B2B TD hybrid

stereovision3 0,035 0,040 1,14ch intrinsics 0,085 0,094 1,10diffeq1 0,120 0,131 1,09diffeq2 0,079 0,085 1,07mkSMAdapter4B 0,560 0,604 1,08sha 0,646 0,690 1,07raygentop 0,583 0,633 1,09or1200 0,916 0,964 1,05mkPktMerge 0,062 0,063 1,02boundtop 0,824 0,871 1,06blob merge 2,062 2,224 1,08stereovision0 2,511 2,634 1,05mkDelayWorker32B 1,449 1,576 1,09bgm 11,957 13,383 1,12LU8PEEng 8,441 9,196 1,09stereovision1 2,447 2,543 1,04stereovision2 7,640 8,088 1,06LU32PEEng 29,146 31,412 1,08mcml 25,759 30,274 1,18

Geometric mean 1,08

Table 4.12: Maximal memory usage comparison

Chapter 5

Conclusions and future work

5.1 Conclusions

One of the two main goals of this thesis was to provide an open source framework for analyticalplacement on FPGAs. Our wire-length-driven analytical placer (based on [6]) achieves anaverage speed-up of 5× while only increasing the average HPWL cost with 3% compared toa high effort placement generated by a wire-length-driven simulated annealing based placer.Because of this we can state that a working analytical placer was built. Because our code willbe made publicly available other research groups can easily build upon our work. We can thusconclude that the first main goal of this thesis was successfully achieved.

While working on the wire-length-driven analytical placement algorithm we have introduceda new gradual legalization method. This new legalization method has shown to give goodresults on benchmark circuits where overlap involving a very large amount of blocks has to beresolved during the course of the algorithm. On average a decrease of 2% HPWL cost afterglobal placement was achieved compared to the complete legalization algorithm. On some ofthe individual benchmark circuits the HPWL cost decrease was as high as 23%. We stronglysuspect that even better results could be obtained when optimizing the parameters of thegradual legalization algorithm.

The second main goal of this thesis was to introduce new methods for incorporating timinginformation in analytical placement algorithms. We have proposed two new net models in orderto achieve this goal: the source-sink net model and the hybrid net model. The source-sink netmodel didn’t meet our expectations. The hybrid net model on the other hand performed verywell. Using this hybrid net model an average speed-up of 2.86× was achieved compared toa high effort placement generated by a timing-driven simulated annealing based placer. Thiswhile only increasing the average HPWL cost with 2% but actually decreasing the averagecritical path delay with 2%. Unfortunately the use of our hybrid net model increases thememory usage of the algorithm compared to the bound-to-bound net model, albeit only with8% on average.

60

CHAPTER 5. CONCLUSIONS AND FUTURE WORK 61

5.2 Future work

5.2.1 Comparison to low effort simulated annealing

All of the results presented in this text were referenced to a high effort simulated annealingbased placer. This placer uses an inner num (placement effort) of 10. A low effort simulatedanneal is obtained by changing inner num to 1. By doing so the number of executed swaps pertemperature is decreased with a factor of 10, and thus the run-time of the low effort simulatedanneal will be approximately 10 times shorter compared to a high effort anneal. Table 5.1shows the speed-up and HPWL of a low effort wire-length-driven simulated anneal referencedto a high effort simulated anneal. It is immediately clear that the predicted speed-up of 10×is confirmed. The HPWL increase averages on 7%, which is a surprisingly good result. Thisimmediately explains why the developers of VPR have decreased the default effort level of theirplacer from 10 to 1. Unfortunately all literature about VPR still mentions a default value of10. Moreover the decrease in quality of result is never specified to be as low as 7%. Because ofthis we have been comparing our analytical placer results to the high effort simulated annealingplacer for way to long. The results presented in Table 5.1 show that this turns out to be awrong decision.

Benchmark name Speed-up HPWL



Table 5.1: Low effort vs high effort wire-length-driven simulated annealing

Table 5.2 shows the speed-up, HPWL and critical path delay of a low effort timing-drivensimulated anneal referenced to a high effort simulated anneal. It is clear that the observationsmade for low effort wire-length-driven simulated annealing are confirmed in the timing-drivencase. We observe an average speed-up of 8.5× while the average HPWL only increases with6%. An interesting result is found for the obtained average critical path delay: a decrease of


5% is observed. This can be explained as follows: during timing-driven simulated annealing theslacks in the timing graph are only updated once every temperature. The number of swaps pertemperature is 10× lower when doing a low effort simulated annealing than when doing a higheffort simulated annealing. The results presented in Table 5.2 indicate that only updating theslacks of the timing graph once a temperature is not enough during high effort timing-drivensimulated annealing.

Benchmark name Speed-up HPWL CP


Geometric mean 8,5 1,06 0,95

Table 5.2: Low effort vs high effort timing-driven simulated annealing

Because of the good results obtained with low effort simulated annealing we have compared ouranalytical placers with these results. In order to do this the refinement step of our analyticalplacer was slightly adapted: we now used an effort level (inner num) of 1 instead of an effortlevel of 4 when comparing to high effort simulated annealing. All other parameters remainedthe same.

Wire-length-driven analytical placement

The results obtained for wire-length-driven analytical placement are presented in Table 5.3.The gradual legalization method was used.

It is immediately clear that the speed-up of our wire-length-driven analytical placer has seri-ously declined compared to the high effort comparison done before. The new geometric mean isonly 1.32×. The final solution quality (HPWL) is 1% better on average. Especially the largerbenchmarks do not provide good results in terms of speed-up. This indicates that some seriousefforts will still need to be made to increase the performance on big benchmark designs. Mostimportantly the legalization of large amounts of overlapping blocks needs to be looked into.


Benchmark name Speed-up HPWL



Table 5.3: Low effort wire-length-driven placement: analytical versus simulated annealing

This because Table 4.6 has indicated that for large benchmark designs the results obtainedafter global placement are not good enough yet. Improvements could also be made on therun-time efficiency of our analytical (global) placement code.

When looking back at Table 5.3 we see that the stereovision3 and mcml benchmarks gaveparticularly bad results in terms of speed-up. For the stereovision3 benchmark this can beaddressed to the compilation of the java byte code. This benchmark was always run first andbecause of this all java byte code had to be compiled as it was executed for the first time.When doing analytical placement (global + detailed placement) more java byte code needsto be compiled than when doing simulated annealing. Therefore this has an effect on therun-time ratio. As the other benchmarks were run when compiled instructions were alreadyavailable, this effect was not present there. The mcml benchmark had a very high run-timefor the global placement phase. This effect was also observed in the LU32PEEng benchmark,albeit less pronounced. We can not immediately explain this at this point, but suspect thatmemory swapping to the hard disk was in effect, or that some effect was occurring with thecache memories.

Our wire-length-driven analytical placer was strongly based on HeAP [6]. The developers ofHeAP report an achieved speed-up of 7.4× accompanied with a 6% decrease in final wire-length. This comparison was made with a low effort simulated annealing based placer. Wehave to conclude that we cannot reach those results yet, even though all available informationabout HeAP was used during the development of our placers. It has to be mentioned that theauthors of HeAP used different benchmark designs and FPGA architectures than we did. Thiscould be a factor playing, although not explaining the complete difference. Section 5.2.2 doessome proposals on improvements that could be done in order to close the gap with HeAP.


Timing-driven analytical placement

The results obtained for timing-driven analytical placement are presented in Table 5.4. Thehybrid net model was used.

Benchmark name Speed-up HPWL CP


Geometric mean 1,16 1,01 1,07

Table 5.4: Low effort timing-driven placement: analytical versus simulated annealing

The observations made during the wire-length-driven comparison are confirmed for the timing-driven case. With an average speed-up of 1.16×, an average HPWL increase of 1% and anaverage critical path delay increase of 7%, they even prove to be slightly worse. It has to be saidthat during the timing-driven analytical placement of the first four benchmark circuits listedin Table 5.4 no correct initial temperature could be found by the low temperature annealingalgorithm. This resulted in an excessive run-time as a default initial temperature was usedwhich was way to high.

As in the wire-length-driven case it is clear that some work still needs to be done on improvingthe results for large benchmark circuits and improving the refinement parameters.

5.2.2 Other future work

We have introduced a new gradual legalization method. This new legalization method hasshown to give some promising results, but some work still needs to be done in order to makethese results more consistent over multiple benchmark circuits. More precisely we suspectthat a close study on calculating the maximal utilization sequence could improve the obtainedresults drastically.


The simulation results presented in Chapter 4 and Section 5.2.1 have indicated that our lowtemperature simulated annealing refinement step doesn’t always work very well. As experi-ments have shown that on average the algorithm spends close to 90% of its time in the refine-ment stage it would be very useful to do a study on improving the low temperature simulatedannealing parameters or do research on new refinement methods.

Our new hybrid net model has shown some very promising results but unfortunately its memoryusage is higher than when using the bound-to-bound net model. A study investigating if someof the connections can be left out could decrease this memory usage. Especially when a pin isconnected to both bound-to-bound as source-sink connections it could be worth investigating ifsome of these bound-to-bound connections could be left out without deteriorating the obtainedresults.

The analytical placers developed as part of this thesis will be made hierarchical. This meansthat the placement problem is divided in several sub problems which are solved separately (tosome extent). We expect that this will both improve the run-time as the quality of result ofthe global placement step. Making the placer multilevel and improving the refinement stepwill be the main focus of the thesis of Seppe Lenders to be handed in January 2016.

Bibliography

[1] Vaughn Betz and Jonathan Rose. Vpr: A new packing, placement and routing tool forfpga research. In Proceedings of the 7th International Workshop on Field-ProgrammableLogic and Applications, FPL ’97, pages 213–222, London, UK, UK, 1997. Springer-Verlag.

[2] Vaughn Betz, Jonathan Rose, and Alexander Marquardt, editors. Architecture and CADfor Deep-Submicron FPGAs. Kluwer Academic Publishers, Norwell, MA, USA, 1999.

[3] Huimin Bian, Andrew C. Ling, Alexander Choong, and Jianwen Zhu. Towards scalableplacement for fpgas. In Proceedings of the 18th Annual ACM/SIGDA International Sym-posium on Field Programmable Gate Arrays, FPGA ’10, pages 147–156, New York, NY,USA, 2010. ACM.

[4] Chih-Liang Eric Cheng. Risa: Accurate and efficient placement routability modeling. InProceedings of the 1994 IEEE/ACM International Conference on Computer-aided Design,ICCAD ’94, pages 690–695, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

[5] Padmini Gopalakrishnan, Xin Li, and Lawrence Pileggi. Architecture-aware fpga place-ment using metric embedding. In Proceedings of the 43rd Annual Design AutomationConference, DAC ’06, pages 460–465, New York, NY, USA, 2006. ACM.

[6] Marcel Gort and Jason Helge Anderson. Analytical placement for heterogeneous fpgas. InDirk Koch, Satnam Singh, and Jim Trresen, editors, FPL, pages 143–150. IEEE, 2012.

[7] V. Granville, M. Krivanek, and J.-P. Rasson. Simulated annealing: a proof of convergence.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16(6):652–656, Jun1994.

[8] Myung-Chul Kim, Dongjin Lee, and Igor L. Markov. Simpl: An effective placementalgorithm. IEEE Trans. on CAD of Integrated Circuits and Systems, 31(1):50–60, 2012.

[9] Guadalupe Garcia Ledesma Sergio, Jose Ruiz. Simulated annealing - advances, applica-tions and hybridizations. august 2012.

[10] Tzu-Hen Lin, Pritha Banerjee, and Yao-Wen Chang. An efficient and effective analyticalplacer for fpgas. In Proceedings of the 50th Annual Design Automation Conference, DAC’13, pages 10:1–10:6, New York, NY, USA, 2013. ACM.

[11] Jason Luu, Jeffrey Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, KonstantinNasartschuk, Miad Nasr, Sen Wang, Tim Liu, Nooruddin Ahmed, Kenneth B. Kent, JasonAnderson, Jonathan Rose, and Vaughn Betz. Vtr 7.0: Next generation architecture andcad system for fpgas. ACM Trans. Reconfigurable Technol. Syst., 7(2):6:1–6:30, July 2014.

66

BIBLIOGRAPHY 67

[12] Alexander Marquardt, Vaughn Betz, and Jonathan Rose. Timing-driven placement forfpgas. In Proceedings of the 2000 ACM/SIGDA Eighth International Symposium on FieldProgrammable Gate Arrays, FPGA ’00, pages 203–213, New York, NY, USA, 2000. ACM.

[13] Fan Mo, Abdallah Tabbara, and Robert K. Brayton. A force-directed macro-cell placer. InProceedings of the 2000 IEEE/ACM International Conference on Computer-aided Design,ICCAD ’00, pages 177–181, Piscataway, NJ, USA, 2000. IEEE Press.

[14] P. Spindler, U. Schlichtmann, and F. M. Johannes. Kraftwerk - a fast force-directedquadratic placement approach using an accurate net model. Trans. Comp.-Aided Des.Integ. Cir. Sys., 27(8):1398–1411, August 2008.

[15] Neil Steiner, Aaron Wood, Hamid Shojaei, Jacob Couch, Peter Athanas, and MatthewFrench. Torc: Towards an open-source tool flow. In Proceedings of the 19th ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays, FPGA ’11, pages 41–44,New York, NY, USA, 2011. ACM.

[16] Natarajan Viswanathan and Chris Chong nuen Chu. Fastplace: Efficient analytical place-ment using cell shifting, iterative local refinement and a hybrid net model. pages 26–33,2004.

[17] Xilinx. XC4000E and XC4000X Series Field Programmable Gate Arrays, May 1999.

[18] Xilinx. XtremeDSP for Virtex-4 FPGAs, May 2008.

[19] M. Xu, G. Grewal, and S. Areibi. Starplace: A new analytic method for {FPGA} place-ment. Integration, the {VLSI} Journal, 44(3):192 – 204, 2011.

[20] Yonghong Xu and Mohammed A. S. Khalid. Qpf: Efficient quadratic placement for fpgas.In Tero Rissa, Steven J. E. Wilton, and Philip Heng Wai Leong, editors, FPL, pages555–558. IEEE, 2005.

List of Figures

2.1 FPGA architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 A logic block containing one LUT and one register . . . . . . . . . . . . . . . . 4

2.3 Switch matrix found in Xilinx XC4000E and XC4000X devices [17] . . . . . . . 4

2.4 DSP48 slice as found in Xilinx Virtex-4 FPGAs [18] . . . . . . . . . . . . . . . 5

2.5 Schematic representation of the FPGA toolflow . . . . . . . . . . . . . . . . . . 6

2.6 The bounding box of a net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.7 The timing graph of a simple circuit [2] . . . . . . . . . . . . . . . . . . . . . . 9

2.8 Two circuit paths with a different logic depth . . . . . . . . . . . . . . . . . . . 11

2.9 The search range when searching for a random swap site . . . . . . . . . . . . . 12

2.10 Simulated annealing temperature schedule [9] . . . . . . . . . . . . . . . . . . . 13

2.11 A very simple netlist to be placed . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.12 The run-time of VPR’s SA placemer versus the problem size (number of movableblocks) on a log-log chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Top level flowchart of the analytical placer built as part of this thesis . . . . . . 22

3.2 A net containing five pins in total (two fixed and three movable) . . . . . . . . 24

3.3 Graphical representation of the clique net model . . . . . . . . . . . . . . . . . 25

3.4 Graphical representation of the star net model . . . . . . . . . . . . . . . . . . 26

3.5 Graphical representation of the bound-to-bound net model . . . . . . . . . . . 28

3.6 Graphical representation of the source-sink net model . . . . . . . . . . . . . . 29

3.7 More complex example circuit to be placed . . . . . . . . . . . . . . . . . . . . 31

3.8 Linear solution (indicated in blue dotted line) of the more complex example . . 31

3.9 Flowchart of the legalization algorithm on homogeneous FPGAs . . . . . . . . 33

68

LIST OF FIGURES 69

3.10 A clustered over-utilized area (blue dotted line) and its expanded area (greendotted line) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.11 Flowchart of the recursive partitioning algorithm . . . . . . . . . . . . . . . . . 35

3.12 Visual representation of the source and target cuts . . . . . . . . . . . . . . . . 36

3.13 A clustered over-utilized area (blue dotted line) and its expanded area (greendotted line) on a heterogeneous architecture . . . . . . . . . . . . . . . . . . . . 37

3.14 Gradual legalization on a homogeneous architecture . . . . . . . . . . . . . . . 39

3.15 The pseudo connections added to the more complex example . . . . . . . . . . 40

3.16 Convergence of the sha benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 An example of a generated architecture containing two different types of hardblocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

List of Tables

2.1 q(n) as a function of the number of pins connected by the net n [4] . . . . . . . 8

2.2 Memory usage of matrices in analytical placement versus working data memoryusage in simulated annealing placement . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Overview of the benchmark designs . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Wire-length-driven simulated annealing results (fixed IOs) . . . . . . . . . . . . 50

4.3 Comparison between our java based SA placer and VPR’s SA placer (movableIOs, wire-length-driven) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Timing-driven simulated annealing simulation results . . . . . . . . . . . . . . . 52

4.5 Comparison between our wire-length-driven and timing-driven SA placers . . . 53

4.6 Wire-length-driven analytical placement results (bound-to-bound net model) . 54

4.7 Run-time distribution wire-length-driven analytical placement . . . . . . . . . . 55

4.8 Timing-driven analytical placers: wire-length comparison after global placement 56

4.9 Timing-driven analytical placers: critical path delay comparison after globalplacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.10 Total evaluation of timing-driven analytical placer using hybrid net model . . . 58

4.11 Run-time distribution timing-driven analytical placement . . . . . . . . . . . . 59

4.12 Maximal memory usage comparison . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Low effort vs high effort wire-length-driven simulated annealing . . . . . . . . . 61

5.2 Low effort vs high effort timing-driven simulated annealing . . . . . . . . . . . 62

5.3 Low effort wire-length-driven placement: analytical versus simulated annealing 63

5.4 Low effort timing-driven placement: analytical versus simulated annealing . . . 64

70

a new less memory intensive net model for timing driven

Documents