1 a fast and efficient algorithm for fpga detailed...

8
1 A Fast and Efficient Algorithm for FPGA Detailed Placement Aseem Sayal, Ronak Oswal Department of Electrical and Computer Engineering, The University of Texas at Austin Abstract—This paper deals with the FPGA detailed placement algorithms. The authors have proposed and implemented 2 heuristic based detailed placement algorithms for FPGAs. In addition to that, authors have implemented the global swap, vertical swap and local reordering placement algorithms, as discussed in [1]. All these algorithms complement each other and are blended in a way to get optimal performance. In order to address the runtime vs performance trade-off, authors have modeled their solution by providing 4 optimization modes, namely High Speed Approach (HSA), High Speed Performance Aware (HPSA), High Performance Speed Aware (HPSA), and High Performance Approach (HPA). The algorithms are designed in such a manner that scalable runtimes are achieved for higher benchmarks. The algorithms also preserve the bin utilization by making them congestion aware. The performance is eval- uated in terms of HPWL reduction and bin utilization from global placement stage for all the benchmarks provided in the class. The authors have achieved 12.80% reduction for FPGA06 benchmark in high performance mode and have achieved optimal performance in 12.80 seconds for FPGA01 in high speed mode. Comparing our results with the other teams presented in class, the authors have achieved one of the best results in terms of HPWL reduction and runtimes for almost all the benchmarks, especially very low run time for the same performance in higher benchmarks. I. I NTRODUCTION The Field Programmable Gate Array (FPGA) is a type of pre- manufactured integrated circuit designed to be configured by customers or designers. FPGAs are becoming more and more popular nowadays because of their ability to re-program in the field to fix bugs, shorter time to market, and lower non- recurring engineering costs. Historically, FPGAs was only used for fast realization of small digital circuits. However, in recent years, the gate count of commercial FPGAs has reached scale of millions [2], so much more complex digital systems are moving towards FPGA based design methodologies. Traditionally, placement is separated into two stages, global and detailed placement. The main purpose of global placement is to distribute the cells evenly over the placement region and optimize certain objectives such as wirelength. As we want to maintain a global view, some approximation has to be made to simplify the problem. Also, the global placement pays more attention to the relative positions among cells globally. Hence, it neglects some local problems. Detailed placement works on the legalized placement to further improve the solution quality. It is more constrained than global placement as it optimizes the objectives by transforming one legal placement solution into another. Because of this nature, more accurate models such as half-perimeter wirelength are used in detailed placement. This work deals with designing and implementation of de- tailed placement algorithms for FPGA. The problem statement is, for a given legal FPGA placement of a netlist N, perform detailed placement performs local refinements and seeks legal locations of the circuit elements, such that the total wirelength is minimized and the cell density is also preserved. The FPGA is composed of 4 type of cells, namely Configurable logic block (CLB), Digital Signal Processor (DSP), Random- Access Memory (RAM) and Input/Output Cell (IO). Typically, majority (99%) of cells are CLBs. This work is mainly final step of ISPD 2016 placement contest [3]. Fig. 1 shows the FPGA01 and FPGA05 ISPD 2016 benchmarks, on which detailed placement is performed A. Literature Survey The numerous previous works on FPGA placement can be classified into three major categories: (1) simulated-annealing based approach, (2) partitioning-based approach, and (3) ana- lytical approach. The most famous academic tool VPR [11] applies simulated annealing as its main tool to optimize objectives such as wirelength, timing, etc. Although it can achieve high quality result, its running time becomes a major drawback when placing a large circuit. Partitioning-based approaches like [4] shorten the running time by recursively partitioning a design and placing them hierarchically. How- ever the partitioning-based methods may result in bad quality because the problem is solved locally after partitioning and these methods are not able to consider the global optimality. Compared to the above two methods, analytical approaches are more favorable especially as the gap between FPGAs and ASICs becomes smaller. Not only the industrial placers migrate from traditional simulated-annealing- based placement to an analytical approach, but also several new academic place- ment tools using analytical approach are shown to produce competitive results with much less running time on FPGAs. In [5], SimPL [6] is applied to FPGA placement, which yields the potential of using analytical methods in FPGA placement. In [7], NTUplace is used as the basic framework of the proposed analytical FPGA placer. Besides the placers mentioned above, there are also other analytical placers like LLP [8], StarPlace [9] and QPF [10]. [11], [12], [13] employed a window-based branch-and- bound method for detailed placement. Domino [14] trans- formed the place- ment problem into a transportation problem that was solved using a network flow algorithm. Kahng et al.

Upload: others

Post on 19-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    A Fast and Efficient Algorithm for FPGA DetailedPlacement

    Aseem Sayal, Ronak Oswal

    Department of Electrical and Computer Engineering, The University of Texas at Austin

    Abstract—This paper deals with the FPGA detailed placementalgorithms. The authors have proposed and implemented 2heuristic based detailed placement algorithms for FPGAs. Inaddition to that, authors have implemented the global swap,vertical swap and local reordering placement algorithms, asdiscussed in [1]. All these algorithms complement each otherand are blended in a way to get optimal performance. In orderto address the runtime vs performance trade-off, authors havemodeled their solution by providing 4 optimization modes, namelyHigh Speed Approach (HSA), High Speed Performance Aware(HPSA), High Performance Speed Aware (HPSA), and HighPerformance Approach (HPA). The algorithms are designed insuch a manner that scalable runtimes are achieved for higherbenchmarks. The algorithms also preserve the bin utilizationby making them congestion aware. The performance is eval-uated in terms of HPWL reduction and bin utilization fromglobal placement stage for all the benchmarks provided in theclass. The authors have achieved 12.80% reduction for FPGA06benchmark in high performance mode and have achieved optimalperformance in 12.80 seconds for FPGA01 in high speed mode.Comparing our results with the other teams presented in class,the authors have achieved one of the best results in terms ofHPWL reduction and runtimes for almost all the benchmarks,especially very low run time for the same performance in higherbenchmarks.

    I. INTRODUCTION

    The Field Programmable Gate Array (FPGA) is a type ofpre- manufactured integrated circuit designed to be configuredby customers or designers. FPGAs are becoming more andmore popular nowadays because of their ability to re-programin the field to fix bugs, shorter time to market, and lower non-recurring engineering costs. Historically, FPGAs was only usedfor fast realization of small digital circuits. However, in recentyears, the gate count of commercial FPGAs has reached scaleof millions [2], so much more complex digital systems aremoving towards FPGA based design methodologies.

    Traditionally, placement is separated into two stages, globaland detailed placement. The main purpose of global placementis to distribute the cells evenly over the placement region andoptimize certain objectives such as wirelength. As we want tomaintain a global view, some approximation has to be madeto simplify the problem. Also, the global placement pays moreattention to the relative positions among cells globally. Hence,it neglects some local problems. Detailed placement works onthe legalized placement to further improve the solution quality.It is more constrained than global placement as it optimizes theobjectives by transforming one legal placement solution into

    another. Because of this nature, more accurate models such ashalf-perimeter wirelength are used in detailed placement.

    This work deals with designing and implementation of de-tailed placement algorithms for FPGA. The problem statementis, for a given legal FPGA placement of a netlist N, performdetailed placement performs local refinements and seeks legallocations of the circuit elements, such that the total wirelengthis minimized and the cell density is also preserved. TheFPGA is composed of 4 type of cells, namely Configurablelogic block (CLB), Digital Signal Processor (DSP), Random-Access Memory (RAM) and Input/Output Cell (IO). Typically,majority (99%) of cells are CLBs. This work is mainly finalstep of ISPD 2016 placement contest [3]. Fig. 1 shows theFPGA01 and FPGA05 ISPD 2016 benchmarks, on whichdetailed placement is performed

    A. Literature SurveyThe numerous previous works on FPGA placement can be

    classified into three major categories: (1) simulated-annealingbased approach, (2) partitioning-based approach, and (3) ana-lytical approach. The most famous academic tool VPR [11]applies simulated annealing as its main tool to optimizeobjectives such as wirelength, timing, etc. Although it canachieve high quality result, its running time becomes a majordrawback when placing a large circuit. Partitioning-basedapproaches like [4] shorten the running time by recursivelypartitioning a design and placing them hierarchically. How-ever the partitioning-based methods may result in bad qualitybecause the problem is solved locally after partitioning andthese methods are not able to consider the global optimality.Compared to the above two methods, analytical approachesare more favorable especially as the gap between FPGAsand ASICs becomes smaller. Not only the industrial placersmigrate from traditional simulated-annealing- based placementto an analytical approach, but also several new academic place-ment tools using analytical approach are shown to producecompetitive results with much less running time on FPGAs. In[5], SimPL [6] is applied to FPGA placement, which yields thepotential of using analytical methods in FPGA placement. In[7], NTUplace is used as the basic framework of the proposedanalytical FPGA placer. Besides the placers mentioned above,there are also other analytical placers like LLP [8], StarPlace[9] and QPF [10].

    [11], [12], [13] employed a window-based branch-and-bound method for detailed placement. Domino [14] trans-formed the place- ment problem into a transportation problemthat was solved using a network flow algorithm. Kahng et al.

  • 2

    Fig. 1: FPGA01 and FPGA05 ISPD 2016 benchmarks.

    [15] employed combinatorial techniques to perform legaliza-tion and detailed placement based on several different objec-tives. In [16], the single-row problem was solved optimallyusing a dynamic programming approach. In [17], Hur andLillis proposed a technique called optimal interleaving and alsoincorporated the dynamic clustering technique [18]. Currentdetailed placement techniques are either not very effective ortoo slow. The window-based technique is very local if thewindow size is small. If a big window is used, the runtimeis not affordable. Domino is considered a very good detailedplacer but it consumes a lot of runtime.

    B. Our Contribution

    In this paper, the authors have proposed and implemented 2heuristic based detailed placement algorithms for FPGAs. Inaddition to that, authors have implemented the global swap,vertical swap and local reordering placement algorithms, asdiscussed in [1]. All these algorithms are blended to achieveoptimal performance quickly. The paper is organized as fol-lows. In Section II, all the implemented existing and pro-posed algorithms are discussed in detail. Section III presentsthe design methodology approach and major optimizationsand tradeoff considerations. Section IV describes the resultsachieved for all the benchmarks in normal and congestionaware cases. In this section, the performance is also comparedwith others who presented in class. The work is concluded insection V. Section VI presents the future scope of this work.

    II. DETAILED PLACEMENT TECHNIQUES

    In this section, the authors have discussed all the algorithmsbeing implemented in detailed placement problem. This sectionis divided in 4 sub-sections. First, global swap, vertical swapand local reordering technique is discussed in terms of imple-mentation, optimizations, run times, merits and demerits. Insecond subsection, the proposed algorithms terminologies arediscussed. In third and fourth sub-section, proposed algorithmsare discussed in terms of implementation, pros, cons and runtime complexity.

    A. Global and Local X/Y Swap TechniqueThe idea is borrowed from the classical detailed placement

    paper [1]. The authors have made some optimizations ontheir idea to achieve better results. An efficient Global Swaptechnique to identify a good pair of cells to swap globallybased on their optimal positions while all other cells are fixed.A Vertical Swap technique that swaps a cell with a nearbycell in the segment above or below so as to move it in thedirection of its optimal position. Finally, a Local re-orderingtechnique that re-orders consecutive standard cells locally toreduce the wirelength. All these techniques are implemented.The half perimeter length (HPWL) is evaluated by (1) and (2).

    HPWLe = |xmax − xmin|+ |ymax − ymin| (1)

    HPWL =∑

    e∈EHPWLe (2)

    The swap and move operation is performed by moving thecells in optimal region. Given all other cells in the circuitare fixed, the “optimal region” for a cell is defined as theregion to place where the wirelength is optimal. This region isdetermined based on the median idea [1]. The optimal regionfor cell1 is shown in Fig. 2. The algorithm implementationorder is given by Fig. 3. For every cell, optimal region isevaluated and then it is moved/swapped to the cell in itsoptimal region, if there is net HPWL improvement after theswap/move operation. If the good swap pair is not found, thenvertical swap is performed. The authors have considering ascanning window of size 3. So, the #sites equivalent to size ofscanning window in direction of optimal region are consideredfor move/swap operations. Similarly, this action is performedin horizontal direction.

    The authors modified this algorithm to make it moreefficient in terms of run time and performance. This approachconventionally is very slow and is of O(n2) time complexity.In order to reduce run time, the authors have skipped highfanout nets while evaluating HPWL improvement beforeand after swap operation. This has very less detrimentaleffect on performance, but significantly reduced run timeby 3X. The authors have also reduced run time of globalswap operation by limiting the size to 20 and swapping thecells whenever there is improvement, and not scanning thewhole optimal region. This reduces run time to half. All thealgorithm being implemented are linear in time complexity

  • 3

    Fig. 2: Optimal Region description. [1]

    Fig. 3: Global and Local Swap Algorithm.

    In global, vertical swap, all swap modes can be enabledusing switch “enable all direction swap”. If not enabled,vertical and horizontal swap are performed only if globalswap is not done for that cell. The merits of this approachare that it significantly improves the HPWL since, it has awider scope with global and local refinements. However,even after timing optimizations this technique is slower thanproposed techniques. Also, the solution might get trap at localmaxima. The time complexity of this technique is given by (3).

    TimeComplexity = O((α1 + α2)n) (3)

    where α1 = #netsforeachcell ≈ 5α2 = 2×#criticalnets×#criticalcells(eachnet) ≈ 10

    B. Proposed Algorithms Terminologies

    Let us assume celli is lying on net 0. The “critical cells”for any net are the cells which lie on the edges of thebounding box of this net. Similarly, these cells when lie onedges for specific nets, those nets serve as critical nets. Then,the HPWL improvement is calculated in both horizontal andvertical direction. The ∆HPWL improvement is calculated bychecking the critical nets for this cell. Then, number of timesthis cell appears on left edge minus number of times this cellappears on right edge define the ∆xHPWL improvement.In the example shown by Fig. 4, the cell highlighted in redcolor has 3 critical nets, 2 on which this lies on right edgeand one for which it lies on left edge. Thus ∆x HPWL im-provement is -1, where sign indicates the “preferred direction”.Similarly, ∆yHPWL improvement is calculated. Once, the

    Fig. 4: ∆HPWL computation.

    Fig. 5: Proposed Algorithms terminologies.

    preferred direction is evaluated, the immediate cell in preferreddirection is called “neighbor cell”. For effective improvementin HPWL, the edge of critical nets need to be moved inpreferred direction. Therefore, if any other cells lie on thesame critical edge of critical net, those cells are termed as“associated cells”. And the neighbors of these associated cellsare termed as “associated neighbor cells”. Fig. 5 describes allthese terminologies.

    C. Selective Local Bidirectional 1D Swapping TechniqueTo overcome the demerits of previous global and local swap

    and to recover the solution from local maxima, the authorshave proposed this selective local bidirectional 1D swappingtechnique. This technique moves and swaps selective CLBcells in order to recover from local maxima. Fig. 6 describesthe execution steps for this algorithm. The algorithm involvessequential Y and X swap, hence authors call it bidirectional1D swapmove. The idea of this algorithm is borrowed fromconcept of simulated annealing. First, all the cells are orderedin decreasing absolute values of ∆yHPWL. Then, each cell(CLB/RAM/DSP) is checked if “movable” or “swappable” inits preferred direction. The cells are swapped only if both cellspossess directional affinity, i.e. if both have opposite preferreddirections. If the neighboring preferred direction site is emptyand valid for a particular cell type, the cell is moved. In thisway, move/swap operation is performed. Once, the operation

  • 4

    Fig. 6: Selective Bidirectional 1D Swapping algorithm.

    is performed, those site(s) are freezed and remain untouchedin this iteration. Similarly, all the cells are arranged in orderof decreasing ∆y HPWL values. Once this MoveY step isperformed, all the cells are similarly swapped/moved in Xdirection.

    In order to optimize the performance of algorithm, the cellsare fed in particular order to maximize HPWL reduction.Also, to reduce HPWL computation time, high fanout netsare ignored. This has very less effect on results, but results insignificant improvement in HPWL. As mentioned earlier, thistechnique helps in recovering solution from local maxima. Thistechnique is very fast in comparison to previous global/localswap technique. This algorithm is also linear in time com-plexity and is given by (4). However, this technique doesnttake into consideration penalty introduced due to movementof neighbor cells, non-movement of associated cells or dueto movement of associated and associated neighbor cells. Dueto this, this may penalize HPWL a bit and need to run twiceto overcome detrimental effect. This limitation is overcome innext proposed technique.

    TimeComplexity = O((α1 + α2)n) (4)

    where α1 = #netsforeachcell ≈ 5α2 = 2 × criticalnets × #criticalcells(eachnet) ×#nonfreezesites ≈ 2

    D. Penalty-Aware Dynamic 2D Swapping TechniqueIn order to overcome the limitation of previous proposed

    approach, the authors have proposed a more cautious penaltyaware dynamic 2D swapping technique. In this technique,the detrimental effects on HPWL improvement due to neigh-bor, associated and neighbor associated cells are considered,and then swap/move decisions are made. The improvementfunction is given by (5). The absolute ∆HPWL is com-puted, and penalty of moving neighbor cell is deductedin case of swap operation. From this, either the contribu-tion of associated cell Cassoc is deducted in case associ-ated cell is not movable, else penalties of associated and

    Fig. 7: Penalty aware 2D Swapping algorithm.

    neighbor associated cells are deducted. If improvement valueexceeds“improvement threshold”, the cell, neighbor cell, as-sociated cells and its neighbors are swappedmoved and allthese sites are freezed. Thus, movement ensures guaranteedimprovement in HPWL.

    Improvement = abs(∆HPWL) - p ×Pneighbor −(Cassocor(p× Passoc + p× Passocneighbor)

    (5)

    The algorithm description, shown by Fig. 7 is as follows.First, all the cells are ordered in decreasing order of sum ofabsolute ∆xHPWL and ∆Y HPWL values. Then, checkXand checkY operations are performed. In checkX operation,improvement function is calculated for horizontal movementof cell in horizontal preferred direction. Similarly, checkYoperation is performed. Based on this, 4 different cases arepossible, as shown by Fig. 8. In case movement/swap in mothdirections are possible, the cell is moved in particular direction,controlled by user in settings. Assuming cell moves in Ydirection, the checkX operation is performed at new site. Ifthe operation is not possible, cell is moved back to its originaldirection and then moved in X direction. At new site, checkYoperation is performed and accordingly cell freezes at thissite or new Y site after moveY operation. Similarly, all thecases are handled. Once the operation is finished, all the sitesare marked as “freeze sites” and remain blocked for currentiteration. This process is shown in Fig. 9. Since, each cell isconsidered in both X and Y direction at the same time, thistechnique achieves global maxima solution.

    There are various optimizations being performed in thistechnique, each controlled by optimization switch. The“check direction” switch allows this movement only if thecells being swapped possess directional affinity. Also, sinceall the penalties are considered, the move/swap operationscan be relaxed by 2 parameters, namely “penalty factor”and “improvement threshold”. Both these factors can betuned in a way to relax and constrain swap/move operations.

  • 5

    Fig. 8: 2D Swapping possible moves.

    Fig. 9: 2D swapping moves steps.

    If improvement exceeds “improvement threshold”, thenonly checkX or checkY operation returns 1 and MoveX orMoveY operations are performed. This technique ensuresglobal maxima by considering cell movement in both Xand Y direction at the same time, and performs swap/moveoperations for all types of cells. Since, the cells are orderedin decreasing ∆HPWL improvement order, significantreduction is achieved. Since this technique is penalty aware,it allows guaranteed improvement in HPWL. This techniqueis also faster (3X) than global/local swap technique. Thetechnique achieves scalable run times, since this is linearin time complexity. However, this technique is slower thanprevious proposed technique by 1.5X.

    TimeComplexity = O((α1 + α2)n) (6)

    where α1 = netsforeachcell ≈ 5α2 = 2 × criticalnets × criticalcells(eachnet) ×nonfreezesites ≈ 3

    All these techniques complement each other, and overcomedisadvantages of each other. Therefore, when blended properly

    Fig. 10: Config file format.

    achieves optimal performance. The blending methodology isdescribed in next Design Methodology section.

    III. DESIGN METHODOLOGYThis section describes the algorithm design methodology

    and approach adopted to perform detailed placement on FPGAbenchmarks. The authors have followed a systematic CADapproach to perform HPWL reduction while preserving binutilization.

    A. Design ApproachThe authors have first generated visualizations for all these

    benchmarks using Perl Module as shown in Fig. 1, to get betterunderstanding of global placement on all benchmarks. Also,the code implementation was first performed on Perl platformfor faster prototyping. However, the final code is implementedin C++ to reduce runtime. The authors have a single binary fileas output to run detailed placement for both congestion awareand non-aware benchmarks. Further, optimization switches aretuned for optimal performance, along with different run modesto address performance vs runtime approach. All these featuresare supported by config file based flow as shown in Fig. 10.

    B. Optimization SwitchesFor optimal performance, we have provided various design

    knobs/switches which can be tuned to achieve better reductionin HPWL. Setting switch “check direction” to 1 performsmove or swap operation in proposed penalty aware 2D dy-namic swapping algorithm only if the directional affinity existsbetween the cells under consideration. Otherwise, all cells areconsidered for move/swap operation. Also, in this algorithm,cells are swapped/moved only if the net HPWL improvementafter all penalty deductions from neighbor and associatedcells is more than improvement threshold. Therefore, reducingvalue of this switch “improvement threshold” relaxes con-straints and allows more cells to swap/move. The penaltydeduction is controlled by another switch penalty factor. Inglobal, vertical swap, all swap modes can be enabled usingswitch “enable all direction swap”. If not enabled, verticaland horizontal swap are performed only if global swap is not

  • 6

    done for that cell. Also, maximum ABU utilization switch isprovided to control bin utilization for congestion aware case.All these optimization switches are provided in config file,which is required while performing detailed placement.

    C. Congestion Aware TechnologyThe authors have provided a single binary as an output to

    perform congestion aware or unaware detailed placement. Themode is enabled by command line, while running the binary.When congestion mode is enabled, the sitemap is divided into6 x 6 bins. The 6*6 bin utilization is calculated by (7).

    BU =#CLBcells

    #CLBsites(7)

    The scaled HPWL is given by (8).

    sHPWL = HPWL× (1 + α× penalty) (8)

    where α = 1penalty = max( ABU10targetbinutilization − 1.0)

    In all the implemented algorithms, the swap operationsremain unaffected. In case of move operations, the operationis performed only if the cell moving in new bin is havingutilization less than desired value. Else, move is not sanctioned.This allows to preserve the bin utilization, but at the same timeallowing other operations to achieve HPWL reduction.

    D. Multi Optimizations Mode SupportIn order to address the time v/s performance trade-off,

    the authors have provided 4 optimization modes, namelyHigh Speed Approach (HSA), High Speed Performance AwareApproach (HSPA), High Performance Speed Aware Approach(HPSA) and High Performance Aware (HPA). The runtimeincreases from HSA to HPA, but HPWL is also reduced toachieve better results. This allows users of this code to operatethe code in the mode which gives them a good balance ofruntime and improvement.

    E. General OptimizationsThe authors have implemented various optimization tech-

    niques in order to improve runtime and reduce HPWL opti-mally. In order to reduce runtime, the authors have skippedhigh fanout nets while evaluating HPWL improvement beforeand after swap operation. This has very less detrimental effecton performance, but significantly reduced runtime by 3X. Theauthors have also reduced runtime of global swap operation bylimiting the size to 20 and swapping the cells whenever there isimprovement, and not scanning the whole optimal region. Thisreduces runtime to half. All the algorithm being implementedare linear in time complexity. Also, the code is compiled usingO3 switch in C to reduce runtime. In order to achieve optimalperformance, the authors have ordered the cells in decreasingorder of ∆HPWL improvement. This ensures that cells whichcan create bigger impact on HPWL reduction are considered

    for move/swap operations, this improving performance. Also,the bottlenecks in each algorithm are addressed by blending itwith complementary algorithm to overcome the demerits andoverall improve the HPWL reduction.

    IV. RESULTSIn this section, the detailed placement optimization results

    for all benchmarks are presented. The performance of thesealgorithms is evaluated in terms of HPWL reduction for 7benchmarks provided in ISPD 2016 contest and in class. Also,the performance is absorbed for congestion aware benchmarks.As discussed in previous section, authors have modeled theirsolution in 4 optimization modes, namely HSA, HSPA, HPSA,HPA. Table I lists down the HPWL reduction and runtime foreach benchmark and optimization mode. The congestion awaredetailed placement is performed on separate 5 benchmarksprovided and listed in Table II.

    The readers can observe that significant improvement inHPWL is observed in high speed mode in very less time.Close to 11% reduction in HPWL is observed in 97 secondsfor FPGA06, whereas FPG01 gives 3.26% improvement in12.8 seconds. The benchmark FPGA05, which is the mostcomplex benchmark gave close to 1.51%reduction in HPWLin just 2 minutes, which is significantly faster than resultspresented by others in class. For high performance approaches,4.52% improvement is observed in FPGA01 under 1 minute,whereas 12.8% improvement for FPGA06. In congestion awarecases, the authors have met the utilization constraint for allthe benchmarks, except FPGA05 where it gets exceeded by0.03. The authors have achieved 8.57% reduction in HPWLfor FPGA01, whereas 7.59% for FPGA02. Tables I and IIpresent the comparison of HPWL VS time for FPGA01-07.The authors have also generated visuals of benchmarks atglobal and detailed placement stage using Perl GD module.Fig. 11 shows the FPGA01 congestion benchmark at both thesestages respectively. The readers can clearly see the clustersformation and close packing of cells after detailed placementwhich reduces HPWL by 8.57%. The algorithms designed arelinear in time and scalable enough for complex benchmarks toprovide optimal performance in approximately 5 minutes.

    V. CONCLUSIONSThe authors have successfully implemented their proposed

    detailed placement algorithms along with the global and localswap techniques. These complementary algorithms are blendedin a way to overcome the demerits of any single algorithm.Doing so, the authors have achieved very high reduction inHPWL for all the benchmarks. Several optimization switchesare put in existing and proposed algorithm to make it runfaster and better. The runtime vs performance tradeoff isalso addressed by providing 4 different optimization modes.The authors are able to achieve scalable runtimes for biggerbenchmarks as well as optimal results for congestion awarebenchmarks. Comparing our results with the other teamspresented in class, the authors have achieved the one of thebest results in terms of HPWL reduction and runtimes foralmost all the benchmarks, especially very low runtime forsame performance for the higher benchmarks.

  • 7

    Benchmark HSA ∆HPWL% HSA Runtime HSPA ∆HPWL% HSPA Runtime HPSA ∆HPWL% HPSA Runtime HPA ∆HPWL% HPA RuntimeFPGA01 3.26% 12.8sec 3.69% 21.4sec 4.02% 34.50s 4.52% 56.8secFPGA02 1.88% 21.8sec 3.33% 35.8sec 3.67% 41.20s 4.14% 1m 34secFPGA03 1.90% 1m 15sec 2.10% 2m 03sec 2.28% 3m 18sec 2.56% 5m 03secFPGA04 1.45% 1m 37sec 1.61% 2m 35sec 1.76% 4m 07sec 1.68% 6m 40secFPGA05 1.51% 2m 05sec 1.71% 3m 14sec 1.92% 5m 07s 1.97% 8m 27secFPGA06 10.90% 1m 37sec 11.79% 2m 38sec 12.48% 4m 15sec 12.80% 7m 10secFPGA07 6.74% 2m 06sec 7.38% 3m 21sec 7.94% 5m 20s 8.40% 9m 04sec

    TABLE I: Results for ABU = 1.0 for all provided benchmarks.

    Benchmark ∆HPWL% Bin Utilization RuntimeFPGA01 8.57% 0.889 57.6secFPGA02 7.59% 0.881 66secFPGA03 4.63% 0.886 3m 45secFPGA04 4.64% 0.892 4m 46secFPGA05 (4.02) - 3.70% 0.903 2m 18sec

    TABLE II: Results for ABU = 0.9 for all provided benchmarks.

    Fig. 11: Global and Detailed Placement for FPGA01.

    VI. FUTURE WORK

    In this work, authors have mainly emphasized on designingvery efficient algorithms by first providing lot of optimizationswitches and then tuning all of them for optimal performancein terms of HPWL reduction and lower runtimes. The authorshave implemented their solution first in Perl for faster proto-typing and then C++ for their final deliverable. The authors

    believe that the runtime can be further reduced by 4-5X bymore efficient use of data structures in C++. The authors wouldalso like to extend scope of this work by including criticalpath time reduction. The authors would also like to expandthe congestion driven detailed placement work in 4 differentoptimization modes as done for normal case.

    ACKNOWLEDGMENT

    The authors are grateful to Prof. David Pan, The Universityof Texas at Austin for giving an opportunity to improve ouralgorithm designing skills, and working on this challengingproblems. The authors would also like to thank Wuxi Li forhis helpful remarks in algorithm and C++ code design.

    REFERENCES

    [1] M. Pan, N. Viswanathan, and C. Chu, “An efficient and effective de-tailed placement algorithm,” in ICCAD-2005. IEEE/ACM InternationalConference on Computer-Aided Design, 2005., pp. 48–55, IEEE, 2005.

    [2] W. Li and S. Dhar, “Utplacef: A routability-driven fpga placer withphysical and congestion aware packing,” 2016.

    [3] “Ispd 2016 benchmarks,”[4] P. Maidee, C. Ababei, and K. Bazargan, “Fast timing-driven

    partitioning-based placement for island style fpgas,” in Proceedings ofthe 40th Annual Design Automation Conference, DAC ’03, (New York,NY, USA), pp. 598–603, ACM, 2003.

    [5] M. Gort and J. H. Anderson, “Analytical placement for heterogeneousfpgas,” in IEEE International Conference on Field Programmable Logicand Applications (FPL), pp. 143 – 150, IEEE, 2012.

    [6] M. Kim, D. Lee, and I. Markov, “Simpl: An effective placementalgorithm,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 31, pp. 50–60, 1 2012.

    [7] T.-H. Lin, P. Banerjee, and Y.-W. Chang, “An efficient and effectiveanalytical placer for fpgas,” in Proceedings of the 50th Annual DesignAutomation Conference, DAC ’13, (New York, NY, USA), pp. 10:1–10:6, ACM, 2013.

    [8] C.-W. Pui, G. Chen, W.-K. Chow, K.-C. Lam, J. Kuang, P. Tu,H. Zhang, E. F. Y. Young, and B. Yu, “Ripplefpga: A routability-drivenplacement for large-scale heterogeneous fpgas,” in Proceedings of the35th International Conference on Computer-Aided Design, ICCAD ’16,(New York, NY, USA), pp. 67:1–67:8, ACM, 2016.

    [9] M. Xu, G. Grewal, and S. Areibi, “Starplace: A new analytic methodfor fpga placement,” Integr. VLSI J., vol. 44, pp. 192–204, June 2011.

    [10] Y. Xu and M. A. Khalid, “Qpf: efficient quadratic placement forfpgas,” IEEE International Conference on Field Programmable Logicand Applications (FPL), pp. 555–558.

    [11] S. N. Adya and I. L. Markov, “Consistent placement of macro-blocksusing floorplanning and standard-cell placement,” In Proc. Intl. Symp.on Physical Design, pp. 12–17.

  • 8

    [12] A. Agnihotri, M. C. YILDIZ, A. Khatkhate, A. Mathur, S. Ono, andP. H. Madden, “Fractional cut: Improved recursive bisection placement,”in Proceedings of the 2003 IEEE/ACM International Conference onComputer-aided Design, ICCAD ’03, (Washington, DC, USA), pp. 307–, IEEE Computer Society, 2003.

    [13] A. e. Cladwell, A. B. Kahng, and M. I. L., “Optimal partitioners andend-case placers for standard-cell layout,” IEEE Trans. on Computer-Aided Design, pp. 1304 – 13.

    [14] K. Doll, F. M. Johannes, and K. J. Antreich, “Iterative placementimprovement by network flow methods,” IEEE Trans. Computer-AidedDesign of Integrated Circuits and Systems, pp. 1189–2000.

    [15] A. B. Kahng, I. L. Markov, and S. Reda, “Legalization of rowbasedplacements,” Proc. Great Lakes Symp. on VLSI, pp. 214–219.

    [16] A. B. Kahng, P. Tucker, and A. Zelikovsky, “Optimization of linearplacements for wirelength minimization with free sites,” in DesignAutomation Conference, 1999. Proceedings of the ASP-DAC’99. Asiaand South Pacific, pp. 241–244, IEEE, 1999.

    [17] S. W. Hur and J. Lillis, “Mongrel: Hybrid techniques for standardcell placement,” in Proceedings of the 2000 IEEE/ACM InternationalConference on Computer-aided Design, ICCAD ’00, (Piscataway, NJ,USA), pp. 165–170, IEEE Press, 2000.

    [18] S.-W. Hur and J. Lillis, “Relaxation and Clustering in a Local SearchFramework: Application to Linear Placement,” in DAC, pp. 360–366,ACM Press, 1999.