[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

6
A Dynamically Reconfigurable Architecture for Embedded Systems Gilles Sassatelli Gaston Cambon Jerome Galy Lionel Torres LRMM LIRMM LIRMM LIRMM 161 Rue ADA 161 Rue ADA 161 Rue ADA 161 Rue ADA 34392 Montpellier 34392 Montpellier 34392 Montpellier 34392 Montpellier France France France France sassate @lirmm.fr cambon @lirmm.fr galy @lirmm.fr tones @ lirmm.fr (33)04-67-4 1-85-69 (33)04-67-41-85-85 (33)04-67-41-85-67 (33)04-67-41-85-72 Abstract Internet is becoming one of the key features of tomorrow's communication world. The evolution of mobile phones networks such as UMTS will soon allow everyone to be connected everywhere. These new network technologies bring the abilit), to deal not only with classical voice or text messages, but also with improved content: multimedia. At the mobile level, this kind of data oriented content requires highly efJicient architectures; and nowadays embedded system-on-chip solutions will no longer be able to manage the critical constraints like area, power and data computing efficiency. In this paper we will propose a new dynamically reconfigurable network, dedicated to data oriented applications such as the one targeted for instance on third generation networks. Principles, realisations and comparative results will be exposed for some classical applications, targeted on different architectures. 1. Introduction Tomorrow's mobile phone networks will definitely be Internet oriented. They will not only be phones, but will also provide numerous functions, today considered as pertinent to desktop computers or PDAs (Personal Data Assistants). Agenda, Walkman (MP3), memo and portable drives are only a few of these new items that nowadays mobile phones are just beginning to provide. The actual circuit based commutation technology, used in the second generation network will soon evolve to packet mode, and more precisely IP (Internet Protocol). Its efficiency in physical resource sharing, joint to the new data width brought by third generation networks technologies such as CDMA (Code Division Multiple Access) will allow mobile phone users to have direct internet access. There are a lot of new commercial services such as trading, shopping, but also multimedia file transfer such as audio (MP3), video (MPEG1, MPEG2, MPEG 4) and even videoconferencing (H.26 1 and H.263 protocols) which will become reality in the next years. Figure 1 shows schematically the network architecture of a possible future third generation mobile phone network. There are two main different streams, the data stream and the voice stream. Second generation networks are circuit- commutation based, as the third generation ones will mainly be packet-commutation based. Each mobile will send and receive data through a base station which will be connected to a packet data network. According to the content type (voice or data), the data will pass through a circuit gateway directly to a classical telephone network (Public Switched Telephone Network) or will be sent over the internet through a packet gateway. Figure 1 : Third generation mobile network architecture All these newly allowed high-demand applications, joint to the network-relative technologies management (CDMA for the channel access, TCP/IP for the network protocol stack) are bringing a new challenge in mobile phones design: Designers have now to deal with computing intensive applications in a mobile context, as to say with the strong power and cost constraints. Nowadays mobile phones are mostly based on a SoC (System on Chip) approach (figure 2). On the same silicon die are grouped heterogeneous IP (Intellectual Property) cores. 1074-600901 $10.00 0 2001 IEEE 32

Upload: l

Post on 02-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

A Dynamically Reconfigurable Architecture for Embedded Systems

Gilles Sassatelli Gaston Cambon Jerome Galy Lionel Torres LRMM LIRMM LIRMM LIRMM

161 Rue ADA 161 Rue ADA 161 Rue ADA 161 Rue ADA 34392 Montpellier 34392 Montpellier 34392 Montpellier 34392 Montpellier

France France France France

sassate lirmmfr cambon lirmmfr galy lirmmfr tones lirmmfr (33)04-67-4 1-85-69 (33)04-67-41-85-85 (33)04-67-41-85-67 (33)04-67-4 1-85-72

Abstract

Internet is becoming one of the key features of tomorrows communication world The evolution of mobile phones networks such as UMTS will soon allow everyone to be connected everywhere These new network technologies bring the abilit) to deal not only with classical voice or text messages but also with improved content multimedia At the mobile level this kind of data oriented content requires highly efJicient architectures and nowadays embedded system-on-chip solutions will no longer be able to manage the critical constraints like area power and data computing efficiency In this paper we will propose a new dynamically reconfigurable network dedicated to data oriented applications such as the one targeted for instance on third generation networks Principles realisations and comparative results will be exposed for some classical applications targeted on different architectures

1 Introduction

Tomorrows mobile phone networks will definitely be Internet oriented They will not only be phones but will also provide numerous functions today considered as pertinent to desktop computers or PDAs (Personal Data Assistants) Agenda Walkman (MP3) memo and portable drives are only a few of these new items that nowadays mobile phones are just beginning to provide

The actual circuit based commutation technology used in the second generation network will soon evolve to packet mode and more precisely IP (Internet Protocol)

Its efficiency in physical resource sharing joint to the new data width brought by third generation networks technologies such as CDMA (Code Division Multiple Access) will allow mobile phone users to have direct internet access There are a lot of new commercial services such as trading shopping but also multimedia file transfer such as audio (MP3) video (MPEG1 MPEG2 MPEG 4)

and even videoconferencing (H26 1 and H263 protocols) which will become reality in the next years

Figure 1 shows schematically the network architecture of a possible future third generation mobile phone network There are two main different streams the data stream and the voice stream Second generation networks are circuit- commutation based as the third generation ones will mainly be packet-commutation based Each mobile will send and receive data through a base station which will be connected to a packet data network According to the content type (voice or data) the data will pass through a circuit gateway directly to a classical telephone network (Public Switched Telephone Network) or will be sent over the internet through a packet gateway

Figure 1 Third generation mobile network architecture

All these newly allowed high-demand applications joint to the network-relative technologies management (CDMA for the channel access TCPIP for the network protocol stack) are bringing a new challenge in mobile phones design

Designers have now to deal with computing intensive applications in a mobile context as to say with the strong power and cost constraints

Nowadays mobile phones are mostly based on a SoC (System on Chip) approach (figure 2) On the same silicon die are grouped heterogeneous IP (Intellectual Property) cores

1074-600901 $1000 0 2001 IEEE 32

21 22 23

Figure 2 The three main SoC approaches

Figure 21 presents the SoC approach the radio frequency core (this one however takes often place on a different chip) can be assimilated to the physical layer of a network stack Its function is essentially to amplify filter and demodulate the incoming RF signal It transmits the baseband signal to the ADC (Analog Digital Converter) and the resulting digital signal is than sent to a DSP or pP which manages at the software level all digital functions of the phone including the ones which are still relative to the channel like data compression encoding and logical channels multiplexingdemultiplexing

There are different ways to face these new problems - The easiest and actual way to deal with this increasing computing power is naturally to use a bigger more powerful DSPpP (figure 21) than the ones used today but it will probably not be feasible for the most demanding applications as the resulting processor will grow until the size of a Pentium (such as the ones taking place in the most powerful PDA or pocket PCs) with the corresponding area cost and consumption problems

- Another way is to try to identify the future application field and to use a dedicated core to compute the common parts of the algorithms (Figure 22) For example if we target JPEG and MPEG based applications we will make the choice of implementing a wired IDCT (Inverse Discrete Cosine Transform) core which is known to be the common most demanding part of both algorithms [ 7 ] [ 8 ] An interesting but restrictive solution as the application field is thus not extensible

- Yet another way is the reconfigurable computing [3][4][10] For example integrate a FPGA core [1][2] where depending on the target application different algorithdarchitecture solutions could be synthesised (figure 23) If here we target JPEG applications we will choose to synthesise the IDCT core in the FPGA and also an application dependant part of the algorithm like Huffman coding or quantization But in the other way if we target MPEG [9] applications we will still make the choice of a wired IDCT but this time we will also select the motion estimation [6] which is one of the most demanding part of

the MPEG compression algorithm

This kind of approach seems to be quite interesting we can imagine depending on a given application a video streaming one for example that the mobile could directly connect to the vendorrsquos site to download the corresponding applet which is nothing else than the configuration file of the considered reconfigurable network

2 Reconfigurable solutions

A closer look to the kind of tomorrowrsquos mobile applications shows a very data oriented data intensive trend the multimedia content needs a very high count of arithmetic operations which would naturally imply to synthesise numbers of arithmetic operators in case of using fine grained reconfigurable logic (FPGA for example)

In a LUT-based FPGA [ 1][ 1 11 two main layers are used - The operative layer in which takes place all the CLB

IO blocks and switch matrix This kind of CLB-based architecture is designed to manage bit-level data

- The configuration layer which can simply be considered as a big SRAM The configuration is downloaded before the operating phase in this RAM this is therefore a statically configurable architecture

In our context the granularity has to be quite different as we have to compute arithmetic-level data This lsquomismatchrsquo of granularity implies a majored total cost mostly due to arithmetic operators synthesis

3 System overview

Dataflow oriented applications require the use of coarse grained reconfigurable network In this way our architecture follows an original concept

1

2

3

33

The operative layer is no longer CLB based but use a coarse-grained granularity component the Dnode (Data node) It is a datapath component with an ALU and a few registers as shown in figure 4 This component is configured by a microinstruction code

The configuration layer follows the same principle as FPGAs it is a RAM which contains the configuration of all the components of the operative layer

We also use a custom RISC core [ 5 ] with a dedicated instruction set its task is to manage dynamically the configuration of the network and also to control the data injection and data recuperation in the operative layer

_ lt I Fd Dnode

-- C O N F I G U A A 7

II

ION

Figure 3 System overview

This architecture is thus not intended to be used as a stand- alone solution rather an IP core for data oriented intensive computing which would take place in a SOC Figure 3 shows our system in a SoC context The pP can thus confide the most demanding part of a given application to our IP core So it downloads to the RISC memory the corresponding configuration program (manages the dynamic reconfiguration)

From a functional point of view - The host processor first sends the management code to

the configuration controller memory (the custom RISC has its own program memory) This is a object code ready to be executed and specially designed to manage dynamically the configuration of the network (the content of the RAM thus changes from one cycle to another) as to say the functionality of the operating layer Each clock cycle the configuration controller is able to change up to the entire content of the RAM thanks to its dedicated instruction set

Once done our core is ready to compute The host processor sends the data to the operating layer via a specific scheme and then get back the computed data As the configuration is managed dynamically it is possible to multiplex the sent data and to compute them by several sequential (hardware multiplexing) or concurrent(static) synthesised datapaths

-

4 Operating layer architecture

41 The Dnode architecture It essentially consists in an ALU-Multiplier able to make all the classical arithmetic and logic operations addition multiplication subtraction roll shift and so on As shown in figure 4 there is also a register file and two multiplexers This optimised architecture is able in the same clock cycle to make all possible operations even between two different registers Its corresponding microinstruction code the

configuration code comes from a memory location in the configuration layer As previously said this code evolves during the computing phase the functionality can thus be changed from one clock cycle to another from an addition to a multiplication load to register

Each Dnode has in fact two execution modes - Global mode (normal mode) already exposed the

Dnode executes the microinstruction code which comes from the configuration layer managed by the RISC configuration controller

Local mode The stand-alone mode Each Dnode has 7 registers an up to 6-states counter and a 6 to I multiplexer which forms a small controller Each one of the 6 first registers can contain a Dnode microinstruction code and each clock cycle the counter increases the value on the multiplexer address input thus sending the content of a register to the datapath part of the Dnode

In this mode the Dnode is able to compute various algorithms like MAC or serial digital filters This architecture joint to a specific inputloutput Data controller (exposed later) allows very efficient high bandwidth data oriented computation

-

A B

s Figure 4 The Dnode architecture

42 The Ring architecture A regular array implies interconnection and data transfer problems as a dataflow often nearly all the time needs data feedback (figure 5)

Data feedback

Figure 5 A classical Datapath

34

These operations require important routing capabilities (figure 5) which limit the size or the performance of the array (latency cycles) Our approach to solve this problem takes place in the use of a curled structure

421

We use a curled pipelined systolic structure as shown in figure 6 All the Dnodes form a ring his length (Dnodes layers number) and width (Dnodes per-layer number) can easily be scaled

Forward The main Dataflow

n L a y e r n-1

Dataf low

Layer n

Layer n + l

Figure 6 The Ring architecture Dnodes are organised in layers a Dnode layer is connected to the two adjacent ones by switch components able to make any interconnection between two stages It also manages data injection and recuperation by direct dedicated FIFOs and optional RISC communications via a shared bus In normal mode each Dnode can be seen as an arithmetic operator of a datapath which computes a data each clock cycle In stand-alone mode each Dnode can be seen as a autonomous CPU The structure is also flexible in the way that all Dnodes do not have to run in the same mode allowing to compute either in global mode (normal mode) or local mode (stand-alone)

422 Reverse The secondary Dataflow

The data feedback problem is addressed by this one we use special feedback pipelines (figure 7) forming a reverse Dataflow to avoid complex routing structures The last task that accomplishes each switch is to write unconditionally (no control needed) the computed result of the previous Dnode layer in a dedicated pipeline (each switch owns its pipeline) which allows the feedback of each data to the previous stages (figure 7)

These ones can then choose to get these data through the switches which have direct access to all the pipelines This technique ensures a good scalability of the architecture as the routing problem is thus removed

5

feedback

-

I I

I I

I

Figure 7 The feedback network architecture

Comparisons amp realisations

51 Comparative Results

This version has a maximal computing power of 1600 MIPS (at a clock frequency of 200 MHz) quite impressive compared to the 400 MIPS of a Pentium I1 450 MHz processor The theoretical maximum bandwidth of this version of the structure is about 3 Gbytesh limited to 250 Mbytess in our implemented communication protocol (a PCI based bus) between the host CPU and the core

To program this structure we wrote an assembler which parse both RISC level (for the control) and Ring level primitives It generates the machine code ready to be executed in the architecture

In the application field targeted by third generation systems we can find lots of video-relative techniques One of these well known computing intensive algorithms is the motion estimation Widely used in video compression techniques for broadcasting storing and videoconferencing his task is to remove the temporal redundancy in video streams as the DCTs is to remove the spatial redundancy

Block matching and specially Full Search Block Matching (FSBM) algorithm is the most popular implementation also recommended by several standard committees (MPEG (video) and H26 1 (videoconferencing) standards)

The Mean Absolute Difference (MAD) criterion used to estimate the matching of the current block can be formulated as follows

N N n ) = y Y[R(i j ) - S( i + m j + n)(

r=l j = l

35

Frameat t Ev Minimization

Frame rate

Block size

--- 1

15 Framesls

8 x 8 Pixels

Frame at t+l

Dnode area

r

025ym 018pm

006 mm2 004 m2

Best match

Motion vector (mn)

Estimated Frequency

DCT + Quantification

1 Huffman coding

200 MHz MHz

Searching region

Figure 8 The Motion Estimation algorithm

R(ij ) is the reference block of size N x N and S(i+mj+n) the candidate block within the search area determined by p and q which are the maximum horizontal and vertical displacements The size of this area is (N+p+q)2 pixels and the displacement vector represented by (mn) is determined by the least MAD(mn) among all the (p+q+1)2 possible displacement within the search area Table 1 shows the specifications of a videoconferencing codec

I Imagesize I 352~240Pixels I

1 Max Displacement I lt= mrsquo lt= I Pixels

Table 1 Videoconferencing codec specifications

For each candidate block the first summation (i=1 to 8) requires N operations and the accumulation N- 1 operations thus a total of 2N-I operations The second summation requires to compute N times the previous one account of operations and again N- 1 operations for the accumulation of the partial sums The total amount of arithmetic operations to compute is so 2N2 - 1 The (2N-I)N first operations can be achieved within (2N- l)N (075Nx) clock cycles in a Nx Nodes version of our structure as there are no dependencies on these data and one node over four is in wait state (layer n 2 nodes computing two R()-S() operations layer n+l 1 node accumulating of the two previous computed results) The last N-1 operations (accumulation) are achieved in int(ln(N))+l clock cycles for N lt= Nx In a 16 Nodes version of our structure and with the previous specified codec (N=8) the computation of the MAD for a candidate block requires 13 clock cycles Each reference block requires the computation of 289 candidate blocks and there are 1320 reference blocks in each frame The total processing time of an entire image frame is

1320x289~13=4959240 cycles At a 200 MHz clock frequency the computation time would be 24ms which is more than two times smaller than the frame period (115s) Table 2 shows the performances of the Systolic Ring compared with the ASIC architecture implemented in [ 121 and Intel MMX instructions [13] using the criterion of the number of cycles needed for matching a 8x8 reference block against its search area of 8 pixels displacement

1 Systolic Ring 1 ASIC[12] I MMX[13] I 3757 I 581 I 28900 1

Table 2 Motion Estimation performance comparison (cycles)

Our structure shows again its efficiency in a such computing intensive context The ASIC implementation is much faster than our solution at the price of flexibility the Systolic Ring provides the advantage of hardware reuse and is also almost 8 times faster than a MMX solution

62 Design Implementation

Table 3 shows the comparative implementation results in both technologies An area of 07 mm2 could be achieved with the last CMOS 018pm process with a maximum estimated frequency of 200 MHz

The IOU

Core area 1 09 mmrsquo 1 07 rnmrsquo 1

Table 3 Table of synthesis results

area of each Dnode joint to the exposed specific architecture shows that this one could easily be scaled to larger realizations Figure 8 shows a foreseeable 018pm technology 12 mm2 die area SoC for high constrained embedded solution Our specific architecture allows the integration of powerful 64 Dnodes version of our Ring (34 mm2 on-die area) with a widely used ARM7 CPU able to run various operating systems like windows CE Epoc32 or Linux This kind of solution could provide a great computation powedcost ratio which combines the flexibility of a CPU reconfigurable architecture couple with the efficiency of dedicated core

36

ARM7TDMI 0 54 mm - 32 bits ARM RISC core - Running iJmdowsCE EPOC32 i l n u x Wide choice of development to015

Ring 64 3 4 mm2 64 Dnodea Systolic Ring

Fast computation of data oriented applications

Figure 8 The Motion Estimation algorithm

6 Conclusion

We have proposed a new coarse grained arithmetic block based dynamically reconfigurable architecture which proves its efficiency in data oriented processing His scalability shows that its field of applications can not only be limited to embedded high-constrained applications but can also make be worth its faculties in other contexts where data oriented high data bandwidth processing remains critical A small 8-Dnodes version of this structure already provides up to 1600 MIPS of raw power for data dominated application with a sustained data rate of 3 Gbytesls at 200 MHz either in global or local mode Our future work takes place in the translation of the structure to floating point and also in the writing of an efficient compilingprofiling tool which main task would be to maximize the computation density (the objective is no Dnode in wait state) in the architecture

7 References

[I] Stephen Brown and J Rose Architecture of FPGAs and CPLDs A Tutorial IEEE Design and Test of Computers

[2] M Gokhale et ai Building and Using a Highly Parallel Programmable Logic Array IEEE Computer pp 81 -89 Jan 1991

[3] W H Mangione-Smith et al Seeking Solutions in Configurable Computing IEEE Computer pp 38-43 December 1997

[4] J R Hauser and J Wawrzynek Carp A MIPS Processor with a Re-configurable CO-processor Proc of the IEEE Symposium on FPGAs for Custom Computing Machines I997

[ 5 ] A Abnous C Christensen J Gray J Lenell A Naylor and N Bagherzadeh Design and Implementation of the Tiny RISC microprocessor Microprocessors and Microsystems

Vol 13 NO 2 pp 42-57 1996

Vol 16 NO 4 pp 187-94 1992

[6] C Hsieh and T Lin VLSI Architecture For Block- Matching Motion Estimation Algorithm IEEE Trans on Circuits and Systems for Video Technology vol 2 pp 169- 175 June 1992

[7] N Ahmed T Natarajan and KR Rao Discrete cosine transform IEEE Trans On Computers vol C-23 pp 90- 93 Jan 1974

[8] ISOflEC JTCl CD 10918 Digital compression and coding of continuous-tone still images - part 1 requirements and guidelines ISO 1993 (JPEG)

[9] ISOIIEC JTCl CD 13818 Generic coding of moving pictures and associated audio video ISO 1994 (MPEG-2 standard)

[ IO] Challenges for Adaptive Computing Systems Defense and Advanced Research Projects Agency at wwwdarpamiIitolresearchIacschallenges html

[ 1 I ] 25Xilinx the Programmable Logic Data Book 2000 [12] ABugeja and W Yang A Re-configurable VLSI

Coprocessing System for the Block Matching Algorithm IEEE Trans On VLSI systems vol 5 September 1997

[ 131 Intel Application Notes for Pentium MMX httpIdeveloperintelcom

37

21 22 23

Figure 2 The three main SoC approaches

Figure 21 presents the SoC approach the radio frequency core (this one however takes often place on a different chip) can be assimilated to the physical layer of a network stack Its function is essentially to amplify filter and demodulate the incoming RF signal It transmits the baseband signal to the ADC (Analog Digital Converter) and the resulting digital signal is than sent to a DSP or pP which manages at the software level all digital functions of the phone including the ones which are still relative to the channel like data compression encoding and logical channels multiplexingdemultiplexing

There are different ways to face these new problems - The easiest and actual way to deal with this increasing computing power is naturally to use a bigger more powerful DSPpP (figure 21) than the ones used today but it will probably not be feasible for the most demanding applications as the resulting processor will grow until the size of a Pentium (such as the ones taking place in the most powerful PDA or pocket PCs) with the corresponding area cost and consumption problems

- Another way is to try to identify the future application field and to use a dedicated core to compute the common parts of the algorithms (Figure 22) For example if we target JPEG and MPEG based applications we will make the choice of implementing a wired IDCT (Inverse Discrete Cosine Transform) core which is known to be the common most demanding part of both algorithms [ 7 ] [ 8 ] An interesting but restrictive solution as the application field is thus not extensible

- Yet another way is the reconfigurable computing [3][4][10] For example integrate a FPGA core [1][2] where depending on the target application different algorithdarchitecture solutions could be synthesised (figure 23) If here we target JPEG applications we will choose to synthesise the IDCT core in the FPGA and also an application dependant part of the algorithm like Huffman coding or quantization But in the other way if we target MPEG [9] applications we will still make the choice of a wired IDCT but this time we will also select the motion estimation [6] which is one of the most demanding part of

the MPEG compression algorithm

This kind of approach seems to be quite interesting we can imagine depending on a given application a video streaming one for example that the mobile could directly connect to the vendorrsquos site to download the corresponding applet which is nothing else than the configuration file of the considered reconfigurable network

2 Reconfigurable solutions

A closer look to the kind of tomorrowrsquos mobile applications shows a very data oriented data intensive trend the multimedia content needs a very high count of arithmetic operations which would naturally imply to synthesise numbers of arithmetic operators in case of using fine grained reconfigurable logic (FPGA for example)

In a LUT-based FPGA [ 1][ 1 11 two main layers are used - The operative layer in which takes place all the CLB

IO blocks and switch matrix This kind of CLB-based architecture is designed to manage bit-level data

- The configuration layer which can simply be considered as a big SRAM The configuration is downloaded before the operating phase in this RAM this is therefore a statically configurable architecture

In our context the granularity has to be quite different as we have to compute arithmetic-level data This lsquomismatchrsquo of granularity implies a majored total cost mostly due to arithmetic operators synthesis

3 System overview

Dataflow oriented applications require the use of coarse grained reconfigurable network In this way our architecture follows an original concept

1

2

3

33

The operative layer is no longer CLB based but use a coarse-grained granularity component the Dnode (Data node) It is a datapath component with an ALU and a few registers as shown in figure 4 This component is configured by a microinstruction code

The configuration layer follows the same principle as FPGAs it is a RAM which contains the configuration of all the components of the operative layer

We also use a custom RISC core [ 5 ] with a dedicated instruction set its task is to manage dynamically the configuration of the network and also to control the data injection and data recuperation in the operative layer

_ lt I Fd Dnode

-- C O N F I G U A A 7

II

ION

Figure 3 System overview

This architecture is thus not intended to be used as a stand- alone solution rather an IP core for data oriented intensive computing which would take place in a SOC Figure 3 shows our system in a SoC context The pP can thus confide the most demanding part of a given application to our IP core So it downloads to the RISC memory the corresponding configuration program (manages the dynamic reconfiguration)

From a functional point of view - The host processor first sends the management code to

the configuration controller memory (the custom RISC has its own program memory) This is a object code ready to be executed and specially designed to manage dynamically the configuration of the network (the content of the RAM thus changes from one cycle to another) as to say the functionality of the operating layer Each clock cycle the configuration controller is able to change up to the entire content of the RAM thanks to its dedicated instruction set

Once done our core is ready to compute The host processor sends the data to the operating layer via a specific scheme and then get back the computed data As the configuration is managed dynamically it is possible to multiplex the sent data and to compute them by several sequential (hardware multiplexing) or concurrent(static) synthesised datapaths

-

4 Operating layer architecture

41 The Dnode architecture It essentially consists in an ALU-Multiplier able to make all the classical arithmetic and logic operations addition multiplication subtraction roll shift and so on As shown in figure 4 there is also a register file and two multiplexers This optimised architecture is able in the same clock cycle to make all possible operations even between two different registers Its corresponding microinstruction code the

configuration code comes from a memory location in the configuration layer As previously said this code evolves during the computing phase the functionality can thus be changed from one clock cycle to another from an addition to a multiplication load to register

Each Dnode has in fact two execution modes - Global mode (normal mode) already exposed the

Dnode executes the microinstruction code which comes from the configuration layer managed by the RISC configuration controller

Local mode The stand-alone mode Each Dnode has 7 registers an up to 6-states counter and a 6 to I multiplexer which forms a small controller Each one of the 6 first registers can contain a Dnode microinstruction code and each clock cycle the counter increases the value on the multiplexer address input thus sending the content of a register to the datapath part of the Dnode

In this mode the Dnode is able to compute various algorithms like MAC or serial digital filters This architecture joint to a specific inputloutput Data controller (exposed later) allows very efficient high bandwidth data oriented computation

-

A B

s Figure 4 The Dnode architecture

42 The Ring architecture A regular array implies interconnection and data transfer problems as a dataflow often nearly all the time needs data feedback (figure 5)

Data feedback

Figure 5 A classical Datapath

34

These operations require important routing capabilities (figure 5) which limit the size or the performance of the array (latency cycles) Our approach to solve this problem takes place in the use of a curled structure

421

We use a curled pipelined systolic structure as shown in figure 6 All the Dnodes form a ring his length (Dnodes layers number) and width (Dnodes per-layer number) can easily be scaled

Forward The main Dataflow

n L a y e r n-1

Dataf low

Layer n

Layer n + l

Figure 6 The Ring architecture Dnodes are organised in layers a Dnode layer is connected to the two adjacent ones by switch components able to make any interconnection between two stages It also manages data injection and recuperation by direct dedicated FIFOs and optional RISC communications via a shared bus In normal mode each Dnode can be seen as an arithmetic operator of a datapath which computes a data each clock cycle In stand-alone mode each Dnode can be seen as a autonomous CPU The structure is also flexible in the way that all Dnodes do not have to run in the same mode allowing to compute either in global mode (normal mode) or local mode (stand-alone)

422 Reverse The secondary Dataflow

The data feedback problem is addressed by this one we use special feedback pipelines (figure 7) forming a reverse Dataflow to avoid complex routing structures The last task that accomplishes each switch is to write unconditionally (no control needed) the computed result of the previous Dnode layer in a dedicated pipeline (each switch owns its pipeline) which allows the feedback of each data to the previous stages (figure 7)

These ones can then choose to get these data through the switches which have direct access to all the pipelines This technique ensures a good scalability of the architecture as the routing problem is thus removed

5

feedback

-

I I

I I

I

Figure 7 The feedback network architecture

Comparisons amp realisations

51 Comparative Results

This version has a maximal computing power of 1600 MIPS (at a clock frequency of 200 MHz) quite impressive compared to the 400 MIPS of a Pentium I1 450 MHz processor The theoretical maximum bandwidth of this version of the structure is about 3 Gbytesh limited to 250 Mbytess in our implemented communication protocol (a PCI based bus) between the host CPU and the core

To program this structure we wrote an assembler which parse both RISC level (for the control) and Ring level primitives It generates the machine code ready to be executed in the architecture

In the application field targeted by third generation systems we can find lots of video-relative techniques One of these well known computing intensive algorithms is the motion estimation Widely used in video compression techniques for broadcasting storing and videoconferencing his task is to remove the temporal redundancy in video streams as the DCTs is to remove the spatial redundancy

Block matching and specially Full Search Block Matching (FSBM) algorithm is the most popular implementation also recommended by several standard committees (MPEG (video) and H26 1 (videoconferencing) standards)

The Mean Absolute Difference (MAD) criterion used to estimate the matching of the current block can be formulated as follows

N N n ) = y Y[R(i j ) - S( i + m j + n)(

r=l j = l

35

Frameat t Ev Minimization

Frame rate

Block size

--- 1

15 Framesls

8 x 8 Pixels

Frame at t+l

Dnode area

r

025ym 018pm

006 mm2 004 m2

Best match

Motion vector (mn)

Estimated Frequency

DCT + Quantification

1 Huffman coding

200 MHz MHz

Searching region

Figure 8 The Motion Estimation algorithm

R(ij ) is the reference block of size N x N and S(i+mj+n) the candidate block within the search area determined by p and q which are the maximum horizontal and vertical displacements The size of this area is (N+p+q)2 pixels and the displacement vector represented by (mn) is determined by the least MAD(mn) among all the (p+q+1)2 possible displacement within the search area Table 1 shows the specifications of a videoconferencing codec

I Imagesize I 352~240Pixels I

1 Max Displacement I lt= mrsquo lt= I Pixels

Table 1 Videoconferencing codec specifications

For each candidate block the first summation (i=1 to 8) requires N operations and the accumulation N- 1 operations thus a total of 2N-I operations The second summation requires to compute N times the previous one account of operations and again N- 1 operations for the accumulation of the partial sums The total amount of arithmetic operations to compute is so 2N2 - 1 The (2N-I)N first operations can be achieved within (2N- l)N (075Nx) clock cycles in a Nx Nodes version of our structure as there are no dependencies on these data and one node over four is in wait state (layer n 2 nodes computing two R()-S() operations layer n+l 1 node accumulating of the two previous computed results) The last N-1 operations (accumulation) are achieved in int(ln(N))+l clock cycles for N lt= Nx In a 16 Nodes version of our structure and with the previous specified codec (N=8) the computation of the MAD for a candidate block requires 13 clock cycles Each reference block requires the computation of 289 candidate blocks and there are 1320 reference blocks in each frame The total processing time of an entire image frame is

1320x289~13=4959240 cycles At a 200 MHz clock frequency the computation time would be 24ms which is more than two times smaller than the frame period (115s) Table 2 shows the performances of the Systolic Ring compared with the ASIC architecture implemented in [ 121 and Intel MMX instructions [13] using the criterion of the number of cycles needed for matching a 8x8 reference block against its search area of 8 pixels displacement

1 Systolic Ring 1 ASIC[12] I MMX[13] I 3757 I 581 I 28900 1

Table 2 Motion Estimation performance comparison (cycles)

Our structure shows again its efficiency in a such computing intensive context The ASIC implementation is much faster than our solution at the price of flexibility the Systolic Ring provides the advantage of hardware reuse and is also almost 8 times faster than a MMX solution

62 Design Implementation

Table 3 shows the comparative implementation results in both technologies An area of 07 mm2 could be achieved with the last CMOS 018pm process with a maximum estimated frequency of 200 MHz

The IOU

Core area 1 09 mmrsquo 1 07 rnmrsquo 1

Table 3 Table of synthesis results

area of each Dnode joint to the exposed specific architecture shows that this one could easily be scaled to larger realizations Figure 8 shows a foreseeable 018pm technology 12 mm2 die area SoC for high constrained embedded solution Our specific architecture allows the integration of powerful 64 Dnodes version of our Ring (34 mm2 on-die area) with a widely used ARM7 CPU able to run various operating systems like windows CE Epoc32 or Linux This kind of solution could provide a great computation powedcost ratio which combines the flexibility of a CPU reconfigurable architecture couple with the efficiency of dedicated core

36

ARM7TDMI 0 54 mm - 32 bits ARM RISC core - Running iJmdowsCE EPOC32 i l n u x Wide choice of development to015

Ring 64 3 4 mm2 64 Dnodea Systolic Ring

Fast computation of data oriented applications

Figure 8 The Motion Estimation algorithm

6 Conclusion

We have proposed a new coarse grained arithmetic block based dynamically reconfigurable architecture which proves its efficiency in data oriented processing His scalability shows that its field of applications can not only be limited to embedded high-constrained applications but can also make be worth its faculties in other contexts where data oriented high data bandwidth processing remains critical A small 8-Dnodes version of this structure already provides up to 1600 MIPS of raw power for data dominated application with a sustained data rate of 3 Gbytesls at 200 MHz either in global or local mode Our future work takes place in the translation of the structure to floating point and also in the writing of an efficient compilingprofiling tool which main task would be to maximize the computation density (the objective is no Dnode in wait state) in the architecture

7 References

[I] Stephen Brown and J Rose Architecture of FPGAs and CPLDs A Tutorial IEEE Design and Test of Computers

[2] M Gokhale et ai Building and Using a Highly Parallel Programmable Logic Array IEEE Computer pp 81 -89 Jan 1991

[3] W H Mangione-Smith et al Seeking Solutions in Configurable Computing IEEE Computer pp 38-43 December 1997

[4] J R Hauser and J Wawrzynek Carp A MIPS Processor with a Re-configurable CO-processor Proc of the IEEE Symposium on FPGAs for Custom Computing Machines I997

[ 5 ] A Abnous C Christensen J Gray J Lenell A Naylor and N Bagherzadeh Design and Implementation of the Tiny RISC microprocessor Microprocessors and Microsystems

Vol 13 NO 2 pp 42-57 1996

Vol 16 NO 4 pp 187-94 1992

[6] C Hsieh and T Lin VLSI Architecture For Block- Matching Motion Estimation Algorithm IEEE Trans on Circuits and Systems for Video Technology vol 2 pp 169- 175 June 1992

[7] N Ahmed T Natarajan and KR Rao Discrete cosine transform IEEE Trans On Computers vol C-23 pp 90- 93 Jan 1974

[8] ISOflEC JTCl CD 10918 Digital compression and coding of continuous-tone still images - part 1 requirements and guidelines ISO 1993 (JPEG)

[9] ISOIIEC JTCl CD 13818 Generic coding of moving pictures and associated audio video ISO 1994 (MPEG-2 standard)

[ IO] Challenges for Adaptive Computing Systems Defense and Advanced Research Projects Agency at wwwdarpamiIitolresearchIacschallenges html

[ 1 I ] 25Xilinx the Programmable Logic Data Book 2000 [12] ABugeja and W Yang A Re-configurable VLSI

Coprocessing System for the Block Matching Algorithm IEEE Trans On VLSI systems vol 5 September 1997

[ 131 Intel Application Notes for Pentium MMX httpIdeveloperintelcom

37

_ lt I Fd Dnode

-- C O N F I G U A A 7

II

ION

Figure 3 System overview

This architecture is thus not intended to be used as a stand- alone solution rather an IP core for data oriented intensive computing which would take place in a SOC Figure 3 shows our system in a SoC context The pP can thus confide the most demanding part of a given application to our IP core So it downloads to the RISC memory the corresponding configuration program (manages the dynamic reconfiguration)

From a functional point of view - The host processor first sends the management code to

the configuration controller memory (the custom RISC has its own program memory) This is a object code ready to be executed and specially designed to manage dynamically the configuration of the network (the content of the RAM thus changes from one cycle to another) as to say the functionality of the operating layer Each clock cycle the configuration controller is able to change up to the entire content of the RAM thanks to its dedicated instruction set

Once done our core is ready to compute The host processor sends the data to the operating layer via a specific scheme and then get back the computed data As the configuration is managed dynamically it is possible to multiplex the sent data and to compute them by several sequential (hardware multiplexing) or concurrent(static) synthesised datapaths

-

4 Operating layer architecture

41 The Dnode architecture It essentially consists in an ALU-Multiplier able to make all the classical arithmetic and logic operations addition multiplication subtraction roll shift and so on As shown in figure 4 there is also a register file and two multiplexers This optimised architecture is able in the same clock cycle to make all possible operations even between two different registers Its corresponding microinstruction code the

configuration code comes from a memory location in the configuration layer As previously said this code evolves during the computing phase the functionality can thus be changed from one clock cycle to another from an addition to a multiplication load to register

Each Dnode has in fact two execution modes - Global mode (normal mode) already exposed the

Dnode executes the microinstruction code which comes from the configuration layer managed by the RISC configuration controller

Local mode The stand-alone mode Each Dnode has 7 registers an up to 6-states counter and a 6 to I multiplexer which forms a small controller Each one of the 6 first registers can contain a Dnode microinstruction code and each clock cycle the counter increases the value on the multiplexer address input thus sending the content of a register to the datapath part of the Dnode

In this mode the Dnode is able to compute various algorithms like MAC or serial digital filters This architecture joint to a specific inputloutput Data controller (exposed later) allows very efficient high bandwidth data oriented computation

-

A B

s Figure 4 The Dnode architecture

42 The Ring architecture A regular array implies interconnection and data transfer problems as a dataflow often nearly all the time needs data feedback (figure 5)

Data feedback

Figure 5 A classical Datapath

34

These operations require important routing capabilities (figure 5) which limit the size or the performance of the array (latency cycles) Our approach to solve this problem takes place in the use of a curled structure

421

We use a curled pipelined systolic structure as shown in figure 6 All the Dnodes form a ring his length (Dnodes layers number) and width (Dnodes per-layer number) can easily be scaled

Forward The main Dataflow

n L a y e r n-1

Dataf low

Layer n

Layer n + l

Figure 6 The Ring architecture Dnodes are organised in layers a Dnode layer is connected to the two adjacent ones by switch components able to make any interconnection between two stages It also manages data injection and recuperation by direct dedicated FIFOs and optional RISC communications via a shared bus In normal mode each Dnode can be seen as an arithmetic operator of a datapath which computes a data each clock cycle In stand-alone mode each Dnode can be seen as a autonomous CPU The structure is also flexible in the way that all Dnodes do not have to run in the same mode allowing to compute either in global mode (normal mode) or local mode (stand-alone)

422 Reverse The secondary Dataflow

The data feedback problem is addressed by this one we use special feedback pipelines (figure 7) forming a reverse Dataflow to avoid complex routing structures The last task that accomplishes each switch is to write unconditionally (no control needed) the computed result of the previous Dnode layer in a dedicated pipeline (each switch owns its pipeline) which allows the feedback of each data to the previous stages (figure 7)

These ones can then choose to get these data through the switches which have direct access to all the pipelines This technique ensures a good scalability of the architecture as the routing problem is thus removed

5

feedback

-

I I

I I

I

Figure 7 The feedback network architecture

Comparisons amp realisations

51 Comparative Results

This version has a maximal computing power of 1600 MIPS (at a clock frequency of 200 MHz) quite impressive compared to the 400 MIPS of a Pentium I1 450 MHz processor The theoretical maximum bandwidth of this version of the structure is about 3 Gbytesh limited to 250 Mbytess in our implemented communication protocol (a PCI based bus) between the host CPU and the core

To program this structure we wrote an assembler which parse both RISC level (for the control) and Ring level primitives It generates the machine code ready to be executed in the architecture

In the application field targeted by third generation systems we can find lots of video-relative techniques One of these well known computing intensive algorithms is the motion estimation Widely used in video compression techniques for broadcasting storing and videoconferencing his task is to remove the temporal redundancy in video streams as the DCTs is to remove the spatial redundancy

Block matching and specially Full Search Block Matching (FSBM) algorithm is the most popular implementation also recommended by several standard committees (MPEG (video) and H26 1 (videoconferencing) standards)

The Mean Absolute Difference (MAD) criterion used to estimate the matching of the current block can be formulated as follows

N N n ) = y Y[R(i j ) - S( i + m j + n)(

r=l j = l

35

Frameat t Ev Minimization

Frame rate

Block size

--- 1

15 Framesls

8 x 8 Pixels

Frame at t+l

Dnode area

r

025ym 018pm

006 mm2 004 m2

Best match

Motion vector (mn)

Estimated Frequency

DCT + Quantification

1 Huffman coding

200 MHz MHz

Searching region

Figure 8 The Motion Estimation algorithm

R(ij ) is the reference block of size N x N and S(i+mj+n) the candidate block within the search area determined by p and q which are the maximum horizontal and vertical displacements The size of this area is (N+p+q)2 pixels and the displacement vector represented by (mn) is determined by the least MAD(mn) among all the (p+q+1)2 possible displacement within the search area Table 1 shows the specifications of a videoconferencing codec

I Imagesize I 352~240Pixels I

1 Max Displacement I lt= mrsquo lt= I Pixels

Table 1 Videoconferencing codec specifications

For each candidate block the first summation (i=1 to 8) requires N operations and the accumulation N- 1 operations thus a total of 2N-I operations The second summation requires to compute N times the previous one account of operations and again N- 1 operations for the accumulation of the partial sums The total amount of arithmetic operations to compute is so 2N2 - 1 The (2N-I)N first operations can be achieved within (2N- l)N (075Nx) clock cycles in a Nx Nodes version of our structure as there are no dependencies on these data and one node over four is in wait state (layer n 2 nodes computing two R()-S() operations layer n+l 1 node accumulating of the two previous computed results) The last N-1 operations (accumulation) are achieved in int(ln(N))+l clock cycles for N lt= Nx In a 16 Nodes version of our structure and with the previous specified codec (N=8) the computation of the MAD for a candidate block requires 13 clock cycles Each reference block requires the computation of 289 candidate blocks and there are 1320 reference blocks in each frame The total processing time of an entire image frame is

1320x289~13=4959240 cycles At a 200 MHz clock frequency the computation time would be 24ms which is more than two times smaller than the frame period (115s) Table 2 shows the performances of the Systolic Ring compared with the ASIC architecture implemented in [ 121 and Intel MMX instructions [13] using the criterion of the number of cycles needed for matching a 8x8 reference block against its search area of 8 pixels displacement

1 Systolic Ring 1 ASIC[12] I MMX[13] I 3757 I 581 I 28900 1

Table 2 Motion Estimation performance comparison (cycles)

Our structure shows again its efficiency in a such computing intensive context The ASIC implementation is much faster than our solution at the price of flexibility the Systolic Ring provides the advantage of hardware reuse and is also almost 8 times faster than a MMX solution

62 Design Implementation

Table 3 shows the comparative implementation results in both technologies An area of 07 mm2 could be achieved with the last CMOS 018pm process with a maximum estimated frequency of 200 MHz

The IOU

Core area 1 09 mmrsquo 1 07 rnmrsquo 1

Table 3 Table of synthesis results

area of each Dnode joint to the exposed specific architecture shows that this one could easily be scaled to larger realizations Figure 8 shows a foreseeable 018pm technology 12 mm2 die area SoC for high constrained embedded solution Our specific architecture allows the integration of powerful 64 Dnodes version of our Ring (34 mm2 on-die area) with a widely used ARM7 CPU able to run various operating systems like windows CE Epoc32 or Linux This kind of solution could provide a great computation powedcost ratio which combines the flexibility of a CPU reconfigurable architecture couple with the efficiency of dedicated core

36

ARM7TDMI 0 54 mm - 32 bits ARM RISC core - Running iJmdowsCE EPOC32 i l n u x Wide choice of development to015

Ring 64 3 4 mm2 64 Dnodea Systolic Ring

Fast computation of data oriented applications

Figure 8 The Motion Estimation algorithm

6 Conclusion

We have proposed a new coarse grained arithmetic block based dynamically reconfigurable architecture which proves its efficiency in data oriented processing His scalability shows that its field of applications can not only be limited to embedded high-constrained applications but can also make be worth its faculties in other contexts where data oriented high data bandwidth processing remains critical A small 8-Dnodes version of this structure already provides up to 1600 MIPS of raw power for data dominated application with a sustained data rate of 3 Gbytesls at 200 MHz either in global or local mode Our future work takes place in the translation of the structure to floating point and also in the writing of an efficient compilingprofiling tool which main task would be to maximize the computation density (the objective is no Dnode in wait state) in the architecture

7 References

[I] Stephen Brown and J Rose Architecture of FPGAs and CPLDs A Tutorial IEEE Design and Test of Computers

[2] M Gokhale et ai Building and Using a Highly Parallel Programmable Logic Array IEEE Computer pp 81 -89 Jan 1991

[3] W H Mangione-Smith et al Seeking Solutions in Configurable Computing IEEE Computer pp 38-43 December 1997

[4] J R Hauser and J Wawrzynek Carp A MIPS Processor with a Re-configurable CO-processor Proc of the IEEE Symposium on FPGAs for Custom Computing Machines I997

[ 5 ] A Abnous C Christensen J Gray J Lenell A Naylor and N Bagherzadeh Design and Implementation of the Tiny RISC microprocessor Microprocessors and Microsystems

Vol 13 NO 2 pp 42-57 1996

Vol 16 NO 4 pp 187-94 1992

[6] C Hsieh and T Lin VLSI Architecture For Block- Matching Motion Estimation Algorithm IEEE Trans on Circuits and Systems for Video Technology vol 2 pp 169- 175 June 1992

[7] N Ahmed T Natarajan and KR Rao Discrete cosine transform IEEE Trans On Computers vol C-23 pp 90- 93 Jan 1974

[8] ISOflEC JTCl CD 10918 Digital compression and coding of continuous-tone still images - part 1 requirements and guidelines ISO 1993 (JPEG)

[9] ISOIIEC JTCl CD 13818 Generic coding of moving pictures and associated audio video ISO 1994 (MPEG-2 standard)

[ IO] Challenges for Adaptive Computing Systems Defense and Advanced Research Projects Agency at wwwdarpamiIitolresearchIacschallenges html

[ 1 I ] 25Xilinx the Programmable Logic Data Book 2000 [12] ABugeja and W Yang A Re-configurable VLSI

Coprocessing System for the Block Matching Algorithm IEEE Trans On VLSI systems vol 5 September 1997

[ 131 Intel Application Notes for Pentium MMX httpIdeveloperintelcom

37

These operations require important routing capabilities (figure 5) which limit the size or the performance of the array (latency cycles) Our approach to solve this problem takes place in the use of a curled structure

421

We use a curled pipelined systolic structure as shown in figure 6 All the Dnodes form a ring his length (Dnodes layers number) and width (Dnodes per-layer number) can easily be scaled

Forward The main Dataflow

n L a y e r n-1

Dataf low

Layer n

Layer n + l

Figure 6 The Ring architecture Dnodes are organised in layers a Dnode layer is connected to the two adjacent ones by switch components able to make any interconnection between two stages It also manages data injection and recuperation by direct dedicated FIFOs and optional RISC communications via a shared bus In normal mode each Dnode can be seen as an arithmetic operator of a datapath which computes a data each clock cycle In stand-alone mode each Dnode can be seen as a autonomous CPU The structure is also flexible in the way that all Dnodes do not have to run in the same mode allowing to compute either in global mode (normal mode) or local mode (stand-alone)

422 Reverse The secondary Dataflow

The data feedback problem is addressed by this one we use special feedback pipelines (figure 7) forming a reverse Dataflow to avoid complex routing structures The last task that accomplishes each switch is to write unconditionally (no control needed) the computed result of the previous Dnode layer in a dedicated pipeline (each switch owns its pipeline) which allows the feedback of each data to the previous stages (figure 7)

These ones can then choose to get these data through the switches which have direct access to all the pipelines This technique ensures a good scalability of the architecture as the routing problem is thus removed

5

feedback

-

I I

I I

I

Figure 7 The feedback network architecture

Comparisons amp realisations

51 Comparative Results

This version has a maximal computing power of 1600 MIPS (at a clock frequency of 200 MHz) quite impressive compared to the 400 MIPS of a Pentium I1 450 MHz processor The theoretical maximum bandwidth of this version of the structure is about 3 Gbytesh limited to 250 Mbytess in our implemented communication protocol (a PCI based bus) between the host CPU and the core

To program this structure we wrote an assembler which parse both RISC level (for the control) and Ring level primitives It generates the machine code ready to be executed in the architecture

In the application field targeted by third generation systems we can find lots of video-relative techniques One of these well known computing intensive algorithms is the motion estimation Widely used in video compression techniques for broadcasting storing and videoconferencing his task is to remove the temporal redundancy in video streams as the DCTs is to remove the spatial redundancy

Block matching and specially Full Search Block Matching (FSBM) algorithm is the most popular implementation also recommended by several standard committees (MPEG (video) and H26 1 (videoconferencing) standards)

The Mean Absolute Difference (MAD) criterion used to estimate the matching of the current block can be formulated as follows

N N n ) = y Y[R(i j ) - S( i + m j + n)(

r=l j = l

35

Frameat t Ev Minimization

Frame rate

Block size

--- 1

15 Framesls

8 x 8 Pixels

Frame at t+l

Dnode area

r

025ym 018pm

006 mm2 004 m2

Best match

Motion vector (mn)

Estimated Frequency

DCT + Quantification

1 Huffman coding

200 MHz MHz

Searching region

Figure 8 The Motion Estimation algorithm

R(ij ) is the reference block of size N x N and S(i+mj+n) the candidate block within the search area determined by p and q which are the maximum horizontal and vertical displacements The size of this area is (N+p+q)2 pixels and the displacement vector represented by (mn) is determined by the least MAD(mn) among all the (p+q+1)2 possible displacement within the search area Table 1 shows the specifications of a videoconferencing codec

I Imagesize I 352~240Pixels I

1 Max Displacement I lt= mrsquo lt= I Pixels

Table 1 Videoconferencing codec specifications

For each candidate block the first summation (i=1 to 8) requires N operations and the accumulation N- 1 operations thus a total of 2N-I operations The second summation requires to compute N times the previous one account of operations and again N- 1 operations for the accumulation of the partial sums The total amount of arithmetic operations to compute is so 2N2 - 1 The (2N-I)N first operations can be achieved within (2N- l)N (075Nx) clock cycles in a Nx Nodes version of our structure as there are no dependencies on these data and one node over four is in wait state (layer n 2 nodes computing two R()-S() operations layer n+l 1 node accumulating of the two previous computed results) The last N-1 operations (accumulation) are achieved in int(ln(N))+l clock cycles for N lt= Nx In a 16 Nodes version of our structure and with the previous specified codec (N=8) the computation of the MAD for a candidate block requires 13 clock cycles Each reference block requires the computation of 289 candidate blocks and there are 1320 reference blocks in each frame The total processing time of an entire image frame is

1320x289~13=4959240 cycles At a 200 MHz clock frequency the computation time would be 24ms which is more than two times smaller than the frame period (115s) Table 2 shows the performances of the Systolic Ring compared with the ASIC architecture implemented in [ 121 and Intel MMX instructions [13] using the criterion of the number of cycles needed for matching a 8x8 reference block against its search area of 8 pixels displacement

1 Systolic Ring 1 ASIC[12] I MMX[13] I 3757 I 581 I 28900 1

Table 2 Motion Estimation performance comparison (cycles)

Our structure shows again its efficiency in a such computing intensive context The ASIC implementation is much faster than our solution at the price of flexibility the Systolic Ring provides the advantage of hardware reuse and is also almost 8 times faster than a MMX solution

62 Design Implementation

Table 3 shows the comparative implementation results in both technologies An area of 07 mm2 could be achieved with the last CMOS 018pm process with a maximum estimated frequency of 200 MHz

The IOU

Core area 1 09 mmrsquo 1 07 rnmrsquo 1

Table 3 Table of synthesis results

area of each Dnode joint to the exposed specific architecture shows that this one could easily be scaled to larger realizations Figure 8 shows a foreseeable 018pm technology 12 mm2 die area SoC for high constrained embedded solution Our specific architecture allows the integration of powerful 64 Dnodes version of our Ring (34 mm2 on-die area) with a widely used ARM7 CPU able to run various operating systems like windows CE Epoc32 or Linux This kind of solution could provide a great computation powedcost ratio which combines the flexibility of a CPU reconfigurable architecture couple with the efficiency of dedicated core

36

ARM7TDMI 0 54 mm - 32 bits ARM RISC core - Running iJmdowsCE EPOC32 i l n u x Wide choice of development to015

Ring 64 3 4 mm2 64 Dnodea Systolic Ring

Fast computation of data oriented applications

Figure 8 The Motion Estimation algorithm

6 Conclusion

We have proposed a new coarse grained arithmetic block based dynamically reconfigurable architecture which proves its efficiency in data oriented processing His scalability shows that its field of applications can not only be limited to embedded high-constrained applications but can also make be worth its faculties in other contexts where data oriented high data bandwidth processing remains critical A small 8-Dnodes version of this structure already provides up to 1600 MIPS of raw power for data dominated application with a sustained data rate of 3 Gbytesls at 200 MHz either in global or local mode Our future work takes place in the translation of the structure to floating point and also in the writing of an efficient compilingprofiling tool which main task would be to maximize the computation density (the objective is no Dnode in wait state) in the architecture

7 References

[I] Stephen Brown and J Rose Architecture of FPGAs and CPLDs A Tutorial IEEE Design and Test of Computers

[2] M Gokhale et ai Building and Using a Highly Parallel Programmable Logic Array IEEE Computer pp 81 -89 Jan 1991

[3] W H Mangione-Smith et al Seeking Solutions in Configurable Computing IEEE Computer pp 38-43 December 1997

[4] J R Hauser and J Wawrzynek Carp A MIPS Processor with a Re-configurable CO-processor Proc of the IEEE Symposium on FPGAs for Custom Computing Machines I997

[ 5 ] A Abnous C Christensen J Gray J Lenell A Naylor and N Bagherzadeh Design and Implementation of the Tiny RISC microprocessor Microprocessors and Microsystems

Vol 13 NO 2 pp 42-57 1996

Vol 16 NO 4 pp 187-94 1992

[6] C Hsieh and T Lin VLSI Architecture For Block- Matching Motion Estimation Algorithm IEEE Trans on Circuits and Systems for Video Technology vol 2 pp 169- 175 June 1992

[7] N Ahmed T Natarajan and KR Rao Discrete cosine transform IEEE Trans On Computers vol C-23 pp 90- 93 Jan 1974

[8] ISOflEC JTCl CD 10918 Digital compression and coding of continuous-tone still images - part 1 requirements and guidelines ISO 1993 (JPEG)

[9] ISOIIEC JTCl CD 13818 Generic coding of moving pictures and associated audio video ISO 1994 (MPEG-2 standard)

[ IO] Challenges for Adaptive Computing Systems Defense and Advanced Research Projects Agency at wwwdarpamiIitolresearchIacschallenges html

[ 1 I ] 25Xilinx the Programmable Logic Data Book 2000 [12] ABugeja and W Yang A Re-configurable VLSI

Coprocessing System for the Block Matching Algorithm IEEE Trans On VLSI systems vol 5 September 1997

[ 131 Intel Application Notes for Pentium MMX httpIdeveloperintelcom

37

Frameat t Ev Minimization

Frame rate

Block size

--- 1

15 Framesls

8 x 8 Pixels

Frame at t+l

Dnode area

r

025ym 018pm

006 mm2 004 m2

Best match

Motion vector (mn)

Estimated Frequency

DCT + Quantification

1 Huffman coding

200 MHz MHz

Searching region

Figure 8 The Motion Estimation algorithm

R(ij ) is the reference block of size N x N and S(i+mj+n) the candidate block within the search area determined by p and q which are the maximum horizontal and vertical displacements The size of this area is (N+p+q)2 pixels and the displacement vector represented by (mn) is determined by the least MAD(mn) among all the (p+q+1)2 possible displacement within the search area Table 1 shows the specifications of a videoconferencing codec

I Imagesize I 352~240Pixels I

1 Max Displacement I lt= mrsquo lt= I Pixels

Table 1 Videoconferencing codec specifications

For each candidate block the first summation (i=1 to 8) requires N operations and the accumulation N- 1 operations thus a total of 2N-I operations The second summation requires to compute N times the previous one account of operations and again N- 1 operations for the accumulation of the partial sums The total amount of arithmetic operations to compute is so 2N2 - 1 The (2N-I)N first operations can be achieved within (2N- l)N (075Nx) clock cycles in a Nx Nodes version of our structure as there are no dependencies on these data and one node over four is in wait state (layer n 2 nodes computing two R()-S() operations layer n+l 1 node accumulating of the two previous computed results) The last N-1 operations (accumulation) are achieved in int(ln(N))+l clock cycles for N lt= Nx In a 16 Nodes version of our structure and with the previous specified codec (N=8) the computation of the MAD for a candidate block requires 13 clock cycles Each reference block requires the computation of 289 candidate blocks and there are 1320 reference blocks in each frame The total processing time of an entire image frame is

1320x289~13=4959240 cycles At a 200 MHz clock frequency the computation time would be 24ms which is more than two times smaller than the frame period (115s) Table 2 shows the performances of the Systolic Ring compared with the ASIC architecture implemented in [ 121 and Intel MMX instructions [13] using the criterion of the number of cycles needed for matching a 8x8 reference block against its search area of 8 pixels displacement

1 Systolic Ring 1 ASIC[12] I MMX[13] I 3757 I 581 I 28900 1

Table 2 Motion Estimation performance comparison (cycles)

Our structure shows again its efficiency in a such computing intensive context The ASIC implementation is much faster than our solution at the price of flexibility the Systolic Ring provides the advantage of hardware reuse and is also almost 8 times faster than a MMX solution

62 Design Implementation

Table 3 shows the comparative implementation results in both technologies An area of 07 mm2 could be achieved with the last CMOS 018pm process with a maximum estimated frequency of 200 MHz

The IOU

Core area 1 09 mmrsquo 1 07 rnmrsquo 1

Table 3 Table of synthesis results

area of each Dnode joint to the exposed specific architecture shows that this one could easily be scaled to larger realizations Figure 8 shows a foreseeable 018pm technology 12 mm2 die area SoC for high constrained embedded solution Our specific architecture allows the integration of powerful 64 Dnodes version of our Ring (34 mm2 on-die area) with a widely used ARM7 CPU able to run various operating systems like windows CE Epoc32 or Linux This kind of solution could provide a great computation powedcost ratio which combines the flexibility of a CPU reconfigurable architecture couple with the efficiency of dedicated core

36

ARM7TDMI 0 54 mm - 32 bits ARM RISC core - Running iJmdowsCE EPOC32 i l n u x Wide choice of development to015

Ring 64 3 4 mm2 64 Dnodea Systolic Ring

Fast computation of data oriented applications

Figure 8 The Motion Estimation algorithm

6 Conclusion

We have proposed a new coarse grained arithmetic block based dynamically reconfigurable architecture which proves its efficiency in data oriented processing His scalability shows that its field of applications can not only be limited to embedded high-constrained applications but can also make be worth its faculties in other contexts where data oriented high data bandwidth processing remains critical A small 8-Dnodes version of this structure already provides up to 1600 MIPS of raw power for data dominated application with a sustained data rate of 3 Gbytesls at 200 MHz either in global or local mode Our future work takes place in the translation of the structure to floating point and also in the writing of an efficient compilingprofiling tool which main task would be to maximize the computation density (the objective is no Dnode in wait state) in the architecture

7 References

[I] Stephen Brown and J Rose Architecture of FPGAs and CPLDs A Tutorial IEEE Design and Test of Computers

[2] M Gokhale et ai Building and Using a Highly Parallel Programmable Logic Array IEEE Computer pp 81 -89 Jan 1991

[3] W H Mangione-Smith et al Seeking Solutions in Configurable Computing IEEE Computer pp 38-43 December 1997

[4] J R Hauser and J Wawrzynek Carp A MIPS Processor with a Re-configurable CO-processor Proc of the IEEE Symposium on FPGAs for Custom Computing Machines I997

[ 5 ] A Abnous C Christensen J Gray J Lenell A Naylor and N Bagherzadeh Design and Implementation of the Tiny RISC microprocessor Microprocessors and Microsystems

Vol 13 NO 2 pp 42-57 1996

Vol 16 NO 4 pp 187-94 1992

[6] C Hsieh and T Lin VLSI Architecture For Block- Matching Motion Estimation Algorithm IEEE Trans on Circuits and Systems for Video Technology vol 2 pp 169- 175 June 1992

[7] N Ahmed T Natarajan and KR Rao Discrete cosine transform IEEE Trans On Computers vol C-23 pp 90- 93 Jan 1974

[8] ISOflEC JTCl CD 10918 Digital compression and coding of continuous-tone still images - part 1 requirements and guidelines ISO 1993 (JPEG)

[9] ISOIIEC JTCl CD 13818 Generic coding of moving pictures and associated audio video ISO 1994 (MPEG-2 standard)

[ IO] Challenges for Adaptive Computing Systems Defense and Advanced Research Projects Agency at wwwdarpamiIitolresearchIacschallenges html

[ 1 I ] 25Xilinx the Programmable Logic Data Book 2000 [12] ABugeja and W Yang A Re-configurable VLSI

Coprocessing System for the Block Matching Algorithm IEEE Trans On VLSI systems vol 5 September 1997

[ 131 Intel Application Notes for Pentium MMX httpIdeveloperintelcom

37

ARM7TDMI 0 54 mm - 32 bits ARM RISC core - Running iJmdowsCE EPOC32 i l n u x Wide choice of development to015

Ring 64 3 4 mm2 64 Dnodea Systolic Ring

Fast computation of data oriented applications

Figure 8 The Motion Estimation algorithm

6 Conclusion

We have proposed a new coarse grained arithmetic block based dynamically reconfigurable architecture which proves its efficiency in data oriented processing His scalability shows that its field of applications can not only be limited to embedded high-constrained applications but can also make be worth its faculties in other contexts where data oriented high data bandwidth processing remains critical A small 8-Dnodes version of this structure already provides up to 1600 MIPS of raw power for data dominated application with a sustained data rate of 3 Gbytesls at 200 MHz either in global or local mode Our future work takes place in the translation of the structure to floating point and also in the writing of an efficient compilingprofiling tool which main task would be to maximize the computation density (the objective is no Dnode in wait state) in the architecture

7 References

[I] Stephen Brown and J Rose Architecture of FPGAs and CPLDs A Tutorial IEEE Design and Test of Computers

[2] M Gokhale et ai Building and Using a Highly Parallel Programmable Logic Array IEEE Computer pp 81 -89 Jan 1991

[3] W H Mangione-Smith et al Seeking Solutions in Configurable Computing IEEE Computer pp 38-43 December 1997

[4] J R Hauser and J Wawrzynek Carp A MIPS Processor with a Re-configurable CO-processor Proc of the IEEE Symposium on FPGAs for Custom Computing Machines I997

[ 5 ] A Abnous C Christensen J Gray J Lenell A Naylor and N Bagherzadeh Design and Implementation of the Tiny RISC microprocessor Microprocessors and Microsystems

Vol 13 NO 2 pp 42-57 1996

Vol 16 NO 4 pp 187-94 1992

[6] C Hsieh and T Lin VLSI Architecture For Block- Matching Motion Estimation Algorithm IEEE Trans on Circuits and Systems for Video Technology vol 2 pp 169- 175 June 1992

[7] N Ahmed T Natarajan and KR Rao Discrete cosine transform IEEE Trans On Computers vol C-23 pp 90- 93 Jan 1974

[8] ISOflEC JTCl CD 10918 Digital compression and coding of continuous-tone still images - part 1 requirements and guidelines ISO 1993 (JPEG)

[9] ISOIIEC JTCl CD 13818 Generic coding of moving pictures and associated audio video ISO 1994 (MPEG-2 standard)

[ IO] Challenges for Adaptive Computing Systems Defense and Advanced Research Projects Agency at wwwdarpamiIitolresearchIacschallenges html

[ 1 I ] 25Xilinx the Programmable Logic Data Book 2000 [12] ABugeja and W Yang A Re-configurable VLSI

Coprocessing System for the Block Matching Algorithm IEEE Trans On VLSI systems vol 5 September 1997

[ 131 Intel Application Notes for Pentium MMX httpIdeveloperintelcom

37