circular buffering-5 (1)

8/3/2019 Circular Buffering-5 (1)

1/10

Alignment Issue in Circular Buffering:

This document is written targeting Texas instruments microcontrollers for signal

processing specifically TMS320VC5416. The issue is explained in detail in furtherparagraphs.

While implementing FIR filtering in C on the TMS320VC5416 some function calls

in the DSPLIB requires that the coefficients and the data be aligned in memory.

The alignment of the data is understood but what is not understood is the

alignment required with the filter coefficients. This document explains why

alignment is required for the filter coefficients in the memory for the DSPLIB

functions to work. Here the question can be rephrased that why it was chosen to

put the filter coefficients in a circular buffer instead of a linear buffer. Is it moreefficient?

First I clarify what exactly we mean by alignment. Alignment in memory here

means that for a given filter length nh, the coefficients are put into memory with

starting address with K = log2(l), where 2l>=nh lower bit zeros. e.g. for a filter

length 5 the starting address in the memory for the filter coefficients must be :

x x x x x x x x x x x x x 0 0 0

The Issues are as following:

1. Why do filter coefficients need circular buffering for the DSPLIB functions towork?

2. Why do the filter coefficients need memory alignment for the DSPLIBfunctions to work?

Issues:

The approach is as follows. We answer the following questions .

a) Why is circular buffering efficient than linear buffering for the data?a.1) How is FIR filtering implemented?

a.2) How is it implemented in the hardware?


2/10

b) Why is it efficient for the filter coefficients?

c) Why is alignment needed for DSPLIB functions to work?

a.1 :

I first explain FIR processing algorithms and their flaws and problems occurring

during practical implementations that will shed light on our issues.

For an FIR filter the impulse response is of the form:

Here M is the filter order.

The length of the filter is nh = M+1.

The output can be obtained by equation:

(Equation 1)

length of output = ny

length of input = nx

Now to implement above equation there are two types of methods:

1. Sample Processing: We take the input samples one by one and with eachinput coming we have a sample of output y. Here the filter is implemented

as state system. Some of sample processing techniques are as follows.

y Direct form 1y Direct form 2y Canonical formy Cacade form

2. Block Processing: Here we take a block of input samples and give manyouput samples. Some of block processing methods are as follows.

y Convolutiony Matrix form


3/10

y LTI formy Overlap add block convolution method

We start with the sample processing methods as they are much easier in realtime

applications. We focus on Direct form 1.

The structure to implement the filter through direct form is following.

fig 1 : direct form 1

Above structure implements an Mth

order filter.

To understand higher level of what is happening we first look at the C

implementation of above structure and then look at the processor level

implementation.

The following code would implement the above structure:

double fir(M,h,w,x) // usage y = fir(M,h,w,x)

double *h, *w, x;

int M;

{

int i;


4/10

double y = 0 ; //output sample

K = (L


5/10

1. One call of above function returns just one sample of the output. i.e. firstcall will give y0 second call will give y1 and so on.. However with every call

we must give the function proper input samples to give correct output.

2. Here the picture suggests as the input samples sitting and the filter slidingalong. This implements Equation 1 but with a change of variables. If wepicture the opposite i.e filter sitting and input samples sliding through, then

it replicates the Equation 1 and the function.

a.2 :

Now we look more closely of how this can be implemented in the

hardware(controller). Hardware implementation is closely related to the assembly

language or the instruction set available to us for a particular processor. Here we

use instruction set of the TMS320VC to understand implemetation.

All the filter coeficients will be located at a particular location in the memory.

Here we assume the order of the filter 3 but this can be generalized to any filter

of order M.

y0 y1 ym

Say,

...

0000 1000

0001 1001

0002 . 1002

0003 1003

xn

.

.

.

.

.

.

x1

x0

xn

.

.

.

.

.

.

x1

x0

h0

h1

h2

h3


6/10

Above is the pictorial representation on the convolution. Procedure. Every output

sample y is result of one fir function call.

Now in assembly we can use MAC and RPT instructions.

*********

Here we look at specific locations in the memory in our case for the filter : 0000 to

0003 and for the data 1000 to 0003 and perform filtering over those memory

locations and thus have to move our data in those memory locations constantly.

So we need continuous moving of input data.

This is clearly an overhead. The following mechanism can be used to do the same

thing more efficiently.

p p1

A pseudo code is as follows:

cfir:

repeat : n


7/10

fig 3 : Contents of circular buffer at successive time instants

.. . .

1001 . 1001 p

1000 p 1000

999 0 | 0

998 0 | 0

997 0 0

n=0 n=1

fig4: updating of pointer p for one cfir call (n=0, n=1 etc.) and successive cfir

calls

xn

.

.

.

.

.

.

x1

x0

xn

.

.

.

.

.

.

x1

x0


8/10

In this case the data is assumed to be located at a static location in memory. The

pointer p as shown in fig 3 is pointing at location of x0. For the first call of cfir

(n=0 in the fig 3), p is pointing at x0. The function results in the successive

decrements to the pointer (i.e. pointing to memory locations 999, 998, 997)

resulting in multiplication with 0. When the last time repeat is executed (n=nh),if condition becomes true and the pointer p (pointing to data) is first reset to

original position (i.e. memory location 1000) is started from i.e. in this case x0and then incremented (so now it points to x1, memory location 1001). For the next

call (n=1 in fig 3) same thing happens but now when n equals nh and the

function enter if then pointer p is reset to original position in this case x1 and

then incremented. Thus pointer wraps around emulating a circular buffer. This is

clearly more efficient than moving all data every time a filtering operation is

done. Here putting the data in the circular buffer means that we dont need to

implement the statement 2 and statement 3. We still have to implementstatement 1 because of the pointer p1.

This clarifies all parts of a.

Some comments:

1. Here although we started we samples processing but if we call above cfirfunction for more than one we effectively are doing block processing.

2. For TMS320vcXXXX controllers the resetting of the data pointer and theinc for the pointer for the next filtering operation is done in the hardware.

3.Here the function would be called number of times= ny to get all the output

samples. After ny calls the pointer to the data values is reset to the first location.

This explains the need for circular buffering for the data samples. Now we look at

part b

b :

In the above discussion we just looked at what is happening to data samples. Now

we concentrate on the filter coefficients. For the first call (i.e. n=0) the pointer


9/10

p1 would be pointing at h0. Then for the successive calls the pointer will be

decremented and will point to h1, h2 and h3 successively. This completes one

filtering operation (i.e. one call of cfir is completed) Now for the next call the to

the function or next filtering operation for next output sample the pointer p1

must point to h0 again. Thus it can be seen that putting in the circular buffer alsowould benefit the filter coefficients. Here what we mean by putting in the circular

buffer is that we will not have to check for statement 4. So if we put both data

and filter coefficients in the circular buffer then we dont need to execute

statements 1,2,3 and 4 in the software.

This clarifies b.

c :

To understand why memory alignment would be required in the memory

specifically for the filter coefficients when circular buffering is used, we take a

closer look at how would the circular buffering would be implemented in the

hardware. We assume that the filter coefficients are store in order in contagious

location in memory.

Before we look at both the ways we take a note of what would it require for the

hardware implementation to be successful in the hardware of circular buffering

for filter coefficients.

y The filter buffer pointer must reset or come back to the original position(i.e. the starting address or the location of the first filter coefficient) once it

goes through nh iterations.

The information available to hardware is:

1. nh i.e. length of filter.2. Starting address of the filter coefficients

Now there are two ways we can implement the requirement stated above.

1. The starting address of the register is stored in a register say R1 and isadded to the length of the filter stored in R2. The result is stored in

register R3. Contents of R1 are copied to R4 and this acts as the pointer


10/10

and is incremented in the cfir . With every increment the contents of the

register is XORed with R3. If a match is found then R3 is copied to R4.

Thus a circular buffer is implemented.

2. For the second method we must ensure the following:

a. The size of the buffer must be a power of two (2n>nh). The filterlength can be any size.

b. However, the buffer must be aligned so that the starting address ofthe buffer has n lsb's equal to zero.

In this case we register L1 contains the length of the filter. After every

iteration of the loop in cfir the register is decremented. As soon as all

become zero is reached the pointer is made zero in the n lsbs.

The TMS320 processors implement the circular buffer in the above-mentioned

way. Hence we need alignment in the memory.

Here we dont need to move the data in the memory. The data is static at one

place it is just the pointer pointing them that is being changed. This is definitely

more efficient than previous method. The former method is linear buffering and

the data was assumed to be in linear buffer. And the later method is circular

buffering in which the data is put in a circular buffer.

As we see that for one filtering operation we need only those number of input

samples as the length of the filter.

So every time fir is called we first have to move input data by one memory

location and then perform filtering.

In order that the above code can be implemented in assembly the following

things need to be done.

This clarifies c

Arthur Butz.

Rushi Desai.

circular buffering-5 (1)

Documents