# 7.4. Realizing finite precision filters – Digital Filters Design for Signal and Image Processing

## 7.4. Realizing finite precision filters

### 7.4.1. Introduction

In Chapters 5 and 6, we discussed FIR and IIR filter synthesis. The methods presented there allow us to obtain the transfer function of the filter, generally in the direct canonic form. However, in practice, and more especially when a filter is implanted in a processor dedicated to digital signal processing (DSP), the coefficients and values of the samples are coded on a finite number of bits. Quantifying these values requires several constraints that must be taken into account.

With IIR filters, we will show the influence quantification errors can have on a filter coefficient of the frequency response of a filter, in the case of a direct canonic, then cascade structure. This observation allows us to deduce the most adaptive choice for implementing digital filters. Then, we will look at other problems related to implementing filters, such as saturation.

We should remember that this section does not look at parallel structures; however, a similar study can be made to compare the influence of coefficient quantification on the frequency response of a filter.

### 7.4.2. Examples of FIR filters

Let the following formula be the transfer function of an FIR filter:

During implantation, the coefficients of the impulse response h(n) will never have exact theoretical value as predicted; they will be rounded off to the value [h(n)]r, the closest multiple of the quantification step q. The change to finite precision thus introduces a quantification error e(n) written as follows:

This error is limited and verifies the following inequality:

So we have:

The error occurring in the transfer function due to the quantification of the coefficients of the impulse response is:

The frequency response, evaluated on the unity circle, then satisfies:

constitutes the upper limit of the error. It is easy to evaluate because the coefficients appear in a linear way in the expression of the frequency response.

### 7.4.3. IIR filters

#### 7.4.3.1 Introduction

First, let us again look at the transfer function expressions for direct and cascade structures:

and

In this section, we will look at the influence of the coefficient quantification of the filter on the frequential behavior of the filter. To do this, we observe the filter attenuation Af(f), represented as:

The expression of the filter attenuation Af(f), shown in equation (7.46), is difficult to use in our study. So we will instead use another expression that does not directly interpose the quantity . So, since the transfer function Hz (z) is a complex quantity, we can express it as a function of its module and phase, as follows:

From here, by taking the complex logarithm of equation (7.47), we obtain:

Consequently,

and:

Using equations (7.46) and (7.50), the filter attenuation Af(f) can be written as:

If we look at a direct structure, the change to finite precision introduces an error on each coefficient ap of the denominator and bp of the numerator of the transfer function Hz(z) that we write respectively as Δap and Δbp. This quantification will generate an attenuation error of the filter. If we carry out a first-order approximation, the global error on ΔAf(f) equals:

In equation (7.52), Sai (f) designates the attenuation sensitivity of the filter in relation to coefficient ai. By using equation (7.50), we can express it as follows:

or:

With a cascade structure, the error occurring on the filter attenuation equals, to the first order:

In this section, we compare attenuation variations ΔAf(f) due to quantification when we use the direct canonic or cascade structure.

We begin by looking at the direct structure and, more particularly by the calculation of . We use the transfer function in equation (7.1), since:

For p contained between 1 and M-1, we obtain:

In addition, given equation (7.1), since:

for p contained between 0 and N −1, we have:

From here, the error occurring on the filter attenuation equals, to the first order:

Using equation (7.60), the error is notably expressed as a sum of rational functions whose denominator is of degree N−1 or M−1. We will see that this quantity plays a role that becomes more harmful to a filter the higher the filter's order. To see this, we introduce the poles {Pi}i=1, …, M of the transfer function.

So we can express zPi as follows:

where ψi represented in Figure 7.20.

Figure 7.20. Geometric representation of the position of the poles

From here, given equation (7.62), equation (7.61) can be written as:

The higher the number of M poles, the more z, which describes the unity circle in the complex plane, has the chance to be close to one or several between them. The higher the number M of poles, the more likely is to be small. So the higher the filter's order, the more the first order-error occurring on the filter is raised. For a second-order direct structure, the attenuation sensitivity of the filter in relation to the p degree of the coefficient ap of the denominator of the transfer function Hz(z) is the lowest. Lastly, the closer the poles are to the unity circle, the more the errors there will be for certain frequencies, since becomes smaller.

Now we will look at the cascade structure of the same filter characterized by equation (7.37). As before, we calculate the sensitivity in relation to the coefficients of the denominators and the numerators of rational fractions of the transfer function, and in relation to the gain K.

So now we have:

from which we have:

As before, we can demonstrate that:

Unlike the case with direct form structures, the global error depends on the cascade structure of second-order rational functions. By following the same reasoning based on equations (7.62) and (7.63), we see that the cascade structure is the most appropriate; it minimizes first-order error occurring with filter attenuation.

So we choose a structure of second-order cells in cascade to realize an even order filter. For an odd order filter, a first-order cell must also be used.

In order to illustrate the difference in sensitivity between the two structures, we present the reduced precision implementation (to 1/256 then about 1/64) of an IIR type II Chebyshev filter (expected attenuation of 30 dB, with normalized cut-off frequency of 0.15) with a direct and cascade structure. The obtained result is close to the theoretical model.

Figure 7.21. Squared amplitude of a type II Chebyshev filter (expected attenuation of 30 dB, normalized cut-off frequency of 0.15) in finite precision with two types of structures: cascade and direct

#### 7.4.3.2 The influence of quantification on filter stability

If quantification errors have an influence on the frequency response of a filter, they also have an intrinsic influence on the position of the transfer function poles and, consequently, on the stability of the filter.

Let us look again at the expression of the transfer function in the case of a direct structure:

We then note the presence of the poles of the transfer function in equation (7.1):

When the filter is realized with one of the direct forms, the coefficients directly appear in the difference equation. During the implantation of this type of filter, the change into finite precision has an influence on the position of the poles and, consequently, on the stability of the filter. In this section, we will obtain an estimation of the precision with which the coefficients must be represented to guaranty the filter's stability.

We make the hypothesis that Hz(z) is a transfer function of a low-pass filter. The poles of Hz(z) are thus all inside and near the unity disk. We can write that:

where the error ei, is complex and so .

Let us assume that a sole coefficient ar is modified and that it takes the new value , so that:

where δ represents the gap in relation to the exact value.

The denominator of the transfer function Hz(z) then is expressed in the following form:

So for z = 1, equation (7.74) becomes:

The transfer function then admits a pole as z = 1 if Aq(1) = 0 by taking into account equation (7.75):

By introducing the poles of Hz(z), from equation (7.76) we obtain the following condition:

By taking into account equation (7.72), the condition in equation (7.77) becomes:

The filter of the transfer function is therefore unstable because the stability criterion based on the position of the poles being strictly inside the unity disk in the complex plane has not been respected. So, we obtain the pole z = 1 for a relatively weak perturbation , because:

DIGITAL EXAMPLE.- let us consider a digital filter with the following transfer function:

Using equation (7.77), if we increase one of the coefficients of the denominator of −A(1) = −10−6, the filter implanted in the direct form structure will be unstable.

As well, for a perturbation of the order of 2−20 <10−6 <2−19, we need at the least a coefficient coding on 19 bits. However, if we choose to cascade three first-order filters with identical transfer functions, equal to:

the filter will be less sensitive to errors. In this situation, the instability will occur if we increase one of the coefficients of 10−2, which corresponds to the quantification error brought about by a coding on 6 bits.

Sections 7.4.3.1 and 7.4.3.2 analyzed the importance choosing the correct structure for IIR filters during their implantation. If the direct form structure seems a priori the simplest and most natural, it is also highly sensitive to quantification errors and can lead to a frequency response that does not correspond to the specifications. This means the filter can be unstable. For this reason we will choose a cascade structure. However, several questions occur at this point:

– Must we introduce scale factors to avoid saturations in second-order cells?

– How can we obtain, then sequence, second-order cells that make up this structure?

– What criterion should we use for representing the necessary cells? The answer to this question will be the topic of section 7.4.3.3.

#### 7.4.3.3. Introduction to scalefactors

Before we discuss cell sequencing, it is important that we take into account saturation problems that can occur at each point of the second-order cell. Because of recursivity, saturation can spread and make the filtering algorithm unusable. To alleviate this problem, we introduce scale factors inside each second-order cell.

With the second-order cell with the transfer function:

the first step consists of avoiding the saturation of the intermediary output x1)(k) (see Figure 7.22).

To do that, we determine the impulse response fi(k) of the system whose input is x(k) and output is xj(k). It corresponds to the inverse transform of the transfer function . By introducing a normalization factor at the input of the second-order cell, written αi, and represented by:

the intermediary output does not undergo saturation because it corresponds to the input x(k) before normalization, The resulting transfer function equals:

Figure 7.22. First part of the structure of a second-order cell with input normalization

We must then compensate for this normalization by 1/αi in the second part of the cell in order to find the transfer function of the order i of the second-order cell; that is:

This operation multiplies the transfer function of the system whose input is x1(k) and output is y(k) by αi. So we have:

Figure 7.23. Structure of a second-order cell resulting from input normalization

Two alternative approaches consist in taking the following normalization factors:

or

#### 7.4.3.4 Decomposing the transfer function into first- and second-order cells

As we have seen in section 7.4.3.1, we cascade first- and second-order cells instead of using an order M direct structure.

When M is even, we recall that we have:

However, according to the association that we have made with and , the resulting filters do not work the same way in relation to quantifying filter coefficients and input signal samples. The first question we must resolve is how to best pair poles and transfer function zeros. Then we must define the order in which these cells are arranged.

Quantifying the filter coefficients on a finite number of bits modifies the position of the poles and the filter's zeros.

So the poles and the second-order cells represented by the transfer function:

respectively equal:

and

Now we will look more closely at poles; depending on the values of coefficients acil and aci2, the nature of the poles will differ.

Figure 7.24. Stability triangle

So when , the poles are complex conjugates; otherwise they are real. To ensure stability, we must have:

This constraint implies that the two following inequalities must be satisfied1:

and

From there, when working in infinite precision, the pos itions can occupy the poles in the plane are written in a triangle shown in Figure 7.24. By work ing in finite precision, acil and aci2 take the discrete values respectively between −1 and 1 and −2 and 2. This means we must take care that the corresponding poles stay within the stability triangle.

The pairing of the transfer function poles and zeros depends on minimizing the power of the output noise occurring from quantification errors on the samples and results of operations. We write this quantity PNtotaloutput.

This means that the quantification of the input signals introduces a noise ex at input whose level equals , where q designates the quantification step. It brings about, as an output of the first cell, a noise whose level can be calculated in the frequential domain, and equals:

At the output of the cascaded structure of N/2 cells (we assume N is even), it thus generates an error whose level equals:

However, this quantification error of the input signal is not the only element to consider; inside each cell, the result of each multiplication of two operands of M bits is normally coded on 2M bits. Now, if several multiplication have been done, the number of bits on which each final result is coded becomes higher and higher. In practice, this is not feasible. For this reason, we carry out a truncation operation or round – off the result of the M bit multiplication. In Figure 7.25 this truncation of 2M to M bits is modeled by the addition of an error written eji(n).

This operation introduces a global error ej whose level is .

Figure 7.25. Modelization of truncation errors for a second-order structure

From there, if we look at the type of error generated in the j cell, the resulting noise from the filter's output equals:

Using equations (7.93) and (7.94), the total noise at the filter's output thus equals:

In order to minimize Ptotaloutput. we must minimize each factor. The most complex problem is that of the value taken by for the frequency associated with a pole of . To avoid an overly high amplitude response at this frequency, we must choose a zero of close to the pole to best neutralize the pole's influence.

Once the pairing of the poles and zeros of the transfer function has been done, the order of the poles will be represented by minimizing the level of the output noise.

We should keep in mind that much work is being done to improve this procedure so as to reduce noise levels.