The popularity of the C programming language and its derivatives in signal
processing applications is increasing. Two books specifically about signal
processing in C have recently become available [Reid, David]. A broad collection
of numerical algorithms with implementations in C, Fortran and Pascal has been
available for some years [Press]. Efforts are underway to develop an ANSI
standard for Digital Signal Processing extensions to C [Leary]. A discussion of
the advantages and disadvantages of C for numerical computing may be found in
[MacDonald].
When a new reduced instruction set computer (RISC) or digital signal processor
(DSP) is introduced a C compiler is one of the first high level development
tools available. The wide availability of compilers and the ease with which
portable applications may be written has prompted the use of C as a target
language for other programming languages, such as C++, Pascal and Fortran and
for higher level tools, such as the Ptolemy system[Buck].
C is expressive enough to allow the careful programmer to obtain the numerical
performance potential of a processor so important for many signal processing
applications. This paper presents recommendations for efficiently implementing
signal processing algorithms in C without sacrificing clarity or portability.
These recommendations are derived from the author's experience implementing
signal processing functions for musical applications [Freed 1992] with the MIPS
R4000 [Kane] C compiler on a Silicon Graphics Indigo workstation. There is
therefor a bias in the example code towards audio signal processing, but the
recommendations apply to numerical computing in general.
Most of the recommendations can be interpreted for traditional C [Kernighan],
ANSI C [ANSI] and C++ [Ellis]. Although the ANSI standardization of C did not
greatly change the language, some of these changes directly concern
implementation of numerical algorithms. These and issues relating to C++ will be
discussed throughout.
Recommendations will look like this
If you are in a hurry just read recommendations in italics
The choice of data types for variables in signal processing applications
relates directly to algorithm performance. The following sections discuss the
basic data types available in C.
Integers in C come in four potentially different sizes: char,
short, int and long, and two flavors:
signed and unsigned. Long integers are at least 32-bit wide
in most modern machines and are a good choice for most integer calculations in
signal processing. Short integers are often 16-bit wide and are commonly used to
represent sound samples for storage on disks and for D/A and A/D conversions.
Although int is usually 32 bits, it has been customary for many
compilers to provide 16-bit wide int's for the Motorola 68000 Family
of processors. Many low-cost digital signal processors provide 16 or 24-bit
operations and addressing.
Floating point numbers in C can be single (float), double (double
) or extended precision (long double). Most processors
that provide floating point operations support a 32-bit single precision format
with a 24-bit signed magnitude mantissa. Most use the IEEE 754 standard for such
numbers, although popular low-cost DSP chips, the TMS320c3x series for example,
do not. Most RISC machines and floating point coprocessors provide direct
support for double precision floating point numbers with a 53-bit signed
magnitude mantissa. Long double's are the same as
double's in the current generation of processors.
The following table is for the MIPS C compiler:
Name Type Size Range
Character char 8 0 to 255
Integer int (long) 32 -27147483648 to 27147483647
short 16 -32768 to 32767
unsigned 32 0 to 4294967295
Floating Point float 32 +/-10+/-38
double (long double) 64 +/-10+/-380
Audio signals for musical applications will serve as an example of how to
choose appropriate data types.
Most RISC processors have faster floating point multiply operations than
integer ones. Integer additions on the other hand are usually as fast as or
faster than floating point additions, because of the normalization step required
in floating point arithmetic. Though the choice of data type for signals appears
at first glance to depend on the instruction timing of a particular processor
and the particular mix of multiplications and additions of a given algorithm,
many other factors support the choice of floating point:
Floating point operations allow much greater dynamic range*
Applications driving future processor designs require good floating point
performance.*
Use float for calculations on sound samples
*Many processors implement integer arithmetic using the floating point ALU, e.g.
PA-RISC [Hansen]
typedef can be used to parameterize types to allow for future changes,
e.g.
typedef long Stype; /* sample */
typedef float Ftype; /* frequency */
typedef double Ttype; /* time */
In practice this parameterization is not as useful as might be hoped.
Consider the filter:
#ifdef FLOAT
typedef float Stype;
#else
typedef long Stype; /* sample */
#endif
Stype simple_filter(Stype input, float k, float j)
{
static Stype output= 0;
return output = output*k +input*j;
}
This code will be slower when Stype is long than when it is float
, because variables of integer type in the same expression as variables of
floating point types are promoted to floating point. Not only will these
conversions slow the filter down, but the critical arithmetic operations will be
done in floating point anyway.
Here is the filter using integers:
#define SCALE 65536
long integer_filter(long input, long k, long j)
{
static long output = 0;
return output = (output*k+input*j)/SCALE;
}
Allowing for 16-bits of dynamic range in the input and output and to avoid
integer overflow, the above function only allows for 15-bits of precision in the
specification of the filter coefficients.
The above example illustrates a major difficulty for signal processing with C
integers : there is no fixed point arithmetic interpretation of those integers
such as that of fixed point signal processor chips like the DSP56001 [Lee].
Although multiplication of two 32-bit integers on RISC processors such as the
R3000 yields the expected 64 bit resolution result, the multiplication operator
in C yields an integer the same size as its operands: 32 bits. The top 32 bits
are unavailable to the C programmer and it is therefor impossible to maintain
higher resolution for subsequent additions within an expression. Although there
are important applications where 16 bit signals and filter coefficients suffice,
most audio applications demand at least 24-bit signal resolution and better than
48-bit resolution for additions. This difficulty is addressed in DSP/C by
adding a fract data type to C [Leary].
How many bits are needed to represent time to within a sample at 48kHz
sample rate over the closed interval of one day? All the integers between 0 and
24*60*60*48000 = 4147200000 have to be represented. 32-bit floating point
numbers and 24-bit or smaller integers can be eliminated as candidates since
2^23 = 8388352. These types cannot represent time to the required precision over
the interval of one hour. This leaves the 32-bit integer or double precision
floating point types. The floating point representation has the advantage that
time could be easily represented in seconds and be independent of sample rate.
The addition operation on 32-bit integers is faster than double precision
addition on many processors and may be implemented concurrently with other
instructions.
Use long for counting samples Use double
or long for measuring elapsed time
Delay line lengths for waveguide filters, for example, [Laakso et al] need
to be specified to a better precision than the sample interval. A 100th of a
sample interval precision over the range 0 to a maximum delay of a few minutes
covers the broad range of needs from, for example, accurately tuning high
frequencies for a simulated flute to the long delays typical of large
reverberant spaces.
Use float or long integers for fractional sample delays
For most applications the range 0 to 48Khz suffices. However it can be
useful to represent negative frequencies and it is now commonplace to sample at
a rate higher than required by the Nyquist criterion and the range of human
hearing. Let's allow for -1Mhz to 1Mhz. Although human pitch perception has
limited precision, it is a common mistake to use this to decide on the required
frequency precision. In additive synthesis, for example, sounds are synthesized
for more than their pitch percept. Their timbre and loudness is also important
and is a function of the phase and amplitude of a set of sinusoidal
components. Two sinusoidal components close in frequency are perceived as one
component with an amplitude modulation whose rate is a function of the
difference in frequency. Differences between these beating sounds are
perceptible when the frequency difference is less than 1 Hz.
A usable range of values is easily covered by a single precision floating point
number or 32-bit integer. The 24-bit fractional format of the DSP56000 family
suffices, but 16-bits is not enough.
Use float or an integer type with more than 16-bits for
frequency parameters
An important use of the frequency parameter is in the representation of a
phase increment for synthesis of periodic signals using linear oscillators. If
the signal to be synthesized is read from a table that has a length which is a
power of 2, a very efficient oscillator implementation is possible by
accumulating integer phase increments [Hartmann].
Use an unsigned integer type with more than 16-bits for phase increments
and accumulators.
16-bit PCM (the standard used in Compact Discs) provides enough dynamic
range for most listening contexts. However, when recording it is common to
encounter situations where more dynamic range is needed. It is also useful to
have more resolution to avoid the effects of round off errors during operations
on signals. Although most high-quality audio is stored on tape and disk as
16-bit integers, many vendors now offer the option of 24-bits. Most already do
at least 24-bit processing and many offer 20-bit D/A conversion [Sony Hi-bit
reference].
Use signed and unsigned char for low-quality sound samples
Use short (16-bit) integers for sound samples on tape and disk Use float or better than 16-bit integers for calculations on sounds
Consider the following vector function from UDI [Depalle]:
sun_vaddosb(vect_in1,incr_in1,vect_in2,incr_in2,
vect_out,incr_out,vect_length)
float *vect_in1, *vect_in2, *vect_out;
unsigned long vect_length, incr_in1, incr_in2, incr_out;
{
register float tmp,
tmp_1,
tmp_2;
if((incr_in1 == 1) && (incr_in2 == 1) && (incr_out == 1))
{
while(vect_length--)
{
tmp_1 = *vect_in1++;
tmp_2 = *vect_in2++;
tmp = tmp_1 - tmp_2;
*vect_out++ = (tmp_1 + tmp_2) / tmp;
}
}
else
{
while(vect_length--)
{
tmp_1 = *vect_in1;
tmp_2 = *vect_in2;
tmp = tmp_1 - tmp_2;
*vect_out = (tmp_1 + tmp_2) / tmp;
vect_in1 += incr_in1;
vect_in2 += incr_in2;
vect_out += incr_out;
}
}
}
This is typical of coding styles which have evolved since the earliest
applications of C. Programmers have taken advantage of the ease with which C can
be used to express low-level operations to optimize the performance of the code
using the compiler for the machine at hand.
The following version might have been the starting point for these
optimizations:
void vaddosb(vect_in1,incr_in1,vect_in2,incr_in2,
vect_out,incr_out,vect_length)
float *vect_in1, *vect_in2, *vect_out;
unsigned long vect_length,incr_in1,incr_in2,incr_out;
{
unsigned int i;
for(i=0;i<vect_length;++i)
vect_out[i*incr_out] =
(vect_in1[i*incr_in1] + vect_in2[i*incr_in2])
/ (vect_in1[i*incr_in1] - vect_in2[i*incr_in2]);
}
No register variables, pointers or temporary variables are used and no
special case is made for vectors of contiguous elements. Will this smaller,
clearer function be much slower because of the 5 multiplies introduced per
iteration? Will it be slower without the special case which allows compilers to
optimize the post increment? With the MIPS R3000 processor on the SGI Indigo,
the above code is faster than the original hand optimized code! The following
measurements are the sum of the user time of ten runs of UNIX time(1) command
executing a program which called the aforementioned functions one million times
with 32 element vectors:
vector stride 1 sun_vaddosb 225.9 vaddosb 225
vector stride 2 sun_vaddosb 225.4 vaddosb 225
The MIPS C compiler did all the optimizations the original programmer did by
hand and was able to generate more efficient loop testing code, which accounts
for the small advantage over the original function. A conclusion from these
measurements is that it is better to write simple, clear programs and examine
their performance in a particular application than to write in a machine
specific optimized style. This section is devoted to recommendations which are
likely to yield faster code with most machines and compilers without leading to
hard to understand code. The last section will discuss machine and compiler
specific optimizations. But first here are some general recommendations on
changing code to make it more efficient.
It can be difficult to predict performance by inspecting C code. Now that many
processors have optimizing assemblers it is not easy to predict performance by
looking at the output of the C compiler either. The only way to tell that
changes to code are improving performance is to measure performance. Most modern
compilers come with tools for execution time profiling. Even when these are not
available, the UNIX time(1) command can be used to good effect.
Measure real performance to changes in code
It is very easy to break working code while changing it to make it more
efficient. Most programmers have experienced the punishment for not following
these recommendations from [Kernighan] :
Make it right before you make it faster
Keep it right when you make it faster Make it
clear before you make it faster
Arguments to functions defined with "old style" declarations are
promoted to larger types. Char's and short's are promoted to
int's. Float's are promoted to double's. So
consider this version of the simple filter:
typedef float Stype
Stype simple_filter(input, k,j)
Stype input; /* will be promoted to double */
float k; /* will be promoted to double */
float j; /* will be promoted to double */
{
static Stype output= 0;
return output = output*k +input*j;
}
On machines where the float and double types are
different sizes this version will be considerably slower than one with the new
declaration style introduced with ANSI C and C++. Not only will all the
arithmetic be performed in double precision, but two conversions between
float and double will be required to use and assign to the
output variable. Most RISC processors have efficient double precision ALU
operations. However the additional costs of conversion instructions cannot be
avoided. Bus bandwidth and storage requirements of double's is usually
twice that of float's. The size penalty may have an unpredictable and
negative impact on performance by wasting precious cache memory or floating
point registers.
Use new style declarations Use conversion utilities to
change old-style declarations in traditional C code
Note that the constant 3.141 is of type double and its use in an
expression will cause other types to be promoted to double.
Use the ANSI C f or F suffix for single precision
floating point constants, e.g. 3.141f or when the suffix is
unavailable use a caste: e.g. (float) 3.141
In this example, most compilers will generate code to multiply k by t,
PI, and 2.
#define PI 3.14159265358979
a = r*sin(2*PI*k*t);
So this constant folding will be more efficient:
#define TWOPI 6.2831853072 /* 2[[pi]] */
a = r*sin(TWOPI*k*t);
Fold constant expressions into single constants
Division is usually considerably more expensive than multiplication and
compilers often do not do strength reduction, i.e. x/2.0 can be
written as x*0.5. Strength reduction of 2.0*x to x+x
is often not done by RISC compilers because in some instruction schedules
the addition may be a slower choice.
Consider strength reduction when it is measurably faster
Standard C library functions such as sin() and cos()return
double precision results to a double precision argument. ANSI C standardized
single precision floating point versions of these routines.
When single precision suffices use faster library routines, e.g. fsin()
and fcos() instead of sin() and cos()
Mathematical library routines are designed to achieve good performance without
compromising accuracy for a wide range of input values. This means that they
work over a much larger domain than an application may require. They are also
required to maintain the errno variable for exception handling.
Consider restrictive and faster implementations of C library functions
Remember thought that considerable effort may have been expended on efficient
implementations of the C library functions. The hard work to do better ones for
specific applications may be difficult to justify.
Certain arithmetic operations such as division by zero may require special
additional exception processing. The C standard does not define this additional
work. It is usually processor specific but may be well defined in the case of
IEEE 754 floating point exceptions such as overflow and underflow. In any event
many of these exceptions result in a potentially expensive processor context
switch to handle the exception.
Avoid computations which may lead to expensive exception processing
Conversions from one data type to another may be expensive. Conversions from
floating point numbers to signed and unsigned integers are commonly slow
operations. On the R3000 it is cheaper to convert a floating point number to a
signed integer and then that integer to an unsigned integer than directly from
floating point to an unsigned integer. The results are not necessarily the
same.
Avoid expensive conversions from one type to another
In traditional C, compilers are completely free to a change the order the
evaluation of commutative and associative operators. ANSI C requires that
compilers honor the order of expression evaluation specified by parentheses.
Unless numerical precision requirements dictate otherwise (since floating point
addition is not associative) it therefor is best to minimize use of parentheses
and give compilers the most options for optimization. This unfortunately may
compromize the readability of programs.
With ANSI C consider parentheses to trade numerical precision for
optimization opportunity
Modern computer systems are designed to take advantage of locality of memory
references, whether these systems be based on RISC, CISC, or DSP architecture.
Register variables and data cache memory are used to take advantage of temporal
locality - the tendency for most variable references to be to variables which
were most recently referenced. Instruction caches and queues are designed to
take advantage of temporal and spatial locality of instruction streams - the
tendency for instructions streams to contain contiguous instructions or repeats
of short sequences from loops. Cache lines and page mode memory chips are used
to take advantage of spatial locality of data and instructions.
It is important therefor to consider algorithms and implementations which
enhance the temporal and spatial locality of instruction and data references.
This will become even more significant as systems evolve to include more than
one processor, because of the high cost of references to variables shared
between processors. Although the recommendations to enhance locality are machine
independent, the payoff in their application will vary considerably among
processors and memory subsystem designs.
The cost of access to a variable may vary considerably according to its address
in memory.
Use a memory allocator which aligns data to memory page, and cache
line boundaries - the vendor supplied library function for memory allocation may
do this automatically
Register files and cache memory are much smaller in size than main memory, so it
is important to be economical in the use of space. For numerical applications
this means not unnecessarily using large data types such as
double when float will do.
double variables usually use twice the register and cache space of
float variables, use them only when the precision they offer is required
In C it is usually easy to control spatial locality of variables, since arrays
are implemented in contiguous memory locations and successive elements in
structures are required to be implemented contiguously. For example on the SGI
indigo, a reference to the real element of a complex number consisting of two
float structure elements results in the complex component being
brought into cache memory concurrently, speeding up a subsequent access to the
complex element of the structure.
Place variables that are commonly used together into contiguous memory
locations, e.g. typedef struct { float r, c; } complex ;
Variables declared with local scope, are not necessarily implemented on the
stack contiguously, so it may pay to create a structure to group together such
variables, unless of course the compiler is expected to place the variables in
registers anyway.
Modern C compilers offer a range of optimization and other controls over the
code generation process. They are usually described concisely, perhaps in the
UNIX manual format.
Consult compiler manual for control over code generation
The MIPS C compiler for SGI workstations, like many ANSI C compliant compilers,
has an option to revert to traditional C definition (-cckr). This option is
useful for compiling old code, but it should be used with the -float option,
which overrides the old type promotion rules and allows the compiler to generate
faster single precision floating point operations for expression with no types
larger than float.
use cc -cckr -float for old code use cc -O2 -c or cc -O3 for ANSI code
The -g[0-3] argument which is used to specify generation of information
for symbolic debugging suppresses optimization.
Use compiler optimization after code has been debugged
Profiling code is another potential source of inefficiency.
Turn off profiling options if it effects performance
In addition to documentation on the compiler itself, a reference manual for
the language it implements is usually provided. It is now standard practice to
include in such a manual a description of implementation-defined behavior. One
of these implementation choices is the interpretation of the register
storage-class specifier. The MIPS Compiler uses up to 6 register storage-class
specifiers when not optimizing and ignores them when optimizing. It always
ignores register storage-class specifiers in new style function definitions and
declarations.
The MIPS compiler like most modern C compilers make very good choices
automatically of which variables to put into registers. On the rare occasions
when it can be shown, by examining the output of the compiler and measuring
performance, that its use of registers can be improved on, a better way is
needed to guide the compiler. Unfortunately it is better to compile most modules
with optimization so register storage-class specifications will be
ignored. We can however use the volatile type qualifier, introduced
with ANSI C. Variables qualified this way will not usually be placed in
registers.
Use volatile to defeat a compilers unfruitful register assignments
The MIPS processors have an overflow exception on signed integer additions
and subtractions but not on unsigned integer operations. When using integers for
phase accumulators in digital oscillators it is common practice to allow the
accumulator to wrap by simply overflowing.
Use unsigned integers for phase accumulators to avoid exceptions
Recursive digital filters (IIR) with impulsive inputs may cause floating point
underflows. MIPS floating point has an unmaskable and expensive exception on
underflow. This data dependent instruction execution time is disastrous for hard
real-time applications such as sound synthesis. One solution to this is to clamp
the state variables of the recursive filter to 0 as they approach 0. Recent
designs such as the Motorola 88110 allow for exceptions to be completely
disabled.
The following functions all perform the addition of two non zero length
vectors:
void add(int count, float *a, float *b, float *result) /* fortran */
{
int i;
for(i=0;i<count;++i)
result[i] = a[i] +b[i];
}
void add(int count, float *a, float *b, float *result) /* UDI */
{
while(count--)
*result++ = *a++ + *b++;
}
void add(int count, float *a, float *b, float *result) /* Csound */
{
do
{
*result++ = *a++ + *b++;
}while(--count);
}
void add(int count, float *a, float *b, float *result) /* Resound */
{
float *end = result+count;
while(result<end)
{
*result++ = *a++ + *b++;
}
}
void add(int count, float *a, float *b, float *result) /* Ptolemy */
{
int i;
for(i=count-1;i >=0;--i)
result[i] = a[i] +b[i];
}
For the interesting values of count greater than 0, these
functions all perform the same operation. For values less than or equal to 0
they do not all perform the same way. It is unreasonable to expect a compiler to
generate the same code for each function. Which function will be the fastest?
For many compilers for RISC machines the first example will be fastest. Compiler
developers cannot optimize every way to express loops. Perhaps because of a
tradition of optimizing Fortran do loops, the MIPS compiler generates
good code for loops with array indices which are functions of loop iteration
count variables. It unrolls loops thereby increasing the chance for scheduling
optimizations, and minimizing loop control between blocks. It can also generate
fast test instructions for iteration variables with an upper and lower bound.
Use for loops and an iteration variable instead of
while Reference arrays with iteration variables
instead of pointers
Compilers do not generate code to place the variable count in the
following function into a register:
int count; loopindex() { int i; for(i=0; i < count; ++i)
x(i*i); }
Since the function x(int i) may modify count the
compiler cannot generate code to put a copy of count in a register at
the start of the loop. If it is known that there are no such side effects it is
possible to to help the compiler to come to the same conclusion by assigning all
the variables in the loop to registers:
int count; loopindex() { int i; int n = count; for(i=0; i <
n; ++i) x(i*i); }
Increase the probability of fruitful register assignments by avoiding
side effects and narrowing scope of variables so side effects may be checked for
by the compiler
The recommendations in this paper have been used by the author to create an
efficient vector and function library for real-time audio signal processing on
MIPS processors. Preliminary experiences porting this work to other processors
is very encouraging.
Silicon Graphics PacBell
ANSI C language specification, American National Standard
X3.159-1989
Buck, J., Ha, S., Lee, E. A., Messserschmitt, D. G., Ptolemy: A Platform for
Heterogeneous Simulation and Prototyping, 1991, Proc. of the 1991 European
Simulation Conference, Copenhagen, Denmark, June 17-19
David, Ruth A., Stearns, Samuel D., Signal Processing Algorithms Using
Fortran and C, 1992, Prentice-Hall
Depalle, Philippe Rodet, Xavier, UDI: A Unified DSP Interface for Sound
Signal Analysis and Synthesis, Proceedings of ICMC, 1990, Computer Music
Association
Ellis, Margaret A., Stroustrup, Bjarne, The Annotated C++ Reference Mnaual
, 1990, Addison-Wesley
Freed, A., MacMix: Mixing Music with a Mouse, Proceedings of the 12th
International Computer Music Conference, The Hague, 1986, Computer Music
Association
Freed A., New Tools for Rapid Prototyping of Music Sound Synthesis
Algorithms and Control Strategies, Proc. International Computer Music
Conference, San Jose, Oct. 1992.
Freed, A., Recording, Mixing, and Signal Processing on a Personal Computer
, Proceedings of the AES 5th International Conference on Music and Digital
Technology, 1987
Hansen, Robert C., 1992, New Optimizations for PA-RISC Compilers,
Hewlett-Packard Journal, June 1992.
Hartmann, W. M., Digital Waveform Generation by Fractional Addressing,
J. Acoust. Soc. Amer. 82, 1987, p. 1183-1891
Kane, Gerry, Heinrich, Joe, MIPS RISC Architecture, 1992, Prentice Hall
Kastens, U., Compilation for Instruction Parallel Processors, 1990,
Proceedings 3rd Compiler Compilers Conference 1990, Springer-Verlag
Kernighan, B. W., Ritchie, D. M., The C Programming Language, 1978,
1988, Prentice-Hall
Kernighan, B. W., Plauger, P. J., The Elements of Programming Style,
1978, McGraw-Hill
Laakso, T. I., Välimäki, V., Karjalainen, M., Laine, U. K.,
Real-Time Implementation Techniques for a Continuously Variable Digital
Delay in Modeling Musical Instruments, Proceedings of ICMC 1992, CMA, San
Francisco.
Leary, Kevin W., Waddington, William, DSP/C: A Standard High Level Language
for DSP and Numeric Processing, Proceedings of the ICASSP, 1990, p.
1065-1068
Lee, Edward A., Programmable DSP Architectures, October 1988, IEEE ASSP
Magazine
Moore, F. Richard, Elements of Computer Music, 1990, Prentice-Hall
Press, William H. et al, Numerical Recipes in C - 2nd Edition, 1992,
Cambridge University Press
Reid, Chistopher E., Passin, Thomas B., Signal Processing in C, 1992,
John Wiley & Sons
Macdonald, T., 1991, C for Numeric Computing, The journal of
Supercomputing, 5, 31-48, Kluwer Academic Pulishers, Boston.
|