Analog Devices AN-393 Application Notes

AN-393
a
ONE TECHNOLOGY WAY • P.O. BOX 9106
Considerations for Selecting a DSP Processor
(ADSP-2115 vs. TMS320C5x)
INTRODUCTION
Digital signal processing systems demand high perfor­mance processors. But high performance cannot be measured by a processor’s multiplication/accumulation speed or MIPS (Millions of instructions per second) rat­ing alone. Many times a DSP processor is characterized mainly by its MIPS rate. Since the instruction of one DSP device is not necessarily equivalent to that of another DSP device, a MIPS rating can be misleading. Other architectural and performance requirements relating to a DSP processor’s capabilities in areas such as arith­metic, addressing and program sequencing may be more important. What distinguishes DSPs from other types of microprocessor and microcontroller architec­tures is how well they perform in each of the following areas.
1. Fast and flexible arithmetic A DSP processor must provide single-cycle compu­tation for multiplication, multiplication with accu­mulation, arbitrary amounts of shifting, and standard arithmetic and logical operations. In addi­tion, the arithmetic units should allow for any sequence of computation so that a given DSP algo­rithm can be executed without being reformulated.
2. Extended dynamic range on multiplication/ accumulation Extended sums-of-products are fundamental to DSP algorithms. Protection against overflow in succes­sive accumulations ensures that no loss of data or range occurs.
3. Single-cycle fetch of two operands (from either on­or off-chip) Again, in extended sums-of-products calcula­tions, two operands are always needed to feed the calculation. A processor must be able to sustain two operand data throughput. Also, flexible ad­dressing capabilities for multiple data memories is important.
APPLICATION NOTE
NORWOOD, MASSACHUSETTS 02062-9106
4. Hardware circular buffering (both on- and off-chip) A large class of DSP algorithms including most fil­ters require circular buffers. Hardware to handle address pointer wraparound or modulo addressing reduces overhead (increasing performance) and simplifies implementation.
5. Zero overhead looping and branching DSP algorithms are naturally repetitive and can eas­ily be expressed as loops. Program sequencing that supports looped code with zero overhead provides the best performance and the easiest programming implementation. Likewise, overhead penalties for conditional program flow are unacceptable in signal processing applications.
Not all processors currently used for DSP and DSP-like functions meet these architectural and performance requirements equally well. This article examines these considerations for selecting a DSP processor, compar­ing two 16-bit fixed-point processors, the ADSP-2115 from Analog Devices and the TMS320C5x from Texas Instruments.
The three sections that follow discuss the five points above. The arithmetic section discusses items one and two, the addressing capabilities sections discusses items three and four and the program sequencing sec­tion discusses item five.
Program examples and benchmarks can be found at the end of this article.
ARITHMETIC CAPABILITIES
The basis of a successful DSP implementation is the ability to perform fast math. Arithmetic capabilities are the foundation of DSP performance.
General Purpose Math
One indicator of a good arithmetic architecture is the ability to perform a wide range of arithmetic computa­tions. These computations should be handled in a
617/329-4700
BUS
EXCHANGE
PROGRAM MEMORY BUS
24
PMD
INPUT REGS
ALU
OUTPUT REGS
OUTPUT REGS
INPUT REGS
OUTPUT REGS
RESULT BUS
Figure 1. Block Diagram of Arithmetic Section of the ADSP-2115
flexible manner so that the algorithm can be imple­mented without rearranging the order of the arithmetic operations or operands. If the arithmetic architecture is fixed, too special-purpose or limited and the algorithm must be rearranged, this poses extra work for the DSP designer or programmer and delays getting a system running. Algorithm development frequently turns out to be much of the work of implementing a DSP system. If an algorithm can be used “as is” with no extra work, the design can be finished sooner and with less chance of error.
Arithmetic Architecture
Figure 1 shows a block diagram of the arithmetic section of the ADSP-2115 while Figure 2 shows that of the TMS320C50. Both of these devices utilize a modified Harvard architecture which can feed data operands from both program memory and data memory to the arith­metic section. Both of these devices work with 16-bit numbers.
ADSP-2115 Arithmetic Architecture Overview
The ADSP-2115 has three independent computational units: an ALU, a multiplier/accumulator (MAC), and a barrel shifter. They are connected (via the Result bus) so that the output register of any arithmetic unit may be operated on directly as an input by any other unit. In ad­dition, the ALU and MAC are directly connected to both the program and data memory buses. Operands for ALU and MAC operations can come from both memories or any combination of off-chip memory and other data reg­isters in the processor. All arithmetic operations are reg­ister based and a group of registers surrounds each arithmetic unit. A primary and secondary bank of regis­ters is available to provide for fast context switching. All arithmetic registers can also be used as general purpose data registers.
MAC
DMD BUS16
DMD
INPUT REGS
SHIFTER
OUTPUT REGS
16
TMS320C5x Arithmetic Architecture Overview
Figure 2 shows the block diagram of the arithmetic sec­tion of the TMS320C50. The TMS320C50 contains a mul­tiplier, an ALU, a Parallel Logic Unit (PLU), a 16-bit scaling shifter and additional shifters at the outputs of both the accumulator and multiplier. The multiplier has an input register, TREG0, and an output register, PREG. The multiplier has direct input connections to both the program and data bus only for one operand or input. The ALU has direct access to only the data bus, not the program bus. Results are always sent to either the data bus or the accumulator registers. In some cases, the re­sult must first be stored back in data memory before it can be used as an input for another calculation. Opera­tions such as adding two data values from memory or multiply/accumulating with a data value can require multiple cycles.
With the TMS320C50, there is no dedicated multiplier/ accumulator (MAC), which is required in many DSP al­gorithms. Instead the ALU must be used in conjunction with the multiplier for MAC operations. This may require some rearrangement of the algorithm or the temporary storage of intermediate results in data memory if the al­gorithm requires MAC operations interleaved with ALU operations. Also, there are arithmetic pipeline delays that are required to achieve sustained MAC operations. Basic multiply and ALU operations require multiple cycles as opposed to the single cycle operation of the arithmetic units in the ADSP-2115.
The availability of general purpose data registers and the flexibility of data movement in the TMS320C50 is limited. This may result in data bottlenecks and in extra cycles being required to move data into the right posi­tion prior to an arithmetic operation.
–2–
DATA BUS (DATA)
TREG2
MUX
PRESCALER
D A T A
B U S
(
D A T A
)
TREG1
COUNT
MUX
POSTSCALER
MUX
OVM SXM HM
ACC(32)ACCB(32)
TREG0
ALU(32)
OV TC C
DBMR
TRM
MULTIPLIER
PREG(32)
P-SCALER
MUX
PM
P R O G R A M
D A T A
B U S
DATA BUS (DATA)
Figure 2. Block Diagram of Arithmetic Section of the TMS320C50
ADSP-2115 ALU
The ALU has two X and two Y input registers: AX0, AX1, and AY0, AY1. ALU operations are performed on any X-Y assortment of these input registers. They may be loaded from any combination of program and data memory or other data registers in the processor. The re­sult of the operation appears in the ALU result (AR) or ALU feedback (AF) register. AR and AF can also be used as the X and Y operands (respectively) in any ALU cal­culation. The result registers of the MAC and barrel shifter can also be used directly as X inputs to the ALU (and vice versa).
ALU instructions are coded in a register transfer, alge­braic syntax. An example of addition is shown below. This example is a multifunction instruction. The first “clause” of the instruction (up to the first comma) is the addition operation. The second clause loads the X input register from data memory (“DM”) and the third clause loads the Y input from program memory. An addition (or any other ALU operation) can be executed on a sus-
MUX
BIM
PLU(16)
tained, single-cycle basis. (These operand fetching clauses of the instruction may be omitted, if they are not needed.)
AR=AX0+AY1,AX0=DM(I0,M0),AY1=PM(I4,M4)
All ALU operations complete in a single 50 ns cycle. (All references to cycles for the ADSP-2115 assume a 20 MHz device.) The ADSP-2115 runs at full speed even with an off-chip memory access.
TMS320C50 ALU
ALU operations require that one operand must come from the accumulator while the other comes from either the multiplier output, the accumulator buffer, or from the data bus or accumulator through a shifter. To add two numbers, the accumulator must be loaded with the first data value. After the accumulator is loaded, a sec­ond number can be added to the accumulator. The in­structions for the ALU are specified with a mnemonic. The two instructions required to add two numbers are shown on the following page.
–3–
ZALR <data memory address> ADD <data memory address>
For the result to be used as an input value for anything other than another ALU operation, the data must first be stored back into data memory from the accumulator. Not all ALU operations can be performed in a single 35 ns cycle; an add as shown above can be accom­plished every two cycles. All references to TMS320C50 cycles assume a 28.57 MHz device with a 35 ns cycle time. Not all ALU instructions (i.e., ADD #k, SUB #k, ADD #lk, SUB #lk, ADRK) can be used with the repeat feature.
ADSP-2115 MAC
As shown in Figure 1, the ADSP-2115 multiplier/accumu­lator (MAC) sits next to the ALU. Like the ALU, it has two X and two Y input registers, MX0, MX1 and MY0, MY1. The unit performs both multiplications and MACs inde­pendent of the ALU. This is a key difference from the architecture of the TMS320C50.
MAC operations are performed on any X-Y assortment of input registers. They may be loaded from any combi­nation of program and data memory or other data regis­ters in the processor. The result of the operation appears in the MAC result register (MR) or the MAC feedback register (MF). Like the ALU, the feedback and result reg­isters can also serve as the X and Y inputs for any multi­plication or MAC operation. The result registers of the barrel shifter and ALU can also be used directly as X in­puts to the MAC (and vice versa).
The instructions for the MAC are specified in a register transfer, algebraic syntax. An example is shown below. The first line shows multiplication of two signed oper­ands and the second example shows multiplication with accumulation of one signed and one unsigned operand. (Signed and unsigned operands can be mixed in any combination.)
The second example is a multifunction instruction. The first “clause” of the instruction (up to the first comma) is the MAC operation. The second clause loads the X input register from data memory (DM) and the third clause loads the Y input from program memory. Any MAC operation can be executed on a sustained, single-cycle basis. (These operand fetching clauses of the instruction may be omitted, if they are not needed, as in the first example.)
MR=MX0*MY0 (SS) MR=MR+MX1*MY1(SU), MX1=DM(I0,M0), MY1=PM(I4,M4)
The MR (MAC result) register is actually a 40-bit accu­mulator. It is divided into two 16-bit pieces (MR0 and MR1) and an 8-bit overflow register (MR2). DSP applica­tions frequently deal with numbers over a large dynamic range. The eight “overflow” bits of MR2 allow for 256 MAC overflows before a loss of data can occur. The MAC also supports multiprecision operations as well as automatic unbiased rounding.
All multiplication and MAC operations execute in a single 50 ns cycle. (Please consult an
Sheet
for the most recent specifications.) Two new operands can be loaded into the input registers in paral­lel with the computation so that a new MAC operation with new operands can be started every cycle. The ADSP-2115 runs at full speed even with an off-chip memory access.
TMS320C50 MAC Operation
There is no dedicated multiplier/accumulator hardware in the TMS320C50. The TMS320C50 requires the use of both the multiplier and the ALU to perform a complete multiplication/accumulation operation. A multiplication is performed by loading the TREG0 register with the first operand. Once this data is loaded, a value from the data bus can be multiplied with the value in the TREG0 regis­ter. The instructions for the multiplier are specified with a mnemonic. The instructions for a multiplication are shown below.
LT <data memory address> MPY <data memory address>
A product is obtained every two cycles.
A full multiplication/accumulation requires the use of the ALU as well as the multiplier. The instruction required to perform a MAC operation is shown below. This instruction requires two words of program memory storage.
MAC <prog. mem. address> <data mem. address>
With both operands in on-chip memory, the MAC instruction takes three 35 ns cycles in non-repeat mode. In repeat mode, it will require number of repeats.
There are four different mnemonics used for the multi­ply/accumulate function: MAC, MACD, MADD, MADS. The specific use of each of these depends upon the source of the data. For a dual operand fetch, such as that needed for a digital filter, the MADD instruction should be used. The DMOV portion of the MADD instruction will not function with external memory. All data must reside on chip.
The TMS320C50 provides one bit of extension in the accumulator (a 31-bit accumulator with an overflow bit compared to the 40-bit accumulator of the ADSP-2115). After more than one overflow, the calculation of the TMS320C50 is corrupted. Automatic rounding is not supported in the multiplier. This is unlike the ADSP­2115, where up to 256 overflows can occur with no lost data and automatic rounding is performed in the same cycle as the multiply operation.
ADSP-2115 Shifter
The barrel shifter in the ADSP-2115 has an input register, SI, and accepts as inputs any result registers in the pro­cessor (e.g., MR1, AR) including its own result register,
2 + n
ADSP-21xx Data
cycles, where n is the
–4–
Table I. Summary of Arithmetic Capabilities
DSP Requirement ADSP-2115 TMS320C50
All ALU Operations—Single Cycle No Single-Cycle Multiplication No Single-Cycle MAC Operations ✓✓*
Single-Cycle Shifting 0–32 Bits 0–16 Bits
Left or Right Left or Right
0–7 Bits Left 1 or 4 Bits Left 6 Bits Right
Accumulator Overflow Protection 8 Bits 1 Bit Signed, Unsigned or Mixed-Mode Multiplications No Mixed Mode Single-Cycle Normalization No
*Approaches single-cycle efficiency when using repeat mode.
SR. Like the MAC result register set, the 32-bit SR is di­vided into two 16-bit registers, SR0 and SR1. The shifter also has an exponent register, SE, which is set automati­cally by the exponent adjust instructions and used for normalization instructions.
The shifter can place a 16-bit input value anywhere within a 32-bit field in a single cycle. The input can be shifted any number of bits from off-scale left to off-scale right with either an arithmetic or logical shift. Other functions such as exponent detection, normalization, denormalization, block floating-point exponent mainte­nance, and pattern merging can also be performed with this shifter. All shifter operations are performed in a single cycle. Numbers can be normalized, regardless of the number of bits to be shifted, in a single cycle.
TMS320C50 Shifter
The TMS320C50 has three scaling shifters. The P­scaler shifts the product 0, 1, or 4 bits to the left or 6 bits to the right. The prescaler at the input of the ALU shifts data to the left or right from 0 to 16 bits. The post-scaler at the output of the ALU can shift data coming from the accumulator left from 0 to 7 bits. These shifters add the advantage of being able to scale data during the data move instead of requiring an additional shifter operation but limit the flexibility for general purpose shifting operations.
Arithmetic Summary
Table I summarizes the comparison of arithmetic capa­bilities of these processors.
The side-by-side arithmetic architecture of the ADSP­2115 results in easier implementation of many DSP algorithms as compared to the fixed sequence, end-to­end architecture of the TMS320C50. Due to the depen­dency of the ALU on the multiplier for multiplication/ accumulations in the TMS320C50, MAC operations can­not be easily intermingled with ALU operations. This may require changing the order of calculations in an algorithm so that the interdependency of ALU and multi­plier does not cause a problem. The local storage regis-
ters found in the ADSP-2115 make data movement for calculations easy. If data is to be used many times, it can reside in a register to eliminate the need of fetching it from memory each time. With local registers and the open architecture, it is easy to perform arithmetic opera­tions in any order and to guarantee that input operands and results remain intact until explicitly overwritten or moved.
DATA ADDRESSING CAPABILITIES
A digital signal processor’s ability to perform fast arithmetic is wasted if the required data cannot be fetched at sustained speed equal to the processing rate. Addressing hardware must support the dual operand fetches required to fully utilize the Harvard architecture found in most DSPs. A good DSP must have the ability to store two types of data operands, typically a coefficient and a data word. Maximum effi­ciency can be obtained if two different memory spaces are provided for the data operands so that two operands can be fetched in the same single cycle. Using both data memory and program memory to store data will allow maximum efficiency. Circular buffers are frequently useful in implementing DSP algorithms; hardware support of address pointer wraparound is another feature distinguishing a signal processor from other types of high-performance processors.
Figure 3 shows the address generation circuitry of the ADSP-2115 while Figure 4 shows that of the TMS320C50. The addressing capabilities of the TMS320C50 are basi­cally the same as those of the TMS320C25 with the addi­tion of some circular buffering logic. Flexibility is still limited since there is only one modify register (AR0) and only two simultaneous circular buffers are supported compared to the eight modify registers and eight simul­taneous circular buffers of the ADSP-2115. Also, due to instruction pipelining of the TMS320C50, the auxiliary registers cannot be used for as many as two cycles after certain register load instructions. These addressing
–5–
Loading...
+ 11 hidden pages