Digital signal processing systems demand high performance processors. But high performance cannot be
measured by a processor’s multiplication/accumulation
speed or MIPS (Millions of instructions per second) rating alone. Many times a DSP processor is characterized
mainly by its MIPS rate. Since the instruction of one DSP
device is not necessarily equivalent to that of another
DSP device, a MIPS rating can be misleading. Other
architectural and performance requirements relating to
a DSP processor’s capabilities in areas such as arithmetic, addressing and program sequencing may be
more important. What distinguishes DSPs from other
types of microprocessor and microcontroller architectures is how well they perform in each of the following
areas.
1.Fast and flexible arithmetic
A DSP processor must provide single-cycle computation for multiplication, multiplication with accumulation, arbitrary amounts of shifting, and
standard arithmetic and logical operations. In addition, the arithmetic units should allow for any
sequence of computation so that a given DSP algorithm can be executed without being reformulated.
2. Extended dynamic range on multiplication/
accumulation
Extended sums-of-products are fundamental to DSP
algorithms. Protection against overflow in successive accumulations ensures that no loss of data or
range occurs.
3.Single-cycle fetch of two operands (from either onor off-chip)
Again, in extended sums-of-products calculations, two operands are always needed to feed the
calculation. A processor must be able to sustain
two operand data throughput. Also, flexible addressing capabilities for multiple data memories
is important.
•
APPLICATION NOTE
NORWOOD, MASSACHUSETTS 02062-9106
4.Hardware circular buffering (both on- and off-chip)
A large class of DSP algorithms including most filters require circular buffers. Hardware to handle
address pointer wraparound or modulo addressing
reduces overhead (increasing performance) and
simplifies implementation.
5.Zero overhead looping and branching
DSP algorithms are naturally repetitive and can easily be expressed as loops. Program sequencing that
supports looped code with zero overhead provides
the best performance and the easiest programming
implementation. Likewise, overhead penalties for
conditional program flow are unacceptable in signal
processing applications.
Not all processors currently used for DSP and DSP-like
functions meet these architectural and performance
requirements equally well. This article examines these
considerations for selecting a DSP processor, comparing two 16-bit fixed-point processors, the ADSP-2115
from Analog Devices and the TMS320C5x from Texas
Instruments.
The three sections that follow discuss the five points
above. The arithmetic section discusses items one and
two, the addressing capabilities sections discusses
items three and four and the program sequencing section discusses item five.
Program examples and benchmarks can be found at the
end of this article.
ARITHMETIC CAPABILITIES
The basis of a successful DSP implementation is the
ability to perform fast math. Arithmetic capabilities are
the foundation of DSP performance.
General Purpose Math
One indicator of a good arithmetic architecture is the
ability to perform a wide range of arithmetic computations. These computations should be handled in a
•
617/329-4700
BUS
EXCHANGE
PROGRAM MEMORY BUS
24
PMD
INPUT REGS
ALU
OUTPUT REGS
OUTPUT REGS
INPUT REGS
OUTPUT REGS
RESULT BUS
Figure 1. Block Diagram of Arithmetic Section of the ADSP-2115
flexible manner so that the algorithm can be implemented without rearranging the order of the arithmetic
operations or operands. If the arithmetic architecture is
fixed, too special-purpose or limited and the algorithm
must be rearranged, this poses extra work for the DSP
designer or programmer and delays getting a system
running. Algorithm development frequently turns out to
be much of the work of implementing a DSP system. If
an algorithm can be used “as is” with no extra work, the
design can be finished sooner and with less chance of
error.
Arithmetic Architecture
Figure 1 shows a block diagram of the arithmetic section
of the ADSP-2115 while Figure 2 shows that of the
TMS320C50. Both of these devices utilize a modified
Harvard architecture which can feed data operands from
both program memory and data memory to the arithmetic section. Both of these devices work with 16-bit
numbers.
ADSP-2115 Arithmetic Architecture Overview
The ADSP-2115 has three independent computational
units: an ALU, a multiplier/accumulator (MAC), and a
barrel shifter. They are connected (via the Result bus) so
that the output register of any arithmetic unit may be
operated on directly as an input by any other unit. In addition, the ALU and MAC are directly connected to both
the program and data memory buses. Operands for ALU
and MAC operations can come from both memories or
any combination of off-chip memory and other data registers in the processor. All arithmetic operations are register based and a group of registers surrounds each
arithmetic unit. A primary and secondary bank of registers is available to provide for fast context switching. All
arithmetic registers can also be used as general purpose
data registers.
MAC
DMD BUS16
DMD
INPUT REGS
SHIFTER
OUTPUT REGS
16
TMS320C5x Arithmetic Architecture Overview
Figure 2 shows the block diagram of the arithmetic section of the TMS320C50. The TMS320C50 contains a multiplier, an ALU, a Parallel Logic Unit (PLU), a 16-bit
scaling shifter and additional shifters at the outputs of
both the accumulator and multiplier. The multiplier has
an input register, TREG0, and an output register, PREG.
The multiplier has direct input connections to both the
program and data bus only for one operand or input.
The ALU has direct access to only the data bus, not the
program bus. Results are always sent to either the data
bus or the accumulator registers. In some cases, the result must first be stored back in data memory before it
can be used as an input for another calculation. Operations such as adding two data values from memory or
multiply/accumulating with a data value can require
multiple cycles.
With the TMS320C50, there is no dedicated multiplier/
accumulator (MAC), which is required in many DSP algorithms. Instead the ALU must be used in conjunction
with the multiplier for MAC operations. This may require
some rearrangement of the algorithm or the temporary
storage of intermediate results in data memory if the algorithm requires MAC operations interleaved with ALU
operations. Also, there are arithmetic pipeline delays
that are required to achieve sustained MAC operations.
Basic multiply and ALU operations require multiple
cycles as opposed to the single cycle operation of the
arithmetic units in the ADSP-2115.
The availability of general purpose data registers and
the flexibility of data movement in the TMS320C50 is
limited. This may result in data bottlenecks and in extra
cycles being required to move data into the right position prior to an arithmetic operation.
–2–
DATA BUS (DATA)
TREG2
MUX
PRESCALER
D
A
T
A
B
U
S
(
D
A
T
A
)
TREG1
COUNT
MUX
POSTSCALER
MUX
OVMSXMHM
ACC(32)ACCB(32)
TREG0
ALU(32)
OVTCC
DBMR
TRM
MULTIPLIER
PREG(32)
P-SCALER
MUX
PM
P
R
O
G
R
A
M
D
A
T
A
B
U
S
DATA BUS (DATA)
Figure 2. Block Diagram of Arithmetic Section of the TMS320C50
ADSP-2115 ALU
The ALU has two X and two Y input registers: AX0, AX1,
and AY0, AY1. ALU operations are performed on any
X-Y assortment of these input registers. They may be
loaded from any combination of program and data
memory or other data registers in the processor. The result of the operation appears in the ALU result (AR) or
ALU feedback (AF) register. AR and AF can also be used
as the X and Y operands (respectively) in any ALU calculation. The result registers of the MAC and barrel
shifter can also be used directly as X inputs to the ALU
(and vice versa).
ALU instructions are coded in a register transfer, algebraic syntax. An example of addition is shown below.
This example is a multifunction instruction. The first
“clause” of the instruction (up to the first comma) is the
addition operation. The second clause loads the X input
register from data memory (“DM”) and the third clause
loads the Y input from program memory. An addition
(or any other ALU operation) can be executed on a sus-
MUX
BIM
PLU(16)
tained, single-cycle basis. (These operand fetching
clauses of the instruction may be omitted, if they are not
needed.)
AR=AX0+AY1,AX0=DM(I0,M0),AY1=PM(I4,M4)
All ALU operations complete in a single 50 ns cycle. (All
references to cycles for the ADSP-2115 assume a
20 MHz device.) The ADSP-2115 runs at full speed even
with an off-chip memory access.
TMS320C50 ALU
ALU operations require that one operand must come
from the accumulator while the other comes from either
the multiplier output, the accumulator buffer, or from
the data bus or accumulator through a shifter. To add
two numbers, the accumulator must be loaded with the
first data value. After the accumulator is loaded, a second number can be added to the accumulator. The instructions for the ALU are specified with a mnemonic.
The two instructions required to add two numbers are
shown on the following page.
For the result to be used as an input value for anything
other than another ALU operation, the data must first be
stored back into data memory from the accumulator.
Not all ALU operations can be performed in a single
35 ns cycle; an add as shown above can be accomplished every two cycles. All references to TMS320C50
cycles assume a 28.57 MHz device with a 35 ns cycle
time. Not all ALU instructions (i.e., ADD #k, SUB #k,
ADD #lk, SUB #lk, ADRK) can be used with the repeat
feature.
ADSP-2115 MAC
As shown in Figure 1, the ADSP-2115 multiplier/accumulator (MAC) sits next to the ALU. Like the ALU, it has two
X and two Y input registers, MX0, MX1 and MY0, MY1.
The unit performs both multiplications and MACs independent of the ALU. This is a key difference from the
architecture of the TMS320C50.
MAC operations are performed on any X-Y assortment
of input registers. They may be loaded from any combination of program and data memory or other data registers in the processor. The result of the operation appears
in the MAC result register (MR) or the MAC feedback
register (MF). Like the ALU, the feedback and result registers can also serve as the X and Y inputs for any multiplication or MAC operation. The result registers of the
barrel shifter and ALU can also be used directly as X inputs to the MAC (and vice versa).
The instructions for the MAC are specified in a register
transfer, algebraic syntax. An example is shown below.
The first line shows multiplication of two signed operands and the second example shows multiplication with
accumulation of one signed and one unsigned operand.
(Signed and unsigned operands can be mixed in any
combination.)
The second example is a multifunction instruction. The
first “clause” of the instruction (up to the first comma) is
the MAC operation. The second clause loads the X input
register from data memory (DM) and the third clause
loads the Y input from program memory. Any MAC
operation can be executed on a sustained, single-cycle
basis. (These operand fetching clauses of the instruction
may be omitted, if they are not needed, as in the first
example.)
The MR (MAC result) register is actually a 40-bit accumulator. It is divided into two 16-bit pieces (MR0 and
MR1) and an 8-bit overflow register (MR2). DSP applications frequently deal with numbers over a large dynamic
range. The eight “overflow” bits of MR2 allow for 256
MAC overflows before a loss of data can occur. The
MAC also supports multiprecision operations as well as
automatic unbiased rounding.
All multiplication and MAC operations execute in a
single 50 ns cycle. (Please consult an
Sheet
for the most recent specifications.) Two new
operands can be loaded into the input registers in parallel with the computation so that a new MAC operation
with new operands can be started every cycle. The
ADSP-2115 runs at full speed even with an off-chip
memory access.
TMS320C50 MAC Operation
There is no dedicated multiplier/accumulator hardware
in the TMS320C50. The TMS320C50 requires the use of
both the multiplier and the ALU to perform a complete
multiplication/accumulation operation. A multiplication
is performed by loading the TREG0 register with the first
operand. Once this data is loaded, a value from the data
bus can be multiplied with the value in the TREG0 register. The instructions for the multiplier are specified with
a mnemonic. The instructions for a multiplication are
shown below.
LT<data memory address>
MPY<data memory address>
A product is obtained every two cycles.
A full multiplication/accumulation requires the use of
the ALU as well as the multiplier. The instruction
required to perform a MAC operation is shown below.
This instruction requires two words of program memory
storage.
MAC<prog. mem. address> <data mem. address>
With both operands in on-chip memory, the MAC
instruction takes three 35 ns cycles in non-repeat mode.
In repeat mode, it will require
number of repeats.
There are four different mnemonics used for the multiply/accumulate function: MAC, MACD, MADD, MADS.
The specific use of each of these depends upon the
source of the data. For a dual operand fetch, such as that
needed for a digital filter, the MADD instruction should
be used. The DMOV portion of the MADD instruction will
not function with external memory. All data must reside
on chip.
The TMS320C50 provides one bit of extension in the
accumulator (a 31-bit accumulator with an overflow bit
compared to the 40-bit accumulator of the ADSP-2115).
After more than one overflow, the calculation of the
TMS320C50 is corrupted. Automatic rounding is not
supported in the multiplier. This is unlike the ADSP2115, where up to 256 overflows can occur with no lost
data and automatic rounding is performed in the same
cycle as the multiply operation.
ADSP-2115 Shifter
The barrel shifter in the ADSP-2115 has an input register,
SI, and accepts as inputs any result registers in the processor (e.g., MR1, AR) including its own result register,
2 + n
ADSP-21xx Data
cycles, where n is the
–4–
Table I. Summary of Arithmetic Capabilities
DSP RequirementADSP-2115TMS320C50
All ALU Operations—Single Cycle✓No
Single-Cycle Multiplication✓No
Single-Cycle MAC Operations✓✓*
Single-Cycle Shifting0–32 Bits0–16 Bits
Left or RightLeft or Right
0–7 Bits Left
1 or 4 Bits Left
6 Bits Right
Accumulator Overflow Protection8 Bits1 Bit
Signed, Unsigned or Mixed-Mode Multiplications✓No Mixed Mode
Single-Cycle Normalization✓No
*Approaches single-cycle efficiency when using repeat mode.
SR. Like the MAC result register set, the 32-bit SR is divided into two 16-bit registers, SR0 and SR1. The shifter
also has an exponent register, SE, which is set automatically by the exponent adjust instructions and used for
normalization instructions.
The shifter can place a 16-bit input value anywhere
within a 32-bit field in a single cycle. The input can be
shifted any number of bits from off-scale left to off-scale
right with either an arithmetic or logical shift. Other
functions such as exponent detection, normalization,
denormalization, block floating-point exponent maintenance, and pattern merging can also be performed with
this shifter. All shifter operations are performed in a
single cycle. Numbers can be normalized, regardless of
the number of bits to be shifted, in a single cycle.
TMS320C50 Shifter
The TMS320C50 has three scaling shifters. The Pscaler shifts the product 0, 1, or 4 bits to the left or 6
bits to the right. The prescaler at the input of the ALU
shifts data to the left or right from 0 to 16 bits. The
post-scaler at the output of the ALU can shift data
coming from the accumulator left from 0 to 7 bits.
These shifters add the advantage of being able to
scale data during the data move instead of requiring
an additional shifter operation but limit the flexibility
for general purpose shifting operations.
Arithmetic Summary
Table I summarizes the comparison of arithmetic capabilities of these processors.
The side-by-side arithmetic architecture of the ADSP2115 results in easier implementation of many DSP
algorithms as compared to the fixed sequence, end-toend architecture of the TMS320C50. Due to the dependency of the ALU on the multiplier for multiplication/
accumulations in the TMS320C50, MAC operations cannot be easily intermingled with ALU operations. This
may require changing the order of calculations in an
algorithm so that the interdependency of ALU and multiplier does not cause a problem. The local storage regis-
ters found in the ADSP-2115 make data movement for
calculations easy. If data is to be used many times, it can
reside in a register to eliminate the need of fetching it
from memory each time. With local registers and the
open architecture, it is easy to perform arithmetic operations in any order and to guarantee that input operands
and results remain intact until explicitly overwritten or
moved.
DATA ADDRESSING CAPABILITIES
A digital signal processor’s ability to perform fast
arithmetic is wasted if the required data cannot be
fetched at sustained speed equal to the processing
rate. Addressing hardware must support the dual
operand fetches required to fully utilize the Harvard
architecture found in most DSPs. A good DSP must
have the ability to store two types of data operands,
typically a coefficient and a data word. Maximum efficiency can be obtained if two different memory
spaces are provided for the data operands so that two
operands can be fetched in the same single cycle.
Using both data memory and program memory to
store data will allow maximum efficiency. Circular
buffers are frequently useful in implementing DSP
algorithms; hardware support of address pointer
wraparound is another feature distinguishing a signal
processor from other types of high-performance
processors.
Figure 3 shows the address generation circuitry of the
ADSP-2115 while Figure 4 shows that of the TMS320C50.
The addressing capabilities of the TMS320C50 are basically the same as those of the TMS320C25 with the addition of some circular buffering logic. Flexibility is still
limited since there is only one modify register (AR0) and
only two simultaneous circular buffers are supported
compared to the eight modify registers and eight simultaneous circular buffers of the ADSP-2115. Also, due to
instruction pipelining of the TMS320C50, the auxiliary
registers cannot be used for as many as two cycles after
certain register load instructions. These addressing
–5–
Loading...
+ 11 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.