Analog Devices, Inc. reserves the right to change this product without
prior notice. Information furnished by Analog Devices is believed to be
accurate and reliable. However, no responsibility is assumed by Analog
Devices for its use; nor for any infringement of patents or other rights of
third parties which may result from its use. No license is granted by implication or otherwise under the patent rights of Analog Devices, Inc.
Trademark and Service Mark Notice
The Analog Devices logo, Blackfin, EZ-ICE, EZ-KIT Lite, SHARC,
TigerSHARC, the TigerSHARC logo, and VisualDSP++ are registered
trademarks of Analog Devices, Inc.
SuperScalar is a trademark of Analog Devices, Inc.
All other brand and product names are trademarks or service marks of
their respective owners.
Page 3
CONTENTS
PREFACE
Purpose of This Manual ................................................................ xvii
Intended Audience ........................................................................ xvii
Manual Contents ......................................................................... xviii
What’s New in This Manual ........................................................... xix
Technical or Customer Support ....................................................... xx
Supported Processors ....................................................................... xx
Product Information ...................................................................... xxi
Thank you for purchasing and developing systems using TigerSHARC®
processors from Analog Devices.
Purpose of This Manual
The ADSP-TS101 TigerSHARC Processor Programming Reference contains
information about the DSP architecture and DSP assembly language for
TigerSHARC processors. These are 32-bit, fixed- and floating-point digital signal processors from Analog Devices for use in computing,
communications, and consumer applications.
The manual provides information on how assembly instructions execute
on the TigerSHARC processor’s architecture along with reference information about DSP operations.
Intended Audience
The primary audience for this manual is a programmer who is familiar
with Analog Devices processors. This manual assumes that the audience
has a working knowledge of the appropriate processor architecture and
instruction set. Programmers who are unfamiliar with Analog Devices
processors can use this manual, but should supplement it with other texts
(such as the appropriate hardware reference manuals and data sheets) that
describe your target architecture.
•Chapter 1, “Introduction”
Provides a general description of the DSP architecture, instruction
slot/line syntax, and instruction parallelism rules.
•Chapter 2, “Compute Block Registers”
Provides a description of the compute block register file, register
naming syntax, and numeric formats.
•Chapter 3, “ALU”
Provides a description of the arithmetic logic unit (ALU) and communications logic unit (CLU) operation, includes ALU/CLU
instruction examples, and provides the ALU instruction summary.
•Chapter 4, “Multiplier”
Provides a description of the multiply-accumulator (multiplier)
operation, includes multiplier instruction examples, and provides
the multiplier instruction summary.
•Chapter 5, “Shifter”
Provides a description of the bit wise, barrel shifter (shifter) operation, includes shifter instruction examples, and provides the shifter
instruction summary.
•Chapter 6, “IALU”
Provides a description of the integer arithmetic logic unit (IALU)
and data alignment buffer (DAB) operation, includes IALU
instruction examples, and provides the IALU instruction summary.
•Chapter 7, “Program Sequencer”
Provides a description of the program sequencer operation, the
instruction alignment buffer (IAB), the branch target buffer
(BTB), and the instruction pipeline. This chapter also includes a
program sequencer instruction summary.
•Chapter 8, “Instruction Set”
Describes the ADSP-TS101 processor instruction set in detail,
starting with an overview of the instruction line and instruction
types.
•Appendix A, “Quick Reference”
Contains a concise description of the ADSP-TS101 processor
assembly language. It is intended to be used as an assembly programming reference.
•Appendix B, “Register/Bit Definitions”
Provides register and bit name definitions to be used in
ADSP-TS101 processor programs.
•Appendix C, “Instruction Decode”
Identifies operation codes (opcodes) for instructions. Use this
chapter to learn how to construct opcodes.
Preface
L
This programming reference is a companion document to the
ADSP-TS101 TigerSHARC Processor Hardware Reference.
What’s New in This Manual
Revision 1.1 of the ADSP-TS101 TigerSHARC Processor Programming Reference corrects and closes all open Tool Anomaly Reports (TARs) against
this manual, adds figure titles that were missing, and updates Web site and
contact numbers. These changes affect the preface, various chapters,
appendices, and the index.
The name “TigerSHARC” refers to a family of floating-point and
fixed-point [8-bit, 16-bit, and 32-bit] processors. VisualDSP++ currently
supports the following TigerSHARC processors:
ADSP-TS101, ADSP-TS201, ADSP-TS202, and ADSP-TS203
SHARC® (ADSP-21xxx) Processors
The name “SHARC” refers to a family of high-performance, 32-bit,
floating-point processors that can be used in speech, sound, graphics, and
imaging applications. VisualDSP++ currently supports the following
SHARC processors:
You can obtain product information from the Analog Devices Web site,
from the product CD-ROM, or from the printed publications (manuals).
Analog Devices is online at www.analog.com. Our Web site provides information about a broad range of products—analog integrated circuits,
amplifiers, converters, and digital signal processors.
MyAnalog.com is a free feature of the Analog Devices Web site that allows
customization of a Web page to display only the latest information on
products you are interested in. You can also choose to receive weekly
e-mail notifications containing updates to the Web pages that meet your
interests. MyAnalog.com provides access to books, application notes, data
sheets, code examples, and more.
Registration
Visit www.myanalog.com to sign up. Click Register to use MyAnalog.com.
Registration takes about five minutes and serves as a means to select the
information you want to receive.
If you are already a registered user, just log on. Your user name is your
e-mail address.
Processor Product Information
For information on embedded processors and DSPs, visit the Analog
Devices Web site at www.analog.com/processors, which provides access
to technical publications, data sheets, application notes, product overviews, and product announcements.
The following publications that describe the ADSP-TS101 TigerSHARC
processor (and related processors) can be ordered from any Analog Devices
sales office:
•ADSP-TS101S TigerSHARC Embedded Processor Data Sheet
Online documentation comprises the VisualDSP++ Help system, software
tools manuals, hardware tools manuals, processor manuals, the Dinkum
Abridged C++ library, and Flexible License Manager (FlexLM) network
license manager software documentation. You can easily search across the
entire VisualDSP++ documentation set for any topic of interest. For easy
printing, supplementary .PDF files of most manuals are also provided.
Each documentation file type is described as follows.
File Description
.CHMHelp system files and manuals in Help format
.HTM or
.HTML
.PDFVisualDSP++ and processor manuals in Portable Documentation Format (PDF).
Dinkum Abridged C++ library and FlexLM network license manager software documentation. Viewing and printing the
Internet Explorer 4.0 (or higher).
Viewing and printing the .PDF files requires a PDF reader, such as Adobe Acrobat
Reader (4.0 or higher).
.HTML files requires a browser, such as
If documentation is not installed on your system as part of the software
installation, you can add it from the VisualDSP++ CD-ROM at any time
by running the Tools installation. Access the online documentation from
the VisualDSP++ environment, Windows® Explorer, or the Analog
Devices Web site.
•Access VisualDSP++ online Help from the Help menu’s Contents, Search, and Index commands.
•Open online Help from context-sensitive user interface items (toolbar buttons, menu commands, and windows).
Accessing Documentation From Windows
In addition to any shortcuts you may have constructed, there are many
ways to open VisualDSP++ online Help or the supplementary documentation from Windows.
Help system files (.
CHM) are located in the Help folder, and .PDF files are
located in the Docs folder of your VisualDSP++ installation CD-ROM.
The Docs folder also contains the Dinkum Abridged C++ library and the
FlexLM network license manager software documentation.
Using Windows Explorer
•Double-click the vdsp-help.chm file, which is the master Help system, to access all the other .CHM files.
•Double-click any file that is part of the VisualDSP++ documentation set.
Using the Windows Start Button
•Access VisualDSP++ online Help by clicking the Start button and
choosing Programs, Analog Devices, VisualDSP++, and
VisualDSP++ Documentation.
•Access the
.PDF files by clicking the Start button and choosing
Programs, Analog Devices, VisualDSP++, Documentation for
Printing, and the name of the book.
Select a processor family and book title. Download archive (.ZIP) files, one
for each manual. Use any archive management software, such as WinZip,
to decompress downloaded files.
Printed Manuals
For general questions regarding literature ordering, call the Literature
Center at 1-800-ANALOGD (1-800-262-5643) and follow the prompts.
VisualDSP++ Documentation Set
To purchase VisualDSP++ manuals, call 1-603-883-2430. The manuals
may be purchased only as a kit.
If you do not have an account with Analog Devices, you are referred to
Analog Devices distributors. For information on our distributors, log onto
http://www.analog.com/salesdir.
Hardware Tools Manuals
To purchase EZ-KIT Lite® and In-Circuit Emulator (ICE) manuals, call
1-603-883-2430. The manuals may be ordered by title or by product
number located on the back cover of each manual.
Processor Manuals
Hardware reference and instruction set reference manuals may be ordered
through the Literature Center at 1-800-ANALOGD (1-800-262-5643),
or downloaded from the Analog Devices Web site. Manuals may be
ordered by title or by product number located on the back cover of each
manual.
All data sheets (preliminary and production) may be downloaded from the
Analog Devices Web site. Only production (final) data sheets (Rev. 0, A,
B, C, and so on) can be obtained from the Literature Center at
1-800-ANALOGD (1-800-262-5643); they also can be downloaded from
the Web site.
To have a data sheet faxed to you, call the Analog Devices Faxback System
at 1-800-446-6212. Follow the prompts and a list of data sheet code
numbers will be faxed to you. If the data sheet you want is not listed,
check for it on the Web site.
Text conventions used in this manual are identified and described as
follows.
ExampleDescription
Close command
(File menu)
{this | that}Alternative items in syntax descriptions appear within curly brackets
[this | that]Optional items in syntax descriptions appear within brackets and sepa-
[this,…]Optional item lists in syntax descriptions appear within brackets
.SECTIONCommands, directives, keywords, and feature names are in text with
filenameNon-keyword placeholders appear in text with italic style format.
L
a
Titles in reference sections indicate the location of an item within the
VisualDSP++ environment’s menu system (for example, the Close
command appears on the File menu).
and separated by vertical bars; read the example as this or that. One
or the other is required.
rated by vertical bars; read the example as an optional
delimited by commas and terminated with an ellipse; read the example
as an optional comma-separated list of this.
letter gothic font.
Note: For correct operation, ...
A Note: provides supplementary information on a related topic. In the
online version of this book, the word Note appears instead of this
symbol.
Caution: Incorrect device operation may result if ...
Caution: Device damage may result if ...
A Caution: identifies conditions or inappropriate usage of the product
that could lead to undesirable results or product damage. In the online
version of this book, the word Caution appears instead of this symbol.
this or that.
Warn in g: Injury to device users may result if ...
A Warning: identifies conditions or inappropriate usage of the product
[
that could lead to conditions that are potentially hazardous for devices
users. In the online version of this book, the word Wa rnin g appears
instead of this symbol.
The ADSP-TS101 TigerSHARC Processor Programming Reference
describes the Digital Signal Processor (DSP) architecture and instruction
set. These descriptions provide the information required for programming
TigerSHARC processor systems. This chapter introduces programming
concepts for the DSP with the following information:
•“DSP Architecture” on page 1-6
•“Instruction Line Syntax and Structure” on page 1-20
•“Instruction Parallelism Rules” on page 1-24
The TigerSHARC processor is a 128-bit, high performance, next generation version of the ADSP-2106x SHARC DSP. The TigerSHARC
processor sets a new standard of performance for digital signal processors,
combining multiple computation units for floating-point and fixed-point
processing as well as very wide word widths. The TigerSHARC processor
maintains a ‘system-on-a-chip’ scalable computing design philosophy,
including 6M bit of on-chip SRAM, integrated I/O peripherals, a host
processor interface, DMA controllers, link ports, and shared bus connectivity for glueless MDSP (Multi Digital Signal Processing).
In addition to providing unprecedented performance in DSP applications
in raw MFLOPS and MIPS, the TigerSHARC processor boosts performance measures such as MFLOPS/Watt and MFLOPS/square inch in
multiprocessing applications.
•Program sequencer—Controls the program flow and contains an
instruction alignment buffer (IAB) and a branch target buffer
(BTB)
•Three 128-bit buses providing high bandwidth connectivity
between all blocks
•External port interface including the host interface, SDRAM controller, static pipelined interface, four DMA channels, four link
ports (each with two DMA channels), and multiprocessing support
•6M bits of internal memory organized as three blocks—M0, M1
and M2—each containing 16K rows and 128 bits wide (a total of
2M bit).
•Debug features
•JTAG Test Access Port
The TigerSHARC processor external port provides an interface to external
memory, to memory-mapped I/O, to host processor, and to additional
TigerSHARC processors. The external port performs external bus arbitration and supplies control signals to shared, global memory and I/O
devices.
Figure 1-3 illustrates a typical single-processor system. A multiprocessor
system is illustrated in Figure 1-4 on page 1-6 and is discussed later in
“Scalability and Multiprocessing” on page 1-19.
The TigerSHARC processor includes several features that simplify system
development. The features lie in three key areas:
•Support of IEEE floating-point formats
•IEEE 1149.1 JTAG serial scan path and on-chip emulation
features
•Architectural features supporting high-level languages and operating systems
The features of the TigerSHARC processor architecture that directly support high-level language compilers and operating systems include:
•Simple, orthogonal instruction allowing the compiler to efficiently
use the multi-instruction slots
•General-purpose data and IALU register files
•32- and 40-bit floating-point and 8-, 16-, 32-, and 64-bit fixedpoint native data types
As shown in Figure 1-1 on page 1-2 and Figure 1-2 on page 1-3, the DSP
architecture consists of two divisions: the DSP core (where instructions
execute) and the I/O peripherals (where data is stored and off-chip I/O is
processed). The following discussion provides a high-level description of
the DSP core and peripherals architecture. More detail on the core appears
in other sections of this reference. For more information on I/O peripherals, see the ADSP-TS101 TigerSHARC Processor Hardware Reference.
High performance is facilitated by the ability to execute up to four 32-bit
wide instructions per cycle. The TigerSHARC processor uses a variation
of a Static Superscalar™ architecture to allow the programmer to specify
which instructions are executed in parallel in each cycle. The instructions
do not have to be aligned in memory so that program memory is not
wasted.
The 6M bit internal memory is divided into three 128-bit wide memory
blocks. Each of the three internal address/data bus pairs connect to one of
the three memory blocks. The three memory blocks can be used for triple
accesses every cycle where each memory block can access up to four, 32-bit
words in a cycle.
The external port cluster bus is 64 bits wide. The high I/O bandwidth
complements the high processing speeds of the core. To facilitate the high
clock rate, the TigerSHARC processor uses a pipelined external bus with
programmable pipeline depth for interprocessor communications and for
Synchronous SRAM and DRAM (SSRAM and SDRAM).
The four link ports support point-to-point high bandwidth data transfers.
Link ports have hardware supported two-way communication.
The processor operates with a two cycle arithmetic pipeline. The branch
pipeline is two to six cycles. A branch target buffer (BTB) is implemented
to reduce branch delay. The two identical computation units support
floating-point as well as fixed-point arithmetic.
During compute intensive operations, one or both integer ALUs compute
or generate addresses for fetching up to two quad operands from two
memory blocks, while the program sequencer simultaneously fetches the
next quad instruction from the third memory block. In parallel, the computation units can operate on previously fetched operands while the
sequencer prepares for a branch.
While the core processor is doing the above, the DMA channels can be
replenishing the internal memories in the background with quad data
from either the external port or the link ports.
The processing core of the TigerSHARC processor reaches exceptionally
high DSP performance through using these features:
•Computation pipeline
•Dual computation units
•Execution of up to four instructions per cycle
•Access of up to eight words per cycle from memory
The two computation units (compute blocks) perform up to 6 floatingpoint or 24 fixed-point operations per cycle.
Each multiplier and ALU unit can execute four 16-bit fixed-point operations per cycle, using Single-Instruction, Multiple-Data (SIMD)
operation. This operation boosts performance of critical imaging and signal processing applications that use fixed-point data.
Compute Blocks
The TigerSHARC processor core contains two computation units called
compute blocks. Each compute block contains a register file and three independent computation units—an ALU, a multiplier, and a shifter. For
meeting a wide variety of processing needs, the computation units process
data in several fixed- and floating-point formats listed here and shown in
Figure 1-5:
•Fixed-point format
These include 64-bit long word, 32-bit normal word, 16-bit short
word, and 8-bit byte word. For short word fixed-point arithmetic,
quad parallel operations on quad-aligned data allow fast processing
of array data. Byte operations are also supported for octal-aligned
data.
•Floating-point format
These include 32-bit normal word and 40-bit extended word.
Floating-point operations are single or extended precision. The
normal word floating-point format is the standard IEEE format,
and the 40-bit extended-precision format occupies a double word
(64 bits) with eight additional LSBs of mantissa for greater
accuracy.
Each compute block has a general-purpose, multi-port, 32-word data register file for transferring data between the computation units and the data
buses and storing intermediate results. All of these registers can be
accessed as single-, dual-, or quad-aligned registers. For more information
on the register file, see “Compute Block Registers” on page 2-1.
Arithmetic Logic Unit (ALU)
The ALU performs arithmetic operations on fixed-point and floatingpoint data and logical operations on fixed-point data. The source and destination of most ALU operations is the compute block register file.
On the ADSP-TS101 processor, the ALU includes a special sub-block,
which is referred to as the communications logic unit (CLU). The CLU
instructions are designed to support different algorithms used for communications applications. The algorithms that are supported by the CLU
instructions are:
•Viterbi Decoding
•Turbo-code Decoding
•Despreading for code-division multiple access (CDMA) systems
The multiplier performs fixed-point or floating-point multiplication and
fixed-point multiply/accumulate operations. The multiplier supports several data types in fixed- and floating-point. The floating-point formats are
float and float-extended, as in the ALU. The source and destination of
most operations is the compute block register file.
The TigerSHARC processor’s multiplier supports complex multiply-accumulate operations. Complex numbers are represented by a pair of 16-bit
short words within a 32-bit word. The least significant bits (LSBs) of the
input operand represents the real part, and the most significant bits
(MSBs) of the input operand represent the imaginary part.
For more information on the multiplier, see “Multiplier” on page 4-1.
Bit Wise Barrel Shifter (Shifter)
The shifter performs logical and arithmetic shifts, bit manipulation, field
deposit, and field extraction. The shifter operates on one 64-bit, one or
two 32-bit, two or four 16-bit, and four or eight 8-bit fixed-point operands. Shifter operations include:
•Shifts and rotates from off-scale left to off-scale right
•Bit manipulation operations, including bit set, clear, toggle and
test
•Bit field manipulation operations, including field extract and
deposit, using register
BFOTMP (which is internal to the shifter)
•Bit FIFO operations to support bit streams with fields of varying
length
•Support for ADSP-2100 family compatible fixed-point/floatingpoint conversion operations (such as exponent extract, number of
leading 1s or 0s)
For more information on the shifter, see “Shifter” on page 5-1.
Integer Arithmetic Logic Unit (IALU)
The IALUs can execute standard standalone ALU operations on IALU
register files. The IALUs also provide memory addresses when data is
transferred between memory and registers. The DSP has dual IALUs (the
J-IALU and the K-IALU) that enable simultaneous addresses for multiple
operand reads or writes. The IALUs allow computational operations to
execute with maximum efficiency because the computation units can be
devoted exclusively to processing data.
Each IALU has a multiport, 32-word register file. Operations in the IALU
are not pipelined. The IALUs support pre-modify with no update and
post-modify with update address generation. Circular data buffers are
implemented in hardware. The IALUs support the following types of
instructions:
•Regular IALU instructions
•Move Data instructions
•Load Data instructions
•Load/Store instructions with register update
•Load/Store instructions with immediate update
For indirect addressing (instructions with update), one of the registers in
the register file can be modified by another register in the file or by an
immediate 8- or 32-bit value, either before (pre-modify) or after (postmodify) the access. For circular buffer addressing, a length value can be
associated with the first four registers to perform automatic modulo
addressing for circular data buffers; the circular buffers can be located at
arbitrary boundaries in memory. Circular buffers allow efficient implementation of delay lines and other data structures, which are commonly
used in digital filters and Fourier transformations. The TigerSHARC processor circular buffers automatically handle address pointer wraparounds,
reducing overhead and simplifying implementation.
The IALUs also support bit reverse addressing, which is useful for the FFT
algorithm. Bit reverse addressing is implemented using a reverse carry
addition that is similar to regular additions, but the carry is taken from the
upper bits and is driven into lower bits.
The IALU provides flexibility in moving data as single-, dual-, or quadwords. Every instruction can execute with a throughput of one per cycle.
IALU instructions execute with a single cycle of latency while computation units have two cycles of latency. Normally, there are no dependency
delays between IALU instructions, but if there are, three or four cycles of
latency can occur.
For more information on the IALUs, see “IALU” on page 6-1.
Program Sequencer
The program sequencer supplies instruction addresses to memory and,
together with the IALUs, allows computational operations to execute with
maximum efficiency. The sequencer supports efficient branching using
the branch target buffer (BTB), which reduces branch delays for conditional and unconditional instructions. The sequencer and IALU’s control
flow instructions divide into two types:
•Control flow instructions. These instructions are used to direct pro-
gram execution by means of jumps and to execute individual
instructions conditionally.
•Immediate extension instructions. These instructions are used to
extend the numeric fields used in immediate operands for the
sequencer and the IALU.
•Direct jumps and calls based on an immediate address operand
specified in the instruction encoding. For example: ‘
jump 100;
true.
•Indirect jumps based on an address supplied by a register. The
instructions used for specifying conditional execution of a line are a
subcategory of indirect jumps. For example: ‘if <cond> cjmp;’ is a
jump to the address pointed to by the CJMP register.
’ always jumps to address 100, if the <cond> evaluates as
if <cond>
L
The TigerSHARC processor achieves its fast execution rate by means of an
eight-cycle pipeline.
Two stages of the sequencer’s pipeline actually execute in the computation
units. The computation units perform single-cycle operations with a twocycle computation pipeline, meaning that results are available for use two
cycles after the operation is begun. Hardware causes a stall if a result is not
available in a given cycle (register dependency check). Up to two computation instructions per compute block can be issued in each cycle,
instructing the ALU, multiplier or shifter to perform independent, simultaneous operations.
The TigerSHARC processor has four general-purpose external interrupts,
IRQ3-0. The processor also has internally generated interrupts for the two
timers, DMA channels, link ports, arithmetic exceptions, multiprocessor
vector interrupts, and user-defined software interrupts. Interrupts can be
nested through instruction commands. Interrupts have a short latency and
do not abort currently executing instructions. Interrupts vector directly to
a user-supplied address in the interrupt table register file, removing the
overhead of a second branch.
The control flow instruction must use the first instruction slot in
the instruction line.
The branch penalty in a deeply pipelined processor such as the TigerSHARC processor can be compensated for by the use of a branch target
buffer (BTB) and branch prediction. The branch target address is stored
in the BTB. When the address of a jump instruction, which is predicted
by the user to be taken in most cases, is recognized (the tag address), the
corresponding jump address is read from the BTB and is used as the jump
address on the next cycle. Thus the latency of a jump is reduced from
three to six wasted cycles to zero wasted cycles. If this address is not stored
in the BTB, the instruction must be fetched from memory.
Other instructions also use the BTB to speed up these types of branches.
These instructions are interrupt return, call return, and computed jump
instructions.
Immediate extensions are associated with IALU or sequencer (control
flow) instructions. These instructions are not specified by the programmer, but are implied by the size of the immediate data used in the
instructions. The programmer must place the instruction that requires an
immediate extension in the first instruction slot and leave an empty
instruction slot in the line (use only three slots), so the assembler can place
the immediate extension in the second instruction slot of the instruction
line.
L
For more information on the sequencer, BTB, and immediate extensions,
see “Program Sequencer” on page 7-1.
Quad Instruction Execution
The TigerSHARC processor can execute up to four instructions per cycle
from a single memory block, due to the 128-bit wide access per cycle. The
ability to execute several instructions in a single cycle derives from a Static Superscalar architectural concept. This is not strictly a superscalar architecture because the instructions executed in each cycle are specified in the
Note that only one immediate extension may be in a single instruction line.
Page 46
DSP Architecture
instruction by the programmer or by the compiler, and not by the chip
hardware. There is also no instruction reordering. Register dependencies
are, however, examined by the hardware and stalls are generated where
appropriate. Code is fully compacted in memory and there are no alignment restrictions for instruction lines.
Relative Addresses for Relocation
Most instructions in the TigerSHARC processor support PC relative
branches to allow code to be relocated easily. Also, most data references
are register relative, which means they allow programs to access data blocks
relative to a base register.
Nested Call and Interrupt
Nested call and interrupt return addresses (along with other registers as
needed) are saved by specific instructions onto the on-chip memory stack,
allowing more generality when used with high-level languages. Nonnested calls and interrupts do not need to save the return address in internal memory, making these more efficient for short, non-nested routines.
Context Switching
The TigerSHARC processor provides the ability to save and restore up to
eight registers per cycle onto a stack in two internal memory blocks when
using load/store instructions. This fast save/restore capability permits efficient interrupts and fast context switching. It also allows the TigerSHARC
processor to dispense with on-chip PC stack or alternate registers for register files or status registers.
Internal Memory and Other Internal Peripherals
The on-chip memory consists of three blocks of 2M bits each. Each block
is 128 bits (four words) wide, thus providing high bandwidth sufficient to
support both computation units, the instruction stream and external I/O,
even in very intensive operations. The TigerSHARC processor provides
access to program and two data operands without memory or bus constraints. The memory blocks can store instructions and data
interchangeably.
Each memory block is organized as 64K words of 32 bits each. The
accesses are pipelined to meet one clock cycle access time needed by the
core, DMA, or by the external bus. Each access can be up to four words.
Memories (and their associated buses) are a resource that must be shared
between the compute blocks, the IALUs, the sequencer, the external port,
and the link ports. In general, if during a particular cycle more than one
unit in the processor attempts to access the same memory, one of the competing units is granted access, while the other is held off for further
arbitration until the following cycle—see “Bus Arbitration Protocol” in
the ADSP-TS101 TigerSHARC Processor Hardware Reference. This type of
conflict only has a small impact on performance due to the very high
bandwidth afforded by the internal buses.
An important benefit of large on-chip memory is that by managing the
movement of data on and off chip with DMA, a system designer can realize high levels of determinism in execution time. Predictable and
deterministic execution time is a central requirement in DSP and realtime systems.
Internal Buses
The processor core has three buses, each one connected to one of the
internal memories. These buses are 128 bits wide to allow up to four
instructions, or four aligned data words, to be transferred in each cycle on
each bus. On-chip system elements also use these buses to access memory.
Only one access to each memory block is allowed in each cycle, so DMA
or external port transfers must compete with core accesses on the same
block. Because of the large bandwidth available from each memory block,
not all the memory bandwidth can be used by the core units, which leaves
some memory bandwidth available for use by the DSP’s DMA processes
or by the bus interface to serve other DSPs bus master transfers to the
DSP’s memory.
Internal Transfer
Most registers of the TigerSHARC processor are classified as universal registers (Uregs). Instructions are provided for transferring data between any
two Uregs, between a Ureg and memory, or for the immediate load of a
Ureg. This includes control registers and status registers, as well as the
data registers in the register files. These transfers occur with the same timing as internal memory load/store.
Data Accesses
Each move instruction specifies the number of words accessed from each
memory block. Two memory blocks can be accessed on each cycle because
of the two IALUs. For a discussion of data and register widths and the
syntax that specifies these accesses, see “Register File Registers” on
page 2-5.
Quad Data Access
Instructions specify whether one, two or four words are to be loaded or
stored. Quad words1 can be aligned on a quad-word boundary and long
words aligned on a long-word boundary. This, however, is not necessary
when loading data to computation units because a data alignment buffer
(DAB) automatically aligns quad words that are not aligned in memory.
1
A memory quad word is comprised of four 32-bit words or 128 bits of data.
Up to four data words from each memory block can be supplied to each
computation unit, meaning that new data is not required on every cycle
and leaving alternate cycles for I/O to the memories. This is beneficial in
applications with high I/O requirements since it allows the I/O to occur
without degrading core processor performance.
Booting
The internal memory of the TigerSHARC processor can be loaded from
an 8-bit EPROM using a boot mechanism at system powerup. The DSP
can also be booted using another master or through one of the link ports.
Selection of the boot source is controlled by external pins. For information on booting the DSP, see the ADSP-TS101 TigerSHARC Processor Hardware Reference.
Scalability and Multiprocessing
The TigerSHARC processor, like the related Analog Devices product the
SHARC DSP, is designed for multiprocessing applications. The primary
multiprocessing architecture supported is a cluster of up to eight TigerSHARC processors that share a common bus, a global memory, and an
interface to either a host processor or to other clusters. In large multiprocessing systems, this cluster can be considered an element and connected
in configurations such as torroid, mesh, tree, crossbar, or others. The user
can provide a personal interconnect method or use the on-chip communication ports.
The TigerSHARC processor improves on most of the multiprocessing
capabilities of the SHARC DSP and enhances the data transfer bandwidth. These capabilities include:
•On-chip bus arbitration for glueless multiprocessing
•Globally accessible internal memory and registers
The TigerSHARC processor supports the IEEE standard P1149.1 Joint
Test Action Group (JTAG) standard for system test. This standard defines
a method for serially scanning the I/O status of each component in a system. The JTAG serial port is also used by the TigerSHARC processor
EZ-ICE® to gain access to the processor’s on-chip emulation features.
Instruction Line Syntax and Structure
TigerSHARC processor is a static superscalar DSP processor that executes
from one to four 32-bit instruction slots in an instruction line. With few
exceptions, an instruction line executes with a throughput of one cycle in
an eight-deep pipeline. Figure 1-6 shows the instruction slot and line
structure.
There are some important things to note about the instruction slot and
instruction line structure and how this structure relates to instruction
execution.
•Each instruction line consists of up to four 32-bit instruction slots.
•Instruction slots are delimited with one semicolon “;”.
•Instruction lines are terminated with two semicolons “;;”.
•The up to four instructions on an instruction line are executed in
parallel.
•Every instruction slot consists of a 32-bit opcode.
Each instruction SLOT is delimited with one semicolon.
The instruction LINE is terminated with two semicolons.
The first two instruction SLOTS are special:
1. (if used) Conditional (if-do, if-else) or a sequencer (jump or other)
instructions must use SLOT 1.
2. (if used) Immediate extension instructions must use SLOT 2.
Figure 1-6. Instruction Line and Slot Structure
•Some instructions (such as immediate extensions) require two 32bit opcodes (instruction slots) to execute.
•Some instructions (program sequencer, conditional, and immediate
extension) require specific instruction slots.
An instruction is a 32-bit word that activates one or more of the TigerSHARC processor’s execution units to carry out an operation. The DSP executes or stalls the instructions in the same instruction line together.
Although the DSP fetches quad words from memory, instruction lines do
not have to be aligned to quad-word boundaries. Regardless of size (one to
four instructions), instruction lines follow one after the other in memory
with a new instruction line beginning one word from where the previous
instruction line ended. The end of an instruction line is identified by the
most significant bit (MSB) in the instruction word.
Instruction Notation Conventions
The TigerSHARC processor assembly language is based on an algebraic
syntax for ease of coding and readability. The syntax for TigerSHARC
processor instructions selects the operation that the DSP executes and the
mode in which the DSP executes the operation. Operations include computations, data movements, and program flow controls. Modes include
Single-Instruction, Single-Data (SISD) versus Single-Instruction, Multiple-Data (SIMD) selection, data format selection, word size selection,
enabling saturation, and enabling truncation. All controls on instruction
execution are included in the DSP’s instruction syntax—there are no
mode bits to set in control registers for this DSP.
This book presents instructions in summary format. This format presents
all the selectable items and optional items available for an instruction. The
conventions for these are:
this|that|other Lists of items delimited with a vertical bar “|” indi-
cate that syntax permits selection of one of the
items. One item from the list must be selected. The
vertical bar is not part of instruction syntax.
{option} An item or a list of items enclosed within curley
braces “{}” indicate an optional item. The item may
be included or omitted. The curley braces are not
part of instruction syntax.
where shown in summary syntax with one exception.
Empty parenthesis (no options selected) may not
appear in an instruction.
Rm Rmd Rmq Register names are replaceable items in the sum-
mary syntax and appear in italics. Register names
indicate that the syntax requires a single (Rm), double (Rmd), or quad (Rmq) register. For more
information on register name syntax, compute
block selection, and data format selection, see “Reg-
ister File Registers” on page 2-5.
<imm#> Immediate data (literal values) in the summary syn-
tax appears as <imm#> with # indicating the bit
width of the value.
For example, the following instruction in summary format:
{X|Y|XY}{S|B}Rs = MIN|MAX (Rm, Rn) {({U}{Z})} ;
could be coded as any of the following instructions:
XR3 = MIN (R2, R1) ;
YBR2 = MAX (R1, R0) (UZ);
XYSR2 = MAX (R3, R4) (U);
Unconditional Execution Support
The DSP supports unconditional execution of up to four instructions in
parallel. This support lets programmers use simultaneous computations
with data transfers and branching or looping. These operations can be
combined with few restrictions. The following example code shows three
instruction lines containing 2, 4, and 1 instruction slots each, respectively:
It is important to note that the above instructions execute unconditionally. Their execution does not depend on computation-based conditions.
For a description of condition dependent (conditional) execution, see
“Conditional Execution Support” on page 1-24.
Conditional Execution Support
All instructions can be executed conditionally (a mechanism also known as
predicated execution). The condition field exists in one instruction slot in
an instruction line, and all the remaining instructions in that line either
execute or not, depending on the outcome of the condition.
In a conditional computational instruction, the execution of the entire
instruction line can depend on the specified condition at the beginning of
the instruction line. Conditional instructions take one of the following
forms:
This syntax permits up to three instructions to be controlled by a condition. For more information, see “Conditional Execution” on page 7-12.
Instruction Parallelism Rules
The TigerSHARC processor executes from one to four 32-bit instructions
per line. The compiler or programmer determines which instructions may
execute in parallel in the same line prior to runtime (leading to the name
Static Superscalar). The DSP architecture places several constraints on the
application of different instructions and various instruction combinations.
Note that all the restrictions refer to combinations of instructions within
the same line. There is no restriction of combinations between lines.
There are, however, cases in which certain combinations between lines
may cause stall cycles (see “Conditional Branch Effects on Pipeline” on
page 7-44), mostly because of data conflicts (operand of an instruction in
line n+1 is the result of instruction in line #n, which is not ready when
fetched).
Table 1-1 on page 1-29 and Table 1-2 on page 1-34 identify instruction
parallelism rules for the TigerSHARC processor. The following sections
provide more details on each type of constraint and accompany the details
with examples:
•“General Restriction” on page 1-36
•“IALU Instruction Restrictions” on page 1-39
•“Compute Block Instruction Restrictions” on page 1-37
•“Sequencer Instruction Restrictions” on page 1-45
The instruction parallelism rules in Table 1-1 and Table 1-2 present the
resource usage constraints for instructions that occupy instruction slots in
the same instruction line. The horizontal axis lists resources—portions of
the DSP architecture that are active during an instruction—and lists the
number of resources that are available. The vertical axis lists instruction types—descriptive names for classes of instructions. For resources, a ‘1’
indicates that a particular instruction uses one unit of the resource, and a
‘2’ indicates that the instruction uses two units of the resource. Typical
instructions of most classes are listed with the descriptive name for the
instruction type.
It is important to note that Table 1-1 and Table 1-2 identify static restrictions for the TigerSHARC processor. Static restrictions are distinguished
from dynamic restrictions, in that static restrictions can be resolved by the
assembler. For example, the assembler flags the instruction
XR3:0 = Q[J0 += 3];; because the modifier is not a multiple of 4—this is
a static violation.
Dynamic restrictions cannot be resolved by the assembler because these
restrictions represent runtime conditions, such as stray pointers. When the
processor encounters a dynamic (runtime) violation, an exception is issued
when the violation runs through the core. Whatever the case, the processor does not arrive at a deadlock situation, although unpredictable results
may be written into registers or memory.
As a dynamic restriction example, examine the instruction
xr3:0 = Q[J0 += 4];;. Although this instruction looks correct to the
assembler, it may violate hardware restrictions if J0 is not quad aligned.
Because the assembler cannot predict what the code will do to J0 up to the
point of this instruction, this violation is dynamic, since it occurs at
runtime.
Further, Table 1-1 and Table 1-2 cover restrictions that arise from the
interaction of instructions that share a line, but mostly omits restrictions
of single instructions. An example of the former occurs when two instructions attempt to use the same unit in the same line. An example of an
individual instruction restriction is an attempt to use a register that is not
valid for the instruction. For example, the instruction XR0 = CB[J5+=1];;
is illegal because circular buffer accesses can only use IALU registers J0
through J3.
For most instruction types, you can locate the instruction in Table 1-1 or
Table 1-2 and read across to find out the resources it uses. Resource usage
for data movement instructions is more complicated to analyze. Resource
usage for these instructions is calculated by adding together base resources,
where base resources are determined by the type of move instruction.
Move instructions are Ureg transfer (register to register), immediate load
(immediate values to register), memory load (memory to register), and
memory store (register to memory). Source resources are determined by
the resource register and are only applicable when the source itself is a register (Ureg transfer and stores). Destination resources may be of two types:
•Address pointer in post-modify (for example,
XR0 = [J0 += 2];;)
•Destination register—only applicable when the destination is a register (Ureg transfer, memory loads and immediate loads)
If a particular combination of base, source, and destination uses more
resources than are available, that combination is illegal. Consider, for
example, the following instruction:
XR3:0 = Q[K31+0x40000];;
This is a memory load instruction, or specifically, a K-IALU load using a
32-bit offset. Reading across the table, the base resources used by the
instruction are two slots in the line—the K-IALU instruction and the second instruction slot (for the immediate extension). The destination is
XR3:0, which are X-compute block registers. The ‘X-Register File,
Dreg = XR31–0’ line under ‘Ureg transfer and Store (Source Register)
Resources’ in the table indicates that the instruction also uses an X-compute block port and an X-compute block input port.
The following Ureg transfer instruction provides another example:
XYR0=CJMP;;
This example uses the following resources:
•One instruction slot
•Base resources—an IALU instruction (no matter whether J-IALU
or K-IALU) and the Ureg transfer resource (base resources) for the
IALU instruction
Table 1-1. Parallelism Rules for Register File, DAB, J/K-IALU, and Port
Access Instructions (Cont’d)
Resources:
Inst. slots used
First inst. slot1Second inst. slot2IALU inst.
⇒ Resources Available: ⇒
⇓ Instruction Types: ⇓
Memory Load Ureg (Destination Register) Resources
X-Register File DAB/SDAB
XDreg = DAB q[addr]
XDreg = XR31–0
Y-Register Fi l e DAB/SDAB
YDreg = DAB q[addr]
YDreg = YR31–0
XY-Register Files DAB/SDAB
XYDreg = DAB q[addr]
XYDreg = XYR31–0
4112111112 2112 21111 3 3
Imm. load or Ureg xfer
J-IALU
K-IALU
J-IALU-port I/O
K-IALU-port I/O
X-ports I/O3X-ports input
X-ports output
X-DAB
Y-ports I/O3Y-ports input
Y-ports output
Y-D AB
Seq.-port I/O
11 1
11 1
11 111 1
3
3
Ext. Port I/O
IOP-port I/O
Link Port I/O
1 If a conditional instruction is present on the instruction line, it must use the first instruction slot.
2 If an immediate extension is present on the instruction line, it must use the second instruction slot.
3 These resources are listed for informational purposes only. These constraints can not be exceeded
within the core.
4 Complete list is all registers in register groups 0x1A, 0x38, and 0x39: CJMP, RETI, RETIB, RETS,
Table 1-2. Parallelism Rules for Compute Block and Sequencer Instructions
Resources:
Inst. slots used
First inst. slot1Second inst. slot2X-Comp Block Inst.
⇒ Resources Available: ⇒
⇓ Instruction Types: ⇓
Y Compute Block Operations
Y-ALU instruction, except quad output
YDreg = Dreg + Dreg
Y-Multiplier instruction, except quad output
YDreg = Dreg * Dreg
Y-Shifter instruction, except MASK, FDEP, STAT111
Y-ALU instruction with quad output
add_sub, EXPAND, MERGE)
(
Y-Multiplier instruction with quad output 111 1
Y-Shifter instructions MASK, FDEP, YSTAT12
X and Y Compute Block Operations (SIMD)
XY-ALU instruction, except quad output
XYDreg = Dreg + Dreg
XY-Multiplier instruction, except quad output
XYDreg = Dreg * Dreg
XY-Shifter instruction, except
XY-ALU instruction with quad output
add_sub, EXPAND, MERGE)
(
XY-Multiplier instruction with quad output 111 1 11 1
XY-Shifter instructions MASK, FDEP, X/YSTAT122
MASK, FDEP, STAT11111
41121112111
111
111
1111
11111
11111
1111111
X-ALU
X-Multiplier
X-Shifter
Y-Comp Block Inst.
Y-A LU
Y-Multiplier
Y-Shifter
1 If a conditional instruction is present on the instruction line, it must use the first instruction slot.
2 If an immediate extension is present on the instruction line, it must use the second instruction slot.
There is a general restriction that applies to all types of instructions: Two
instructions may not write to the same register. This restriction is checked
statically by the assembler. For example:
XR0 = R1 + R2 ; XR0 = R5 * R6 ;;
/* Invalid; these instructions cannot be on the same instruction
line */
XR1 = R2 + R3 , XR1 = R2 - R3 ;;
/* Invalid; add-subtract to the same register */
Consequently, a load instruction may not be targeted to a register that is
updated in the same line by another instruction. For example:
XR0 = [J17 + 1] ; R0 = R3 * R8 ;; /* Invalid */
A load/store instruction in that uses post-modify and update addressing
cannot load the same register that is used as the index Jm/Km (pointer to
memory). For example:
J0 = [J0 += 1] ;;
/* Invalid; J0 cannot be used as both destination (Js) and index
(Jm) in a post-modify (+=) load or store */
No instruction can write to the CJMP register in the same line as a CALL
instruction (which also updates the
There are two compute blocks, and instructions can be issued to either or
both.
•Instructions in the format XRs = Rm op Rn are issued to the X-compute block
•Instructions in the format YRs = Rm op Rn are issued to the Y-compute block
•Instructions in the format Rs = Rm op Rn or XYRs = Rm op Rn are
issued to both the X- and Y-compute blocks
The following conditions apply when issuing instructions to the compute
blocks. Note that the assembler statically checks all of these restrictions.
•Up to two instructions can be issued to each compute block (making that a maximum of four compute block instructions in one
line). Note, however, that for this rule, the instructions of type
Rs = Rm op Rn count as one instruction for each compute block.
For example:
R0 = R1 + R2 ; R3 = R4 * R5 ;;
/* Valid; a total of four instructions */
XR0 = R1 + R2 ; XR3 = R4 * R5 ; XR6 = LSHIFT R1 BY R7 ;;
/* Invalid; three instructions to compute block X */
•Only one instruction can be issued to each unit (ALU, multiplier,
or shifter) in a cycle. Each of the two instructions must be issued to
a different unit (ALU, multiplier or shifter). For example:
•When one of the shifter instructions listed below is executed, it
must be the only instruction in that line for the particular compute
block. The instructions are:
access to XSTAT/YSTAT registers. For example:
XR0 += MASK R1 BY R2 ; XR6 = R1 + R2 ;;
/* Invalid; three operand shifter instruction in same line
with an ALU operation; both issued to compute block X */
•Only one unit (ALU or multiplier) can use two result buses. A unit
uses two result buses either when the result is quad word or when
there are two results (dual ADD and SUB instructions—R0 = R1+R2,
R5 = R1-R2;). Another instruction is allowed in the same line, as
long as it is not a shifter instruction. For example:
•Communications Logic Unit (CLU) register load instructions have
the same restrictions as shifter instructions, with one exception—a
CLU register load instruction can be executed in the same instruction line with another compute instruction that has a quad result.
•All CLU instructions, except for load of CLU registers, refer to the
same rules as compute ALU instructions.
IALU Instruction Restrictions
There are four types of IALU instructions:
•Memory load/store—for example: R0 = [J0 + 1] ;
•IALU operations—for example: J0 = J1 + J2 ;
•Load data—for example: R1 = 0xABCD ;
Introduction
•Ureg transfer—for example: XR0 = YR0 ;
These restrictions apply when issuing instructions to the IALU. Except for
the load data restriction, the assembler flags all of these restrictions.
•Up to one J-IALU and up to one K-IALU instruction can be issued
in the same instruction line. For example:
R0 = [J0 += 1] ; R1 = [K0 += 1] ;;
/* It’s recommended that J0 and K0 point to different memory blocks to avoid stall */
[J0 += 1] = XR0 ; [K0 += 1] = YR0;;
J0 = [J5 + 1] ; XR0 = [K6 + 1] ;;
R1 = 0xABCD ; R0 = [J0 += 1] ;;
/* One load data instruction (in K-IALU) and one J-IALU
operation */
•There can be up to two load instructions to the same compute
block register file or up to one load to and one store from the same
compute block register file. (A compute block register file has one
input port and one input/output port.) If two store instructions are
issued, none of them will be executed.For example:
[J0 + 1] = XR0 ; [K0 + 1] = XR1 ;;
/* Invalid; attempts to use two output ports */
R0 = [J0 + 1] ; R1 = [K1 + 1] ;;
/* Valid; uses two input ports in compute block X and Y */
R0 = [J0 + 1] ; [K1 + 1] = XR1 ;; /* Valid */
•A Ureg transfer within the same compute register file cannot be
used with any other store to that register file. For example:
•Only one DAB load per Compute Block is allowed. For example:
XR3:0 = DAB Q[J0 += 4] ; XR7:4 = DAB Q[K0 += 4] ;;
/* Invalid */
XR3:0 = DAB Q[J0 += 4] ; YR7:4 = DAB Q[K0 += 4] ;; /* Valid
*/
•Only one memory load/store to and from the same single port register files is allowed. The single port register files are:
•J-IALU registers: groups 0xC and 0xE
•K-IALU registers: groups 0xD and 0xF
•Bus Control registers: groups 0x24 and 0x3A
•Sequencer, Interrupt and BTB registers: groups 0x1A, 0x30–
0x39, and 0x3B
•Debug logic registers: groups 0x1B, 0x3D–0x3F
For example:
J0 = [J5 + 1] ; K0 = [K6 + 1] ;; /* Valid */
J0 = [J5 + 1] ; [K6 + 1] = K0 ;; /* Valid */
J0 = [J5 + 1] ; [K6 + 1] = J1 ;;
/* Invalid; one load to J-IALU register file and one store
from J-IALU register file */
•Access to memory must be aligned to its size. For example, quad
word access must be quad-word aligned. The long access must be
aligned to an even address. This excludes load to compute block via
DAB. In addition, the immediate address modifier must be a multiple of four in quad accesses and of two in long accesses. For
example:
XR3:0 = Q[J0 += 3] ;; /* Invalid */
XR3:0 = Q[J0 += 4] ;; /* Valid */
•A Ureg store instruction and an instruction that updates the same
Ureg may not be issued in the same instruction line, because the
store instruction may be stalled and by the time it progresses, the
contents may have been modified by the update instruction. For
example:
•On load or store instructions the memory address may not be a register. For example, the address may not be a memory mapped
register address in the range of
Q[J2 + 0] = XR3:0 ;;
/* Invalid if J2 is in the range of 0x180000 to 0x1FFFFF */
0x180000 to 0x1FFFFF. For example:
•If one IALU is used to access the other IALU register, there may
not be an immediate load instruction in the same line. For
example:
Q[J2 + 0] = K3:0 ; XR0 = 100 ;; /* Invalid */
Q[K2 + 0] = K3:0 ; XR0 = 100 ;; /* Valid */
Sequencer Instruction Restrictions
There can be one sequencer instruction and one immediate extension per
line, where the sequencer instruction can be jump, indirect jump, and
other instructions. The assembler statically checks all of these restrictions:
•The sequencer instruction must be the first instruction in the fourslot instruction line.
•The immediate extension must be the second instruction in the
four-slot instruction line.
•The immediate extension is counted as one of the four instructions
in the line.
•There cannot be two instructions that end in the same quad-word
boundary, and where both have branch instructions with a predicted bit set. For example:
IF MLE, JUMP + 100 ;; /* begin address 100 */
IF NALE JUMP -50 ;
XR0 = R5 + R6 ; J0 = J2 + J3 ; YR4 = [K3 + 40] ;;
/* Valid; first instruction line ends on 1001; second
instruction line ends on 1005 */
IF MLE, JUMP + 100 ;; /* begin address 100 */
IF NALE JUMP - 50 ;;
/* Invalid; both lines within the same quad word */
•For instruction SCFx += op Cond, there can be no operation
between compute block static flags (XSF0/1, YSF0/1, and XYSF0/1)
and non-compute block conditions.
The TigerSHARC processor core contains two compute blocks. Each
compute block contains a register file and three independent computation
units—an ALU, a multiplier, and a shifter. Because the execution of all
computational instructions in the TigerSHARC DSP depends on the
input and output data formats and depends on whether the instruction is
executed on one computational block or both, it is important to understand how to use the TigerSHARC DSP’s compute block registers. This
chapter describes the registers in the compute blocks, shows how the register name syntax controls data format and execution location, and defines
the available data formats.
The DSP has two compute blocks—compute block X and compute
block Y. Each block contains a register file and three independent computation units. The units are the ALU, multiplier, and shifter.
A general-purpose, multiport, 32-word data register file in each compute
block serves for transferring data between the computation units and the
data buses and stores intermediate results. Figure 2-1 shows how each of
the register files provide the interface between the internal buses and the
computational units within the compute blocks.
As shown in Figure 2-1, data input to the register file passes through the
data alignment buffer (DAB). The DAB is a two quad-word FIFO that
provides aligned data for registers when dual- or quad-register loads
receive misaligned data from memory. For more information on using the
DAB, see “IALU” on page 6-1.
Figure 2-1. Data Register Files in Compute Block X and Y
Within the compute block, there are two types of registers—memory-mapped registers and non-memory-mapped registers. The memory
mapped registers in each of the compute blocks are the general-purpose
data register file registers XR31–0 and YR31–0. Because these registers are
memory mapped, they are accessible to external bus devices.
For operations within a single DSP, the distinction between memory-mapped and non-memory-mapped compute block registers is
important because the memory-mapped registers are Universal registers
(Ureg). The Ureg group of registers is available for many types of operations working with portions of the DSP’s core other than the portion of
the core where the Ureg resides. The compute block Ureg registers can be
used for additional operations unavailable to other
tinguish the compute block register file registers from other Ureg registers,
the XR31–0 and YR31–0 registers are also referred to as Data registers (Dreg).
For operations in a multiprocessing DSP system, it is very useful that 90%
of the registers in the TigerSHARC processor are memory-mapped registers. The memory-mapped registers have absolute addresses associated
with them, meaning that they can be accessed by other processors through
multiprocessor space or accessed by any other bus masters in the system.
L
The compute blocks have a few registers that are non-memory mapped.
These registers do not have absolute addresses associated with them. The
non-memory-mapped registers are special registers that are dedicated for
special instructions in each compute block. The unmapped registers in the
compute blocks include:
A DSP can access its own registers by using the multiprocessor
memory space, but the DSP would have to tie up the external bus
to access its own registers this way.
•Compute block status (XSTAT and YSTAT) registers
•Parallel Result (XPR1–0 and YPR1–0) registers—ALU
Ureg registers. To dis-
•Multiplier Result (XMR3–0 and YMR3–0) registers—Multiplier
•Multiplier Result Overflow (XMR4 and YMR4) registers—Multiplier
Figure 2-2. XSTAT/YSTAT (Upper) Register Bit Descriptions
The non-memory-mapped registers serve special purposes in each compute block. The
X/YSTAT registers (shown in Figure 2-2 and Figure 2-3)
hold the status flags for each compute block. These flags are set or reset to
indicate the status of an instruction’s execution a compute block’s ALU,
multiplier, and shifter. The X/YPR1–0 registers hold parallel results from
the ALU’s SUM, ABS, VMAX, and VMIN instructions. The X/YMR3–0 registers
optionally hold results from fixed-point multiply operations, and the
X/YMR4 register holds overflow from those operations. The X/YBFOTMP reg-
Figure 2-3. XSTAT/YSTAT (Lower) Register Bit Descriptions
Register File Registers
The compute block X and Y register files contain thirty-two 32-bit registers, which serve as a compute block’s interface between DSP internal bus
and the computational units. The register file registers—XR31–0 and
YR31–0—are both universal registers (Ureg) and data registers (Dreg).
All inputs for computations come from the register file and all results are
sent to the register file, except for fixed-point multiplies which can
optionally be sent to the MR3–0 registers.
It is important to note that a register may be used once in an
instruction slot, but the assembly syntax permits using registers multiple times within an instruction line (which contains up to four
instruction slots). The register file registers are hardware interlocked, meaning that there is dependency checking during each
computation to make sure the correct values are being used. When
Page 82
Register File Registers
a computation accesses a register, the DSP performs a register
check to make sure there are no other dependencies on that register. For more information on instruction lines and dependencies,
see “Instruction Line Syntax and Structure” on page 1-20 and
“Instruction Parallelism Rules” on page 1-24.
There are many ways to name registers in the TigerSHARC DSP’s assembly syntax. The register name syntax provides selection of many features of
computational instructions. Using the register name syntax in an instruction, you can specify:
Figure 2-4 shows the parts of the register name syntax and the features
that the syntax selects.
___R_
Register name
Register width selection (# or #:#)
Fixed- or floating-point data format selection (none or F)
Operand size selection (none, L, S, or B)
Compute block selection (none, X, Y, or XY)
{for result registers only}
Figure 2-4. Register File Register Name Syntax
The DSP’s assembly syntax also supports selection of integer or
L
fractional and real or complex data types. These selections are provided as options to instructions and are not part of register file
register name syntax.
Compute Block Selection
As shown in Figure 2-4, the assembly syntax for naming registers lets you
select the compute block of the register with which you are working.
The X and Y register-name prefixes denote in which compute block the
register resides: X = compute block X only, Y = compute block Y only, and
XY (or no prefix) = both. The following ALU instructions provide some
register name syntax examples.
XR0 = R1 + R2 ;; /* This instruction executes in block X */
This instruction uses registers XR0, XR1, and XR2.
YR1 = R5 + R6 ;; /* This instruction executes in block Y */
This instruction uses registers YR1, YR5, and YR6.
XYR0 = R0 + R2 ;; /* This instruction executes in block X & Y */
This instruction uses registers XR0, XR2, YR0, and YR2.
R0 = R22 + R3 ;; /* This instruction executes in block X & Y */
This instruction uses registers XR0, XR22, XR3, YR0, YR22, and YR3.
Because the compute block prefix lets you select between executing the
instruction in one or both compute blocks, this prefix provides the selection between Single-Instruction, Single-Data (SISD) execution and
Single-Instruction, Multiple-Data (SIMD) execution. Using SIMD execution is a powerful way to optimize execution if the same algorithm is being
used to process multiple channels of data.
It is important to note that SISD and SIMD are not modes that are turned
on or off with some latency in the change. SISD and SIMD execution are
always available as execution options simply through register name
selection.
To represent optional items, instruction syntax definitions use curley
braces { } around the item. To represent choices between items, instruction syntax definitions place a vertical bar | between items. The following
syntax definition example and comparable instruction indicates the difference for compute block selection:
{X|Y|XY}Rs = Rm + Rn ;;
/* the curly braces enclose options */
/* the vertical bars separate choices */
XYR0 = R1 + R0 ;;
/* code, no curly braces — no vertical bars */
Register Width Selection
As shown in Figure 2-4 on page 2-7, the assembly syntax for naming registers lets you select the width of the register with which you are working.
To support data sizes larger than a 32-bit word, the DSP’s assembly syntax
lets you combine registers to hold larger words. The register name syntax
for register width works as follows:
•Rs, Rm, or Rn indicates a Single register containing a 32-bit word (or
smaller).
For example, these are register names such as R1, XR2, and so on.
•Rsd, Rmd, or Rnd indicates a Double register containing a 64-bit word
(or smaller).
For example, these are register names such as R1:0, XR3:2, and so
on. The lower register must be evenly divisible by two.
•Rsq, Rmq, or Rnq indicates a Quad register containing a 128-bit word
(or smaller).
For example, these are register names such as R3:0, XR7:4, and so
on. The lowest register must be evenly divisible by 4.
The combination of italic and code font in the register name syntax above
indicates a user-substitutable value. Instruction syntax definitions use this
convention to represent multiple register names. The following syntax
definition example and comparable instruction indicates the difference for
register width selection.
As shown in Figure 2-4 on page 2-7, the assembly syntax for naming registers lets you select the operand size and fixed- or floating-point format of
the data placed within the register with which you are working.
Single, double, and quad register file registers (
(inputs and outputs) for instructions. Depending on the operand size and
fixed- or floating-point format, there may be more that one operand in a
register.
To select the operand size within a register file register, a register name
prefix selects a size that is equal or less than the size of the register. These
operand size prefixes for fixed-point data work as follows.
•B — Indicates Byte (8-bit) word data. The data in a single 32-bit
register is treated as four 8-bit words. Example register names with
byte word operands are
•S — Indicates Short (16-bit) word data. The data in a single 32-bit
register is treated as two 16-bit words. Example register names with
short word operands are SR1, SR1:0, and SR3:0.
•None — Indicates Normal (32-bit) word data. Example register
names with normal word operands are R0R1:0, and R3:0.
•L — Indicates Long (64-bit) word data. An example register name
with a long word operand is LR1:0.
The B, S, and L options apply for ALU and Shifter operations.
Operand size selection differs slightly for the multiplier. For more
information, see “Multiplier Operations” on page 4-4.
Page 87
Compute Block Registers
To distinguish between fixed- and floating-point data, the register name
prefix F indicates that the register contains floating-point data. The DSP
supports the following floating-point data formats.
•None — Indicates fixed-point data
•
FRs, FRm, or FRn (floating-point data in a single register) — Indi-
cates normal (IEEE format, 32-bit) word data. An example register
name with a normal word, floating-point operand is FR3.
•FRsd, FRmd, or FRnd (floating-point data in a double register) —
Indicates extended (40-bit) word data. An example register name
with an extended word, floating-point operand is FR1:0.
It is important to note that the operand size influences the execution of
the instruction. For example,
SRsd = Rmd + Rnd;; is an addition of four
short data operands, stored in two register pairs. An example of this type
of instruction follows and has the results shown in Figure 2-5.
SR1:0 = R31:30 + R25:24;;
Registers
R31:30
R25:24
R1:0
[31:16]
[31:16]
R31[15:0]+R31[31:16]+
R25[15:0]R25[31:16]
[15:0]
[15:0]
R30[31:16]+
Low RegisterHigh Register
[31:16]
[31:16]
[15:0]
[15:0]
R30[15:0]+
R24[15:0]R24[31:16]
Figure 2-5. Addition of Four Short Word Operands in Double Registers
As shown in Figure 2-5, this instruction executes the operation on all 64
bits in this example. The operation is executed on every group of 16 bits
separately.
Data register file registers are used in computational instructions and
memory load/store instructions. The syntax for those instructions is
described in:
•“ALU” on page 3-1
•“Multiplier” on page 4-1
•“Shifter” on page 5-1
The following ALU instruction syntax description shows the conventions
that all syntax descriptions use for data register file names:
{X|Y|XY}{F}Rsd = Rmd + Rnd ;;
Where:
•{X|Y|XY} — The X, Y, or XY (none is same as XY) prefix on the
register name selects the compute block or blocks to execute the
instruction. The curly braces around these items indicate they are
optional, and the vertical bars indicate that only one may be
chosen.
•{F} — The F prefix on the register name selects floating-point format for the operation. Omitting the prefix selects fixed-point
format.
Rsd — The result is a double register as indicated by the d. The reg-
•
ister name takes the form
divisible by two (as in
R#:#, where the lower number is evenly
R1:0).
•Rmd, Rnd — The inputs are double registers. The m and n indicate
that these must be different registers.
Here are some examples of register naming. In Figure 2-6, the register
name
XBR3 indicates the operation uses four fixed-point 8-bit words in the
X compute block R3 data register. In Figure 2-7, the register name XSR3
indicates the operation uses two fixed-point 16-bit words in the X compute block R3 data register. In Figure 2-8, the register name XR3 indicates
the operation uses one fixed-point 32-bit word in the X compute block R3
data register. In Figure 2-8, the register name XFR3 indicates floating-point
data.
3124 2316 158 70
XBR3
(Byte)
8 bits
8 bits
8 bits
8 bits
Figure 2-6. Register R3 in Compute Block X, Treated as Byte Data
3116 150
XSR3
(Short)
16 bits16 bits
Figure 2-7. Register R3 in Compute Block X, Treated as Short Data
310
XR3 or XFR3
(Normal)
32 bits
Figure 2-8. Register R3 in Compute Block X, Treated as Normal Data
Here are additional examples of register naming. Figure 2-9, Figure 2-10,
and Figure 2-11 show examples of operand size in double registers, which
are similar to the examples in Figure 2-6, Figure 2-7, and Figure 2-8.
6348 4732 3116 150
56 5540 3924 238 7
XBR3:2
(Byte)
8 bits8 bits8 bits8 bits8 bits8 bits8 bits8 bits
Figure 2-9. Register R3:2 in Compute Block X, Treated as Byte Data
6348 4732 3116 150
XSR3:2
(Short)
16 bits16 bits16 bits16 bits
Figure 2-10. Register R3:2 in Compute Block X, Treated as Short Data
6332 310
XR3:2
(Normal)
32 bits32 bits
Figure 2-11. Register R3:2 in Compute Block X, Treated as Normal Data
The examples in Figure 2-12 and Figure 2-13 refer to two registers, but
hold a single data word.
6340 390
XFR3:2
(Extended)
not used40 bits
Figure 2-12. Register R3:2 in Compute Block X, Treated as Extended
(Floating-Point) Data
Figure 2-13. Register R3:2 in Compute Block X, Treated as Long Data
Numeric Formats
The DSP supports the 32-bit single-precision floating-point data format
defined in the IEEE Standard 754/854. In addition, the DSP supports a
40-bit extended-precision version of the same format with eight additional
bits in the mantissa. The DSP also supports 8-, 16-, 32-, and 64-bit
fixed-point formats—fractional and integer—which can be signed
(two’s-complement) or unsigned.
IEEE Single-Precision Floating-Point Data Format
IEEE Standard 754/854 specifies a 32-bit single-precision floating-point
format, shown in Figure 2-14. A number in this format consists of a sign
bit s, a 24-bit significand, and an 8-bit unsigned-magnitude exponent e.
For normalized numbers, the significand consists of a 23-bit fraction f and
a hidden bit of 1 that is implicitly presumed to precede f22 in the significand. The binary point is presumed to lie between this hidden bit and f22.
The least significant bit (LSB) of the fraction is f0; the LSB of the exponent is e0.
The hidden bit effectively increases the precision of the floating-point significand to 24 bits from the 23 bits actually stored in the data format.
This bit also insures that the significand of any number in the IEEE normalized number format is always greater than or equal to 1 and less
than 2.
The unsigned exponent e can range between 1 ≤ e ≤ 254 for normal num-
bers in the single-precision format. This exponent is biased by
+127 (254/2). To calculate the true unbiased exponent, 127 must be subtracted from e.
Figure 2-14. IEEE 32-Bit Single-Precision Floating-Point Format
(Normal Word)
The IEEE standard also provides for several special data types in the single-precision floating-point format:
•An exponent value of 255 (all ones) with a nonzero fraction is a
Not-A-Number (NAN). NANs are usually used as flags for data
flow control, for the values of uninitialized variables, and for the
results of invalid operations such as 0 ∗ ∞.
•Infinity is represented as an exponent of 255 and a zero fraction.
Note that because the fraction is signed, both positive and negative
Infinity can be represented.
•Zero is represented by a zero exponent and a zero fraction. As with
Infinity, both positive zero and negative zero can be represented.
The IEEE single-precision floating-point data types supported by the DSP
and their interpretations are summarized in Table 2-1.
Table 2-1. IEEE Single-Precision Floating-Point Data Types
TypeExponentFractionValue
NAN255NonzeroUndefined
Infinity2550(–1)s Infinity
Normal1 ≤ e ≤ 254Any(–1)s (1.f
Zero00 (–1)s Zero
22-0
) 2 e
–127
The TigerSHARC processor is compatible with the IEEE single-precision
floating-point data format in all respects, except for:
•The TigerSHARC processor does not provide inexact flags.
•NAN inputs generate an invalid exception and return a quiet
NAN.
•Denormal operands are flushed to zero when input to a computation unit and do not generate an underflow exception. Any
denormal or underflow result from an arithmetic operation is
flushed to zero and an underflow exception is generated.
•Round-to-nearest and round-towards-zero are supported.
Round-to-±infinity are not supported.
The extended precision floating-point format is 40 bits wide, with the
same 8-bit exponent as in the standard format but with a 32-bit significand. This format is shown in Figure 2-15. In all other respects, the
extended floating-point format is the same as the IEEE standard format.
Figure 2-15. 40-Bit Extended-Precision Floating-Point Format
(Extended Word)
Fixed-Point Formats
The DSP supports fixed-point fractional and integer formats for 16-, 32-,
and 64-bit data. In these formats, numbers can be signed (two’s-complement) or unsigned. The possible combinations are shown in Figure 2-20
through Figure 2-27. In the fractional format, there is an implied binary
point to the left of the most significant magnitude bit. In integer format,
the binary point is understood to be to the right of the LSB. Note that the
sign bit is negatively weighted in a two’s-complement format.
L
The DSP supports a fixed-point, signed, integer format for
8-bit data. Data in the 8- and 16-bit formats is always packed in
32-bit registers as follows—a single register holds four 8-bit or two
16-bit words, a dual register holds eight 8-bit or four 16-bit words,
and a quad register holds sixteen 8-bit or eight 16-bit words.
ALU outputs always have the same width and data format as the inputs.
The multiplier, however, produces a 64-bit product from two 32-bit
inputs. If both operands are unsigned integers, the result is a 64-bit
unsigned integer. If both operands are unsigned fractions, the result is a
64-bit unsigned fraction. These formats are shown in Figure 2-30 and
Figure 2-31.
If one operand is signed and the other unsigned, the result is signed. If
both inputs are signed, the result is signed and automatically shifted left
one bit. The LSB becomes zero and bit 62 moves into the sign bit position. Normally bit 63 and bit 62 are identical when both operands are
signed. (The only exception is full-scale negative multiplied by itself.)
Thus, the left shift normally removes a redundant sign bit, increasing the
precision of the most significant product. Also, if the data format is fractional, a single bit left shift renormalizes the MSB to a fractional format.
The signed formats with and without left shifting are shown in
Figure 2-28 and Figure 2-29.
The multiplier has an 80-bit accumulator to allow the accumulation of
64-bit products. For more information on the multiplier and accumulator, see “Multiplier” on page 4-1.
BRs
Signed
Integer
76520
7
–2
Sign Bit
262
5
. . . . . . . . . . . . . . . . . . . . .
Binary Point
1
22212
0
.
Figure 2-16. 8-Bit Fixed-Point Format, Signed Integer
(Byte Word)