Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections,
modifications, enhancements, improvements, and other changes to its products and services at any
time and to discontinue any product or service without notice. Customers should obtain the latest
relevant information before placing orders and should verify that such information is current and
complete. All products are sold subject to TI’s terms and conditions of sale supplied at the time of order
acknowledgment.
TI warrants performance of its hardware products to the specifications applicable at the time of sale
in accordance with TI’s standard warranty. Testing and other quality control techniques are used to the
extent TI deems necessary to support this warranty. Except where mandated by government
requirements, testing of all parameters of each product is not necessarily performed.
TI assumes no liability for applications assistance or customer product design. Customers are
responsible for their products and applications using TI components. To minimize the risks associated
with customer products and applications, customers should provide adequate design and operating
safeguards.
TI does not warrant or represent that any license, either express or implied, is granted under any TI
patent right, copyright, mask work right, or other TI intellectual property right relating to any
combination, machine, or process in which TI products or services are used. Information published by
TI regarding third-party products or services does not constitute a license from TI to use such products
or services or a warranty or endorsement thereof. Use of such information may require a license from
a third party under the patents or other intellectual property of the third party, or a license from TI under
the patents or other intellectual property of TI.
Reproduction of information in TI data books or data sheets is permissible only if reproduction is without
alteration and is accompanied by all associated warranties, conditions, limitations, and notices.
Reproduction of this information with alteration is an unfair and deceptive business practice. TI is not
responsible or liable for such altered documentation.
Resale of TI products or services with statements different from or beyond the parameters stated by
TI for that product or service voids all express and any implied warranties for the associated TI product
or service and is an unfair and deceptive business practice. TI is not responsible or liable for any such
statements.
Following are URLs where you can obtain information on other Texas Instruments products and
application solutions:
ProductsApplications
Amplifiersamplifier.ti.comAudiowww.ti.com/audio
Data Convertersdataconverter.ti.comAutomotivewww.ti.com/automotive
The TMS320C6000™ digital signal processor (DSP) platform is part of the
TMS320™ DSP family. The TMS320C62x™ DSP generation and the
TMS320C64x™ DSP generation comprise fixed-point devices in the
C6000™ DSP platform, and the TMS320C67x™ DSP generation comprises
floating-point devices in the C6000 DSP platform.
The TMS320C67x+™ DSP is an enhancement of the C67x™ DSP with added
functionality and an expanded instruction set. This document describes the
CPU architecture, pipeline, instruction set, and interrupts of the C67x and
C67x+™ DSPs.
Notational Conventions
Preface
Read This First
This document uses the following conventions.
Any reference to the C67x DSP or C67x CPU also applies, unless other-
wise noted, to the C67x+ DSP and C67x+ CPU, respectively.
Hexadecimal numbers are shown with the suffix h. For example, the
following number is 40 hexadecimal (decimal 64): 40h.
Related Documentation From Texas Instruments
The following documents describe the C6000™ devices and related support
tools. Copies of these documents are available on the Internet at www.ti.com.
Tip: Enter the literature number in the search box provided at www.ti.com.
The current documentation that describes the C6000 devices, related peripherals, and other technical collateral, is available in the C6000 DSP product
folder at: www.ti.com/c6000.
number SPRU723) describes the peripherals available on the
TMS320C672x DSPs.
TMS320C6000 Technical Brief (literature number SPRU197) gives an
introduction to the TMS320C62x and TMS320C67x DSPs, development
tools, and third-party support.
TMS320C6000 Programmer’s Guide (literature number SPRU198)
describes ways to optimize C and assembly code for the TMS320C6000
DSPs and includes application program examples.
TMS320C6000 Code Composer Studio Tutorial (literature number
SPRU301) introduces the Code Composer Studio integrated development environment and software tools.
Code Composer Studio Application Programming Interface Reference
Guide (literature number SPRU321) describes the Code Composer
Studio application programming interface (API), which allows you to program custom plug-ins for Code Composer.
TMS320C6x Peripheral Support Library Programmer’s Reference
(literature number SPRU273) describes the contents of the
TMS320C6000 peripheral support library of functions and macros. It lists
functions and macros both by header file and alphabetically, provides a
complete description of each, and gives code examples to show how
they are used.
Trademarks
ivSPRU733Read This First
TMS320C6000 Chip Support Library API Reference Guide (literature
number SPRU401) describes a set of application programming interfaces
(APIs) used to configure and control the on-chip peripherals.
Code Composer Studio, C6000, C64x, C67x, C67x+, TMS320C2000,
TMS320C5000, TMS320C6000, TMS320C62x, TMS320C64x,
TMS320C67x, TMS320C67x+, TMS320C672x, and VelociTI are trademarks
of Texas Instruments.
Trademarks are the property of their respective owners.
Describes the assembly language instructions of the TMS320C67x DSP. Also described are
parallel operations, conditional operations, resource constraints, and addressing modes.
The TMS320C6000™ digital signal processor (DSP) platform is part of the
TMS320™ DSP family. The TMS320C62x™ DSP generation and the
TMS320C64x™ DSP generation comprise fixed-point devices in the C6000™
DSP platform, and the TMS320C67x™ DSP generation comprises floatingpoint devices in the C6000 DSP platform. All three DSP generations use the
VelociTI™ architecture, a high-performance, advanced very long instruction
word (VLIW) architecture, making these DSPs excellent choices for multichannel and multifunction applications.
The TMS320C67x+ DSP is an enhancement of the C67x DSP with added
functionality and an expanded instruction set.
Any reference to the C67x DSP or C67x CPU also applies, unless otherwise
noted, to the C67x+ DSP and C67x+ CPU, respectively.
TMS320 DSP Family Overview / TMS320C6000 DSP Family Overview
1.1TMS320 DSP Family Overview
The TMS320™ DSP family consists of fixed-point, floating-point, and multiprocessor digital signal processors (DSPs). TMS320™ DSPs have an architec-
ture designed specifically for real-time signal processing.
Table 1−1 lists some typical applications for the TMS320™ family of DSPs. The
TMS320™ DSPs offer adaptable approaches to traditional signal-processing
problems. They also support complex applications that often require multiple
operations to be performed simultaneously.
1.2TMS320C6000 DSP Family Overview
With a performance of up to 6000 million instructions per second (MIPS) and
an efficient C compiler, the TMS320C6000 DSPs give system architects
unlimited possibilities to differentiate their products. High performance, ease
of use, and affordable pricing make the C6000 generation the ideal solution
for multichannel, multifunction applications, such as:
Pooled modems
Wireless local loop base stations
Remote access servers (RAS)
Digital subscriber loop (DSL) systems
Cable modems
Multichannel telephony systems
The C6000 generation is also an ideal solution for exciting new applications;
for example:
Personalized home security with face and hand/fingerprint recognition
Advanced cruise control with global positioning systems (GPS) navigation
and accident avoidance
Remote medical diagnostics
Beam-forming base stations
Virtual reality 3-D graphics
Speech recognition
Audio
Radar
Atmospheric modeling
Finite element analysis
Imaging (examples: fingerprint recognition, ultrasound, and MRI)
Introduction1-2SPRU733
TMS320C6000 DSP Family Overview
Table 1−1. Typical Applications for the TMS320 DSPs
AutomotiveConsumerControl
Adaptive ride control
Antiskid brakes
Cellular telephones
Digital radios
Engine control
Global positioning
Navigation
Vibration analysis
Voice commands
General-PurposeGraphics/ImagingIndustrial
Adaptive filtering
Convolution
Correlation
Digital filtering
Fast Fourier transforms
Hilbert transforms
Waveform generation
Windowing
InstrumentationMedicalMilitary
Digital filtering
Function generation
Pattern matching
Phase-locked loops
Seismic processing
Spectrum analysis
Transient analysis
Digital radios/TVs
Educational toys
Music synthesizers
Pagers
Power tools
Radar detectors
Solid-state answering machines
Disk drive control
Engine control
Laser printer control
Motor control
Robotics control
Servo control
Numeric control
Power-line monitoring
Robotics
Security access
Image processing
Missile guidance
Navigation
Radar processing
Radio frequency modems
Secure communications
Sonar processing
TelecommunicationsVoice/Speech
1200- to 56600-bps modems
Adaptive equalizers
ADPCM transcoders
Base stations
Cellular telephones
Channel multiplexing
Data encryption
Digital PBXs
Digital speech interpolation (DSI)
DTMF encoding/decoding
Echo cancellation
Faxing
Future terminals
Line repeaters
Personal communications
systems (PCS)
Personal digital assistants (PDA)
Speaker phones
Spread spectrum communications
Digital subscriber loop (xDSL)
Video conferencing
X.25 packet switching
The C6000 devices execute up to eight 32-bit instructions per cycle. The C67x
CPU consists of 32 general-purpose 32-bit registers and eight functional units.
These eight functional units contain:
Two multipliers
Six ALUs
The C6000 generation has a complete set of optimized development tools,
including an efficient C compiler, an assembly optimizer for simplified
assembly-language programming and scheduling, and a Windows™ based
debugger interface for visibility into source code execution characteristics. A
hardware emulation board, compatible with the TI XDS510™ and XDS560™
emulator interface, is also available. This tool complies with IEEE Standard
1149.1−1990, IEEE Standard Test Access Port and Boundary-Scan
Architecture.
Features of the C6000 devices include:
Advanced VLIW CPU with eight functional units, including two multipliers
and six arithmetic units
Executes up to eight instructions per cycle for up to ten times the
performance of typical DSPs
Allows designers to develop highly effective RISC-like code for fast
development time
Instruction packing
Gives code size equivalence for eight instructions executed serially or
in parallel
Reduces code size, program fetches, and power consumption
Conditional execution of all instructions
Reduces costly branching
Increases parallelism for higher sustained performance
Efficient code execution on independent functional units
Industry’s most efficient C compiler on DSP benchmark suite
Industry’s first assembly optimizer for fast development and improved
parallelization
8/16/32-bit data support, providing efficient memory support for a variety
of applications
Introduction1-4SPRU733
TMS320C67x DSP Features and Options
40-bit arithmetic options add extra precision for vocoders and other
computationally intensive applications
Saturation and normalization provide support for key arithmetic
operations
Field manipulation and instruction extract, set, clear, and bit counting
support common operation found in control and data manipulation
applications.
The C67x devices include these additional features:
Hardware support for single-precision (32-bit) and double-precision
(64-bit) IEEE floating-point operations.
32 × 32-bit integer multiply with 32-bit or 64-bit result.
In addition to the features of the C67x device, the C67x+ device is enhanced
for code size improvement and floating-point performance. These additional
features include:
Execute packets can span fetch packets.
Register file size is increased to 64 registers (32 in each datapath).
Floating-point addition and subtraction capability in the .S unit.
Mixed-precision multiply instructions.
32-KByte instruction cache that supports execution from both on-chip
RAM and ROM as well as from external memory through a VBUSP-based
external memory interface (EMIF).
Unified memory controller features support for flat on-chip data RAM and
ROM organizations for zero wait-state accesses from both load store units
of the CPU. The memory controller supports different banking organizations for RAM and ROM arrays. The memory controller also supports
VBUSP interfaces (two master and one slave) for transfer of data from the
system peripherals to and from the CPU and internal memory. A VBUSPbased DMA controller can interface to the CPU for programmable bulk
transfers through the VBUSP slave port.
1-5IntroductionSPRU733
TMS320C67x DSP Features and Options
The VelociTI architecture of the C6000 platform of devices make them the first
off-the-shelf DSPs to use advanced VLIW to achieve high performance
through increased instruction-level parallelism. A traditional VLIW architecture
consists of multiple execution units running in parallel, performing multiple
instructions during a single clock cycle. Parallelism is the key to extremely high
performance, taking these DSPs well beyond the performance capabilities of
traditional superscalar designs. VelociTI is a highly deterministic architecture,
having few restrictions on how or when instructions are fetched, executed, or
stored. It is this architectural flexibility that is key to the breakthrough efficiency
levels of the TMS320C6000 Optimizing C compiler. VelociTI’s advanced
features include:
Instruction packing: reduced code size
All instructions can operate conditionally: flexibility of code
Variable-width instructions: flexibility of data types
Fully pipelined branches: zero-overhead branching.
Introduction1-6SPRU733
1.4TMS320C67x DSP Architecture
Á
Á
Figure 1−1 is the block diagram for the C67x DSP. The C6000 devices come
with program memory, which, on some devices, can be used as a program
cache. The devices also have varying sizes of data memory. Peripherals such
as a direct memory access (DMA) controller, power-down logic, and external
memory interface (EMIF) usually come with the CPU, while peripherals such
as serial ports and host ports are on only certain devices. Check the data sheet
for your device to determine the specific peripheral configurations you have.
Figure 1−1. TMS320C67x DSP Block Diagram
Program cache/program memory
32-bit address
256-bit data
TMS320C67x DSP Architecture
DMA, EMIF
Power
down
Data path AData path B
Data cache/data memory
32-bit address
8-, 16-, 32-bit data
C6000 CPU
Program fetch
Instruction dispatch (See Note)
Instruction decode
Register file BRegister file A
.D1.M1.S1.L1
.D2 .M2 .S2 .L2
Control
registers
Control
logic
Test
Emulation
Interrupts
Additional
peripherals:
Timers,
serial ports,
etc.
1-7IntroductionSPRU733
TMS320C67x DSP Architecture
1.4.1Central Processing Unit (CPU)
The C67x CPU, in Figure 1−1, is common to all the C62x/C64x/C67x devices.
The CPU contains:
Program fetch unit
Instruction dispatch unit
Instruction decode unit
Two data paths, each with four functional units
32 32-bit registers
Control registers
Control logic
Test, emulation, and interrupt logic
The program fetch, instruction dispatch, and instruction decode units can
deliver up to eight 32-bit instructions to the functional units every CPU clock
cycle. The processing of instructions occurs in each of the two data paths (A
and B), each of which contains four functional units (.L, .S, .M, and .D) and 16
32-bit general-purpose registers. The data paths are described in more detail
in Chapter 2. A control register file provides the means to configure and control
various processor operations. To understand how instructions are fetched,
dispatched, decoded, and executed in the data path, see Chapter 4.
1.4.2Internal Memory
The C67x DSP has a 32-bit, byte-addressable address space. Internal
(on-chip) memory is organized in separate data and program spaces. When
off-chip memory is used, these spaces are unified on most devices to a single
memory space via the external memory interface (EMIF).
The C67x DSP has two 32-bit internal ports to access internal data memory.
The C67x DSP has a single internal port to access internal program memory,
with an instruction-fetch width of 256 bits.
1.4.3Memory and Peripheral Options
A variety of memory and peripheral options are available for the C6000
platform:
and other asynchronous memories for a broad range of external memory
requirements and maximum system performance.
Introduction1-8SPRU733
TMS320C67x DSP Architecture
DMA Controller (C6701 DSP only) transfers data between address ranges
in the memory map without intervention by the CPU. The DMA controller
has four programmable channels and a fifth auxiliary channel.
EDMA Controller performs the same functions as the DMA controller. The
EDMA has 16 programmable channels, as well as a RAM space to hold
multiple configurations for future transfers.
HPI is a parallel port through which a host processor can directly access
the CPU’s memory space. The host device has ease of access because
it is the master of the interface. The host and the CPU can exchange information via internal or external memory. In addition, the host has direct
access to memory-mapped peripherals.
Expansion bus is a replacement for the HPI, as well as an expansion of
the EMIF. The expansion provides two distinct areas of functionality (host
port and I/O port) which can co-exist in a system. The host port of the
expansion bus can operate in either asynchronous slave mode, similar to
the HPI, or in synchronous master/slave mode. This allows the device to
interface to a variety of host bus protocols. Synchronous FIFOs and
asynchronous peripheral I/O devices may interface to the expansion bus.
McBSP (multichannel buffered serial port) is based on the standard serial
port interface found on the TMS320C2000™ and TMS320C5000™
devices. In addition, the port can buffer serial samples in memory automatically with the aid of the DMA/EDNA controller. It also has multichannel
capability compatible with the T1, E1, SCSA, and MVIP networking
standards.
Timers in the C6000 devices are two 32-bit general-purpose timers used
for these functions:
Time events
Count events
Generate pulses
Interrupt the CPU
Send synchronization events to the DMA/EDMA controller.
Power-down logic allows reduced clocking to reduce power consumption.
Most of the operating power of CMOS logic dissipates during circuit
switching from one logic state to another. By preventing some or all of the
chip’s logic from switching, you can realize significant power savings without losing any data or operational context.
For an overview of the peripherals available on the C6000 DSP, refer to the
TM320C6000 DSP Peripherals Overview Reference Guide (SPRU190).
1-9IntroductionSPRU733
Chapter 2
CPU Data Paths and Control
This chapter focuses on the CPU, providing information about the data paths and
control registers. The two register files and the data cross paths are described.
The components of the data path for the TMS320C67x CPU are shown in
Figure 2−1. These components consist of:
Two general-purpose register files (A and B)
Eight functional units (.L1, .L2, .S1, .S2, .M1, .M2, .D1, and .D2)
Two load-from-memory data paths (LD1 and LD2)
Two store-to-memory data paths (ST1 and ST2)
Two data address paths (DA1 and DA2)
Two register file data cross paths (1X and 2X)
2.2General-Purpose Register Files
There are two general-purpose register files (A and B) in the C6000 data paths.
For the C67x DSP, each of these files contains 16 32-bit registers (A0–A15 for
file A and B0–B15 for file B), as shown in Table 2−1. For the C67x+ DSP, the
register file size is doubled to 32 32-bit registers (A0–A31 for file A and B0–B21
for file B), as shown in Table 2−1. The general-purpose registers can be used
for data, data address pointers, or condition registers.
The C67x DSP general-purpose register files support data ranging in size from
packed 16-bit data through 40-bit fixed-point and 64-bit floating point data.
Values larger than 32 bits, such as 40-bit long and 64-bit float quantities, are
stored in register pairs. In these the 32 LSBs of data are placed in an evennumbered register and the remaining 8 or 32 MSBs in the next upper register
(that is always an odd-numbered register). Packed data types store either four
8-bit values or two 16-bit values in a single 32-bit register, or four 16-bit values
in a 64-bit register pair.
There are 16 valid register pairs for 40-bit and 64-bit data in the C67x DSP
cores. In assembly language syntax, a colon between the register names
denotes the register pairs, and the odd-numbered register is specified first.
The additional registers are addressed by using the previously unused fifth
(msb) bit of the source and register specifiers. All 64-bit register writes and
reads are performed over 2 cycles as per the current C67x devices.
Figure 2−2 shows the register storage scheme for 40-bit long data. Operations
requiring a long input ignore the 24 MSBs of the odd-numbered register.
Operations producing a long result zero-fill the 24 MSBs of the odd-numbered
register. The even-numbered register is encoded in the opcode.
CPU Data Paths and Control2-2SPRU733
Figure 2−1. TMS320C67x CPU Data Paths
LD1 32 MSB
ST1
Data path A
LD1 32 LSB
DA1
.L1
long dst
long src
long src
long dst
.S1
.M1
.D1
src1
src2
dst
dst
src1
src2
dst
src1
src2
dst
src1
src2
General-Purpose Register Files
8
8
8
32
32
8
Register
file A
(A0−A15)
2X
Data path B
DA2
LD2 32 LSB
LD2 32 MSB
ST2
.D2
.M2
.S2
long dst
long src
long src
long dst
.L2
src2
src1
dst
src2
src1
dst
src2
src1
dst
dst
src2
src1
1X
Register
file B
(B0−B15)
8
8
8
32
32
8
Control
register
file
2-3CPU Data Paths and ControlSPRU733
General-Purpose Register Files
Table 2−1. 40-Bit/64-Bit Register Pairs
Register Files
AB
A1:A0B1:B0C67x DSP
A3:A2B3:B2
A5:A4B5:B4
A7:A6B7:B6
A9:A8B9:B8
A11:A10B11:B10
A13:A12B13:B12
A15:A14B15:B14
A17:A16B17:B16C67x+ DSP only
A19:A18B19:B18
A21:A20B21:B20
A23:A22B23:B22
A25:A24B25:B24
A27:A26B27:B26
A29:A28B29:B28
A31:A30B31:B30
Devices
Figure 2−2. Storage Scheme for 40-Bit Data in a Register Pair
310310
Odd registerEven register
Ignored
Odd registerEven register
Zero-filled
CPU Data Paths and Control2-4SPRU733
78
Read from registers
3932310
Write to registers
3932310
40-bit data
40-bit data
2.3Functional Units
The eight functional units in the C6000 data paths can be divided into two
groups of four; each functional unit in one data path is almost identical to the
corresponding unit in the other data path. The functional units are described
in Table 2−2.
Most data lines in the CPU support 32-bit operands, and some support long
(40-bit) and double word (64-bit) operands. Each functional unit has its own
32-bit write port into a general-purpose register file (Refer to Figure 2−1). All
units ending in 1 (for example, .L1) write to register file A, and all units ending
in 2 write to register file B. Each functional unit has two 32-bit read ports for
source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an
extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit
long reads. Because each unit has its own 32-bit write port, when performing
32-bit operations all eight units can be used in parallel every cycle.
See Appendix B for a list of the instructions that execute on each functional
unit.
Table 2−2. Functional Units and Operations Performed
.L unit (.L1, .L2)32/40-bit arithmetic and compare operations
32-bit logical operations
Leftmost 1 or 0 counting for 32 bits
Normalization count for 32 and 40 bits
.S unit (.S1, .S2)32-bit arithmetic operations
32/40-bit shifts and 32-bit bit-field operations
32-bit logical operations
Branches
Constant generation
Register transfers to/from control register
file (.S2 only)
.M unit (.M1, .M2) 16 × 16-bit multiply operations
32 × 32-bit multiply operations
.D unit (.D1, .D2)32-bit add, subtract, linear and circular
address calculation
Loads and stores with 5-bit constant offset
Loads and stores with 15-bit constant
offset (.D2 only)
Arithmetic operations
→ SP, INT → DP, INT → SP
DP
conversion operations
Compare
Reciprocal and reciprocal square-root
operations
Absolute value operations
→ DP conversion operations
SP
SPand DP adds and subtracts
SP and DP reverse subtracts (src2 − src1)
Floating-point multiply operations
Mixed-precision multiply operations
Load doubleword with 5-bit constant
offset
2-5CPU Data Paths and ControlSPRU733
Register File Cross Paths
Register File Cross Paths / Memory, Load, and Store Paths
2.4Register File Cross Paths
Each functional unit reads directly from and writes directly to the register file
within its own data path. That is, the .L1, .S1, .D1, and .M1 units write to register
file A and the .L2, .S2, .D2, and .M2 units write to register file B. The register
files are connected to the opposite-side register file’s functional units via the
1X and 2X cross paths. These cross paths allow functional units from one data
path to access a 32-bit operand from the opposite side register file. The 1X
cross path allows the functional units of data path A to read their source from
register file B, and the 2X cross path allows the functional units of data path
B to read their source from register file A.
On the C67x DSP, six of the eight functional units have access to the register
file on the opposite side, via a cross path. The .M1, .M2, .S1, and .S2 units’ src2
units are selectable between the cross path and the same side register file. In
the case of the .L1 and .L2, both src1 and src2 inputs are also selectable
between the cross path and the same-side register file.
Only two cross paths, 1X and 2X, exist in the C6000 architecture. Thus, the
limit is one source read from each data path’s opposite register file per cycle,
or a total of two cross path source reads per cycle. In the C67x DSP, only one
functional unit per data path, per execute packet, can get an operand from the
opposite register file.
2.5Memory, Load, and Store Paths
The C67x DSP has two 32-bit paths for loading data from memory to the register file: LD1 for register file A, and LD2 for register file B. The C67x DSP also
has a second 32-bit load path for both register files A and B. This allows the
LDDW instruction to simultaneously load two 32-bit values into register file A
and two 32-bit values into register file B. For side A, LD1a is the load path for
the 32 LSBs and LD1b is the load path for the 32 MSBs. For side B, LD2a is
the load path for the 32 LSBs and LD2b is the load path for the 32 MSBs. There
are also two 32-bit paths, ST1 and ST2, for storing register values to memory
from each register file.
On the C6000 architecture, some of the ports for long and doubleword operands are shared between functional units. This places a constraint on which
long or doubleword operations can be scheduled on a data path in the same
execute packet. See section 3.7.5.
CPU Data Paths and Control2-6SPRU733
2.6Data Address Paths
The data address paths (DA1 and DA2) are each connected to the .D units in
both data paths. This allows data addresses generated by any one path to
access data to or from any register.
The DA1 and DA2 resources and their associated data paths are specified as
T1 and T2, respectively. T1 consists of the DA1 address path and the LD1 and
ST1 data paths. For the C67x DSP, LD1 is comprised of LD1a and LD1b to
support 64-bit loads. Similarly, T2 consists of the DA2 address path and the
LD2 and ST2 data paths. For the C67x DSP, LD2 is comprised of LD2a and
LD2b to support 64-bit loads.
The T1 and T2 designations appear in the functional unit fields for load and
store instructions. For example, the following load instruction uses the .D1 unit
to generate the address but is using the LD2 path resource from DA2 to place
the data in the B register file. The use of the DA2 resource is indicated with the
T2 designation.
LDW .D1T2 *A0[3],B1
Data Address Paths / Control Register File
Data Address Paths
2.7Control Register File
Table 2−3 lists the control registers contained in the control register file.
Table 2−3. Control Registers
AcronymRegister NameSection
AMRAddressing mode register2.7.3
CSRControl status register2.7.4
ICRInterrupt clear register2.7.5
IERInterrupt enable register2.7.6
IFRInterrupt flag register2.7.7
IRPInterrupt return pointer register2.7.8
ISRInterrupt set register2.7.9
ISTPInterrupt service table pointer register2.7.10
2.7.1Register Addresses for Accessing the Control Registers
Table 2−4 lists the register addresses for accessing the control register file.
One unit (.S2) can read from and write to the control register file. Each control
register is accessed by the MVC instruction. See the MVC instruction description, page 3-180, for information on how to use this instruction.
Additionally, some of the control register bits are specially accessed in other
ways. For example, arrival of a maskable interrupt on an external interrupt pin,
INTm, triggers the setting of flag bit IFRm. Subsequently, when that interrupt
is processed, this triggers the clearing of IFRm and the clearing of the global
interrupt enable bit, GIE. Finally, when that interrupt processing is complete,
the B IRP instruction in the interrupt service routine restores the pre-interrupt
value of the GIE. Similarly, saturating instructions like SADD set the SAT
(saturation) bit in the control status register (CSR).
Table 2−4. Register Addresses for Accessing the Control Registers
AcronymRegister NameAddressRead/ Write
AMRAddressing mode register00000R, W
CSRControl status register00001R, W
FADCRFloating-point adder configuration10010R, W
FAUCRFloating-point auxiliary configuration10011R, W
FMCRFloating-point multiplier configuration10100R, W
ICRInterrupt clear register00011W
IERInterrupt enable register00100R, W
IFRInterrupt flag register00010R
IRPInterrupt return pointer00110R, W
ISRInterrupt set register00010W
ISTPInterrupt service table pointer00101R, W
NRPNonmaskable interrupt return pointer00111R, W
PCE1
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction
Program counter, E1 phase10000R
CPU Data Paths and Control2-8SPRU733
2.7.2Pipeline/Timing of Control Register Accesses
All MVC instructions are single-cycle instructions that complete their access
of the explicitly named registers in the E1 pipeline phase. This is true whether
MVC is moving a general register to a control register, or conversely. In all
cases, the source register content is read, moved through the .S2 unit, and
written to the destination register in the E1 pipeline phase.
Pipeline StageE1
Readsrc2
Writtendst
Control Register File
Unit in use
.S2
Even though MVC modifies the particular target control register in a single
cycle, it can take extra clocks to complete modification of the non-explicitly
named register. For example, the MVC cannot modify bits in the IFR directly.
Instead, MVC can only write 1’s into the ISR or the ICR to specify setting or
clearing, respectively, of the IFR bits. MVC completes this ISR/ICR write in a
single (E1) cycle but the modification of the IFR bits occurs one clock later. For
more information on the manipulation of ISR, ICR, and IFR, see section 2.7.9,
section 2.7.5, and section 2.7.7.
Saturating instructions, such as SADD, set the saturation flag bit (SAT) in CSR
indirectly. As a result, several of these instructions update the SAT bit one full
clock cycle after their primary results are written to the register file. For example, the SMPY instruction writes its result at the end of pipeline stage E2; its
primary result is available after one delay slot. In contrast, the SAT bit in CSR
is updated one cycle later than the result is written; this update occurs after two
delay slots. (For the specific behavior of an instruction, refer to the description
of that individual instruction).
The B IRP and B NRP instructions directly update the GIE and NMIE,
respectively. Because these branches directly modify CSR and IER,
respectively, there are no delay slots between when the branch is issued and
when the control register updates take effect.
2-9CPU Data Paths and ControlSPRU733
Control Register File
2.7.3Addressing Mode Register (AMR)
For each of the eight registers (A4–A7, B4–B7) that can perform linear or circular addressing, the addressing mode register (AMR) specifies the addressing
mode. A 2-bit field for each register selects the address modification mode:
linear (the default) or circular mode. With circular addressing, the field also
specifies which BK (block size) field to use for a circular buffer. In addition, the
buffer must be aligned on a byte boundary equal to the block size. The mode
select fields and block size fields are shown in Figure 2−3 and described in
Table 2−5.
Figure 2−3. Addressing Mode Register (AMR)
3126 2521 2016
Reserved
R-0R/W-0R/W-0
1514 1312 1110 98 76 54 32 10
B7 MODE
R/W-0R/W-0R/W-0R/W-0R/W-0R/W-0R/W-0R/W-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
B6 MODEB5 MODEB4 MODEA7 MODEA6 MODEA5 MODEA4 MODE
BK1BK0
Table 2−5. Addressing Mode Register (AMR) Field Descriptions
BitFieldValue Description
31−26 Reserved0Reserved. The reserved bit location is always read as 0. A value written to
this field has no effect.
25−21 BK10−1Fh Block size field 1. A 5-bit value used in calculating block sizes for circular
addressing. Table 2−6 shows block size calculations for all 32 possibilities.
Block size (in bytes) = 2
20−16 BK00−1Fh Block size field 0. A 5-bit value used in calculating block sizes for circular
addressing. Table 2−6 shows block size calculations for all 32 possibilities.
Block size (in bytes) = 2
15−14 B7 MODE0−3hAddress mode selection for register file B7.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
CPU Data Paths and Control2-10SPRU733
(N+1)
, where N is the 5-bit value in BK1
(N+1)
, where N is the 5-bit value in BK0
Control Register File
Table 2−5. Addressing Mode Register (AMR) Field Descriptions (Continued)
BitDescriptionValueField
13−12 B6 MODE0−3hAddress mode selection for register file B6.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
11−10
9−8
7−6
B5 MODE0−3hAddress mode selection for register file B5.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
B4 MODE0−3hAddress mode selection for register file B4.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
A7 MODE0−3hAddress mode selection for register file A7.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
5−4
A6 MODE0−3hAddress mode selection for register file A6.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
2-11CPU Data Paths and ControlSPRU733
Control Register File
Table 2−5. Addressing Mode Register (AMR) Field Descriptions (Continued)
BitDescriptionValueField
3−2A5 MODE0−3hAddress mode selection for register file a5.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
1−0
A4 MODE0−3hAddress mode selection for register file A4.
0Linear modification (default at reset)
1hCircular addressing using the BK0 field
2hCircular addressing using the BK1 field
3hReserved
Table 2−6. Block Size Calculations
BKn ValueBlock SizeBKn ValueBlock Size
00000210000131 072
00001410001262 144
00010810010
000111610011
001003210100
001016410101
0011012810110
0011125610111
0100051211000
010011 02411001
010102 04811010
010114 09611011
011008 19211100
0110116 38411101
0111032 76811110
01111
524 288
1 048 576
2 097 152
4 194 304
8 388 608
16 777 216
33 554 432
67 108 864
134 217 728
268 435 456
536 870 912
1 073 741 824
2 147 483 648
65 536111114 294 967 296
Note:When n is 11111, the behavior is identical to linear addressing.
CPU Data Paths and Control2-12SPRU733
Control Register File
2.7.4Control Status Register (CSR)
The control status register (CSR) contains control and status bits. The CSR
is shown in Figure 2−4 and described in Table 2−7. For the PWRD, EN, PCC,
and DCC fields, see the device-specific data manual to see if it supports the
options that these fields control.
The power-down modes and their wake-up methods are programmed by the
PWRD field (bits 15−10) of CSR. The PWRD field of CSR is shown in
Figure 2−5. When writing to CSR, all bits of the PWRD field should be
configured at the same time. A logic 0 should be used when writing to the
reserved bit (bit 15) of the PWRD field.
Figure 2−4. Control Status Register (CSR)
3124 2316
CPU ID
R-0R-x
REVISION ID
†
15109875 4210
PWRD
R/W-0R/WC-0R-xR/W-0R/W-0R/W-0R/W-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; WC = Bit is cleared on write; -n = value
†
See the device-specific data manual for the default value of this field.
after reset; -x = value is indeterminate after reset
SATENPCCDCCPGIEGIE
Figure 2−5. PWRD Field of Control Status Register (CSR)
151413121110
Reserved
R/W-0R/W-0R/W-0R/W-0R/W-0R/W-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Enabled or nonenabled interrupt wakeEnabled interrupt wakePD3PD2PD1
2-13CPU Data Paths and ControlSPRU733
Control Register File
Table 2−7. Control Status Register (CSR) Field Descriptions
BitFieldValueDescription
31−24 CPU ID0−FFhIdentifies the CPU of the device. Not writable by the MVC instruction.
0−1hReserved
2hC67x CPU
3hC67x+ CPU
4h−FFhReserved
23−16
15−10 PWRD0−3FhPower-down mode field. See Figure 2−5. Writable by the MVC instruction.
REVISION ID0−FFhIdentifies silicon revision of the CPU. For the most current silicon
revision information, see the device-specific data manual. Not writable
by the MVC instruction.
0No power-down.
1h−8hReserved
9hPower-down mode PD1; wake by an enabled interrupt.
Ah−10hReserved
11hPower-down mode PD1; wake by an enabled or nonenabled interrupt.
12h−19h Reserved
1AhPower-down mode PD2; wake by a device reset.
1BhReserved
1ChPower-down mode PD3; wake by a device reset.
1D−3FhReserved
9
SATSaturate bit. Can be cleared only by the MVC instruction and can be set
only by a functional unit. The set by a functional unit has priority over a
clear (by the MVC instruction), if they occur on the same cycle. The SAT
bit is set one full cycle (one delay slot) after a saturate occurs. The SAT
bit will not be modified by a conditional instruction whose condition is false.
0Any unit does not perform a saturate.
1Any unit performs a saturate.
8
ENEndian mode. Not writable by the MVC instruction.
0Big endian
1Little endian
CPU Data Paths and Control2-14SPRU733
Control Register File
Table 2−7. Control Status Register (CSR) Field Descriptions (Continued)
BitDescriptionValueField
7−5PCC0−7hProgram cache control mode. Writable by the MVC instruction. See the
PGIEPrevious GIE (global interrupt enable). Copy of GIE bit at point when
interrupt is taken. Physically the same bit as SGIE bit in the interrupt task
state register (ITSR). Writeable by the MVC instruction.
0Disables saving GIE bit when an interrupt is taken.
1Enables saving GIE bit when an interrupt is taken.
0
GIEGlobal interrupt enable. Physically the same bit as GIE bit in the task state
register (TSR). Writable by the MVC instruction.
0Disables all interrupts, except the reset interrupt and NMI (nonmaskable
interrupt).
1Enables all interrupts.
2-15CPU Data Paths and ControlSPRU733
Control Register File
2.7.5Interrupt Clear Register (ICR)
The interrupt clear register (ICR) allows you to manually clear the maskable
interrupts (INT15−INT4) in the interrupt flag register (IFR). Writing a 1 to any
of the bits in ICR causes the corresponding interrupt flag (IFn) to be cleared
in IFR. Writing a 0 to any bit in ICR has no effect. Incoming interrupts have
priority and override any write to ICR. You cannot set any bit in ICR to affect
NMI or reset. The ISR is shown in Figure 2−6 and described in Table 2−8.
Note:
Any write to ICR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in IFR until two
cycles after the write to ICR.
Any write to ICR is ignored by a simultaneous write to the same bit in the
interrupt set register (ISR).
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Table 2−9. Interrupt Enable Register (IER) Field Descriptions
BitFieldValue Description
31−16 Reserved0Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
15−4IEnInterrupt enable. An interrupt triggers interrupt processing only if the
corresponding bit is set to 1.
0Interrupt is disabled.
1Interrupt is enabled.
3−2
Reserved0Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
1NMIENonmaskable interrupt enable. An interrupt triggers interrupt processing only if
the bit is set to 1.
The NMIE bit is cleared at reset. After reset, you must set the NMIE bit to
enable the NMI and to allow INT15−INT4 to be enabled by the GIE bit in CSR
and the corresponding IER bit. You cannot manually clear the NMIE bit; a write
of 0 has no effect. The NMIE bit is also cleared by the occurrence of an NMI.
0All nonreset interrupts are disabled.
1All nonreset interrupts are enabled. The NMIE bit is set only by completing a
B NRP instruction or by a write of 1 to the NMIE bit.
011Reset interrupt enable. You cannot disable the reset interrupt.
2-17CPU Data Paths and ControlSPRU733
Control Register File
2.7.7Interrupt Flag Register (IFR)
The interrupt flag register (IFR) contains the status of INT4−INT15 and NMI
interrupt. Each corresponding bit in the IFR is set to 1 when that interrupt
occurs; otherwise, the bits are cleared to 0. If you want to check the status of
interrupts, use the MVC instruction to read the IFR. (See the MVC instruction
description, page 3-180, for information on how to use this instruction.) The
IFR is shown in Figure 2−8 and described in Table 2−10.
Legend: R = Readable by the MVC instruction; -n = value after reset
Table 2−10. Interrupt Flag Register (IFR) Field Descriptions
BitFieldValue Description
31−16 Reserved0Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
15−4IFnInterrupt flag. Indicates the status of the corresponding maskable interrupt. An
interrupt flag may be manually set by setting the corresponding bit (ISn) in the
interrupt set register (ISR) or manually cleared by setting the corresponding bit
(ICn) in the interrupt clear register (ICR).
0Interrupt has not occurred.
1Interrupt has occurred.
3−2
Reserved0Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
1NMIFNonmaskable interrupt flag.
0Interrupt has not occurred.
1Interrupt has occurred.
0
00Reset interrupt flag.
CPU Data Paths and Control2-18SPRU733
Control Register File
2.7.8Interrupt Return Pointer Register (IRP)
The interrupt return pointer register (IRP) contains the return pointer that
directs the CPU to the proper location to continue program execution after
processing a maskable interrupt. A branch using the address in IRP (B IRP)
in your interrupt service routine returns to the program flow when interrupt
servicing is complete. The IRP is shown in Figure 2−9.
The IRP contains the 32-bit address of the first execute packet in the program
flow that was not executed because of a maskable interrupt. Although you can
write a value to IRP, any subsequent interrupt processing may overwrite that
value.
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -x = value is indeterminate after reset
2-19CPU Data Paths and ControlSPRU733
Control Register File
2.7.9Interrupt Set Register (ISR)
The interrupt set register (ISR) allows you to manually set the maskable interrupts (INT15−INT4) in the interrupt flag register (IFR). Writing a 1 to any of the
bits in ISR causes the corresponding interrupt flag (IFn) to be set in IFR. Writing a 0 to any bit in ISR has no effect. You cannot set any bit in ISR to affect
NMI or reset. The ISR is shown in Figure 2−10 and described in Table 2−11.
Note:
Any write to ISR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in IFR until two
cycles after the write to ISR.
Any write to the interrupt clear register (ICR) is ignored by a simultaneous
write to the same bit in ISR.
Legend: R = Read only; W = Writeable by the MVC instruction; -n = value after reset
Table 2−11.Interrupt Set Register (ISR) Field Descriptions
BitFieldValue Description
31−16 Reserved0Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
15−4ISnInterrupt set.
0Corresponding interrupt flag (IFn) in IFR is not set.
1Corresponding interrupt flag (IFn) in IFR is set.
3−0
Reserved0Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
CPU Data Paths and Control2-20SPRU733
Control Register File
2.7.10 Interrupt Service Table Pointer Register (ISTP)
The interrupt service table pointer register (ISTP) is used to locate the interrupt
service routine (ISR). The ISTB field identifies the base portion of the address
of the interrupt service table (IST) and the HPEINT field identifies the specific
interrupt and locates the specific fetch packet within the IST. The ISTP is
shown in Figure 2−11 and described in Table 2−12. See section 5.1.2.2 on
page 5-9 for a discussion of the use of the ISTP.
Figure 2−11.Interrupt Service Table Pointer Register (ISTP)
3116
ISTB
R/W-0
15109543210
ISTB
R/W-0R-0R-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
HPEINT00000
Table 2−12. Interrupt Service Table Pointer Register (ISTP) Field Descriptions
BitFieldValueDescription
31−10 ISTB0−3F FFFFh Interrupt service table base portion of the IST address. This field is cleared
to 0 on reset; therefore, upon startup the IST must reside at address 0. After
reset, you can relocate the IST by writing a new value to ISTB. If relocated,
the first ISFP (corresponding to RESET
processing, because reset clears the ISTB to 0. See Example 5−1.
9−5HPEINT0−1FhHighest priority enabled interrupt that is currently pending. This field indicates
the number (related bit position in the IFR) of the highest priority interrupt (as
defined in Table 5−1 on page 5-3) that is enabled by its bit in the IER. Thus,
the ISTP can be used for manual branches to the highest priority enabled interrupt. If no interrupt is pending and enabled, HPEINT contains the value 0.
The corresponding interrupt need not be enabled by NMIE (unless it is NMI)
or by GIE.
4−0−0Cleared to 0 (fetch packets must be aligned on 8-word (32-byte) boundaries).
The NMI return pointer register (NRP) contains the return pointer that directs
the CPU to the proper location to continue program execution after NMI
processing. A branch using the address in NRP (B NRP) in your interrupt
service routine returns to the program flow when NMI servicing is complete.
The NRP is shown in Figure 2−12.
The NRP contains the 32-bit address of the first execute packet in the program
flow that was not executed because of a nonmaskable interrupt. Although you
can write a value to NRP, any subsequent interrupt processing may overwrite
that value.
Figure 2−12. NMI Return Pointer Register (NRP)
310
NRP
R/W-x
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -x = value is indeterminate after reset
2.7.12 E1 Phase Program Counter (PCE1)
The E1 phase program counter (PCE1), shown in Figure 2−13, contains the
32-bit address of the fetch packet in the E1 pipeline phase.
Figure 2−13. E1 Phase Program Counter (PCE1)
310
PCE1
R-x
Legend: R = Readable by the MVC instruction; -x = value is indeterminate after reset
CPU Data Paths and Control2-22SPRU733
2.8Control Register File Extensions
The C67x DSP has three additional configuration registers to support floatingpoint operations. The registers specify the desired floating-point rounding
mode for the .L and .M units. They also contain fields to warn if src1 and src2
are NaN or denormalized numbers, and if the result overflows, underflows, is
inexact, infinite, or invalid. There are also fields to warn if a divide by 0 was
performed, or if a compare was attempted with a NaN source. Table 2−13 lists
the additional registers used. The OVER, UNDER, INEX, INVAL, DENn,
NANn, INFO, UNORD and DIV0 bits within these registers will not be modified
by a conditional instruction whose condition is false.
The floating-point adder configuration register (FADCR) contains fields that
specify underflow or overflow, the rounding mode, NaNs, denormalized
numbers, and inexact results for instructions that use the .L functional units.
FADCR has a set of fields specific to each of the .L units: .L2 uses bits 31−16
and .L1 uses bits 15−0. FADCR is shown in Figure 2−14 and described in
Table 2−14.
Note:
For the C67x+ DSP, the ADDSP, ADDDP, SUBSP, and SUBDP instructions
executing in the .S functional unit use the rounding mode from and set the
warning bits in FADCR. The warning bits in FADCR are the logical-OR of the
warnings produced on the .L functional unit and the warnings produced by
the ADDSP/ADDDP/SUBSP/SUBDP instructions on the .S functional unit
(but not other instructions executing on the .S functional unit).
The floating-point auxiliary register (FAUCR) contains fields that specify
underflow or overflow, the rounding mode, NaNs, denormalized numbers, and
inexact results for instructions that use the .S functional units. FAUCR has a
set of fields specific to each of the .S units: .S2 uses bits 31−16 and .S1 uses
bits 15−0. FAUCR is shown in Figure 2−15 and described in Table 2−15.
Note:
For the C67x+ DSP, the ADDSP, ADDDP, SUBSP, and SUBDP instructions
executing in the .S functional unit use the rounding mode from and set the
warning bits in the floating-point adder configuration register (FADCR). The
warning bits in FADCR are the logical-OR of the warnings produced on the
.L functional unit and the warnings produced by the ADDSP/ADDDP/
SUBSP/SUBDP instructions on the .S functional unit (but not other instructions executing on the .S functional unit).
The floating-point multiplier configuration register (FMCR) contains fields that
specify underflow or overflow, the rounding mode, NaNs, denormalized
numbers, and inexact results for instructions that use the .M functional units.
FMCR has a set of fields specific to each of the .M units: .M2 uses bits 31−16
and .M1 uses bits 15−0. FMCR is shown in Figure 2−16 and described in
Table 2−16.
This chapter describes the assembly language instructions of the
TMS320C67x DSP. Also described are parallel operations, conditional
operations, resource constraints, and addressing modes.
The C67x floating-point DSP uses all of the instructions available to the
TMS320C62x™ DSP but it also uses other instructions that are specific to the
C67x DSP. These specific instructions are for 32-bit integer multiply, doubleword load, and floating-point operations, including addition, subtraction, and
multiplication.
Table 3−1 explains the symbols used in the instruction descriptions.
Table 3−1. Instruction Operation and Execution Notations
SymbolMeaning
abs(x)Absolute value of x
andBitwise AND
−aPerform 2s-complement subtraction using the addressing mode defined by the AMR
+aPerform 2s-complement addition using the addressing mode defined by the AMR
b
i
bit_countCount the number of bits that are 1 in a specified byte
bit_reverseReverse the order of bits in a 32-bit register
byte08-bit value in the least-significant byte position in 32-bit register (bits 0-7)
byte18-bit value in the next to least-significant byte position in 32-bit register (bits 8-15)
byte28-bit value in the next to most-significant byte position in 32-bit register (bits 16-23)
byte38-bit value in the most-significant byte position in 32-bit register (bits 24-31)
bv2Bit vector of two flags for s2 or u2 data type
bv4Bit vector of four flags for s4 or u4 data type
b
y..z
condCheck for either creg equal to 0 or creg not equal to 0
creg3-bit field specifying a conditional register, see section 3.6
cstnn-bit constant field (for example, cst5)
dint64-bit integer value (two registers)
dpDouble-precision floating-point register value
Select bit i of source/destination b
Selection of bits y through z of bit string b
dp(x)Convert x to dp
dst_h or dst_omsb32 of dst (placed in odd-numbered register of 64-bit register pair)
dst_l or dst_elsb32 of dst (placed in even-numbered register of a 64-bit register pair)
dws4Four packed signed 16-bit integers in a 64-bit register pair
dwu4
Four packed unsigned 16-bit integers in a 64-bit register pair
Instruction Set3-2SPRU733
Instruction Operation and Execution Notations
Table 3−1. Instruction Operation and Execution Notations (Continued)
SymbolMeaning
gmpyGalois Field Multiply
i2Two packed 16-bit integers in a single 32-bit register
i4Four packed 8-bit integers in a single 32-bit register
int32-bit integer value
int(x)Convert x to integer
lmb0(x)Leftmost 0 bit search of x
lmb1(x)Leftmost 1 bit search of x
long40-bit integer value
lsbn or LSBnn least-significant bits (for example, lsb16)
msbn or MSBnn most-significant bits (for example, msb16)
nopNo operation
norm(x)Leftmost nonredundant sign bit of x
notBitwise logical complement
opOpfields
orBitwise OR
RAny general-purpose register
rcp(x)Reciprocal approximation of x
ROTLRotate left
satSaturate
sbyte0Signed 8-bit value in the least-significant byte position in 32-bit register (bits 0−7)
sbyte1Signed 8-bit value in the next to least-significant byte position in 32-bit register (bits 8−15)
sbyte2Signed 8-bit value in the next to most-significant byte position in 32-bit register (bits 16−23)
sbyte3Signed 8-bit value in the most-significant byte position in 32-bit register (bits 24−31)
scstnn-bit signed constant field
sdintSigned 64-bit integer value (two registers)
se
Sign-extend
3-3Instruction SetSPRU733
Instruction Operation and Execution Notations
Table 3−1. Instruction Operation and Execution Notations (Continued)
SymbolMeaning
sintSigned 32-bit integer value
slongSigned 40-bit integer value
sllongSigned 64-bit integer value
slsb16Signed 16-bit integer value in lower half of 32-bit register
smsb16Signed 16-bit integer value in upper half of 32-bit register
spSingle-precision floating-point register value that can optionally use cross path
sp(x)Convert x to sp
sqrcp(x)Square root of reciprocal approximation of x
src1_hmsb32 of src1
src1_llsb32 of src1
src2_hmsb32 of src2
src2_llsb32 of src2
s2Two packed signed 16-bit integers in a single 32-bit register
s4Four packed signed 8-bit integers in a single 32-bit register
−sPerform 2s-complement subtraction and saturate the result to the result size, if an overflow
occurs
+sPerform 2s-complement addition and saturate the result to the result size, if an overflow
occurs
ubyte0Unsigned 8-bit value in the least-significant byte position in 32-bit register (bits 0−7)
ubyte1Unsigned 8-bit value in the next to least-significant byte position in 32-bit register (bits 8−15)
ubyte2Unsigned 8-bit value in the next to most-significant byte position in 32-bit register (bits 16−23)
ubyte3Unsigned 8-bit value in the most-significant byte position in 32-bit register (bits 24−31)
ucstnn-bit unsigned constant field (for example, ucst5)
uintUnsigned 32-bit integer value
ulongUnsigned 40-bit integer value
ullongUnsigned 64-bit integer value
ulsb16
Unsigned 16-bit integer value in lower half of 32-bit register
Instruction Set3-4SPRU733
Instruction Operation and Execution Notations
Table 3−1. Instruction Operation and Execution Notations (Continued)
SymbolMeaning
umsb16Unsigned 16-bit integer value in upper half of 32-bit register
u2Two packed unsigned 16-bit integers in a single 32-bit register
u4Four packed unsigned 8-bit integers in a single 32-bit register
x clear b,eClear a field in x, specified by b (beginning bit) and e (ending bit)
x ext l,rExtract and sign-extend a field in x, specified by l (shift left value) and r (shift right value)
x extu l,rExtract an unsigned field in x, specified by l (shift left value) and r (shift right value)
x set b,eSet field in x to all 1s, specified by b (beginning bit) and e (ending bit)
xint32-bit integer value that can optionally use cross path
xorBitwise exclusive-OR
xsintSigned 32-bit integer value that can optionally use cross path
xslsb16Signed 16 LSB of register that can optionally use cross path
xsmsb16Signed 16 MSB of register that can optionally use cross path
xspSingle-precision floating-point register value that can optionally use cross path
xs2Two packed signed 16-bit integers in a single 32-bit register that can optionally use cross path
xs4Four packed signed 8-bit integers in a single 32-bit register that can optionally use cross path
xuintUnsigned 32-bit integer value that can optionally use cross path
xulsb16Unsigned 16 LSB of register that can optionally use cross path
xumsb16Unsigned 16 MSB of register that can optionally use cross path
xu2Two packed unsigned 16-bit integers in a single 32-bit register that can optionally use cross path
xu4Four packed unsigned 8-bit integers in a single 32-bit register that can optionally use cross path
→Assignment
+Addition
++Increment by 1
×Multiplication
−Subtraction
==
Equal to
3-5Instruction SetSPRU733
Instruction Operation and Execution Notations
Table 3−1. Instruction Operation and Execution Notations (Continued)
SymbolMeaning
>Greater than
>=Greater than or equal to
<Less than
<=Less than or equal to
<<Shift left
>>Shift right
>>sShift right with sign extension
>>zShift right with a zero fill
~Logical inverse
&
Logical AND
Instruction Set3-6SPRU733
Instruction Syntax and Opcode Notations
3.2Instruction Syntax and Opcode Notations
Table 3−2 explains the syntaxes and opcode fields used in the instruction
descriptions.
The C64x CPU 32-bit opcodes are mapped in Appendix C through Appendix G.
Table 3−2. Instruction Syntax and Opcode Notations
SymbolMeaning
baseRbase address register
CC
creg3-bit field specifying a conditional register, see section 3.6
cstconstant
cstaconstant a
cstbconstant b
cstnn-bit constant field
dstdestination
dstms
dwdoubleword; 0 = word, 1 = doubleword
ii
n
ld/stload or store; 0 = store, 1 = load
modeaddressing mode, see section 3.8
offsetRregister offset
opopfield; field within opcode that specifies a unique instruction
op
n
pparallel execution; 0 = next instruction is not executed in parallel, 1 = next instruction is
rLDDW instruction
rsvreserved
sside A or B for destination; 0 = side A, 1 = side B.
scscaling mode; 0 = nonscaled, offsetR/ucst5 is not shifted; 1 = scaled, offsetR/ucst5 is shifted
scstn
bit n of the constant ii
bit n of the opfield
executed in parallel
n-bit signed constant field
3-7Instruction SetSPRU733
Instruction Syntax and Opcode Notations
Table 3−2. Instruction Syntax and Opcode Notations (Continued)
SymbolMeaning
scst
n
bit n of the signed constant field
snsign
srcsource
src1source 1
src2source 2
srcms
stg
n
bit n of the constant stg
tside of source/destination (src/dst) register; 0 = side A, 1 = side B
ucstnn-bit unsigned constant field
ucst
n
bit n of the unsigned constant field
unitunit decode
xcross path for src2; 0 = do not use cross path, 1 = use cross path
y.D1 or .D2 unit; 0 = .D1 unit, 1 = .D2 unit
z
test for equality with zero or nonzero
Instruction Set3-8SPRU733
Overview of IEEE Standard Single- and Double-Precision Formats
3.3Overview of IEEE Standard Single- and Double-Precision Formats
Floating-point operands are classified as single-precision (SP) and doubleprecision (DP). Single-precision floating-point values are 32-bit values stored
in a single register. Double-precision floating-point values are 64-bit values
stored in a register pair. The register pair consists of consecutive even and odd
registers from the same register file. The 32 least-significant-bits are loaded
into the even register; the 32 most-significant-bits containing the sign bit and
exponent are loaded into the next register (that is always the odd register). The
register pair syntax places the odd register first, followed by a colon, then the
even register (that is, A1:A0, B1:B0, A3:A2, B3:B2, etc.).
Instructions that use DP sources fall in two categories: instructions that read
the upper and lower 32-bit words on separate cycles, and instructions that
read both 32-bit words on the same cycle. All instructions that produce a
double-precision result write the low 32-bit word one cycle before writing the
high 32-bit word. If an instruction that writes a DP result is followed by an
instruction that uses the result as its DP source and it reads the upper and lower words on separate cycles, then the second instruction can be scheduled on
the same cycle that the high 32-bit word of the result is written. The lower result
is written on the previous cycle. This is because the second instruction reads
the low word of the DP source one cycle before the high word of the DP source.
IEEE floating-point numbers consist of normal numbers, denormalized
numbers, NaNs (not a number), and infinity numbers. Denormalized numbers
are nonzero numbers that are smaller than the smallest nonzero normal
number. Infinity is a value that represents an infinite floating-point number.
NaN values represent results for invalid operations, such as (+infinity +
(−infinity)).
Normal single-precision values are always accurate to at least six decimal
places, sometimes up to nine decimal places. Normal double-precision values
are always accurate to at least 15 decimal places, sometimes up to 17 decimal
places.
Table 3−3 shows notations used in discussing floating-point numbers.
3-9Instruction SetSPRU733
Overview of IEEE Standard Single- and Double-Precision Formats
Table 3−3. IEEE Floating-Point Notations
SymbolMeaning
sSign bit
eExponent field
fFraction (mantissa) field
xCan have value of 0 or 1 (don’t care)
NaNNot-a-Number (SNaN or QNaN)
SNaNSignal NaN
QNaNQuiet NaN
NaN_outQNaN with all bits in the f field = 1
InfInfinity
LFPNLargest floating-point number
SFPNSmallest floating-point number
LDFPNLargest denormalized floating-point number
SDFPNSmallest denormalized floating-point number
signed Inf+infinity or −infinity
signed NaN_out
NaN_out with s = 0 or 1
Instruction Set3-10SPRU733
Overview of IEEE Standard Single- and Double-Precision Formats
Á
Á
Á
Á
Á
Á
Á
Á
Figure 3−1 shows the fields of a single-precision floating-point number represented within a 32-bit register.
The floating-point fields represent floating-point numbers within two ranges:
normalized (e is between 0 and 255) and denormalized (e is 0). The following
formulas define how to translate the s, e, and f fields into a single-precision
floating-point number.
Normalized:
s
(e−127)
−1
× 2
× 1.f 0 < e < 255
Denormalized (Subnormal):
s
−126
−1
× 2
× 0.f e = 0; f nonzero
Table 3−4 shows the s,e, and f values for special single-precision floatingpoint numbers.
0
Table 3−4. Special Single-Precision Values
Symbol
+0
−0
+Inf
БББББ
−Inf
NaN
QNaN
БББББ
SNaN
Sign (s)
0
1
0
ÁÁÁ
1
x
x
ÁÁÁ
x
Exponent (e)
0
0
255
ÁÁÁ
255
255
255
ÁÁÁ
255
Fraction (f)
0
0
0
БББББББ
0
nonzero
1xx..x
БББББББ
0xx..x and nonzero
3-11Instruction SetSPRU733
Overview of IEEE Standard Single- and Double-Precision Formats
Table 3−5 shows hexadecimal and decimal values for some single-precision
floating-point numbers.
Figure 3−2 shows the fields of a double-precision floating-point number represented within a pair of 32-bit registers.
Table 3−5. Hexadecimal and Decimal Representation for Selected Single-Precision Values
The floating-point fields represent floating-point numbers within two ranges:
normalized (e is between 0 and 2047) and denormalized (e is 0). The following
formulas define how to translate the s, e, and f fields into a double-precision
floating-point number.
20 190
−1
+ 1*2−2 + ... + 1*2
52
)−1)/(252)
31
f
−52
or
f
0
Instruction Set3-12SPRU733
Overview of IEEE Standard Single- and Double-Precision Formats
Normalized:
s
(e−1023)
−1
× 2
× 1.f 0 < e < 2047
Denormalized (Subnormal):
s
−1022
−1
× 2
× 0.f e = 0; f nonzero
Table 3−6 shows the s, e, and f values for special double-precision floatingpoint numbers.
Table 3−6. Special Double-Precision Values
SymbolSign (s)Exponent (e)Fraction (f)
+0000
−0100
+Inf020470
−Inf120470
NaNx2047nonzero
QNaNx20471xx..x
SNaN
x20470xx..x and nonzero
Table 3−7 shows hexadecimal and decimal values for some double-precision
floating-point numbers.
Table 3−7. Hexadecimal and Decimal Representation for Selected Double-Precision Values
SymbolHex ValueDecimal Value
NaN_out7FFF FFFF FFFF FFFFQNaN
00000 0000 0000 00000.0
−08000 0000 0000 0000−0.0
13FF0 0000 0000 00001.0
24000 0000 0000 00002.0
LFPN7FEF FFFF FFFF FFFF1.7976931348623157e+308
SFPN0010 0000 0000 00002.2250738585072014e−308
LDFPN000F FFFF FFFF FFFF2.2250738585072009e−308
SDFPN
0000 0000 0000 00014.9406564584124654e−324
3-13Instruction SetSPRU733
Delay Slots
3.4Delay Slots
The execution of floating-point instructions can be defined in terms of delay
slots and functional unit latency. The number of delay slots is equivalent to the
number of additional cycles required after the source operands are read for the
result to be available for reading. For a single-cycle type instruction, operands
are read on cycle i and produce a result that can be read on cycle i + 1. For
a 4-cycle instruction, operands are read on cycle i and produce a result that
can be read on cycle i + 4. Table 3−8 shows the number of delay slots associat-
ed with each type of instruction.
The double-precision floating-point addition, subtraction, multiplication,
compare, and the 32-bit integer multiply instructions also have a functional unit
latency that is greater than 1. The functional unit latency is equivalent to the
number of cycles that the instruction uses the functional unit read ports. For
example, the ADDDP instruction has a functional unit latency of 2. Operands
are read on cycle i and cycle i + 1. Therefore, a new instruction cannot begin
until cycle i + 2, rather than i + 1. ADDDP produces a result that can be read
on cycle i + 7, because it has six delay slots.
Delay slots are equivalent to an execution or result latency. All of the instructions in the C67x DSP have a functional unit latency of 1. This means that a
new instruction can be started on the functional unit each cycle. Single-cycle
throughput is another term for single-cycle functional unit latency.
Instruction Set3-14SPRU733
Table 3−8. Delay Slot and Functional Unit Latency
Delay Slots
Instruction Type
Delay
Slots
Functional
Unit Latency
Read Cycles
†
Cycles
Write
Single cycle01ii
2-cycle DP11ii, i + 1
DP compare12i, i + 11 + 1
4-cycle31ii + 3
INTDP41ii + 3, i + 4
Load41ii, i + 4
MPYSP2DP42ii + 3, i + 4
ADDDP/SUBDP62i, i + 1i + 5, i + 6
MPYSPDP63i, i + 1i + 5, i + 6
MPYI84i, i + 1, 1 + 2, i + 3i + 8
MPYID94i, i + 1, 1 + 2, i + 3i + 8, i + 9
MPYDP
†
Cycle i is in the E1 pipeline phase.
‡
A write on cycle i + 4 uses a separate write port from other .D unit instructions.
94i, i + 1, 1 + 2, i + 3i + 8, i + 9
†
‡
3-15Instruction SetSPRU733
Parallel Operations
3.5Parallel Operations
Instructions are always fetched eight at a time. This constitutes a fetch packet.
The basic format of a fetch packet is shown in Figure 3−3. Fetch packets are
aligned on 256-bit (8-word) boundaries.
Figure 3−3. Basic Format of a Fetch Packet
310 310 310 310 310 310 310 310
pppppppp
LSBs of
the byte
address
Instruction
A
00000b
Instruction
B
00100b
Instruction
C
01000b
Instruction
D
01100b
Instruction
E
10000b
Instruction
F
10100b
Instruction
G
11000b
Instruction
11100b
The execution of the individual instructions is partially controlled by a bit in
each instruction, the p-bit. The p-bit (bit 0) determines whether the instruction
executes in parallel with another instruction. The p-bits are scanned from left
to right (lower to higher address). If the p -bit of instruction i is 1, then instruction
i + 1 is to be executed in parallel with (in the the same cycle as) instruction i.
If the p-bit of instruction i is 0, then instruction i + 1 is executed in the cycle after
instruction i. All instructions executing in parallel constitute an execute packet.
An execute packet can contain up to eight instructions. Each instruction in an
execute packet must use a different functional unit.
On the C67x DSP, an execute packet cannot cross an 8-word boundary;
therefore, the last p-bit in a fetch packet is always cleared to 0, and each fetch
packet starts a new execute packet. On the C67x+ DSP, an execute packet
can cross an 8-word boundary.
There are three types of p-bit patterns for fetch packets. These three p-bit patterns result in the following execution sequences for the eight instructions:
H
Fully serial
Fully parallel
Partially serial
Example 3−1 through Example 3−3 show the conversion of a p-bit sequence
into a cycle-by-cycle execution stream of instructions.
Instruction Set3-16SPRU733
Parallel Operations
Example 3−1. Fully Serial p-Bit Pattern in a Fetch Packet
Note:Instructions C, D, and E do not use any of the same functional units, cross paths, or
other data path resources. This is also true for instructions F, G, and H.
CDE
FGH
3.5.1Example Parallel Code
The vertical bars || signify that an instruction is to execute in parallel with the
previous instruction. The code for the fetch packet in Example 3−3 would be
represented as this:
instruction A
instruction B
instruction C
|| instruction D
|| instruction E
H
instruction F
|| instruction G
|| instruction H
3.5.2Branching Into the Middle of an Execute Packet
If a branch into the middle of an execute packet occurs, all instructions at lower
addresses are ignored. In Example 3−3, if a branch to the address containing
instruction D occurs, then only D and E execute. Even though instruction C is
in the same execute packet, it is ignored. Instructions A and B are also ignored
because they are in earlier execute packets. If your result depends on executing A, B, or C, the branch to the middle of the execute packet will produce an
erroneous result.
Instruction Set3-18SPRU733
3.6Conditional Operations
Most instructions can be conditional. The condition is controlled by a 3-bit
opcode field (creg) that specifies the condition register tested, and a 1-bit field
(z) that specifies a test for zero or nonzero. The four MSBs of every opcode
are creg and z. The specified condition register is tested at the beginning of
the E1 pipeline stage for all instructions. For more information on the pipeline,
see Chapter 4. If z = 1, the test is for equality with zero; if z = 0, the test is for
nonzero. The case of creg = 0 and z = 0 is treated as always true to allow
instructions to be executed unconditionally. The creg field is encoded in the
instruction opcode as shown in Table 3−9.
Table 3−9. Registers That Can Be Tested by Conditional Operations
Conditional Operations
Specified
Conditional
Register
Unconditional0000
Reserved
B0001z
B1010z
B2011z
A1100z
A2101z
Reserved
†
This value is reserved for software breakpoints that are used for emulation purposes.
‡
x can be any value.
†
Bit
31302928
0001
11x
cregz
‡
‡
x
Conditional instructions are represented in code by using square brackets, [ ],
surrounding the condition register name. The following execute packet
contains two ADD instructions in parallel. The first ADD is conditional on B0
being nonzero. The second ADD is conditional on B0 being zero. The character ! indicates the inverse of the condition.
[B0]ADD.L1A1,A2,A3
|| [!B0] ADD.L2B1,B2,B3
The above instructions are mutually exclusive, only one will execute. If they
are scheduled in parallel, mutually exclusive instructions are constrained as
described in section 3.7. If mutually exclusive instructions share any resources
as described in section 3.7, they cannot be scheduled in parallel (put in the
same execute packet), even though only one will execute.
3-19Instruction SetSPRU733
Resource Constraints
3.7Resource Constraints
No two instructions within the same execute packet can use the same
resources. Also, no two instructions can write to the same register during the
same cycle. The following sections describe how an instruction can use each
of the resources.
3.7.1Constraints on Instructions Using the Same Functional Unit
Two instructions using the same functional unit cannot be issued in the same
execute packet.
The following execute packet is invalid:
ADD .S1 A0, A1, A2 ;.S1 is used for
|| SHR .S1 A3, 15, A4 ;...both instructions
The following execute packet is valid:
ADD .L1 A0, A1, A2 ;Two different functional
|| SHR .S1 A3, 15, A4 ;...units are used
3.7.2Constraints on the Same Functional Unit Writing in the Same Instruction Cycle
Two instructions using the same functional unit cannot write their results in the
same instruction cycle.
Instruction Set3-20SPRU733
3.7.3Constraints on Cross Paths (1X and 2X)
One unit (either a .S, .L, or .M unit) per data path, per execute packet, can read
a source operand from its opposite register file via the cross paths (1X and 2X).
For example, the .S1 unit can read both its operands from the A register file; or
it can read an operand from the B register file using the 1X cross path and the
other from the A register file. The use of a cross path is denoted by an X following
the functional unit name in the instruction syntax (as in S1X).
The following execute packet is invalid because the 1X cross path is being
used for two different B register operands:
MV .S1X B0, A0 ; \ Invalid. Instructions are using the 1X cross path
|| MV .L1X B1, A1 ; / with different B registers
The following execute packet is valid because all uses of the 1X cross path are
for the same B register operand, and all uses of the 2X cross path are for the
same A register operand:
ADD .L1X A0,B1,A1 ; \ Instructions use the 1X with B1
|| SUB .S1X A2,B1,A2 ; / 1X cross paths using B1
|| AND .D1 A4,A1,A3 ;
|| MPY .M1 A6,A1,A4 ;
|| ADD .L2 B0,B4,B2 ;
|| SUB .S2X B4,A4,B3 ; / 2X cross paths using A4
|| AND .D2X B5,A4,B4 ; / 2X cross paths using A4
|| MPY .M2 B6,B4,B5 ;
Resource Constraints
The operand comes from a register file opposite of the destination, if the x bit
in the instruction field is set.
3-21Instruction SetSPRU733
Resource Constraints
3.7.4Constraints on Loads and Stores
Load and store instructions can use an address pointer from one register file
while loading to or storing from the other register file. Two load and store
instructions using a destination/source from the same register file cannot be
issued in the same execute packet. The address register must be on the same
side as the .D unit used.
The following execute packet is invalid:
LDW.D1 *A0,A1 ; \ .D2 unit must use the address
|| LDW .D2 *A2,B2 ; / register from the B register file
Two loads and/or stores loading to and/or storing from the same register file
cannot be issued in the same execute packet.
The following execute packet is invalid:
LDW.D1 *A4,A5 ; \ Loading to and storing from the
|| STW .D2 A6,*B4 ; / same register file
The following execute packets are valid:
LDW.D1 *A4,B5 ; \ Loading to, and storing from
|| STW .D2 A6,*B4 ; / different register files
LDW.D1 *A0,B2 ; \ Loading to
|| LDW .D2 *B0,A1 ; / different register files
Instruction Set3-22SPRU733
3.7.5Constraints on Long (40-Bit) Data
Because the .S and .L units share a read register port for long source operands
and a write register port for long results, only one long result may be issued
per register file in an execute packet. All instructions with a long result on the
.S and .L units have zero delay slots. See section 2.2 for the order for long
pairs.
The following execute packet is invalid:
ADD .L1 A5:A4,A1,A3:A2 ; \ Two long writes
|| SHL.S1 A8,A9,A7:A6 ; / on A register file
The following execute packet is valid:
ADD .L1 A5:A4,A1,A3:A2 ; \ One long write for
|| SHL.S2 B8,B9,B7:B6 ; / each register file
Because the .L and .S units share their long read port with the store port,
operations that read a long value cannot be issued on the .L and/or .S units
in the same execute packet as a store.
The following execute packet is invalid:
ADD .L1 A5:A4,A1,A3:A2; \ Long read operation and a
|| STW .D1 A8,*A9; / store
Resource Constraints
The following execute packet is valid:
ADD .L1A4, A1, A3:A2; \ No long read with
|| STW.D1 A8,*A9; / the store
On the C67x DSP, doubleword load instructions conflict with long results from
the .S units. All stores conflict with a long source on the .S unit. The following
execute packet is invalid, because the .D unit store on the T1 path conflicts with
the long source on the .S1 unit:
ADD .S1 A1,A5:A4, A3:A2 ; \ Long source on .S unit and a store
|| STW .D1T1 A8,*A9 ; / on the T1 path of the .D unit
The following code sequence is invalid:
LDDW .D1T1 *A16,A11:A10 ; \ Double word load written to
; A11:A10 on .D1
NOP 3 ; conflicts after 3 cycles
SHL .S1 A8,A9,A7:A6 ; / with write to A7:A6 on .S1
The following execute packets are valid:
ADD .L1 A1,A5:A4,A3:A2 ; \ One long write for
|| SHL .S2 B8,B9,B7:B6 ; / each register file
ADD .L1 A4, A1, A3:A2 ; \ No long read with
|| STW .D1T1 A8,*A9; / the store on T1 path of .D1
3-23Instruction SetSPRU733
Resource Constraints
3.7.6Constraints on Register Reads
More than four reads of the same register cannot occur on the same cycle.
Conditional registers are not included in this count.
The following execute packets are invalid:
MPY .M1 A1, A1, A4 ; five reads of register A1
|| ADD .L1 A1, A1, A5
|| SUB .D1 A1, A2, A3
MPY .M1 A1, A1, A4 ; five reads of register A1
|| ADD .L1 A1, A1, A5
|| SUB .D2x A1, B2, B3
The following execute packet is valid:
MPY .M1 A1, A1, A4 ; only four reads of A1
|| [A1] ADD .L1 A0, A1, A5
|| SUB .D1 A1, A2, A3
Instruction Set3-24SPRU733
3.7.7Constraints on Register Writes
Two instructions cannot write to the same register on the same cycle. Two
instructions with the same destination can be scheduled in parallel as long as
they do not write to the destination register on the same cycle. For example,
an MPY issued on cycle i followed by an ADD on cycle i + 1 cannot write to the
same register because both instructions write a result on cycle i + 1. Therefore,
the following code sequence is invalid unless a branch occurs after the MPY,
causing the ADD not to be issued.
MPY .M1 A0, A1, A2
ADD .L1 A4, A5, A2
However, this code sequence is valid:
MPY .M1 A0, A1, A2
|| ADD .L1 A4, A5, A2
Figure 3−4 shows different multiple-write conflicts. For example, ADD and
SUB in execute packet L1 write to the same register. This conflict is easily
detectable.
Resource Constraints
MPY in packet L2 and ADD in packet L3 might both write to B2 simultaneously;
however, if a branch instruction causes the execute packet after L2 to be
something other than L3, a conflict would not occur. Thus, the potential conflict
in L2 and L3 might not be detected by the assembler. The instructions in L4
do not constitute a write conflict because they are mutually exclusive. In
contrast, because the instructions in L5 may or may not be mutually exclusive,
the assembler cannot determine a conflict. If the pipeline does receive
commands to perform multiple writes to the same register, the result is
undefined.
Figure 3−4. Examples of the Detectability of Write Conflicts by the Assembler
If an instruction has a multicycle functional unit latency, it locks the functional
unit for the necessary number of cycles. Any new instruction dispatched to that
functional unit during this locking period causes undefined results. If an
instruction with a multicycle functional unit latency has a condition that is evaluated as false during E1, it still locks the functional unit for subsequent cycles.
An instruction of the following types scheduled on cycle i has the following
constraints:
DP compareNo other instruction can use the functional unit on cycles
i and i + 1.
ADDDP/SUBDPNo other instruction can use the functional unit on cycles
i and i + 1.
MPYINo other instruction can use the functional unit on cycles
i, i + 1, i + 2, and i + 3.
MPYIDNo other instruction can use the functional unit on cycles
i, i + 1, i + 2, and i + 3.
MPYDPNo other instruction can use the functional unit on cycles
i, i + 1, i + 2, and i + 3.
MPYSPDPNo other instruction can use the functional unit on cycles
i and i + 1.
MPYSP2DPNo other instruction can use the functional unit on cycles
i and i + 1.
If a cross path is used to read a source in an instruction with a multicycle functional unit latency, you must ensure that no other instructions executing on the
same side uses the cross path.
An instruction of the following types scheduled on cycle i using a cross path
to read a source, has the following constraints:
DP compareNo other instruction on the same side can used the cross
path on cycles i and i + 1.
ADDDP/SUBDPNo other instruction on the same side can use the cross
path on cycles i and i + 1.
MPYINo other instruction on the same side can use the cross
path on cycles i, i + 1, i + 2, and i + 3.
MPYIDNo other instruction on the same side can use the cross
path on cycles i, i + 1, i + 2, and i + 3.
Instruction Set3-26SPRU733
Resource Constraints
MPYDPNo other instruction on the same side can use the cross
path on cycles i, i + 1, i + 2, and i + 3.
MPYSPDPNo other instruction on the same side can use the cross
path on cycles i and i + 1.
Other hazards exist because instructions have varying numbers of delay slots,
and need the functional unit read and write ports of varying numbers of cycles.
A read or write hazard exists when two instructions on the same functional unit
attempt to read or write, respectively, to the register file on the same cycle.
An instruction of the following types scheduled on cycle i has the following
constraints:
2-cycle DPA single-cycle instruction cannot be scheduled on that
functional unit on cycle i + 1 due to a write hazard on cycle
i + 1.
Another 2-cycle DP instruction cannot be scheduled on
that functional unit on cycle i + 1 due to a write hazard on
cycle i + 1.
4-cycleA single-cycle instruction cannot be scheduled on that
functional unit on cycle i + 3 due to a write hazard on cycle
i + 3.
A multiply (16 × 16-bit) instruction cannot be scheduled
on that functional unit on cycle i + 2 due to a write hazard
on cycle i + 3.
ADDDP/SUBDPA single-cycle instruction cannot be scheduled on that
functional unit on cycle i + 5 or i + 6 due to a write hazard
on cycle i + 5 or i + 6, respectively.
A 4-cycle instruction cannot be scheduled on that functional unit on cycle i + 2 or i + 3 due to a write hazard on
cycle i + 5 or i + 6, respectively.
An INTDP instruction cannot be scheduled on that functional unit on cycle i + 2 or i + 3 due to a write hazard on
cycle i + 5 or i + 6, respectively.
INTDPA single-cycle instruction cannot be scheduled on that
functional unit on cycle i + 3 or i + 4 due to a write hazard
on cycle i + 3 or i + 4, respectively.
An INTDP instruction cannot be scheduled on that functional unit on cycle i + 1 due to a write hazard on cycle
i + 1.
A 4-cycle instruction cannot be scheduled on that functional unit on cycle i + 1 due to a write hazard on cycle
i + 1.
3-27Instruction SetSPRU733
Resource Constraints
MPYIA 4-cycle instruction cannot be scheduled on that func-
tional unit on cycle i + 4, i + 5, or i + 6.
A MPYDP instruction cannot be scheduled on that func-
tional unit on cycle i + 4, i + 5, or i + 6.
A MPYSPDP instruction cannot be scheduled on that
functional unit on cycle i + 4, i + 5, or i + 6.
A MPYSP2DP instruction cannot be scheduled on that
functional unit on cycle i + 4, i + 5, or i + 6.
A multiply (16 × 16-bit) instruction cannot be scheduled
on that functional unit on cycle i + 6 due to a write hazard
on cycle i + 7.
MPYIDA 4-cycle instruction cannot be scheduled on that func-
tional unit on cycle i + 4, i + 5, or i + 6.
A MPYDP instruction cannot be scheduled on that func-
tional unit on cycle i + 4, i + 5, or i + 6.
A MPYSPDP instruction cannot be scheduled on that
functional unit on cycle i + 4, i + 5, or i + 6.
A MPYSP2DP instruction cannot be scheduled on that
functional unit on cycle i + 4, i + 5, or i + 6.
A multiply (16 × 16-bit) instruction cannot be scheduled
on that functional unit on cycle i + 7 or i + 8 due to a write
hazard on cycle i + 8 or i + 9, respectively.
MPYDPA 4-cycle instruction cannot be scheduled on that func-
tional unit on cycle i + 4, i + 5, or i + 6.
A MPYI instruction cannot be scheduled on that function-
al unit on cycle i + 4, i + 5, or i + 6.
A MPYID instruction cannot be scheduled on that func-
tional unit on cycle i + 4, i + 5, or i + 6.
A multiply (16 × 16-bit) instruction cannot be scheduled
on that functional unit on cycle i + 7 or i + 8 due to a write
hazard on cycle i + 8 or i + 9, respectively.
Instruction Set3-28SPRU733
Resource Constraints
MPYSPDPA 4-cycle instruction cannot be scheduled on that func-
tional unit on cycle i + 2 or i + 3.
A MPYI instruction cannot be scheduled on that function-
al unit on cycle i + 2 or i + 3.
A MPYID instruction cannot be scheduled on that func-
tional unit on cycle i + 2 or i + 3.
A MPYDP instruction cannot be scheduled on that func-
tional unit on cycle i + 2 or i + 3.
A MPYSP2DP instruction cannot be scheduled on that
functional unit on cycle i + 2 or i + 3.
A multiply (16 × 16-bit) instruction cannot be scheduled
on that functional unit on cycle i + 4 or i + 5 due to a write
hazard on cycle i + 5 or i + 6, respectively.
MPYSP2DPA multiply (16 × 16-bit) instruction cannot be scheduled
on that functional unit on cycle i + 2 or i + 3 due to a write
hazard on cycle i + 3 or i + 4, respectively.
All of the above cases deal with double-precision floating-point instructions or
the MPYI or MPYID instructions except for the 4-cycle case. A 4-cycle instruc-
tion consists of both single- and double-precision floating-point instructions.
Therefore, the 4-cycle case is important for the following single-precision floating-point instructions:
ADDSP
SUBSP
SPINT
SPTRUNC
INTSP
MPYSP
The .S and .L units share their long write port with the load port for the 32 most
significant bits of an LDDW load. Therefore, the LDDW instruction and the .S
or .L unit writing a long result cannot write to the same register file on the same
cycle. The LDDW writes to the register file on pipeline phase E5. Instructions
that use a long result and use the .L and .S unit write to the register file on pipeline phase E1. Therefore, the instruction with the long result must be scheduled later than four cycles following the LDDW instruction if both instructions
use the same side.
3-29Instruction SetSPRU733
Addressing Modes
3.8Addressing Modes
The addressing modes on the C67x DSP are linear, circular using BK0, and
circular using BK1. The addressing mode is specified by the addressing mode
register (AMR), described in section 2.7.3.
All registers can perform linear addressing. Only eight registers can perform
circular addressing: A4−A7 are used by the .D1 unit and B4−B7 are used by
the .D2 unit. No other units can perform circular addressing.
LDB(U)/LDH(U)/LDW, STB/STH/STW, ADDAB/ADDAH/ADDAW/ADDAD,
and SUBAB/SUBAH/SUBAW instructions all use AMR to determine what
type of address calculations are performed for these registers.
3.8.1Linear Addressing Mode
3.8.1.1LD and ST Instructions
For load and store instructions, linear mode simply shifts the offsetR/cst
operand to the left by 3, 2, 1, or 0 for doubleword, word, halfword, or byte
access, respectively; and then performs an add or a subtract to baseR
(depending on the operation specified).
For the preincrement, predecrement, positive offset, and negative offset
address generation options, the result of the calculation is the address to be
accessed in memory. For postincrement or postdecrement addressing, the
value of baseR before the addition or subtraction is the address to be accessed
from memory.
3.8.1.2ADDA and SUBA Instructions
For integer addition and subtraction instructions, linear mode simply shifts the
src1/cst operand to the left by 3, 2, 1, or 0 for doubleword, word, halfword, or
byte data sizes, respectively, and then performs the add or subtract specified.
Instruction Set3-30SPRU733
3.8.2Circular Addressing Mode
The BK0 and BK1 fields in AMR specify the block sizes for circular addressing,
see section 2.7.3.
3.8.2.1LD and ST Instructions
As with linear address arithmetic, offsetR/cst is shifted left by 3, 2, 1, or 0
according to the data size, and is then added to or subtracted from baseR to
produce the final address. Circular addressing modifies this slightly by only
allowing bits N through 0 of the result to be updated, leaving bits 31 through
N + 1 unchanged after address arithmetic. The resulting address is bounded
(N + 1)
to 2
The circular buffer size in AMR is not scaled; for example, a block-size of 8 is
8 bytes, not 8 times the data size (byte, halfword, word). So, to perform circular
addressing on an array of 8 words, a size of 32 should be specified, or N = 4.
Example 3−4 shows an LDW performed with register A4 in circular mode and
BK0 = 4, so the buffer size is 32 bytes, 16 halfwords, or 8 words. The value in
AMR for this example is 0004 0001h.
range, regardless of the size of the offsetR/cst.
Addressing Modes
Example 3−4. LDW Instruction in Circular Mode
LDW.D1*++A4[9],A1
Before LDW1 cycle after LDW5 cycles after LDW
A4
0000 0100h
A1XXXX XXXXhA1XXXX XXXXhA1
mem 104h1234 5678hmem104h1234 5678hmem104h1234 5678h
Note:9h words is 24h bytes. 24h bytes is 4 bytes beyond the 32-byte (20h) boundary 100h−11Fh; thus, it is wrapped around to
(124h − 20h = 104h).
A40000 0104hA40000 0104h
1234 5678h
3-31Instruction SetSPRU733
Addressing Modes
3.8.2.2ADDA and SUBA Instructions
As with linear address arithmetic, offsetR/cst is shifted left by 3, 2, 1, or 0
according to the data size, and is then added to or subtracted from baseR to
produce the final address. Circular addressing modifies this slightly by only
allowing bits N through 0 of the result to be updated, leaving bits 31 through
N + 1 unchanged after address arithmetic. The resulting address is bounded
(N + 1)
to 2
range, regardless of the size of the offsetR/cst.
The circular buffer size in AMR is not scaled; for example, a block size of 8 is
8 bytes, not 8 times the data size (byte, halfword, word). So, to perform circular
addressing on an array of 8 words, a size of 32 should be specified, or N = 4.
Example 3−5 shows an ADDAH performed with register A4 in circular mode
and BK0 = 4, so the buffer size is 32 bytes, 16 halfwords, or 8 words. The value
in AMR for this example is 0004 0001h.
Example 3−5. ADDAH Instruction in Circular Mode
ADDAH.D1A4,A1,A4
Before ADDAH1 cycle after ADDAH
A4
0000 0100h
A4 0000 0106h
A1 0000 0013hA1 0000 0013h
Note:13h halfwords is 26h bytes. 26h bytes is 6 bytes beyond the 32-byte (20h) boundary 100h−11Fh; thus, it is wrapped
around to (126h − 20h = 106h).
3.8.3Syntax for Load/Store Address Generation
The C64x DSP has a load/store architecture, which means that the only way
to access data in memory is with a load or store instruction. Table 3−10 shows
the syntax of an indirect address to a memory location. Sometimes a large offset is required for a load/store. In this case, you can use the B14 or B15 register
as the base register, and use a 15-bit constant (ucst15) as the offset.
Table 3−11 describes the addressing generator options. The memory address
is formed from a base address register (baseR) and an optional offset that is
either a register (offsetR) or a 5-bit unsigned constant (ucst5).
Instruction Set3-32SPRU733
Table 3−10. Indirect Address Generation for Load/Store
Addressing Modes
Preincrement or
No Modification of
Addressing Type
Register indirect*R*++R
Register relative*+R[ucst5]
Register relative with
15-bit constant offset
Base + index
Address Register
*−R[ucst5]
*+B14/B15[ucst15]not supportednot supported
*+R[offsetR]
*−R[offsetR]
Predecrement of
Address Register
*−−R
*++R[ucst5]
*−−R[ucst5]
*++R[offsetR]
*−−R[offsetR]
Table 3−11.Address Generator Options for Load/Store
Mode FieldSyntaxModification Performed
0000*−R[ucst5]Negative offset
0001*+R[ucst5]Positive offset
Postincrement or
Postdecrement of
Address Register
The C62x, C64x, and C67x DSPs share an instruction set. All of the instructions valid for the C62x DSP are also valid for the C67x DSP. See Appendix A
for a list of the instructions that are common to the C62x, C64x, and C67x
DSPs.
3.10 Instruction Descriptions
This section gives detailed information on the instruction set. Each instruction
may present the following information:
Assembler syntax
Functional units
Compatibility
Operands
Opcode
Description
Execution
Pipeline
Instruction type
Delay slots
Functional Unit Latency
Examples
The ADD instruction is used as an example to familiarize you with the way
each instruction is described. The example describes the kind of information
you will find in each part of the individual instruction description and where to
obtain more information.
Instruction Set3-34SPRU733
The way each instruction is describedExample
Example
SyntaxEXAMPLE (.unit) src, dst
The way each instruction is described.
.unit = .L1, .L2, .S1, .S2, .D1, .D2
src and dst indicate source and destination, respectively. The (.unit) dictates
which functional unit the instruction is mapped to (.L1, .L2, .S1, .S2, .M1, .M2,
.D1, or .D2).
A table is provided for each instruction that gives the opcode map fields, units
the instruction is mapped to, types of operands, and the opcode.
The opcode shows the various fields that make up each instruction. These
fields are described in Table 3−2 on page 3-7.
There are instructions that can be executed on more than one functional unit.
Table 3−12 shows how this is documented for the ADD instruction. This
instruction has three opcode map fields: src1, src2, and dst. In the seventh
group, the operands have the types cst5, long, and long for src1, src2, and dst,
respectively. The ordering of these fields implies cst5 + long long, where +
represents the operation being performed by the ADD. This operation can be
done on .L1 or .L2 (both are specified in the unit column). The s in front of each
operand signifies that src1 (scst5), src2 (slong), and dst (slong) are all signed
values.
In the third group, src1, src2, and dst are int, int, and long, respectively. The
u in front of each operand signifies that all operands are unsigned. Any
operand that begins with x can be read from a register file that is different from
the destination register file. The operand comes from the register file opposite
the destination, if the x bit in the instruction is set (shown in the opcode map).
3-35 Instruction SetSPRU733
ExampleThe way each instruction is described
Table 3−12. Relationships Between Operands, Operand Size, Signed/Unsigned,
Functional Units, and Opfields for Example Instruction (ADD)
Opcode map field used...For operand type...UnitOpfield
src1
src2
dst
src1
src2
dst
src1
src2
dst
src1
src2
dst
src1
src2
dst
src1
src2
dst
src1
src2
dst
sint
xsint
sint
sint
xsint
slong
xsint
slong
slong
scst5
xsint
sint
scst5
slong
slong
sint
xsint
sint
scst5
xsint
sint
.L1, .L2000 0011
.L1, .L2010 0011
.L1, .L2010 0001
.L1, .L2000 0010
.L1, .L2010 0000
.S1, .S200 0111
.S1, .S200 0110
src2
src1
dst
src2
src1
dst
3-36 Instruction SetSPRU733
sint
sint
sint
sint
ucst5
sint
.D1, .D201 0000
.D1, .D201 0010
The way each instruction is describedExample
CompatibilityThe C62x, C64x, and C67x DSPs share an instruction set. All of the
instructions valid for the C62x DSP are also valid for the C67x DSP. This
section identifies which DSP family the instruction is valid.
DescriptionInstruction execution and its effect on the rest of the processor or memory
contents are described. Any constraints on the operands imposed by the
processor or the assembler are discussed. The description parallels and
supplements the information given by the execution block.
Execution for .L1, .L2 and .S1, .S2 Opcodes
if (cond)src1 + src2
→ dst
else nop
Execution for .D1, .D2 Opcodes
if (cond)src2 + src1
→ dst
else nop
The execution describes the processing that takes place when the instruction
is executed. The symbols are defined in Table 3−1 (page 3-2).
PipelineThis section contains a table that shows the sources read from, the destina-
tions written to, and the functional unit used during each execution cycle of the
instruction.
Instruction TypeThis section gives the type of instruction. See section 4.2 (page 4-12) for
information about the pipeline execution of this type of instruction.
Delay SlotsThis section gives the number of delay slots the instruction takes to execute
See section 3.4 (page 3-14) for an explanation of delay slots.
Functional Unit Latency
This section gives the number of cycles that the functional unit is in use during
the execution of the instruction.
ExampleExamples of instruction execution. If applicable, register and memory values
are given before and after instruction execution.
3-37 Instruction SetSPRU733
ABSAbsolute Value With Saturation
ABS
Absolute Value With Saturation
SyntaxABS (.unit) src2, dst
.unit = .L1 or .L2
CompatibilityC62x, C64x, C67x, and C67x+ CPU
Opcode
3129 28 2723 2218 1713 12 115 4 3 2 1 0
cregzdstsrc20 0 0 0 0 xop1 1 0 s p
3155171 1
Opcode map field used...For operand type...UnitOpfield
src2
dst
src2
dst
xsint
sint
slong
slong
.L1, .L2001 1010
.L1, L2011 1000
DescriptionThe absolute value of src2 is placed in dst.
Executionif (cond)abs(src2)
→ dst
else nop
The absolute value of src2 when src2 is an sint is determined as follows:
1) If src2 0, then src2
2) If src2 0 and src2
31
3) If src2 = −2
, then 231 − 1 → dst
→ dst
−2
31
, then −src2 → dst
The absolute value of src2 when src2 is an slong is determined as follows: