This publication is provided “AS IS.” Cadence Design Systems, Inc. (hereafter “Cadence") does not make any warranty of any
kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a
particular purpose. Information in this document is provided solely to enable system and software developers to use our
processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual
property rights or licenses granted hereunder to design or fabricate Cadence integrated circuits or integrated circuits based on
the information in this document. Cadence does not warrant that the contents of this publication, whether individually or as one
or more groups, meets your requirements or that the publication is error-free. This publication could include technical
inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated
in new editions of this publication.
Cadence, the Cadence logo, Allegro, Assura, Broadband Spice, CDNLIVE!, Celtic, Chipestimate.com, Conformal, Connections,
Denali, Diva, Dracula, Encounter, Flashpoint, FLIX, First Encounter, Incisive, Incyte, InstallScape, NanoRoute, NC-Verilog,
OrCAD, OSKit, Palladium, PowerForward, PowerSI, PSpice, Purespec, Puresuite, Quickcycles, SignalStorm, Sigrity, SKILL,
SoC Encounter, SourceLink, Spectre, Specman, Specman-Elite, SpeedBridge, Stars & Strikes, Tensilica, TripleCheck,
TurboXim, Vectra, Virtuoso, VoltageStorm, Xplorer, Xtensa, and Xtreme are either trademarks or registered trademarks of
Cadence Design Systems, Inc. in the United States and/or other jurisdictions.
OSCI, SystemC, Open SystemC, Open SystemC Initiative, and SystemC Initiative are registered trademarks of Open SystemC
Initiative, Inc. in the United States and other countries and are used with permission. All other trademarks are the property of
their respective holders.
The following changes (denoted with change bars) were made to this
document for the Cadence Tensilica RG-2017.7 release of the HiFi3 DSP:
Added information in section 2.4.2 that CBEGIN need not be less
than CEND
Updated information in Chapter 6 HiFi 3 NatureDSP Signal Library
RG.5
The following changes (denoted with change bars) were made to this
document for the Cadence Tensilica RG-2017.5 release of the HiFi3 DSP:
Enhanced the description of floating point operations
Corrected the intrinsic for AE_L16_IP
Corrected round instructions that were documented as returning
integer types instead of fractional types
Corrected AE_PKSR24 as returning a fractional type instead of an
integer type
RG.4
The following changes were made to this document for the Cadence Tensilica
RG-2016.4 release of the HiFi3 DSP:
Minor corrections to the (NaN) description
Enhanced conversions between variables of different types in section
3.1.
Clarified the usage restrictions for AE_DIV64
Clarified for converting HiFi 2 legacy types to and from HiFi 3 vector
types
Clarified rules on conversions from float to and from ae_int32x2
Corrected an inaccuracy in output type name for AE_MOVPA24x2
RG.3
Amended several instructions in Chapter 2, including:
Included required alignments
Amended the note for AE_MOVDA32X2 in Section2.11.
Updated Section 2.16.1.
Added information about conversions being applied to intrinsic
invocations in Section 3.1.
Updated Table 3-3 HiFi 3 C/C++ Operators.
Document Change History
vi CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
1.2
Minor clarifications re AE_SA16x4, as well as in Table 3-2 HiFi 3 Display
Types.
1.1
Added information regarding HiFi Mini in Section 3.8.
Corrected HiFi 3 Coprocessor number to 1 in Section 7.1
1.0
Initial customer release.
CADENCE DESIGN SYSTEMS,INC. vii
HiFi 3 DSP User's Guide
viii CADENCE DESIGN SYSTEMS,INC.
HiFi3 DSP User's Guide
1. Introduction
Cadence’s HiFi 3 DSP is a highly optimized audio processor geared for efficient execution of
audio and voice codecs, and pre- and post-processing modules. It goes beyond the two MAC,
two issue, HiFi 2/EP architecture with four multipliers, three VLIW slots, good support for
32x16-bit and 32x32-bit multiplication, a true 64-bit data path and native support for ITUT/ETSI intrinsics. There is an optional floating point unit available, providing for a
2-way SIMD, single-precision IEEE floating point MAC or ALU operation every cycle. The
extra resources provide for significant performance improvements compared to HiFi 2/EP,
particularly on pre/post-processing algorithms as well as voice codecs. The support for 32bit audio as well as ITU-T/ETSI intrinsics, including automatic vectorization, provides much
better performance on out-of-the-box C programs and voice algorithms.
HiFi 3 is backward compatible at the C/C++ source level with HiFi 2/EP. Any algorithm written
in C/C++, including all HiFi 2/EP packages from Cadence, can simply be recompiled on HiFi
3 to get modest performance improvements. For maximum performance, key kernels may
need to be retuned for the HiFi 3 architecture.
The HiFi 3 DSP is a configuration option that can be included with the Xtensa LX 4 (and later
versions) processor. All HiFi 3 DSP operations can be used as intrinsics in standard C/C++
applications. In addition, when compiling with automatic vectorization or with the –mcoproc
option, the compiler will automatically use HiFi 3 operations when compiling standard C code.
Cadence’s HiFi 3 DSP consists of two main components: a DSP subsystem and a subsystem
to assist with bitstream access, and variable-length (Huffman) encoding and decoding. These
are covered in detail in Chapter 2.
1.1 Purpose of this User Guide
This guide provides an overview of the HiFi 3 architecture and its instruction set. It will help
programmers using HiFi 3 by identifying some of the techniques that are commonly used to
optimize algorithms. It provides guidelines to achieve improved performance by using HiFi
3’s instructions, intrinsics, protos, and primitives. This guide also serves as a C/C++ usage
reference for the appropriate way to use HiFi 3 features in a C/C++ software development.
This guide will also assist Xtensa HiFi 3 users who wish to add additional instructions to the
HiFi 3 architecture.
To use this guide most effectively, a basic level of familiarity with the Xtensa software
development flow is highly recommended. For more details, see the Xtensa Software Development Toolkit User’s Guide.
CADENCE DESIGN SYSTEMS,INC. 1
HiFi 3 DSP User's Guide
1.1.1 Conventions
Throughout this guide, the symbol <xtensa_root> refers to the installation directory of a
user's Xtensa configuration. For example, <xtensa_root> might refer to the directory
\usr\xtensa\XtDevTools\install\builds\RF-2015.2-win32\<s1> if <s1> is the
name of your Xtensa configuration. In the examples in this guide, replace <xtensa_root>
with the installation directory of your Xtensa distribution.
1.2 Installation Overview
To install a HiFi 3 configuration, follow the same procedures described in the Xtensa
Development Tools Installation Guide. The HiFi 3 include files are in the following directories
The HiFi 3 DSP, a SIMD (single-instruction/multiple-data) processor, can work in parallel on
two 24/32-bit data items or four 16-bit data items. For example, it allows for one operation to
perform two 32-bit additions in parallel, with each addition occupying half of a 64-bit AE_DR
register. The HiFi 3 multipliers support multiplication of four 24-bit, or four 32x16-bit, or four
16x16-bit operands per cycle. They support two 32x32-bit multiplies per cycle. There are
operations for single, dual, and quad multiplication. Single or dual multiply operations can be
dual issued using VLIW instructions. Quad multiply operations cannot be issued together
with other multiplies. The HiFi 3 DSP can only be configured to use a little-endian byte
ordering.
With the optional floating point unit, HiFi 3 supports two IEEE-754 floating point MACs per
cycle.
In general, 16-bit support is geared towards efficient support of the ITU-T/ETSI intrinsic
model, while 32x16-bit and 24-bit support is provided for both integer and fixed-point
computation.
HiFi 3 is a VLIW architecture, supporting the execution of three operations in parallel. DSP
loads and stores, bitstream and Huffman operations, and core operations are available in
slot 0 of a VLIW instruction. DSP MAC and ALU operations are typically available in slot 1
2 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
AR Base
Register File
Slot2
Misc
Function
Slot 2
ALUMAC
Slot2
Misc
Function
Slot 1
ALU
MAC
Slot2
Slot 0
Variable
Length
Enc/Dec &
Bitstream
Misc
Function
Load /
Store
Unit
ALU
Register MUX
AE_DR
Register File
16 x 64 bits
32 bits
32 bits
Optional
FPU
and slot 2. The optional floating point operations are generally available in slot 1 of a twoslot format.
HiFi 3 supports either caches or local memories with the full flexibility provided by Xtensa
configurations. You can have either or both, and can make different choices for instruction
and data. Audio packages supplied by Cadence do not use DMA. Hence, most customers
either use caches or make local memories sufficiently large to cover desired applications.
The following diagram illustrates the main custom state, register file, and execution units
added to an Xtensa LX processor by the HiFi 3 DSP.
Figure 1-1 HiFi 3 DSP Components
The main hardware resources in the DSP subsystem are two 2-multiplier multiply/accumulate
units, an option for a 2-way SIMD single-precision IEEE floating point unit, a 16-entry register
file AE_DR to hold 64-bit, pairs of 32-bit or quads of 16-bit data items, an arithmetic/logic
unit, and a shift unit to operate on the AE_DR values. The multiplier units support two 32x32bit or 24x32-bit MACs or four 24x24, 16x32, or 16x16-bit MACs per cycle. The multiplies are
supported through single-instruction quad multiplies or through parallel-issued dual or single
multiply instructions.
The load/store unit is capable of loading or storing up to two 24-bit or 32-bit SIMD elements,
four 16-bit SIMD elements, or single elements up to 64 bits in size. 24-bit data can either be
contained inside 32-bit envelopes, or can be packed together into 24 bits of memory. Eight
packed elements can be loaded or stored in three instructions. The load/store unit supports
unaligned accesses, whereby a stream is first primed and afterwards 64 unaligned bits can
be loaded or stored in every cycle.
The DSP subsystem can be issued in several VLIW formats—two 3-slot VLIW formats
(ae_format and ae_format1), one 2-slot format (ae_format2), and one 2-slot mini format
(ae_mini0). The operations for the 3-slot VLIW formats can be issued in one of the three
slots. In each execution cycle, zero or one operation from each slot can be executed
independently according to the static bundling expressed in the machine code. For example,
load operations can execute concurrently with multiply/accumulate operations because loads
CADENCE DESIGN SYSTEMS,INC. 3
HiFi 3 DSP User's Guide
are in ae_slot0 and multiply/accumulate operations are in ae_slot1 or ae_slot2_0. The two
slot format contains multiply instructions that produce more than one register result or use
extra operands as well as some legacy operations. For better code size, many operations
(not including integer or fixed point multiplies) are also available in single issue 16- and 24bit formats. Most floating point operations are available in the 24-bit formats.
1.4 Prefetching
HiFi 3 includes a prefetch option geared for systems with long memory latency. When the
HiFi 3 processor detects a positive stride-1 stream of cache misses (either data or
instruction), it can speculatively prefetch ahead up to four cache lines and place them in a
buffer close to the processor, or on the data side optionally into the L1 data cache (there is
no support for prefetching directly into the L1 instruction cache). In addition, you can manually
issue prefetch instructions.
Hardware prefetching is enabled by default in the reset code provided by Cadence with a low
setting. By default, on configurations that support it, data prefetches are placed into the L1
data cache. You can use the following HAL calls to explicitly disable prefetching or to increase
its aggressiveness in different sections of your code. With more aggressive prefetching, the
hardware will prefetch earlier when detecting a stream and will prefetch more lines ahead.
Assuming sufficient bus bandwidth, performance will improve with more aggressive prefetch,
but the system will require more bandwidth. Prefetching instructions and data can be
controlled separately.
#include <xtensa/hal.h>
int xthal_set_cache_prefetch(unsigned long long mode);
The value returned is not meant for direct use or interpretation; however, it is suitable for
passing to a subsequent call to xthal
_
set_cache
_
prefetch().
The mode parameter can be one of the following:
The value returned from a previous call to xthal
xthal_get_cache
One of the following constants, which apply to both instruction and data caches:
XTHAL
XTHAL
A bit-wise OR of two cache prefetch mode constants, one for the instruction cache:
XTHAL
XTHAL
XTHAL
_
prefetch()
_
PREFETCH
_
PREFETCH
_
ICACHE_PREFETCH
_
ICACHE_PREFETCH
_
ICACHE_PREFETCH
_
ENABLE(enable cache prefetch)
_
DISABLE(disable cache prefetch)
_
OFF(disable instruction cache prefetch)
_
LOW(enable, less aggressive prefetch)
_
MEDIUM(enable, midway aggressive
_
set_cache
_
prefetch() or
prefetch)
XTHAL
_
ICACHE_PREFETCH
_
HIGH(enable, more aggressive prefetch)
4 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
XTHAL
_
ICACHE
_
PREFETCH(n) (explicitly set the InstCtl field of the
PREFCTL register to 0..15. See the Prefetch Architectural Additions
section of the Prefetch Unit Option chapter in the Xtensa Microprocessor Data Book for details.
A bit-wise OR of two cache prefetch mode constants, one for the data cache:
XTHAL
XTHAL
XTHAL
_
DCACHE_PREFETCH
_
DCACHE_PREFETCH
_
DCACHE_PREFETCH
_
OFF (disable data cache prefetch)
_
LOW (enable, less aggressive prefetch)
_
MEDIUM (enable, midway aggressive
prefetch)
XTHAL
XTHAL
_
DCACHE_PREFETCH
_
DCACHE
_
PREFETCH(n) (explicitly set the DataCtl field of the
_
HIGH (enable, more aggressive prefetch)
PREFCTL register to 0..15. See the Prefetch Architectural Additions
section of the Prefetch Unit Option chapter in the Xtensa Microprocessor Data Book for details.
XTHAL
_
DCACHE_PREFETCH_L1
_
OFF (prefetch data to prefetch buffers
only)
XTHAL
_
DCACHE_PREFETCH
_
L1 (on configurations that support it,
prefetch directly to L1 data cache)
For easier simulation, prefetching can also be disabled in the simulator using the xt-run -
-prefetch=0 flag. Disabling prefetching from the simulation command line will override any
HAL calls.
1.4.1 Software Prefetching
Prefetching can also be individually controlled via software using the following GCC
extension:
__builtin_prefetch(addr);
Software prefetches can be used for either data or instructions. They can be used in addition
to, or instead of hardware prefetching. If hardware prefetching is disabled, the software
prefetches are still enabled.
For configurations that do not prefetch into the cache, but instead use a small, 8- to 16-entry
buffer outside of the cache, you must be careful not to prefetch too far ahead. Otherwise, the
data will be overwritten before it is needed by the processor.
Consider a simple example that does an energy calculation. You might choose to place a
few explicit prefetch instructions before the loop to seed the hardware prefetcher. Otherwise,
depending on the mode, the hardware prefetch might delay prefetching until after the second
miss.
You might also want to put prefetch instructions directly inside the loop. Doing so allows you
to prefetch more aggressively than the hardware prefetcher and allows you to prefetch
patterns other than the stride-1 references that are detected by the hardware prefetcher. On
the other hand, placing prefetch instructions inside the loop incurs instruction overhead
whether or not the loop actually suffers from cache misses.
In general, given the effectiveness of the hardware prefetcher, software prefetches should
be used judiciously. Carefully compare performance between using and not using software
prefetching on a loop-by-loop basis.
1.5 HiFi 3 Instruction Set Overview
The HiFi 3 DSP is built on the baseline Xtensa RISC architecture, which implements a rich
set of generic instructions optimized for efficient embedded processing. The power of HiFi 3
comes from a comprehensive DSP and audio instruction set. A wide variety of load/store
operations support multiple addressing modes, with support for 16/24/32-bit scalar and
vector data types together with 56/64-bit scalar. Vector data management is supported with
select operations and shifting.
Multiply operations include 32x32-bit, 32x24-bit, 24x24-bit, 32x16-bit and 16x16-bit. Multiply
operations come in fixed-point and integer variants. They come in high precision and low
precision variants. High-precision multiplies utilize a 64-bit accumulator. Since an
accumulator can only hold one result, HiFi 3 supports dual multiplies where the results of two
multiplies are added or subtracted together before being added into the accumulator. For
example, a single operation might compute the following operation where H and L refer to
the high bits or low bits respectively of an operand.
acc = acc – d0.L*d1.L + d0.H*d1.H.
Low-precision multiplies accumulate in 32-bits. Since each register can hold two 32-bit
accumulators, these instructions can perform two independent SIMD multiplies.
A set of bitstream and variable length instructions allow for efficient access of serial
bitstreams, including Huffman encode and decode.
The optional floating point unit supports 2-way SIMD units of IEEE-754 single precision
floating point operations. Refer to the Xtensa Instruction Set Architecture (ISA) Reference Manual for more details about the core single precision floating point support, on which the
2-way SIMD units are based.
6 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
H
L
2
1
0
31 … 0
31 … 0
15… 0
15 … 0
15 … 0
15 … 0
63 … 0
3
2. HiFi 3 Features
The HiFi 3 DSP contains a 16-entry, 64-bit register file, AE_DR. Each register can hold one
or two, 24 or 32-bit operands, one or four 16-bit operands or one 56- or 64-bit operand as
shown in Figure 2-1. 24-bit and 56-bit operands are sign extended to fill their 32- or 64-bit
container. The separate halves or quarters of the register are always separate data items.
For example, if you shift a 32-bit element to the left, the L element does not spill over into the
high element.
Figure 2-1 AE_DR Register
When a register is stored to memory, the high half of the register is always stored in the
lower memory address. This enables the same source code to work on all configurations,
including big-endian HiFi2 cores. Operations that access individual 24- or 32-bit elements
of AE_DR registers refer to the elements with selectors L and H in the mnemonics.
Operations that access individual 16-bit elements refer to the elements with sectors 3, 2, 1
and 0 in the mnemonics.
For legacy HiFi 2/EP instructions, a 32-bit data item might occupy the middle of an entire
AE_DR register and a 16-bit data item might occupy the middle of a 32-bit half register.
When using such legacy instructions, a register holds half as many elements, hence the
instruction exploits less parallelism. Such instructions should only be used in legacy code.
CADENCE DESIGN SYSTEMS,INC. 7
HiFi 3 DSP User's Guide
State Register
Bit Size
Description
AE_OVERFLOW
1
Indicates whether any arithmetic operation has saturated
since the time when AE_OVERFLOW was last reset to zero.
AE_SAR
7
Contains the shift amount for various DSP shift operations.
State Register
Bit
Size
Description
AE_BITHEAD
32
Contains the bits at the head of the bitstream. The high half
has the current 16 bits and the low half has the next 16 bits.
Only the high half is used for output bitstreams.
AE_BITPTR
4
Offset within the 16 most-significant bits of the bitstream
head. For an input bitstream, this value signifies the number
of most significant bits of AE_BITHEAD that have been
consumed already by the application. For an output
bitstream, this value signifies the number of most significant
bits of AE_BITHEAD that have already been initialized.
AE_BITSUSED
4
Contains the number of bits consumed or produced in the
last table lookup by a variable-length encode/decode
instruction. This value is coded in binary, with the exception
that all zeroes are interpreted as the value 16.
AE_TABLESIZE
4
Contains one less than the base-2 logarithm of the current
decoding table size for variable-length decode. 0
corresponds to a 2-entry table; 15 corresponds to a 65536entry table.
AE_FIRST_TS
4
Contains the correct value of AE_TABLESIZE for the first
level in the lookup-table hierarchy. This state is an
optimization so that no AE_VLDSHT instruction is needed
between consecutive decoding operations using the same
codebook.
AE_NEXTOFFSET
27
This state is used for three different things.
In variable-length decode: Before an AE_VLDL16T or
AE_VLDL32T instruction, AE_NEXTOFSET is the index
HiFi 3 supports a 4-entry, 64-bit alignment register, AE_VALIGN. Using this register allows
the hardware to load or store a SIMD stream that is not 64-bit aligned at a rate of 64-bits
per cycle. It also allows 24-bit data to be packed densely into 24-bit containers. These
mechanisms are described in more detail In Section 2.4.1.
Table 2-1 lists the the TIE state registers in the HiFi 3 DSP.
Table 2-1 DSP Subsystem State Registers
The state registers listed in Table 2-2 pertain to the bitstream and variable-length
encode/decode support subsystem of the HiFi 3 DSP. Programmers generally will not need
to concern themselves with the details of how each of these state registers is used by the
instructions. However, the state registers (understandable for those familiar with the variablelength encode/decode instructions) are documented here for completeness.
Table 2-2 Bitstream and Variable-length Encode/Decode Support Subsystem State Registers
8 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
State Register
Bit
Size
Description
of the table entry corresponding to the current bitstream
prefix to look up.
After an AE_VLDL16T or AE_VLDL32T instruction,
AE_NEXTOFFSET is the offset of the base of the next
decoding lookup table.
In variable-length encode: After an AE_VLEL16T or
AE_VLEL32T instruction, the low bits of
AE_NEXTOFFSET hold the codeword bits produced by
the most recent lookup.
AE_SEARCHDONE
1
This state tells the AE_VLDL16C instruction to prepare
AE_NEXTOFFSET (using AE_FIRST_TS) for a fresh
decoding search starting with the first table in the decoding
hierarchy. This state is an optimization so that no
AE_VLDSHT instruction is needed between consecutive
decoding operations using the same codebook.
State Register
Bit
Size
Description
AE_CBEGIN0
32
Contains the start address of the circular buffer.
AE_CEND0
32
Contains the end address of the circular buffer.
AE_CWRAP
1
Indicates whether any circular buffer operation has wrapped
around since the time when AE_CWRAP was last reset to zero.
State Register
Bit Size
Description
RoundMode
2
Control the rounding mode of floating point operations. A
value of 0 rounds to nearest, a value of 1 rounds toward
0, a value of 2 rounds towards infinite and a value of 3
rounds toward negative infinite.
InvalidFlag
1
Invalid exception flag.
DivZeroFlag
1
Divide-by-zero flag.
OverflowFlag
1
Overflow exception flag.
UnderflowFlag
1
Underflow exception flag.
InexactFlag
1
Inexact exception flag.
The following state registers pertain to the circular buffer support and are shared between
the DSP subsystem and the bitstream and variable-length encode/decode support
subsystem of the DSP.
Table 2-3 Circular Buffer Support State Registers
The following state registers pertain to the optional floating point support.
Table 2-4 Floating Point Support State Registers
CADENCE DESIGN SYSTEMS,INC. 9
HiFi 3 DSP User's Guide
Instruction
Intrinsic
Description
RUR.AE_OVERFLOW
RUR_AE_OVERFLOW,
RAE_OVERFLOW
Read state register
AE_OVERFLOW
RUR.AE_SAR
RUR_AE_SAR,
RAE_SAR
Read state register
AE_SAR
RUR.AE_TABLESIZE
RUR_AE_TABLESIZE,
RAE_TABLESIZE
Read state register
AE_TABLESIZE
RUR.AE_FIRST_TS
RUR_AE_FIRST_TS,
RAE_FIRST_TS
Read state register
AE_FIRST_TS
RUR.AE_BITHEAD
RUR_AE_BITHEAD,
RAE_BITHEAD
Read state register
AE_BITHEAD
RUR.AE_BITSUSED
RUR_AE_BITSUSED,
RAE_BITSUSED
Read state register
AE_BITSUSED
The TIE state registers are grouped as follows into six user registers for the purposes of
efficient save and restore operations:
With the floating point option, use the following user register to control and detect rounding
and exception behavior. Refer to Chapter 4 of the Xtensa Instruction Set Architecture (ISA) Reference Manual for more details.
In addition to specialized instructions sequences used to save and restore entire user
registers efficiently from memory, instructions are provided to read and write individual
state registers. Both types are listed in Table 2-5.
Table 2-5 State Register Access Instructions
10 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
Instruction
Intrinsic
Description
RUR.AE_BITPTR
RUR_AE_BITPTR,
RAE_BITPTR
Read state register
AE_BITPTR
RUR.AE_SEARCHDONE
RUR_AE_SEARCHDONE,
RAE_SEARCHDONE
Read state register
AE_SEARCHDONE
RUR.AE_NEXTOFFSET
RUR_AE_NEXTOFFSET,
RAE_NEXTOFFSET
Read state register
AE_NEXTOFFSET
RUR.AE_CBEGIN0
RUR_AE_CBEGIN0,
RAE_CBEGIN0,
AE_GETBEGIN
Read state register
AE_CBEGIN0
.
AE_GETCBEGIN0
returns a
void *
value.
RUR.AE_CEND0
RUR_AE_CEND0,
RAE_CEND0,
AE_GETCEND0
Read state register
AE_CEND0. AE_GETCEND0
returns a
void *
value.
RUR.AE_CWRAP
RUR_AE_CWRAP,
RAE_CWRAP
Read state register
AE_CWRAP
RUR.FCR
RUR_FCR
Read register FCR containing state RoundMode
RUR.FSR
RUR_FSR
Read register FSR corresponding to state
registers InvalidFlag, DivZeroFlag,
OverflowFlag, and UnderflowFlag
AE_MOVVFCRFSR
AE_MOVVFCRFSR
Copy user register FCR_FSR into a vector
register which can be stored to memory.
WUR.AE_OVERFLOW
WUR_AE_OVERFLOW,
WAE_OVERFLOW
Write state register
AE_OVERFLOW
WUR.AE_SAR
WUR_AE_SAR,
WAE_SAR
Write state register
AE_SAR
WUR.AE_TABLESIZE
WUR_AE_TABLESIZE,
WAE_TABLESIZE
Write state register
AE_TABLESIZE
WUR.AE_FIRST_TS
WUR_AE_FIRST_TS,
WAE_FIRST_TS
Write state register
AE_FIRST_TS
WUR.AE_BITHEAD
WUR_AE_BITHEAD,
WAE_BITHEAD
Write state register
AE_BITHEAD
WUR.AE_BITSUSED
WUR_AE_BITSUSED,
WAE_BITSUSED
Write state register
AE_BITSUSED
WUR.AE_BITPTR
WUR_AE_BITPTR,
WAE_BITPTR
Write state register
AE_BITPTR
WUR.AE_SEARCHDONE
WUR_AE_SEARCHDONE,
WAE_SEARCHDONE
Write state register
AE_SEARCHDONE
WUR.AE_NEXTOFFSET
WUR_AE_NEXTOFFSET,
WAE_NEXTOFFSET
Write state register
AE_NEXTOFFSET
WUR.AE_CBEGIN0
WUR_AE_CBEGIN0,
WAE_CBEGIN0,
AE_SETCBEGIN0
Write state register
AE_CBEGIN0
.
AE_SETCBEGIN0
take a
void *
value.
CADENCE DESIGN SYSTEMS,INC. 11
HiFi 3 DSP User's Guide
Instruction
Intrinsic
Description
WUR.AE_CEND0
WUR_AE_CEND0,
WAE_CEND0,
AE_SETCEND0
Write state register
AE_CEND0 AE_SETCEND0
take a
void *
value.
WUR.AE_CWRAP
WUR_AE_CWRAP,
WAE_CWRAP
Write state register
AE_CWRAP
WUR.FCR
WUR_FCR
Write register FCR containing state RoundMode
WUR.FSR
WUR_FSR
Write register FSR corresponding to state
registers InvalidFlag, DivZeroFlag,
OverflowFlag, and UnderflowFlag
AE_MOVFCRFSRV
AE_MOVFCRFSRV
Set user register FCR_FSR from a vector
register which can be loaded from memory.
Placeholder
Register file
Legal values
Example
A, ah, al, a0, a1, ax
AR
a0 – a15
a3
q, q0, q1, d, d0, d1, dh, dl
AE_DR
aed0 – aed15
aed2
b
BR
b0 – b15
b3
bhl
BR2
b0 – b14 (even)
b0
b3210
BR4
b0-b16 (multiple of 4)
b0
u
AE_VALIGN
u0-u3
u0
Placeholder
Value Range
Stride
i16
-16..14
2
i16pos
0..14
2
i32
-32..28
4
i32pos
0..28
4
i64
-64..56
8
i64pos
0..56
8
i
Operation-dependent
1
In the operation descriptions in Sections 2.4 through 2.17, each mnemonic is listed with
assembly syntax showing placeholders for its operands. The register files of the operands
are implied by the placeholders, as in Table 2-6.
Table 2-6 Operand Register Types
Each operation description is annotated with the name(s) of the slot(s) where that operation
can be issued. Each operation description is also annotated with the C syntax showing the
intrinsic name and prototype for the operation. A discussion of using C data types and
intrinsics to program the HiFi 3 DSP is included in Chapter 3.
Denotes fractional arithmetic (e.g., AE_MULZAAFD24.HH.LL) or the value False in
a conditional move (e.g., AE_MOVF64).
H
and L
Combinations of H and L are used to refer to halves of registers (e.g.,
AE_MULZAAFD24.HH.LL).
0,1,2,3
Combinations of 0, 1, 2 and 3 are used to refer to quarters of registers (e.g.
AE_MULF32X16.L0)
I
Denotes use of an immediate operand (e.g., AE_SRAIP32)
S
Denotes saturating arithmetic (e.g., AE_MULF32S.LL) or the use of the AE_SAR
state register as a shift amount (e.g., AE_SRASP32), depending on the context
Denotes the value True in a conditional move (e.g., AE_MOVT64)
U
Denotes unsigned arithmetic (e.g., AE_MULS32U.LL)
X
Denotes use of an index register in an address computation (e.g., AE_L64.XP)
X2
Denotes a 2-way SIMD operation in contexts (e.g., AE_L32X2.I) where scalar
operations are also available
X4
Denotes a four-way SIMD operation (e.g., AE_L16X4.XC)
All HiFi 2 C types and intrinsics are available in HiFi 3 to ensure C/C++ source code
portability. Notes on HiFi 2 code portability and matching intrinsics are included in the
operation description for the relevant operations as well as in Chapter 3.
2.1 Instruction Naming Conventions
All HiFi 3 DSP operation mnemonics begin with the string AE_ to avoid colliding with any
other space of names. The optional floating point instructions use the standard Xtensa
floating point intrinsic names that add an XT_ prefix to the operation name and replace the
.S with _S.
Following the AE_ prefix, each mnemonic has a string of one or more characters signifying
the type of operation such as load, shift, add, etc. For example, AE_L is the prefix denoting
DSP loads.
The remaining portion of each operation mnemonic typically includes reminders of various
aspects of the operation’s details. Multiplies and loads and stores have more regular naming
conventions that are described in their respective sections.
The HiFi 3 DSP contains instructions for implementing fixed point arithmetic. This section
describes the representation and interpretation of fixed point values as well as some
operations on fixed-point values.
2.2.1 Representation of Fixed Point Values
A fixed point data type m.n contains a sign bit, some number of bits m-1, to the left of the
decimal and some number of bits n, to the right of the decimal. When expressed as a binary
value and stored into a register file, the least significant n bits are the fractional part, and
the most significant m bits are the integer part expressed as a signed 2’s complement
number. If the binary value is interpreted as a 2’s complement signed integer, converting
from the binary value to a fixed point number requires dividing the integer by 2n.
Thus, for example, the 24-bit 1.23 number 0.5 is represented as 0x400000.
And the 64-bit 17.47 number -1.5 is represented as (-2 + 0.5 = 0xff 4000 0000 0000)
14 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
HiFi 3 fractional instructions use fractional operations on 1.15, 1.23, 9.23, 1.31, 17.47, and
1.63, described in more details as follows.
1.15 16-bit fixed point data type with 1 sign bit and 15 bits to the right of the
decimal. The largest positive value 0x7fff is interpreted as 1.0 – 2
-15
. The smallest
negative value 0x8000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
9.23 32-bit fixed point data type with a 9-bit integer and 23 bits to the right of the
decimal. The largest positive value 0x7fffffff is interpreted as 256.0 – 2
-23
. The
smallest negative value 0x80000000 is interpreted as -256.0. The value 0 is
interpreted as 0.0.
1.23 24-bit fixed point data type with 1 sign bit and 23 bits to the right of the
decimal. The largest positive value 0x7fffff is interpreted as 1.0 – 2
-23
. The smallest
negative value 0x800000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
Since register halves hold 32 bits, not 24 bits, typical 24-bit fractional variables are
9.23. However, 24-bit fixed-point multiply instructions ignore the upper 8-bits,
thereby treating them as 1.23.
1.31 32-bit fixed point data type with 1 sign bit and 31 bits to the right of the
decimal. The largest positive value 0x7fffffff is interpreted as 1.0 – 2
-31
. The
smallest negative value 0x80000000 is interpreted as -1.0. The value 0 is
interpreted as 0.0.
17.47 64-bit fixed point data type with a 17-bit integer and 47 bits to the right of
the decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 65536.0 –
-47
2
. The smallest negative value 0x8000 0000 0000 0000 is interpreted as -
65536.0. The value 0 is interpreted as 0.0.
1.63 64-bit fixed point data type with 1 sign bit and 63 bits to the right of the
decimal. The largest positive value 0x7fff ffff is interpreted as 1.0 – 2
-63
. The
smallest negative value 0x8000 0000 0000 0000 is interpreted as -1.0. The value 0
is interpreted as 0.0.
2.2.2 Arithmetic with Fixed Point Values
When multiplying fixed point numbers m.n0 * m.n1, with a standard signed integer multiplier,
the natural result of the multiple will be an m.n data type where n = n0+n1 and m = m0+m1.
For example, multiplying a 1.23 typed variable by a 1.23 typed variable generates a 2.46
typed variable. Since HiFi 3 supports the 17.47 data type, the fixed point multiply instructions
shift the 2.46 result to the left by 1 bit and then sign extends it by 15 bits. In general, highprecision fixed-point multiplications shift their results to the left by 1 bit.
HiFi 3 contains both saturating and non-saturating instructions. Overflowing the supplied
guard bits with a non-saturating instruction is a program error that will cause the result to
wrap around. For saturating operations, the processor also sets the overflow state, which
can later be checked programmatically. In the instruction descriptions that follow, it is
explicitly stated if an operation saturates.
CADENCE DESIGN SYSTEMS,INC. 15
HiFi 3 DSP User's Guide
2.2.3 Other Fixed Point Representations
Programmers are free to use fixed-point representations other than the ones listed in Section
2.2.2. Most HiFi 3 operations are independent of fixed-point representation; e.g., a fixed point
add is equivalent to an integer one. Even for multiplies, the multiply instructions are
compatible with any representations that expect the result to be shifted left by one bit. So, if
the input data is actually a 2.22 data type rather than a 1.23 data type, the 24-bit fixed point
multiply instructions will correctly produce an 11.45 typed variable. The programmer is
responsible for knowing what type of data is in what variables, and if manual conversions are
needed, you can always use shift instructions.
2.3 VLIW Slots and Formats
HiFi 3 can issue up to three operations in a single 64-bit instruction bundle using Xtensa LX
FLIX (VLIW) technology. HiFi 3 supports four different formats ae_format, ae_format1,
ae_format2 and ae_mini0. Every instruction belongs to one format, but different formats may
pack different numbers of operations in a single instruction.
Formats ae_format and ae_format1 both support three parallel operations. The two formats
are logically equivalent, allowing the exact same operations in the first two slots and disjoint
operations in the third. The reason for splitting the format in two is for encoding space. The
first format is 60 bits while the second is 59 bits, allowing a total size of 60.5 bits, something
that is not possible to attain with a single format. The rest of this guide treats the format as a
single format containing the slots ae_slot0, ae_slot1, and ae_slot2.
Format ae_format2 is a 2-slot format with slots ae2_slot0 and ae2_slot1. Using this format
allows for individual operations with more operands or larger immediates than can be used
in the 3768 slot format.
Format ae_mini0, with slots ae_minislot0 and ae_minislot2, is a specialized format that
allows operations that read an AR register operand to execute in parallel with operations that
have at most one AR read and one AR write operand. In particular, this format allows the
parallel execution of some simple core operations such as MOVI and ADDI together with
immediate loads and stores.
For the 3-slot format, the first slot contains all of the HiFi 3 load/store instructions and some
miscellaneous operations. The second slot contains all of the regular multiply and DSP ALU
operations. The third slot contains all the shifts and DSP ALU operations as well as a subset
of the multiply operations.
A subset of the operations as well as all the bitstream operations are available in a single
issue, 24-bit format called Inst. The compiler will automatically use the 24-bit format when
it is not possible (or beneficial) to bundle a relevant operation together with an operation that
can go in another slot. A subset of the core Xtensa operations is also available in the first
VLIW slot, allowing some parallelism between DSP operations and core Xtensa operations.
For ae_format2, the first slot contains all of the load and store operations as well as many
operations from the third slot of the 3-slot format. This slot also contains variants of the core
16 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
branch instructions with large immediates that do not fit into other formats. The second slot
contains all of the multipliers including specialized multipliers that do not fit into the 3-slot
format.
For the optional floating point unit, most floating point operations are available in the second
slot of ae_format2, allowing the machine to issue, for example, one SIMD floating-point load
in parallel with one, 2-way SIMD, multiply-accumulation operation.
Understanding the slotting is important when optimizing code for HiFi 3. Often a loop is limited
by operations that can only go in one slot or another. For example, it is never possible to
issue more than one (possible SIMD) load or store per cycle. If a loop is limited by the
operations in one slot, there is no point in trying to optimize the operations in another slot.
All HiFi 2 instructions available in the Inst slot share opcode space (but do not overlap)
with the MAC16 Option.
The available slotting for the different operations are listed next to the operation descriptions
in the remainder of this chapter.
2.4 Load and Store Operations
HiFi 3 supports loading and storing scalars or vectors of 16, 24, 32, and 64 bits. Each scalar
load/store accesses 16, 24, 32, or 64 bits. Each vector accesses 64 bits or 48 bits for packed
24-bit data. For vector loads and stores, the high address in memory is always stored in the
least significant bits in the register. This enables the same source code to work on both little
and big endian systems. Reverse vector loads and stores reverse the elements in a register
so that the low address in memory is stored in the least significant bits in the register. This
way, whether accessing data in a stride one or stride negative one fashion, the earliest data
to be accessed is always in the same position in the register.
Special support is provided for retaining full throughput when vectors of data are not aligned
to 64-bits. HiFi 3 also supports a single circular buffer that can be used with either aligned or
unaligned data.
2.4.1 Aligning Loads and Stores
HiFi 3 has support for loading or storing vector streams of data 64 bits at a time even if the
data is not aligned to 64 bits. Note that while the vector variables need not be aligned to 64
bits, they must still be aligned according to the requirements of each scalar element, i.e., 32
bits for vectors of ints.
Such loads and stores are called aligning loads and stores. Support is available for 16-, 24-,
and 32-bit data. The aligning vector load and store instructions use the HiFi 3 alignment
register file to provide a throughput of one aligning load or store operation per instruction.
A special priming instruction, AE
of unaligned data. This instruction loads the alignment register with data from the start of the
stream. The subsequent aligning load instruction loads from the next location in memory,
CADENCE DESIGN SYSTEMS,INC. 17
_
LA64.PP, is used to begin the process of loading an array
HiFi 3 DSP User's Guide
merging it with the data already in the alignment register. The exact details of how the aligning
instructions work are not relevant to the programmer. Simply invoke the AE
priming intrinsic with the first address (aligned or not) to be loaded and continue loading with
the appropriate aligning loads to achieve a subsequent throughput of one aligning load per
instruction.
The design of the priming load and aligning load instructions is such that they can be used
in situations where the alignment of the address is unknown. The load sequence works
whether the starting address is aligned or not.
Consider a simple example that adds up the 32-bit elements in an array.
void add(int * a, int n)
{
ae_int32x2 *ap=(ae_int32x2 *) &a[0];
ae_int32x2 tmp;
ae_valign align;
int i;
align = AE_LA64_PP(ap); // prime the stream
for(i = 0; i < n; i = i + 2)
{
AE_LA32X2_IP(tmp,align,ap); // load the next element
V = V + tmp;
}
}
_
LA64
_
PP
Similarly, when accessing the data with a stride of negative one, prime the stream by passing
in the address of the first scalar element to be loaded (a[n-1]), as follows.
void add(int * a, int n)
{
ae_int32x2 *ap=(ae_int32x2 *) &a[n-1];
ae_int32x2 tmp;
ae_valign align;
int i;
align = AE_LA64_PP(ap); // prime the stream
for(i = 0; i < n; i = i + 2)
{
AE_LA32X2_RIP(tmp,align,ap); // load the next element
V = V + tmp;
}
}
Note that in the negative stride case, the start of the stream is handled differently in the
aligned versus the non-aligned case. With aligned loads, one passes in the address of
a[n-2] because that is the address of the first 64-bit word being loaded. With aligning
loads, one passes in the address of the first 32-bit scalar being loaded, a[n-1], because
18 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
the priming load loads from memory the aligned 64-bit envelope containing its argument and
a[n-2]might not be in the same 64-bit envelope as a[n-1].
HiFi 3 supports storing 24-bit data in a packed format that requires only 24-bits per data
element. Using this support can potentially save 25% of the memory required for a 24-bit
variable. Support for this packed data is implemented using the alignment mechanism. In the
examples above, simply use AE_LA24X2 intrinsics instead of AE_LA32X2 as shown below.
Note that we have used char * for the pointer type. While not strictly necessary, it is helpful
to indicate that the packed stream is an unaligned byte stream.
void add(int * a, int n)
{
char *ap=(char *) &a[0];
ae_int24x2 tmp;
ae_valign align;
int i;
align = AE_LA64_PP(ap); // prime the stream
for(i = 0; i < n; i = i + 2)
{
AE_LA24X2_IP(tmp,align,ap); // load the next element
V = V + tmp;
}
}
For packed data, even scalar streams are unaligned, so support is also available for
_
AE
LA24 intrinsics. Because the memory format for packed data is different, packed data
can only be used in cases where all loads and stores of a stream are done using the packing
loads and stores. While the packing loads and stores can be used on any 24-bit variable,
since a priming load and a finalizing store is required for every stream, it is often only efficient
to use them on stride one or stride negative one streams. Similarly, since there are only four
alignment registers, it is only efficient to use them on loops that have at most four streams.
Aligning stores operate in a slightly different manner. Before starting a stream, the alignment
variable needs to be zeroed using the AE_ZALIGN64() intrinsic. On an unaligned store,
each aligning store instruction merges some of the data with data already in the alignment
register and writes the result to memory. The remaining data is written into the alignment
register for use in the next aligning store. If the data happens to be aligned, each aligning
store simply writes its data to memory. After completing the stream, you must finalize the
stream using a finalization instruction. If the data happens to be unaligned, that finalization
instruction writes out the remaining data from the alignment register. The finalization
instruction also zeroes the alignment register so that a follow-on stream can skip the use of
the AE_ZALIGN64() intrinsic.
Following is a simple example that zeroes an n element array of ints named a.
ae_int32x2 V_con = (ae_int32x2)(0);
ae_int32x2 *addr = (ae_int32x2 *) a;
ae_valign align = AE_ZALIGN64(); // zero alignment reg
for(i = 0; i <= n; i = i + 2)
CADENCE DESIGN SYSTEMS,INC. 19
HiFi 3 DSP User's Guide
State
Description
AE_CBEGIN0
The start address of the circular buffer.
AE_CEND0
The end address of circular buffer, i.e., the start address plus the byte
size of the buffer.
{
AE_SA32X2_IP(V_con, align, addr); // store
}
AE_SA64POS_FP(align, addr); // finalize the stream
Negative strided streams work analogously to the case of loads, with the use of RIP
intrinsics. Note that there are separate flush instructions for the positive stride and negative
stride streams.
2.4.2 Circular Buffer
HiFi 3 has support for a single circular buffer, which can be accessed in either the forward or
the backward direction.
The circular buffer boundaries are specified through two 32-bit states:
Table 2-9 Circular Buffer States
Use the following intrinsic functions to read from the circular buffer states in C:
All circular buffer operations follow a “post-increment” convention; that is, in every case the
effective address is the base address while the updated base address is formed by adding
the register offset to the base address with circular wrap-around.
The address increment is specified in terms of number of bytes and must be less than or
equal to the buffer byte size. The increment can be either positive (wrap-around at the end
of the buffer), or negative (wrap-around at the beginning of the buffer).
Both aligned and unaligned accesses are supported. However, for unaligned accesses
_
AE
CBEGIN0 and AE_CEND0 must be aligned to 64 bits. For aligned accesses,
_
AE
CBEGIN0 and AE_CEND0 must be aligned to the size of the data being loaded or
stored. Unaligned accesses use the alignment mechanism described in Section 2.4.1.
Priming loads use the PC suffix with separate instructions for positive and negative stride.
For unaligned references, only stride one and stride negative one are supported. Packed 24bit loads are supported.
20 CADENCE DESIGN SYSTEMS,INC.
HiFi 3 DSP User's Guide
_
AE
CBEGIN0 need not be smaller than AE_CEND0. If an instruction accesses data past
_
the AE
whether it is before or after AE
Circular buffer support is available for DSP loads and stores to the AE_DR register file, as
well as bitstream loads and stores to the AR register file.
Following is an example C code snippet demonstrating how to initialize and use the circular
buffer. The buffer is used to store 24-bit vector data in the 24 MSBs of each 32-bit word with
a negative stride starting from the last element of the buffer.
CEND0 boundary, data will continue to be accessed at AE_CBEGIN0 regardless of
_
CEND0.
/* Allocate the buffers. */
void *buf = malloc(buf_size);
This operation accesses an aligned 16 bit quantity.
24
24-bit scalar
This operation accesses a 24-bit quantity that is
packed into memory so as to occupy only 24 bits in
memory.
32
32-bit scalar
This operation accesses an aligned 32-bit quantity.
This size is also used for legacy 24-bit integers which
are stored in a 32-bit memory location right-justified
and with 8 bits of sign extension.
32F24
Left-justified 24bit fraction
This operation accesses a 24-bit fraction, which is
stored left-justified in a 32-bit memory location. It shifts
the value right by 8 bits and sign extends on the left by
8 bits. The address must be 32-bit aligned.
64
64-bit scalar
This operation accesses an aligned 64-bit quantity.
24X2
Vector of 24-bit
This operation accesses two of the size “24” above,
occupying 48 bits in memory.
32X2
Vector of 32-bit
This operation accesses two of the size “32” above.
Some instructions need the pair to be 64-bit aligned
while others do not.
32X2F24
Vector of leftjustified 24-bit
fraction
This operation accesses two of the size “32F24”
above. Some instructions need the pair to be 64-bit
aligned, while others do not.
16X4
Vector of 16 bit
This operation accesses four of the size “16” above.
Some instructions need the quartet to be 64-bit
aligned, while others do not.
2.4.3 Load and Store Naming Scheme
The mnemonic of most load and store operations contains a size indicating the size of
operands it will load or store. The sizes are listed in the following table.
Table 2-10 Load/Store Operation Sizes
The mnemonic of most load and store operations contains a suffix indicating how the effective
address is computed and whether the base address register is updated. The suffixes are
listed in the following table.
Operations with suffix IP, XP, IC, or XCfollow a “post-increment” convention where the
effective address is the base AR register, and the base address register is updated by adding
an immediate, constant or register offset. Operations with suffix IU or XUfollow a “pre-
increment” convention where the effective address is the result of adding the immediate or
register offset to the base address register’s contents and the base address register is
updated with the effective address. Operations with suffix I or X do not increment, but create
an effective address which is the sum of the base address register and an immediate or offset
register.
22 CADENCE DESIGN SYSTEMS,INC.
Loading...
+ 189 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.