This publication is provided “AS IS.” Cadence Design Systems, Inc. (hereafter “Cadence") does not make any warranty of any kind,
either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose.
Information in this document is provided solely to enable system and software developers to use our processors. Unless specifically
set forth herein, there are no express or implied patent, copyright or any other intellectual property rights or licenses granted
hereunder to design or fabricate Cadence integrated circuits or integrated circuits based on the information in this document.
Cadence does not warrant that the contents of this publication, whether individually or as one or more groups, meets your
requirements or that the publication is error-free. This publication could include technical inaccuracies or typographical errors.
Changes may be made to the information herein, and these changes may be incorporated in new editions of this publication.
2017 Cadence, the Cadence logo, Allegro, Assura, Broadband Spice, CDNLIVE!, Celtic, Chipestimate.com, Conformal,
Connections, Denali, Diva, Dracula, Encounter, Flashpoint, FLIX, First Encounter, Incisive, Incyte, InstallScape, NanoRoute, NCVerilog, OrCAD, OSKit, Palladium, PowerForward, PowerSI, PSpice, Purespec, Puresuite, Quickcycles, SignalStorm, Sigrity, SKILL,
SoC Encounter, SourceLink, Spectre, Specman, Specman-Elite, SpeedBridge, Stars & Strikes, Tensilica, TripleCheck, TurboXim,
Vectra, Virtuoso, VoltageStorm, Xplorer, Xtensa, and Xtreme are either trademarks or registered trademarks of Cadence Design
Systems, Inc. in the United States and/or other jurisdictions.
OSCI, SystemC, Open SystemC, Open SystemC Initiative, and SystemC Initiative are registered trademarks of Open SystemC
Initiative, Inc. in the United States and other countries and are used with permission. All other trademarks are the property of their
respective holders.
Xtensa Release: RG-2017.7
Issue Date: 08/2017
Modification: 422358
Cadence Design Systems, Inc.
2655 Seely Ave.
San Jose, CA 95134
www.cadence.com
ii
Contents
List of Tables............................................................................................................................vii
List of Figures........................................................................................................................... ix
1 Changes from the Previous Version ...................................................................................11
In this release, Xtensa LX7 architecture and hardware are
provided for a limited set of products. Xtensa software and
tools are available to all users for software upgrade from
previous releases.
The following changes (denoted with green change bars)
were made to this document for the Cadence RG-2017.7
release of Xtensa processors. Subsequent releases may
contain updates for features in this release or additional
features may be added.
•Section Extending a ConnX BBE32EP DSP with User
TIE on page 150 was modified to explain the addition
of a restriction on a TIE Compiler error.
The following changes (denoted with orange change bars)
were made to this document for the Cadence RG-2017.5
release of Xtensa processors. Subsequent releases may
contain updates for features in this release or additional
features may be added.
•Section Symmetric FIR on page 73 was modified to
explain the use of symmetrical FIR operations with
complex data in ConnX BBE32EP DSP.
•Added a new section Implementing the Floating Point
FFT/IFFT on ConnX BBE32EP DSP Vector Floating
Point Unit (VFPU) on page 124.
The following changes (denoted with purple change bars)
were made to this document for the Cadence RG-2016.4
release of Xtensa processors. Subsequent releases may
contain updates for features in this release or additional
features may be added.
•Added new chapter on "Single-precision Vector
Floating-point Option"
•Updated table of FLIX formats
•Updated list of DSP options
•General cleanup and clarifications
11
2. Introduction
Topics:
•Purpose of this User's
Guide
•Installation Overview
•ConnX BBE32EP DSP
Architecture Overview
•ConnX BBE32EP DSP
Instruction Set Overview
•Programming Model and
XCC Vectorization
The Cadence® Tensilica® ConnX BBE32EP DSP (32-MAC
Baseband Engine) is based on a ultra-high performance
DSP architecture designed for use in next-generation
baseband processors for LTE Advanced, other 4G cellular
radios and multi-standard broadcast receivers. The high
computational requirements of such applications require
new and innovative architectures with a high degree of
parallelism and efficient I/Os. The ConnX BBE32EP DSP
meets these needs by combining a 16-way Single
Instruction, Multiple Data (SIMD), 32 multiplieraccumulators (MAC) and up to 5-issue Very Long
Instruction Word (VLIW) processing pipeline with a rich
and extensible set of interfaces.
The ConnX BBE family natively supports both real and
complex arithmetic operations. For digital signal
processing developers, this greatly simplifies development
of algorithms dominated by complex arithmetic. In addition
to having the SIMD/VLIW DSP core, the ConnX BBE32EP
DSP contains a 32-bit scalar processor, ideal for efficient
execution of control code. This combined SIMD/VLIW/
Scalar design makes the ConnX BBE32EP DSP ideal for
building real systems where high computational
throughput is combined with complex decision making.
The ConnX BBE32EP DSP is built around a core vector
pipeline consisting of thirty-two 16bx16b MACs along with
a set of versatile pipelined execution units. These units
support flexible precision real and complex multiply-add;
bit manipulation; data shift and normalization; data select,
shuffle and interleave. The ConnX BBE32EP DSP
multipliers and its associated adder and multiplexer trees
enable execution of complex multiply operations and
signal processing filter structures in parallel. The results of
these operations can be extended up to a precision of 40bits per element, truncated/rounded/saturated or shifted/
packed to meet the needs of different algorithms and
implementations. The ConnX BBE32EP DSP instruction
set is optimized for several DSP kernel operations and
matrix multiplies with added acceleration for a wide range
of key wireless functions. In addition, the instruction set
13
supports signed*unsigned multiplies for emulation of 32bit wide multiplication operations.
The ConnX BBE32EP DSP supports programming in C/C
++ with a vectorizing compiler. Automatic vectorization of
scalar C and full support for vector data types allows
software development of algorithms without the need for
programming at assembly level. Native C operator
overloading is supported for natural programming with
standard C operators on real and complex vector datatypes. The ConnX BBE32EP DSP has a Boolean
predication architecture that supports a large number of
predicated operations. This enables the ConnX BBE32EP
DSP compiler to achieve a high vectorization throughput
even with complicated functions that have conditional
operations embedded in their inner loops.
The BBE32EP and its larger cousin, the BBE64EP share
a common architecture, providing a high degree of code
portability between the two cores. Both cores share a
common set of check box options that allow capabilities to
be added/subtracted from the core. This permits the
system designer to optimize the core for a particular
application space reducing both area and power
consumption.
14
2.1 Purpose of this User's Guide
The ConnX BBE32EP DSP User’s Guide provides an overview of the ConnX BBE32EP DSP
architecture and its instruction set. It will help ConnX BBE32EP DSP programmers identify
commonly used techniques to vectorize algorithms. It provides guidelines to improve
software performance through the use of appropriate ConnX BBE32EP DSP instructions,
intrinsics, protos and primitives. It also serves as a reference for programming the ConnX
BBE32EP DSP in a C/C++ software development environment using the Xtensa Xplorer (XX)
Integrated Development Environment (IDE). Additionally, this guide will assist those ConnX
BBE32EP DSP users who wish to add custom operations (more hardware) to the ConnX
BBE32EP DSP instruction set using Tensilica® Instruction Extension (TIE) language.
To use this guide most effectively, a basic level of familiarity with the Xtensa software
development flow is highly recommended. For more details, refer to the Xtensa Software
Development Toolkit User’s Guide.
Throughout this guide, the symbol <xtensa_root> refers to the installation directory of the
user's Xtensa configuration. For example, <xtensa_root> might refer to the directory /usr/
xtensa/<user>/<s1> if <user> is the username and <s1> is the name of the user’s Xtensa
configuration. For all examples in this guide, replace <xtensa_root> with the path to the
installation directory of the user’s Xtensa distribution.
2.2 Installation Overview
To install a ConnX BBE32EP DSP configuration, follow the same procedures described in the
Xtensa Development Tools Installation Guide. The ConnX BBE32EP DSP comes with a
library of examples provided in the XX workspace called
bbe32ep_examples_re_v<version_num>.xws.
The ConnX BBE32EP DSP include-files are in the following directories and files:
Note: There is an additional include header file for TI C6x code compatibility, which is
discussed in TI C6x Instrinsics Porting Assistance Library on page 66. This include
file maps TI C6x intrinsics into standard C code and is meant to assist porting only.
2.3 ConnX BBE32EP DSP Architecture Overview
ConnX BBE32EP DSP, a 16-way SIMD processor, has the ability to work on several data
elements in parallel at the same time. The ConnX BBE32EP DSP executes a single
operation simultaneously across a stream of data elements by means of vector processing.
For example, it allows for vector additions through a narrow vector ADD (sum of two 16-
15
element 16-bits/element vectors) or wide vector (sum of two 16-element 40-bits/element
vectors) ADD, in parallel. These operations include optimized instructions for complex
multiplication and multiply-accumulation, matrix computation, vector division (optional), vector
reciprocal & reciprocal square root (optional) and other performance critical kernels.
ConnX BBE32EP DSP has a 5-slot VLIW architecture, in which up to five operations can be
scheduled and dispatched in parallel every cycle. This allows the processor to support
multiply-accumulate operations of two narrow vectors of eight 16-bit complex (32-bit realimaginary pair) elements in parallel, equivalently thirty-two 16-bit real elements in total, with a
load of eight complex operands and a store of eight complex results in every cycle. To
sustain such high memory bandwidth requirements, the ConnX BBE32EP DSP has two
asymmetric Load/Store Units (LSUs) which can independently communicate with two local
data memories.
For higher efficiency, the ConnX BBE32EP DSP fetches instructions out of a 128-bit wide
access to local instruction memory (IRAM). The instruction fetch interface supports a mix of
16/24-bit single instructions and 48/96-bit FLIX (up to 5-way) instructions. The processor can
also read from and write to system memory and devices attached to the standard system
buses. Other processors or DMA engines can transfer data in and out of the local memories
in parallel with the processor pipeline. The processor can also have an instruction cache. The
ConnX BBE32EP DSP can also be supplemented with any number of wide, high-speed I/O
interfaces (data cache, TIE ports and queues) to directly control devices or hardware blocks,
and to move data directly into and out of the processor register files.
Figure 1: ConnX BBE32EP DSP Architecture
16
The ConnX BBE32EP DSP architecture uses variable length instructions, with encodings of
16/24-bits for its baseline Xtensa RISC instructions, and 48/96-bits for up to five operations in
VLIW that may be issued in parallel. The Xtensa compiler schedules different operations into
up to five VLIW slots available. The VLIW instruction Slot-0 is used to issue mostly loads
and/or store operations. The instruction Slot-1 is used primarily for load operations using the
second LSU while the instruction Slot-4 schedules most move operations. The instruction
Slot-2 mostly allows ALU with multiply operations while the instruction Slot-3 is purely for
ALU operations present in the ConnX BBE32EP DSP instruction set. However, it is important
to note that the positions of these slots are all interleaved in an actual instruction word much
differently than the software view.
The table below illustrates all the basic instruction formats supported by the different
operation slots available in the ConnX BBE32EP DSP VLIW architecture. The Xtensa C
Compiler (XCC) automatically picks an instruction format that offers the best schedule for an
application. When possible, XCC will attempt to pick a 48-bit FLIX format or 16/24-bit
standard instruction format to reduce code size.
Note:
•Users may optionally add additional 48/96-bit instructions formats as user TIE; see
Extending a ConnX BBE32EP DSP with User TIE on page 150.
•Format 8 is available only when the Advanced Precision Multiply/Add option is
present in a ConnX BBE32EP DSP configuration.
•Format 13 and Format 14 are available only when the Single-precision VectorFloating-point option is present in a ConnX BBE32EP DSP configuration.
The ConnX BBE32EP DSP is built around the baseline Xtensa RISC architecture which
implements a rich set of generic instructions optimized for efficient embedded processing.
The power of the ConnX BBE32EP DSP comes from a comprehensive set of over 500 DSP
and baseband optimized operations excluding the baseline Xtensa RISC operations. A
variety of load/store operations support five basic and two special addressing modes for
16/32-bit scalar and 16-bit narrow vector data-types; see Load & Store Operations on page
139. A special addressing mode for circular addressing is also available. Additionally, the
ConnX BBE32EP DSP supports aligning load/store operations to deliver high bandwidth
loads and stores for unaligned data.
Vector data management in the ConnX BBE32EP DSP is supported through operations
designed for element-level data selection, shuffle or shift. Further, to easily manage precision
in vector data there are packing operations specific to each data-type supported. Additionally,
there is an enhanced ISA support for predicated vector operations. Vector level predication
allows a vectorizing compiler to exploit deeper levels of inherent parallelism in a program;
see Predicated Vector Operations on page 135.
Multiply operations supported by the ConnX BBE32EP DSP include real and complex
16bx16b multiply, multiply-round and multiply-add operations. Multiply operations for complex
data provide support for conjugate arithmetic, full-precision arithmetic, magnitude
computation with saturated/rounded outputs. The ConnX BBE32EP DSP is capable of eight
complex multiplies per cycle where each complex product involves four real multiplies. The
architecture supports extended precision with guard bits on all 40-bit wide vector register
data, full support for double precision data and 40-bit accumulation on all MAC operations
without any performance penalty. A wide variety of arithmetic, logical, and shift operations are
supported for up to sixteen 40-bit data-words per cycle. The architecture provides special
operations that assist matrix-multiply operations.
For algorithm and application specific acceleration, the ConnX BBE32EP DSP can be
configured with a number of options:
•Single-precision vector floating-point (see note below)
Restriction: The following pair of options are mutually exclusive:
•Advanced precision multiply/add
•Single-precision vector floating-point
As an example, by configuring a ConnX BBE32EP DSP with the symmetric FIR option and
using pairwise real multiply operations, the configured core offers very high performance for a
wide range of FIR kernels. The performance, in terms of MACs/cycle, on a ConnX BBE32EP
DSP configured for FIR operations is highlighted in the table below.
Table 2: ConnX BBE32EP DSP Multiply Performance
DataCoefficientsTypeMACs/cycle
ComplexRealSymmetric64
ComplexRealAsymmetric32
RealRealSymmetric64
RealRealAsymmetric32
The ConnX BBE32EP DSP instruction set is described in further detail later in ConnX
BBE32EP DSP Features on page 21.
2.5 Programming Model and XCC Vectorization
The ConnX BBE32EP DSP supports a number of programming models -- including standard
C/C++, the ConnX BBE32EP DSP-specific integer and fixed-point data types with operator
overloads and a level of automated vectorization, scalar intrinsics and vector intrinsics.
19
The ConnX BBE32EP DSP contains integer and fixed-point data types that can be used
explicitly by a programmer to write code. These data types can be used with built-in C/C++
operators or with protos (also called intrinsics) as described later in Programming a ConnX
BBE32EP DSP on page 49.
Vectorization, which can be manual or automatic, analyzes an application program for
possible vector parallelism and restructures it to run efficiently on a given ConnX BBE32EP
DSP configuration. Manual vectorization using the ConnX BBE32EP DSP data types and
protos is discussed later in Programming a ConnX BBE32EP DSP on page 49. The Xtensa
C and C++ compiler (XCC) contains a feature to perform automatic vectorization on many
ConnX BBE32EP DSP supported data types. The compiler analyzes and vectorizes a
program with little or no user intervention. It generates code for loops by using the ConnX
BBE32EP DSP operations. It also provides compiler flags and pragmas for users to guide
this process. This feature and its related flags and pragmas are documented in the Xtensa C
and C++ Compiler User’s Guide.
The current generation ConnX cores introduce a new N-way programming model for vector
processing using data-types in memory and registers. The N-way model consists of Nelement data groups to facilitate portable vector programming. N-way refers to the natural
SIMD size of a ConnX machine. The ConnX BBE32EP DSP supports N=16. A ConnX
BBE32EP DSP FLIX instruction can bundle up to five SIMD operations in parallel and each
operation is capable of producing up to ‘N’ results in an independent FLIX lane. Programmers
can adopt the N-way abstraction model by using Ctypes, operations and protos in their code
that are either N-way or ‘N/2’-way (denoted by ‘N_2’ in Ctypes, operation and proto names).
N-way model on the ConnX BBE32EP DSP supports N and N_2 as an abstract
representation of 16 and 8 respectively. This makes code written using the N-way model
easy to port to other architectures with a different SIMD size or example, other SIMD variants
in the BBE EP Cores family -BBE64EP. Although not recommended, users can also use
equivalent ctypes and protos whose names explicitly have ‘16’ or ‘8’ in place of ‘N’ or ‘N_2’
respectively.
For automatic vectorization in an N-way environment, however, it is required that all vector
types inside a loop have the same SIMD width. In the ConnX BBE32EP DSP, this exposes
an important distinction between real and complex vector types. While the real vector types
are supported by a SIMD width of N by the core, the complex vector types are supported by
two natural SIMD widths – N (16) and ‘N_2’ (8). For instance, xb_vecN_2xc16 is an 8-way
complex vector type stored in a single register while xb_vecNxc16 is a 16-way complex
vector type stored in a pair of registers. It is recommended to use the xb_vecN_2xc16 type
when programming only complex vector types, but when programming real and complex
vector types together, the xb_vecNxc16 vector type is suggested for complex data.
20
3. ConnX BBE32EP DSP Features
Topics:
•ConnX BBE32EP DSP
Register Files
•ConnX BBE32EP DSP
Architecture Behavior
•Operation Naming
Conventions
•Fixed Point Values and
Fixed Point Arithmetic
•Data Types Mapped to
the Vector Register File
•Data Typing
•Multiplication Operation
•Vector Select Operations
•Vector Shuffle Operations
•Block Floating Point
•Complex Conjugate
Operations
•FLIX Slots and Formats
21
3.1 ConnX BBE32EP DSP Register Files
The ConnX BBE32EP DSP has a partitioned set of register files to provide high bandwidth
with less register bloat. Larger register files permit deeper software pipelining and reduced
memory traffic. The first partition consists of a set of sixteen 256-bit general purpose narrow
vector registers (vec) that can hold operands and results of SIMD operations. Each vec
register can hold either sixteen 16-bit real or eight 32-bit complex elements, depending on
how the register is used by the software. The second partition consists of a set of four 640-bit
wide vector registers (wvec) each of which can hold sixteen 40-bit elements. The wvec
registers can hold either 32-bit elements with eight guard bits or 40-bit elements. The
interface to each data memory is 256-bits wide and this makes all loads/stores for wide wvec
registers take place through intermediate moves into the narrow 256-bit vec registers.
The ConnX BBE32EP DSP register file organization also has four 256-bit Alignment registers
(4x16N - where N = SIMD width; for ConnX BBE32EP DSP, N = 16) and eight 112-bit
specialized variable Shift/Select registers (8x7N) for use with the select category of
operations that can manipulate the contents of the vec register file. There are also eight 16bit Boolean registers (16xN) for flexible SIMD and VLIW predication.
Lastly, an optional register file - a two entry mvec - is added when the Advanced PrecisionMultiply/Add (advprec) option is configured in a ConnX BBE32EP DSP core. The 16-way 32-
bit mvec registers are used to hold results of only those operations belonging to the advprec
option. Furthermore, floating-point operands (23-bit element vectors) for advanced precision
operations are held as vectors of 7-bit exponents in vsa registers paired with 16-bit mantissa
in narrow vec registers.
22
Figure 2: ConnX BBE32EP DSP Register Files
23
Figure 3: Base Xtensa ISA Register Files
On configuring a ConnX BBE32EP DSP core with certain options, additional state registers
are built into the processor core. These special state registers can also be used to hold
contents of the vector registers temporarily without going to and from memory. The
RUR_<state_name> and WUR_<state_name> operations are used to read from and write to
these state registers. Once initialized, users are advised against using these states in order
not to inadvertently overwrite the contents of these states. Following is a list of fixed and
configuration dependent states in ConnX BBE32EP DSP
•BBE_STATE<A|B|C|D> - 256-bit states added by the FFT or symmetric FIR options. And,
they are shared when both the options are present in a ConnX BBE32EP DSP
configuration.
•BBE_RANGE - A 4-bit state added by the FFT option.
•BBE_MODE - A 5-bit state added by the FFT option.
•BBE_BMUL_STATE - A 1024-bit state added by the LFSR & Convolutional Encoding option.
•BBE_BMUL_ACC - A 32-bit state added by the LFSR & Convolutional Encoding option.
•BBE_PQUO<0|1> & BBE_PREM<0|1> - 128-bit states added by the Vector Divide option.
•BBE_FLUSH_TO_ZERO - A 1-bit state added by the Advanced Precision Multiply/Add option.
•CBEGIN & CEND - These two states are present in the base ConnX BBE32EP DSP core to
support circular addressing; they are used to initialize begin and end addresses in a
circular buffer. More details about circular addressing are provided in Load & Store
Operations on page 139.
More details on the configurable options for ConnX BBE32EP DSP can be found in
Configurable Options on page 71 and in the ISA HTML of the corresponding operations.
Since the ConnX BBE32EP DSP supports 256-bit loads and stores between registers and
memory, at 800MHz, this provides over 50GBps of combined data memory bandwidth
between ConnX BBE32EP DSP and the two load/store units.
For data transfers from memory to registers during Loads:
•16-bit elements in memory are loaded as 16-bit elements in narrow vec registers
24
•32-bit elements in memory are loaded to narrow vec registers first and then expanded to
40-bits in wvec wide registers after a move
•Signed loads sign-extend the elements while unsigned loads zero-fill inside vector
registers
On the other hand, for data transfers from registers to memory during Stores:
•16-bit elements in registers are stored as 16-bit elements in memory
•40-bit elements in registers are saturated to 32-bit elements first and then moved to the
narrow vec registers for a store
•Signed stores are saturated to signed 16/32-bit MAX and MIN, while unsigned stores are
saturated to unsigned 16/32-bit MAX.
The ConnX BBE32EP DSP provides protos, also called intrinsics, to store and restore
register spills. The names of protos that store data element(s) to memory on register spills
have a ‘_storei’ suffix, for example xb_vecN_2xc40_storei, xb_vecNxcq9_30_storei,
xb_c16_storei, etc. These protos use either appropriate store operations in the case of 16-bit
data spills from narrow vec registers or use an appropriate combination of move and store
operations in the case of 40-bit data spills from wide wvec registers or other special registers.
The 40-bit data element(s) in the wvec registers are first zero-extended to 64-bit data
element(s) before a move to the narrow vec register and then followed by four 16-bit stores,
per data element, from the narrow vec register to memory.
Conversely, the ConnX BBE32EP DSP has complementary protos to restore data elements
from memory back into corresponding registers. The names of these protos have a ‘_loadi’
suffix, for example, uint32_loadi, xb_vecN_2xcq19_20_loadi, xb_vecNx32U_loadi, etc.. To
restore spills back into the narrow vec register, the restoring spill protos use appropriate load
operations, for the 16-bit data-types. However, to restore spills back into the wide wvec
register, the restoring spill protos use an appropriate combination of load and move
operations, for the 40-bit data-types. In the latter case, the restoring process is through four
loads of narrow vectors. Once loaded, these are concatenated, and for each element, 40-bits
out of 64-bits are written into the wide wvec vector registers. It can be noticed that the
previously zero-extended 40-bit elements are now truncated back from 64-bits prior to move
from the narrow vec register to the wide wvec register. Note that In general, these special
protos are meant to be used by the compiler, and not by regular programmers, who in
general should not need them.
3.2 ConnX BBE32EP DSP Architecture Behavior
The ConnX BBE32EP DSP architecture provides guard bits in its data path and wide register
file to avoid overflow on ALU and MAC operations. The typical data flow on this machine is:
1. Load data into a narrow unguarded vector register file.
2. Compute and accumulate into the wide vector register file with guard bits.
25
3. Store data in narrow format by packing down with truncation or saturation (via the vec
register file).
The ConnX BBE32EP DSP has thirty-two 16bx16b multipliers, each of which produces a 32bit result to be stored in the guarded 40-bit register elements. This allows up to eight guard
bits for accumulation in the 40-bit register elements. For real multiplication, only 16 multipliers
are used while for complex multiplication, all 32 are used. For pairwise multiply all 32 are
used even for real multiplies
The first load/store unit supports all load and store operations present in the ConnX
BBE32EP DSP ISA and the baseline Xtensa RISC ISA. The second load unit supports a
limited set of commonly used vector load and aligning vector load operations present in the
ConnX BBE32EP DSP ISA. This second load unit does not support any store operations.
Additionally, the ConnX BBE32EP DSP offers operations that pack results. For instance, the
‘PACK’ category of operations extract 16-bits from 40-bit wide data in wvec register elements
and saturate results to the range [ -215 .. 215-1]. Most ‘MOVE’ operations that move 16-bit or
32-bit results from high-precision to low-precision data widths saturate.
Unlike some architectures, the ConnX BBE32EP DSP does not set flags or take exceptions
when operations overflow their 40-bit range, or if saturation or truncation occurs on packing
down from a 40-bit wide vector to a 16-bit vector in registers.
3.3 Operation Naming Conventions
The ConnX BBE32EP DSP uses certain naming conventions for greater consistency and
predictability in determining names of Ctypes, operations and protos. In general, ConnX
BBE32EP DSP operation names follow the pattern below:
•op_class := one or more letters identifying a simple or compound class of operation(s)
•num_SIMD := SIMD width of the core, here 'N' (=16) and 'N_2' (=8)
•element_width := bits in each SIMD element of an operand, here 8/16/32/40 bits
•other_identifiers := one or more letters specifying a sub-class of operation(s) in the
context of the op_class
•C := Complex or circular
•R := Real
•U := Unsigned
•S := Signed or saturation
•I := Immediate
•X := Indexed
26
•P := Post-increment or pairwise
•T := True (predicate)
•F := False (predicate)
•J := Conjugate
•H := High
•L := Low
•B := Boolean
•U := Unsigned
•A := Address register ar
•V := Narrow vector register vec
•W := Wide vector register wvec
•BR := Boolean register br
•BV := Boolean vector register vbool
•VS := Vector shift/select register vsa
•SF := Spread Factor
•CS := Code sets
•INT := Integer
•ALIGN := Vector alignment register valign
Following are a few specific examples to illustrate the naming conventions listed above.
BBE_LVNX16_ILoads a vector of sixteen 16-bit signed elements from memory into a narrow vector
register. The base address used for the load is contained in address register ars. This
base address is added to an offset. The _I extension represents an immediate offset
such that the memory address is a multiple of 32 bytes.
BBE_MOVVA16CThis operation performs a single replicating move of a 32-bit complex element,
consisting of two 16-bit data elements, from the address register AR to a narrow vector
register VEC. The destination register, source register, size and nature (complex) of the
data are identified by V, A, 16 and C respectively.
BBE_LSNX16_IPThis operation performs a 1-way signed scalar load of a 16-bit element from memory
into a narrow vector register. The rest of the output narrow vector register is zero filled.
An address register (AR) holds the base address used for the load and this base
address is updated using an immediate as an offset after the load is done. The _IP
extension indicates a post-operation update for the address register after the sum of
base address and the immediate offset.
BBE_MULNX16PACKL16-way signed real multiply of two narrow 16-bit vectors to produce a 256-bit combined
narrow vector product. The PACKL variant of operations pack the sixteen full-precision
32-bit results, which would otherwise be stored in a 16x40-bit wide vector register, back
into sixteen 16-bit integer results stored in a 256-bit narrow vector register. PACKL
grabs the low order bits of the results — thus a presumed integer and truncates the
high order bits of the result without using a vector shift/select (vsa) register. This kind
27
of pack extracts the lowest 16-bits of each of the sixteen intermediate result elements
and writes them out without shifting, rounding, or saturation.
The following table broadly categorizes a set of commonly used ConnX BBE32EP DSP
operations. It also provides a brief description to show the specific naming conventions used
in each of the categories.
Table 3: Sample Categories of Operations
CategoryMnemonicTypeDescription
LOADBBE_LVLoad vector of 16b elements
PLoad complex pair of 16b elements
SLoad scalar 16b element
ALoad unaligned vector of 16 elements
BLoad vector of 1b elements
MOVEBBE_MOVMove between core, vector and state registers.
Predicated versions available for some.
AMove to AR register
BRMove to BR (core boolean register)
BVMove to vbool (vector boolean register)
IDXIndexed move to vector register.
PAMove from AR to vector register as fractional
Q5.10 with saturation and replication
QAMove from AR to vector register as fractional
Q15 with saturation and replication
PINTMove immediate to vector register as fractional
Q5.10 with saturation, replication
QINTMove immediate to vector register as fractional
Q15 with saturation, replication
QUO
REM
SMove to state-register
Move quotients from input vector register to
vector-divide quotient state-register
Move remainders from input vector register to
vector-divide remainder state-register
28
CategoryMnemonicTypeDescription
SVMove to vec (narrow vector register) with
saturation
SWMove to wvec (wide vector register) with sign-
extension
VMove to vec (narrow vector register)
VSMove to vector shift/select register (vsa)
WMove to wvec (wide vector register)
BBE_MALIGNMove alignment register
BBE_MB
MULTIPLYBBE_MULMultiply operation. Predicated versions available
AMultiply-accumulate operation
CMultiply complex, high precision
JMultiply complex conjugate, high precision
PACKQMultiply signed real, Q15 fractional results with
PACKPMultiply signed real, results converted to Q5.10
PACKLMultiply signed real, integer results with lower-
CPACKQ, CPACKP
& CPACKL
JCPACKQ,
JCPACKP &
JCPACKL
PRMultiply with pairwise sum and accumulation
Move between two vbool (vector boolean)
registers
for some.
high-order 16-bits, with saturation
fractional form, with saturation
order 16-bits
Similar to PACKQ, PACKP & PACKL
respectively, but with complex signed operands
Similar to PACKQ, PACKP & PACKL
respectively, but with complex and complex-
conjugate signed operands
RMultiply with variable round producing wide (40-
bit) results with sign-extension
SGNMultiply multiplicand by the sign of the multiplier
element-wise
29
CategoryMnemonicTypeDescription
SMultiply producing wide (40-bit) results with
sign-extension
UUMultiply (unsigned*unsigned) producing wide
(40-bit) results with sign-extension
USMultiply (unsigned(first input)*signed(second
input)) producing wide (40-bit) results with sign-
extension
PACKBBE_PACKLPacks 40-bit signed real vector to 16-bit
elements with truncation to keep lower bits
PPacks 40-bit Q19.20 signed fractional vector to
16-bit Q5.10 elements
SPacks low-precision integer vector
QPacks vector of Q9.30 in wvec into Q15 results
in vec
VPacks 40-bit wvec to low-precision vec based
on signed shift amount in vector shift/select
register (vsa)
BBE_UNPKP, Q, S, U16-way unpack of 16-bit data in vec to 40-bit
data in wvec
SELECTBBE_SELSelect sixteen 16-bit elements from two input
vectors using vector shift/select register (vsa)
ISelect sixteen 16/40-bit vector from two input
vectors using an immediate value
PRReal element-wise right shift with shift count
using an immediate value
PCComplex element-wise right shift with shift count
using an immediate value
BBE_SELSSingle real element select from vec/wvec
register into element-0 of vec/wvec register
CSingle complex element select from vec/wvec
register into complex element-0 of vec/wvec
register
30
CategoryMnemonicTypeDescription
BBE_DSELI
SHUFFLEBBE_SHFLShuffle 16-element narrow (16-bit) vector using
IShuffle 16-element 16/40-bit vector using an
VSShuffle elements from a vector shift/select (vsa)
STOREBBE_SVStore vector of 16b elements
PStore pair (complex) of 16b elements
SStore scalar 16b element
AStore and align vector of 16b elements
BStore vector of 1b Boolean elements
ABSOLUTEBBE_ABSAbsolute value
ADDBBE_ADDVector add
Interleave or de-interleave real/complex
elements from two input narrow vectors into two
output narrow vector using an immediate value
to specify a select pattern
vector selection register
immediate value to specify a shuffle pattern
register to an output vsa register using an
immediate value to specify a shuffle pattern
ANDBBE_ANDVector bitwise Boolean AND
CONJUGATEBBE_CONJComplex Conjugate
DIVISION
(optional)
SOFT-BIT DEMAP
(optional)
EQUALITYBBE_EQVector equality check
EXTRACTBBE_EXTRExtract one real/complex element into AR
BBE_DIVUUnsigned vector divide
SSigned vector divide
BBE_SDMAP3GPP and IEEE constellation soft-bit demap
BBE_NEQVector inequality check
(address register)
BExtract one real/complex element into BR (core
Boolean register)
31
CategoryMnemonicTypeDescription
BBE_EXTRACTBExtract elements of a single vbool register into
two vbool registers
FFT (optional)BBE_FFTFFT type operations
FLOATING-POINT
RECIPROCAL
(optional)
FLOATING-POINT
RECIPROCAL
SQUARE-ROOT
(optional)
NEGATEBBE_NEGSigned negate
INTERLEAVEBBE_ITLVBit-by-bit interleave of two narrow vectors into
JOINBBE_JOINJoin boolean vectors
MAGNITUDEBBE_MAGIInterleaved magnitude of complex vectors high
BBE_FPRECIP
BBE_FPRSQRT
SSigned saturating negate
PACKQInterleaved magnitude of complex fractional
PACKLInterleaved magnitude of complex integer
Signed 16-bit mantissa + 7-bit exponent
pseudo-floating point reciprocal approximation
16-bit mantissa + 7-bit exponent pseudo-floating
point reciprocal square-root approximation
one narrow vector
precision
vectors, Q15 results with the high-order 16-bits
vectors, integer results with the lower-order 16-
bits
PACKPInterleaved magnitude of complex integer
vectors, results packed to 16-bits in Q5.10
format after shifting
BBE_MAGIAInterleaved magnitude of complex integer
vectors, result multiply-accumulated
BBE_MAGIRInterleaved magnitude of complex integer
vectors, variable rounded results
MAXBBE_MAXMax of vector
BBE_MAXUMax of unsigned vector
BBE_BMAXMax of vector generating Boolean mask
32
CategoryMnemonicTypeDescription
MINBBE_MINMin of vector
BBE_BMINMin of vector generating Boolean mask
NANDBBE_NANDNAND of vec/wvec vectors
NSABBE_NSANormalize shift amount
E
C
U
ORBBE_OROR of vec/wvec/vbool vectors
POLYNOMIALBBE_POLYPolynomial evaluation
RECIPROCAL
(optional)
RECIPROCAL
SQUARE-ROOT
(optional)
REDUCTIONBBE_RADDVector sum reduction
REPLICATEBBE_REPReplicate elements
ROUNDBBE_RNDADJ
BBE_RECIP
BBE_RSQRT
BBE_RMAX, MINVector signed reduction maximum/minimum
BBE_RBMAX, MINVector signed reduction maximum/minimum
Normalise shift amount truncated to even value
Complex normalise shift amount
Unsigned normalise shift amount
16-bit vector reciprocal approximation
Compute normalization and table lookup factors
for advanced reciprocal square root
along with boolean vector indicating location of
maximum/minimum
Rounding add, using variable round amounts
from vector shift/select register (vsa)
SADJSymmetric rounding add, using variable round
amounts from vector shift/select register (vsa)
SATURATEBBE_SATSSaturate signed vector
USaturate unsigned vector
SEQUENCEBBE_SEQ
Create sequence of integer values from 0 to
(32-1) in output vec/wvec register
33
CategoryMnemonicTypeDescription
SHIFT LEFT/
RIGHT
BBE_SLLLogical left shift, amount of signed shift as input
from vsa register
ILogical left shift, amount of unsigned shift as
immediate input
BBE_SLSSaturating shift, amount of signed shift as input
from vsa register
ISaturating shift, amount of unsigned shift as
immediate input
BBE_SLAArithmetic left shift, amount of signed shift as
input from vsa register
BBE_SRAArithmetic right shift, amount of signed shift as
input from vsa register
IArithmetic right shift, amount of unsigned shift
as immediate input
Within each class, sub-conventions are used to describe types of operations, data type
layouts, signed/unsigned, data formatting, addressing modes, etc., depending on the
operation or its class. Table 4: Types of Load/Store Operations on page 34 has an
abbreviated list of this information; for detailed information about the operations, see the
HTML Instruction Set Architecture page. The easiest method to access this page is from the
configuration overview in Xtensa Xplorer:
1. Double-click on the ConnX BBE32EP DSP configuration in the System Overview to open
the Configuration Summary window.
2. Click View Details button in the rightmost column for the installed build to open a
Configuration Overview window.
3. Select All Instructions to open a complete list of instruction descriptions.
In the ConnX BBE32EP DSP, all the operation and proto names (excluding the baseline
Xtensa RISC ISA) begin with the prefix "BBE_". The full names are then constructed using a
number of logical fields separated by underscores ("_"). Syntactically, operation names
having either underscores ("_") or periods (“.”) may be used in an assembly-level program.
However, the C language does not allow the use of periods (".") in the names of operation
identifiers. Therefore, in referring to protos for base Xtensa operations in C, periods in names
are substituted by underscores. This document uses the assembly-correct names for protos
i.e. with underscores after a prefix and periods in the body of a proto name. However, note
that all the programming examples in Programming a ConnX BBE32EP DSP on page 49
list protos using the C-correct form with only underscores and no periods. Thus, in the online
ISA HTML documentation (see On-Line ISA, Protos and Configuration Information on page
167), when searching for information about an operation or a proto having period(s) in its
name, it is useful to search for a variation of the name with underscore(s) replacing period(s).
Note: Only some base Xtensa operations have periods (".") in their names. ConnX
BBE32EP DSP operation names do not have any periods.
3.4 Fixed Point Values and Fixed Point Arithmetic
The ConnX BBE32EP DSP contains operations for implementing fixed point arithmetic. This
section describes the representation and interpretation of fixed point values as well as some
operations on fixed point values.
35
Representation of Fixed Point Values
A fixed point data type Qm.n contains a sign bit, some number of bits m, to the left of the
decimal and some number of bits n, to the right of the decimal. When expressed as a binary
value and stored into a register file, the least significant n bits are the fractional part, and the
most significant m+1 bits are the integer part expressed as a signed 2s complement number.
If the binary value is interpreted as a 2s complement signed integer, converting from the
binary value to a fixed point number requires dividing the integer by 2n.
Thus, for example, the 40-bit Q9.30 number 1.5 is represented as 0x00 6000 0000.
Bit Range (Field)39 (Sign)38 (Integer) 3029 (Fraction) 0
and the 16-bit Q15 number -0.5 is represented as 0xc000
Bit Range (Field)15 (Sign)14 (Fraction) 0
Size(1 bit)(15 bits)
Binary1100 0000 0000 0000
Hex0x10x4000
When m = 0, we write Qn. When n=0, the data type is just a signed integer and we call the
data type a signed m+1-bit integer or intm+1.
The ConnX BBE32EP DSP operations use Q15, Q5.10, Q1.30, Q11.20, Q19.20 and Q9.30
data types, described in more detail, as follows:
•Q15 - 16-bit fixed point data type with 1 sign bit and 15 bits of fraction, to the right of the
binary point. The largest positive value 0x7fff is interpreted as (1.0 - 2
-15
). The smallest
negative value 0x8000 is interpreted as (-1.0). The value 0 is interpreted as (0.0)
•Q5.10 - 16-bit fixed point data type with 1 sign bit, 5 integer bits to the left of the binary
point and 10 bits of fraction to the right of the binary point.
•Q1.30 - 32-bit fixed point data type with 1 sign bit, 1 integer bit to the left of the binary
point and 30 bits of fraction to the right of the binary point.
•Q11.20 - 32-bit fixed point data type with 1 sign bit, 11 integer bits to the left of the binary
point and 20 bits of fraction to the right of the binary point.
•Q19.20 - 40-bit fixed point data type with 1 sign bit, 19 integer bits to the left of the binary
point and 20 bits of fraction to the right of the binary point.
•Q9.30 - 40-bit fixed point data type with 1 sign bit, 9 integer bits to the left of the binary
point and 30 bits of fraction to the right of the binary point.
36
Arithmetic with Fixed Point Values
When multiplying fixed point numbers Qm0.n0 * Qm1.n1, with a standard signed integer
multiplier, the natural result of the multiple will be a Qm.n data type where n = n0+n1 and m =
m0+m1+1. So multiplying a Q15 by a Q15 generates a Q1.30. Since the ConnX BBE32EP
DSP has 16-bit x 16-bit multipliers, it multiplies two Q15 values to produce a Q1.30 result
sign extended to Q9.30 in a 40-bit register element. To convert Q1.30 to Q9.30 requires a
sign extension that fills a 40-bit register. Similarly, a multiplication between two Q5.10 types
produces a Q11.20 result that is sign extended to Q19.20 to fill a 40-bit register.
3.5 Data Types Mapped to the Vector Register File
A number of different data types are defined for the vector register files. These data types are
also referred to as ctypes after the name of the TIE construct that creates them.
Scalar Data Types Mapped to the Vector Register File
The signed integer data types are:
•xb_int16 - A 16-bit signed integer stored in the 16-bit vector register element.
•xb_int32 - A 32-bit signed integer stored in the least significant 32 bits of a 40-bit vector
register element. The upper 8 bits are sign extended from bit 31.
•xb_int40 - A 40-bit signed integer stored in a vector register element.
The complex integer data types are:
•xb_c16 - A signed complex integer value with 16-bit imaginary and 16-bit real parts. The
real and imaginary pair is stored in two 16-bit elements of a vector register file, with the
real part in the lower significant element.
•xb_c32 - A signed complex integer value with 32-bit imaginary and 32-bit real parts. The
real and imaginary pair are stored in two 40-bit elements of a vector register file, with the
real part in the less significant element. The values are sign extended from bit 31.
•xb_c40 - A signed complex integer value with 40-bit imaginary and 40-bit real parts. The
real and imaginary pair occupies two 40-bit elements of a vector register file, with the real
part in the lower significant element.
In addition to the integer data types, the ConnX BBE32EP DSP also supports a programming
model with explicit fixed-point (fractional) data types. The software programming model
provides C intrinsics and operator overloading that use these data types. All the scalar ones
that fit in a single vector register are listed below.
The six real fixed-point data types are:
•xb_q15 - A signed Q15 data type stored in a 16-bit vector register element.
•xb_q5_10 - A signed Q5.10 data type that occupies a 16-bit vector register element.
•xb_q1_30 - A signed Q1.30 data type is stored in the least significant 32-bits of a 40-bit
vector register element. The rest of the bits are sign extended from bit 31.
37
•xb_q11_20 - A signed Q11.20 data type is stored in the least significant 32-bits of a 40-bit
vector register element. The rest of the bits are sign extended from bit 31.
•xb_q19_20 - A signed Q19.20 data type that uses all 40 bits of a vector register element.
•xb_q9_30 - A signed Q9.30 data type that uses all 40 bits of a vector register element.
The six complex fixed-point data types are:
•xb_cq15 - A signed complex fixed-point value with Q15 imaginary and Q15 real parts. The
real and imaginary pair is stored in two 16-bit elements of a vector register file, with the
real part in the lower significant element.
•xb_cq5_10 - A signed complex fixed-point value with Q5.10 imaginary and Q5.10 real
parts. The real and imaginary pair occupy two 16-bit elements of a vector register file, with
the real part in the lower significant element.
•xb_cq1_30 - A signed complex fixed-point value with Q1.30 imaginary and Q1.30 real
parts. The real and imaginary pair is stored in two 40-bit elements of a vector register file,
with the real part in the lower significant element. The values are sign extended from bit
31.
•xb_cq11_20 - A signed complex fixed-point value with Q11.20 imaginary and Q11.20 real
parts. The real and imaginary pair occupy two 40-bit elements of a vector register file, with
the real part in the lower significant element. The values are sign extended from bit 31.
•xb_cq19_20 - A signed complex fixed-point value with Q19.20 imaginary and Q19.20 real
parts. The real and imaginary pair occupies two 40-bit elements of a vector register file,
with the real part in the lower significant element.
•xb_cq9_30 - A signed complex fixed-point value with Q9.30 imaginary and Q9.30 real
parts. The real and imaginary pair occupy two 40-bit elements of a vector register file, with
the real part in the lower significant element.
Vector Data Types Mapped to the Vector Register File
Following is a list of the vector register files and their corresponding vector data types. The
double vector data types are physically stored in a pair of registers.
Table 5: Vector Data Types Mapped to Vector Register Files
The following tables (Table 2 4 through Table 2 7) list the complete data type naming
convention for both operations and protos. These basic data types are used to understand
the naming convention. All operations can be accessed using intrinsics with the same name
as the operation using one of the base data types xb_vecNx for full vector and xb_vecN_2x
for half-vector (for use with complex types). Alternatively, intrinsics are provided to give the
same functionality of each operation mapped appropriately to the set of data types.
The ConnX BBE32EP DSP supports standard C data-types, referred to as memory datatypes, for 16/32-bit scalar and vector data elements in memory. These scalar and vector
data-elements are mapped to the ConnX BBE32EP DSP register file as register data-types.
While the 16-bit memory types are mapped to 16-bit register types, the 32-bit memory types
are mapped to 40-bit register types along with 8 guard bits.
Following are scalar, mem vector, and vector register data type details defined for C.
The ConnX BBE32EP DSP supports a variety of multiplication operations that use a set of
thirty-two 16bx16b SIMD multipliers and associated adders provided as computational
resources. These multiplier enabled operations can be broadly classified as – multiply,
multiply-accumulate, multiply and round, multiply-subtract, unsigned multiply and signwise
multiply operation.
The table below lists the various flavors of multiplication supported by the ConnX BBE32EP
DSP. Each of these operations may have multiple protos each of which support a specific
data-type related to the operation. Refer to the ISA HTML of an operation to find all the
protos using that operation.
Table 10: Types of ConnX BBE32EP DSP Multiplication Operations
TypeOperationDescription
Real signed multiplyBBE_MULNX16
16 16-bit operands;
40-bit sign-extended results
Real signed multiply
with low-precision results
Complex signed multiplyBBE_MULNX16C
Complex signed multiply
with low-precision results
BBE_MULNX16PACKL
BBE_MULNX16PACKP
BBE_MULNX16PACKQ
BBE_MULNX16CPACKL
BBE_MULNX16CPACKP
16 16-bit operands;
16-bit low-precision integer results
16 16-bit operands;
16-bit Q5.10 fractional results
16 16-bit operands;
16-bit Q15 fractional results
8 16-bit complex operands;
40-bit sign-extended results
8 16-bit complex operands;
16-bit low-precision integer results
8 16-bit complex operands;
41
TypeOperationDescription
16-bit Q5.10 fractional results
BBE_MULNX16CPACKQ
Complex conjugate signed multiplyBBE_MULNX16J
Complex conjugate signed multiply
with low-precision results
Complex and complex conjugate
signed multiply with
low-precision results
BBE_MULNX16JPACKL
BBE_MULNX16JPACKP
BBE_MULNX16JPACKQ
BBE_MULNX16JCPACKL
BBE_MULNX16JCPACKP
BBE_MULNX16JCPACKQ
8 16-bit complex operands;
16-bit Q15 fractional results
8 16-bit complex operands;
40-bit sign-extended results
8 16-bit complex operands;
16-bit low-precision integer results
8 16-bit complex operands;
16-bit Q5.10 fractional results
8 16-bit complex operands;
16-bit Q15 fractional results
8 16-bit complex operands;
16-bit low-precision integer results
8 16-bit complex operands;
16-bit Q5.10 fractional results
8 16-bit complex operands;
16-bit Q15 fractional results
Special complex signed multiply
with pairwise reduction add
Real signed multiply-accumulateBBE_MULANX16
Complex signed multiply-accumulate BBE_MULANX16C
Complex conjugate signed
multiply-accumulate
BBE_MULNX16PC_0
BBE_MULNX16PC_1
BBE_MULANX16J
Used to accelerate matrix
multiplication.
Refer to ISA HTML.
16 16-bit operands;
40-bit sign-extended results
8 16-bit complex operands;
40-bit sign-extended results
8 16-bit complex operands;
40-bit sign-extended results
42
TypeOperationDescription
Special complex signed multiply
accumulate with pairwise reduction
add
BBE_MULANX16PC_0
BBE_MULANX16PC_1
Used to accelerate matrix
multiplication.
Refer to ISA HTML.
Real signed multiply
and variable round
Complex signed multiply
and variable round
Complex conjugate signed multiply
and variable round
Special complex signed multiply
with pairwise reduction add and
rounding
Real signed multiply-subtractBBE_MULSNX16
Complex signed multiply-subtractBBE_MULSNX16C
Complex conjugate signed
multiply-subtract
BBE_MULRNX16
BBE_MULRNX16C
BBE_MULRNX16J
BBE_MULRNX16PC_0
BBE_MULRNX16PC_1
BBE_MULSNX16J
16 16-bit operands;
40-bit sign-extended results
8 16-bit complex operands;
40-bit sign-extended results
8 16-bit complex operands;
40-bit sign-extended results
Used to accelerate matrix
multiplication.
Refer to ISA HTML.
16 16-bit operands;
40-bit sign-extended results
8 16-bit complex operands;
40-bit sign-extended results
8 16-bit complex operands;
40-bit sign-extended results
Unsigned with signed multiplyBBE_MULUSNX16
Unsigned with signed multiplyaccumulate
Unsigned with signed multiply with
rounding
Unsigned with unsigned multiplyBBE_MULUUNX16
Unsigned with unsigned multiplyaccumulate
Unsigned with unsigned multiply with
rounding
Special signwise multiplyBBE_MULSGNNX16Refer to ISA HTML.
BBE_MULUSANX16
BBE_MULUSRNX16
BBE_MULUUANX16
BBE_MULUURNX16
16 16-bit operands with first input
unsigned and second input signed;
40-bit sign-extended results
43
3.8 Vector Select Operations
The vector select operations (BBE_SEL) allow elements from two source vectors to be
selectively copied into a destination vector. With this general definition of selective element
transfer, it is easy to implement replication, rotation, shift, extraction and interleaving with the
same basic BBE_SEL type of operation.
To setup a selective transfer, the BBE_SELNX16 operation takes two 16x16-bit narrow
source vectors and one 16x16-bit narrow target vector; along with the vector select register
(vsa). The vector select register contains a user defined 16-valued pattern to define the
required transfer. The 16 values in the vector select register can be [0,…,31] referring to the
indices of the combination of the two source vectors.
On the other hand, the BBE_SELNX16I operation allows common selection patterns without
having to set the vector select register that is required by the BBE_SELNX16 operation. The
BBE_SELNX16I operation selects sixteen 16-bit elements from a pair of narrow vector
registers and produces a single narrow vector output. The nature of selection pattern is
specified through an immediate value iSel. To use selection on 40-bit wide vector data-types,
the BBE_SELNX40I operation provides support for a limited set of pre-defined selections.
Eight preset selection patterns can be chosen with an appropriate immediate iSel to rotate or
interleave wide vector data-types. Refer to the ISA HTML of BBE_SELNX40I for a list of all
the available types and permitted immediate values iSel.
Note: Select patterns for immediate operations BBE_SELNX16I and BBE_SELNX40I
are not necessarily same across range of BBE-EP cores BBE16EP/32EP/64EP.
Programmer should study patterns available for perticular core they are using from
ISAHTML in order to determine the best one.
Vector Initialization and Some Additional Select Patterns
The ConnX BBE32EP DSP also contains a special move operation, BBE_MOVVINX16,
which is used for vector initialization based on an immediate value. The table below
illustrates what is possible.
The vector shuffle operations (BBE_SHFL) in the ConnX BBE32EP DSP can shuffle any 16elements from a source vector register into a target vector register. These shuffle operations
are often useful in performing vector compression, expansion, reordering and in preparation
for matrix multiplication.
For shuffle-based operations, some examples of the support present in the ConnX BBE32EP
DSP are the ability to:
•perform a full SELECT/SHUFFLE and a specialized SELECT/SHUFFLE (pattern specified
by an immediate value) in a single cycle
•perform one 2x2 interleave per cycle with two vector inputs and two vector outputs
•perform a 3x3 interleave in multiple steps
•support specialized shuffles for implementing FIR filters with two vector inputs and two
vector outputs
45
The BBE_SHFLNX16 operation shuffles elements in a narrow source vector based on a
shuffle pattern specified by the vector selection register vsa. The shuffle pattern is set by
storing desired indices [0,…,N-1] in the N values in the vsa register.
Similar to the select operations, the BBE_SHFLNX16I and the BBE_SHFLNX40I operations
use an immediate iSel to specify a shuffle pattern while not using the vsa register; and the
support for wide vector types are a limited set of shuffle patterns. Refer to the ISA HTML for a
list of all available patterns and corresponding immediate values iSel.
Note: Shuffle patterns for immediate operations BBE_SHFLNX16I and
BBE_SHFLNX40I are not necessarily same across range of BBE-EP cores BBE16EP/
32EP/64EP. Programmer should study patterns available for perticular core they are
using from ISAHTML in order to determine the best one.
3.10 Block Floating Point
When applications can operate across a wide numerical range, the programmer may wish to
implement "block floating point", the adjustment of an entire data set based on determining
the actual range of values, normalizing the data-set to maximize the number of bits of
precision, and readjusting the range later in the computation. The operations
BBE_NSANX16, BBE_NSANX16C, BBE_NSANX40, BBE_NSANX40C, BBE_NSAUNX16
and BBE_NSAUNX40 facilitate block floating point. These operations calculate the left shift
amount required to normalize each element to a 16-bit or 40-bit value. The result is returned
in another register, and can be used with the operation BBE_SLLNX16 and BBE_SLLNX40
to normalize the elements of a single vector.
The corresponding operations for xb_vecN_2x40 data are BBE_NSANX40 and
BBE_NSAUNX40. To implement block floating point, BBE_NSANX16 or BBE_NSAUNX16 is
applied to the entire data set, and the minimum normalization shift is calculated using
BBE_MINNX16. Then the same shift is applied to the entire data set, typically using
BBE_SLLNX16. This shift amount can be used at a later stage of the computation, with
BBE_SRANX16 to return the data to non-normalized form.
3.11 Complex Conjugate Operations
A complex conjugate multiply of two complex numbers a and b is the multiplication of a by
the complex conjugate of b and is defined as: result = (a.real + j a.imag) * (b.real - j b.imag)
Where b = (b.real + j * b.imag) and complex conjugate of b = (b.real - j * b.imag). We use "J"
in instruction names to indicate complex conjugate operations. And JC for regular multiply
and complext conjugate multiply as a pair.
There are many variants of complex-conjugate multiply operations:
For example, BBE_MULNX16J takes as input two 16X16-bit vectors, in which the real and
imaginary portions of eight complex numbers are interleaved. It produces eight complex
results in one 16X40-bit register where the real and imaginary parts the complex results are
interleaved.
3.12 FLIX Slots and Formats
The ConnX BBE32EP DSP can issue up to five operations in a single instruction bundle
using Xtensa LX FLIX (VLIW) technology. It contains scalar and vector SIMD operations. The
ConnX BBE32EP DSP is implemented with a number of 48/96-bit formats in addition to the
standard 24-bit and optional 16-bit instruction formats in the Xtensa LX architecture. Each
basic 48/96-bit format can bundle five operations into its five separate FLIX slots respectively.
Instruction List – Showing Slot Assignments
The instruction slot assignment list, showing operations assigned to the various slots of the
various formats, is automatically generated and available for use in the on-line configuration
documentation for the ConnX BBE32EP DSP. Consult ISA HTML for information about this
on-line information. Since this list is automatically generated from the machine description, it
is comprehensive and up to date including the user's choice of configuration options.
47
4. Programming a ConnX BBE32EP DSP
Topics:
•Programming in
Prototypes
•Xtensa Xplorer Display
Format Support
•Operator Overloading
and Vectorization
•Programming Styles
•Conditional Code
•Using the Two Local Data
RAMs and Two Load/
Store Units
•Other Compiler Switches
•TI C6x Instrinsics Porting
Assistance Library
Cadence® recommends two important Xtensa manuals to
read and become familiar with before attempting to obtain
optimal results by programming the ConnX BBE32EP
DSP:
•Xtensa® C Application Programmer’s Guide
•Xtensa® C and C++ Compiler User’s Guide
Note that this chapter does not attempt to duplicate
material in either of these guides.
The ConnX BBE32EP DSP is based on SIMD (Single
Instruction/Multiple Data) techniques for parallel
processing. It is typical for programmers to do some work
to fully exploit the available performance. It may only
require recognizing that an existing implementation of an
application is already in essentially the right form for
vectorization, or it may require completely reordering the
algorithm’s computations to bring together those that can
be done in parallel.
This chapter describes several approaches to
programming the ConnX BBE32EP DSP and explores the
capabilities of automated instruction inference and
vectorization, and cases where the use of intrinsic-based
programming is appropriate.
To use the ConnX BBE32EP DSP data types and
intrinsics in C, please include the appropriate top-level
header file by using the following preprocessor directive in
the source file:
#include <xtensa/tie/xt_bben.h>
The xt_bben.h include file is auto-generated through the
ConnX BBE32EP DSP core build process and contains
#defines for programmers to conditionalize code (see
Conditional Code on page 63). Furthermore, xt_bben.h
includes the lower-level include file (xt_bbe32.h) which
has more core specific ISA information. It is worth noting
that the header file – xt_bben_verification.h contains
additional #defines but used only for ISA verification
49
purposes; and thus are not documented for programmers’
use.
The ConnX BBE32EP DSP processes both fixed-point
and integer data. The basic data element, xb_vecNx16, is
16-bits wide, and a vector consists of sixteen such 16-bit
elements. Thus, an input vector in memory is 256-bits
wide. The ConnX BBE32EP DSP also supports
xb_vecNx32 and xb_vecNx32U, a wider 32-bit data
element with eight such elements stored in a 256-bit
vector. This double-width data type is typically generated
as the result of a multiply or multiply/accumulate operation
on 16-bit data elements.
As described earlier, during a load operation, the ConnX
BBE32EP DSP loads 16-bit scalar values to 16-bits and it
expands 32-bit scalar values to 40-bits. The ConnX
BBE32EP DSP supports both signed and unsigned data
types in memory and provides a symmetrical set of load
and store operations for both these data types. For signed
data, a load results in sign extension from 32-bits to 40bits. For unsigned data, loads result in zero extension.
The ConnX BBE32EP DSP architecture philosophy
follows the data flow of loading data at a lower precision
into a narrow unguarded vec register file, computing and
accumulating into the 40-bit/element guarded wide wvec
register file (8-guard bits), and storing results back at a
lower precision via narrow vec register file. The ConnX
BBE32EP DSP ALU/MAC operations wrap on overflow
and either saturate going from 40-bits to 32-bits or PACK
going from 32-bits to 16-bits.
Note: There are no direct loads into or stores from
the wide wvec register file. All wvec loads/stores
take place through narrow vec registers.
Automatic type conversion is also supported between
vector types and the associated scalar types, for example
between xb_vecNx16 and short and between xb_vecNx40
and int. Converting from a scalar to a vector type
replicates the scalar into each element of the vector.
Converting from a vector to a scalar extracts the first
element of the vector.
The C compiler also supports the xb_vecNx40 type, a
vector of sixteen, 40-bit elements. This data type is useful
because logically the multiplication of two variables of
50
type xb_vecNx16 returns a variable of type xb_vecNx40.
The xb_vecNx40 type is implemented using the 640-bit
wide register file.
Note that as well as the basic types discussed above;
there are a number of other types, such as complex
integer, fractional, complex fractional, etc. And as
discussed in ConnX BBE32EP DSP Features on page 21,
there are alternative prototypes (protos) for the ConnX
BBE32EP DSP operations that deal with these types. This
allows both a more intuitive programming style and richer
capabilities for more automated compiler inference and
vectorization using these additional data types.
Some examples of the various types are:
•xb_cq1_30: Memory scalar, one complex pair of Q1.30
type, 1-bit for integer and 30-bit fractional, sign bit
implied, total 64 bits
•xb_vecN_2xcq9_30: Register vector, eight complex
elements of Q9.30 type, each 9-bit integer and 30-bit
fractional, sign bit implied, total 640 bits
•xb_vecNx32U: Memory vector, sixteen unsigned real
elements of 32-bit integer type, total 512 bits
•xb_vecN_2xcq15: Memory vector, eight complex
elements of Q15 type, total 256 bits
51
4.1 Programming in Prototypes
As part of its programming model, the ConnX BBE32EP DSP defines a number of operation
protos (prototypes or intrinsics) for use by the compiler in code generation, for use by
programmers with data types other than the generic operation support, and to provide
compatibility with related programming models and DSPs. For programmers, the most
important use of protos is to provide alternative operation mappings for various data types. If
the data types are compatible, one operation can support several variations without any extra
hardware cost. For example, protos allow vector operations to support all types of real and
complex integer and fixed-point data types.
The ConnX BBE32EP DSP contains several thousand protos. Some protos are meant for
compiler usage only, although advanced programmers may find a use for them in certain
algorithms. Many of these protos are called BBE_OPERATOR, and in general should not be
used as manual intrinsics; look for the simpler protos for the function instead. The complete
list of protos can be accessed via the ISA HTML.
In the ISA HTML package contained within the ConnX BBE32EP DSP configuration, the
proto_list.html page categorizes all the protos available for programmer use. The following
categories of protos are listed in the easy-to-navigate HTML page:
•BBE PRIMARY PROTOS
•COMPILER USE PROTOS
•DATA MANAGEMENT PROTOS
•OPERATOR OVERLOAD PROTOS
•TYPE CONVERSION PROTOS
•TYPE-CASTING PROTOS
•USER REGISTER PROTOS
•ZERO ASSIGNMENT PROTOS
Extract Protos
Extract protos are used to convert between types in the ConnX BBE32EP DSP programming
model. The extract protos are used to produce new vectors from existing vectors. Note that
their naming and prototype argument list follows certain conventions. The destination type
follows “BBE_EXTRACT”, which is followed by the source type. The destination argument is
first in the argument list, followed by the source argument, and then any special arguments
such as immediates.
There are some special destination-type names used that are not normal ConnX BBE32EP
DSP types. For example, “R” means real (extract the real parts from a vector of complex
variables), and “I” means imaginary (extract the imaginary parts from a vector of complex
variables).
The ConnX BBE32EP DSP also provides protos to extract the real and imaginary parts of a
complex vector - BBE_EXTRACTI_FROMC16 and BBE_EXTRACTR_FROMC16.
52
Example: The following code shows the use of an Extract proto; in this case, to extract the
real values from a complex vector.
Note that the ConnX BBE32EP DSP scalar types are used here, as the compiler does
automatic vectorization and inference.
Combine Protos
Combine protos are also used to convert between types in the ConnX BBE32EP DSP
programming model. Note that their naming and prototype argument list follows certain
conventions. The destination type follows “BBE_COMBINE”, which is followed by the source
type. The destination arguments are first in the argument list, followed by the source
arguments, and then any special arguments such as immediates. The “BBE_COMBINE”
protos are used to produce a single integer or fixed-point complex vector from two noncomplex input vectors of same types.
There are some special source type names used that are not normal ConnX BBE32EP DSP
types. For example, “I” means imaginary, “R” means real, and “Z” means zero. Thus, “IR”
means imaginary-real (combine a vector of imaginary parts and a vector of real parts into a
vector of interleaved complex numbers). “ZR” means zero-real — that is, all imaginary parts
are zero, thus this will generate a vector of interleaved complex numbers whose imaginary
parts are all zero.
Complementary to the BBE_EXTRACT set of protos, the ConnX BBE32EP DSP provides a
set of “BBE_JOIN” protos that can be used to produce a wide integer or fixed-point complex
vector by joining two narrow complex vectors of same type. While the BBE_JOIN protos
require narrow complex vectors as inputs, the BBE_COMBINE protos combine non-complex
vectors into a complex vector.
Example: In the following code, a combine from a vector of real values and zeroes for the
imaginaries is used to create two complex vectors.
// N-way inverse for scaling, note divide is integer division
// Also convert to complex multiplicand to use in later steps
xb_vecNx16 hinv = BBE_DIVNX32(numinv, numinv, hsabs);
xb_vecNxc16 hcxhi, hcxlo;
BBE_COMBINENXC16_FROMZR(hcxhi, hcxlo, hinv);
Move Protos
The ConnX BBE32EP DSP additionally provides “BBE_MOV” category of protos used to cast
between data-types. For vector inputs, the BBE_MOV protos cast from one data-type to
53
another related data-type without changing the size of the bit-stream. These protos don’t
have a cycle cost and are helpful in multi-type programming environments. While the data is
interpreted differently, the length of the vector remains the same within the register file. This
category of protos usually follows the format
BBE_MOV<after_type>FROM<before_type>
Examples: BBE_MOVNX16_FROMNXQ5_10, BBE_MOVNX40_FROMN_2XCQ9_30, etc.
Operator Overload Protos
The proto HTML documentation also includes the Operator protos for compiler usage. Note
that Cadence® recommends usage by advanced programmers only.
When using intrinsics, you may pass variables of different types than expected as long as
there is a defined conversion from the variable type to the type expected by the intrinsic.
Operator overloading is only supported if there is an intrinsic with types that exactly match
the variables. Implicit conversion is not allowed since with operator overloading you are not
specifying the intrinsic name, and the compiler does not guess which intrinsics might match.
The resultant intrinsic is necessary for operator overloading, but there is no advantage in
calling it directly.
4.2 Xtensa Xplorer Display Format Support
Xtensa Xplorer provides support for a wide variety of display formats, which makes use of
these varied data types easier, and also easier to debug. These formats allow memory and
vector register data contents to be displayed in a variety of formats. In addition, users can
define their own display formats. Variables are displayed by default in a format matching their
vector data types. Registers are by default always displayed as xb_vecNx16 types, but you
can change the format to any other format.
Some examples of these display formats for a 256-bit variable are: xb_vecNx16m displays
hex and decimal for each real element of vector
Note that the complex numbers are displayed as they are laid out in memories and registers,
since the ordering of each pair is (imaginary, real).
4.3 Operator Overloading and Vectorization
Common ConnX BBE32EP DSP operations can be accessed in C or C++ by applying
standard C operators to the ConnX BBE32EP DSP data types. There are many operator
overloads defined for the various types, and the compiler will infer the correct ConnX
BBE32EP DSP vector operation in many cases depending upon the data-types used.
In addition, if the scalar types are used in loops, the compiler will often be able to
automatically vectorize and infer or overload. That is, a loop using the ConnX BBE32EP DSP
scalar types may turn into a loop of the ConnX BBE32EP DSP vector operations that is as
tightly packed and efficient as manual code using ConnX BBE32EP DSP intrinsics.
For operations that do not map to standard operators, intrinsics can be used. Several
intrinsics can map to the same underlying operation by means of one or more operations.
There are different intrinsics for different argument types. Intrinsics are particularly useful in
special operations such as select and polynomials which cannot be easily inferred by the
compiler. It’s often best to use intrinsincs to select a specific load/store flavor for operations
where the compiler may not always pick the right load/store operation for the best schedule.
When using intrinsics, the compiler still handles type checking and data movement, and
schedules operations into available slots of a FLIX format.
To understand the limits of compiler automatic inferencing and overloading and vectorization,
the rest of this chapter discusses the various programming styles and provides several
examples showing how the ConnX BBE32EP DSP can be programmed, including the
example results.
32-bit memory ctypes are treated equivalently to corresponding 40-bit register Ctypes. Even
though operator overload protos are not explicity defined for 32-bit Cytpes, the equivalence
lets programmers use the corresponding 40-bit ctype protos. This means xb_vecNx32 =
xb_vecNx32 + xb_vecNx32 works the exact same way as xb_vecNx40 = xb_vecNx40 +
xb_vecNx40. The following tables show the correspondance of 32/40-bit ctypes for the
purpose of operator overload.
32-bit Ctype40-bit Ctype
xb_q11_20xb_q19_20
xb_vecNxq11_20xb_vecNxq19_20
xb_cq11_20xb_cq19_20
xb_vecN_2xcq11_20xb_vecN_2xcq19_20
xb_vecNxcq11_20xb_vecNxcq19_20
xb_c32xb_c40
55
32-bit Ctype40-bit Ctype
xb_cq1_30xb_cq9_30
xb_vecNxc32xb_vecNxc40
xb_vecNxcq1_30xb_vecNxcq9_30
xb_vecN_2xc32xb_vecN_2xc40
xb_vecN_2xcq1_30xb_vecN_2xcq9_30
xb_vecNx32xb_vecNx40
xb_vecNx32Uxb_vecNx40
xb_vecNxq1_30xb_vecNxq9_30
xb_int32xb_int40
xb_q1_30xb_q9_30
4.4 Programming Styles
It is typical for programmers to have to put in some effort on their code, especially legacy
code, to make it run efficiently on a vectorized DSP. For example, there may be changes
required for automatic vectorization, or the algorithm may need some work to expose
concurrency so vector instructions can be used manually as intrinsics. For efficient access to
data items in parallel, or to avoid unaligned loads and stores, which are less efficient than
aligned load/stores, some amount of data reorganization (data marshalling) may be
necessary.
Four basic programming styles that can be used, in increasing order of manual effort, are:
•Auto-vectorizing scalar C code
•C code with vector data types (manually vectorized)
•Use of C instrinsic functions along with vector data types and manual vectorization
•Use of DSP Nature Library
One strategy is to start with legacy C code or to write the algorithm in a natural style using
scalar types (possibly using the ConnX BBE32EP DSP special scalars - xb_int16, xb_c16,
xb_int40, xb_c40; e.g., for Q15 or complex or complex Q15 data). Once the correctness of a
fixed-point code using the ConnX BBE32EP DSP scalar data-types is determined the limits of
what automatic/manual vectorization with operator overloading achieves can be investigated.
By profiling the code, computationally intensive regions of code can be identified and the
limits of automated vectorization determined.
These parts of code that could be vectorized further can then be modified manually to
improve performance. Finally, the most computationally intensive parts of the code can be
improved in performance through the use of C intrinsic functions.
56
At any point, if the performance goals for the code have been met, the optimization can
cease. By starting with what automation can do and refining only the most computationallyintensive portions of code manually, the engineering effort can be directed to where it has the
most effect, which is discussed in the next sections.
Auto-Vectorization
Auto-vectorization of scalar C code using ConnX BBE32EP DSP types can produce effective
results on simple loop nests, but has its limits. It can be improved through the use of compiler
pragmas and options, and effective data marshalling to make data accesses (loads and
stores) regular and aligned.
The xt-xcc compiler provides several options and methods of analysis to assist in
vectorization. These are discussed in more detail in the Xtensa C and C++ Compiler User’s
Guide, in particular in the SIMD Vectorization section. Tensilica® recommends studying this
guide in detail; however, following are some guidelines in summary form:
•Vectorization is triggered with the compiler options O3, -LNO:simd, or by selecting the
Enable Automatic Vectorization option in Xplorer. The -LNO:simd_v and -keep options
give feedback on vectorization issues and keeps intermediate results, respectively.
•Data should be aligned to 32-byte boundaries because of the 256-bit load/store interface.
The XCC compiler will naturally align arrays to start on 32-byte boundaries. But the
compiler cannot assume that pointer arguments are aligned. The compiler needs to be
told that data is aligned by one of the following methods:
•Using global or local arrays rather than pointers
•Using #pragma aligned(<pointer>, n)
•Compiling with -LNO:aligned_pointers=on
•Pointer aliasing causes problems with vectorization. The __restrict attribute for pointer
declarations (e.g. short * __restrict cp;) tells the compiler that the pointer does not alias.
•Compiler alignment options, such as -LNO:aligned_pointers=on, tell the compiler that it
can assume data is always aligned.
There are global compiler aliasing options, but these can sometimes be dangerous.
•Subtle C/C++ semantics in loops may make them impossible to vectorize. The LNO:simd_v feedback can assist in identifying small changes that allow effective
vectorization.
•Irregular or non-unity strides in data array accessing can be a problem for vectorization.
Changing data array accesses to regular unity strides can improve results, even if some
"unnecessary computation" is necessary.
•Outer loops can be simplified wherever possible to allow inner loops to be more easily
vectorized. Sometimes trading outer and inner loops can improve results.
•Loops containing function calls and conditionals may prevent vectorization. It may be
better to duplicate code and perform a little "unnecessary computation" to produce better
results.
57
•Array references, rather than pointer dereferencing, can make code (especially
mathematical algorithms) both easier to understand and easier to vectorize.
Operator Overloading and Inferencing
Many basic C operators work in conjunction with both automatic and manual vectorization to
infer the right intrinsic:
•+ addition
•- subtraction: both unary (additive inverse) and binary
•* multiplication: real and complex
•& bitwise AND
•^ bitwise XOR
•| bitwise OR
•<< bitwise left shift
•>> bitwise right shift
•~ for ConnX BBE32EP DSP complex types, a complex conjugate operation is inferred;
otherwise, bitwise NOT or one’s complement operator
•< less than
•<= less than or equal to
•> greater than
•>= greater than or equal to
•== equal to
The next section illustrates how they work in conjunction.
Vectorization and Inferencing Examples
The examples provided in this section for the ConnX BBE32EP DSP include code snippets
which illustrate the extent of and limits to automatic compiler capabilities. This includes three
very simple algorithms: vector add, vector dot product and matrix multiply, and several scalar
data types: int, short, etc.
In all the following examples, the VEC_SIZE used is 1024 and ARRAY_SIZE was 16.
Almost all these examples vectorize. For example, int vector add:
int ai[VEC_SIZE], bi[VEC_SIZE], ci[VEC_SIZE];
void vec_add_int()
{
int i;
for (i = 0; i < VEC_SIZE; i++)
{
ci[i] = ai[i] + bi[i];
}
}
Note the automatic selection of the BBE_MULANX16 for the short, as well as appropriate
loads and stores. In addition, the correct reduction-add is selected to return the result as a
short.
Two variants that do not vectorize in these examples are vector dot product and matrix
multiply of ints. This is for the simple reason that the ConnX BBE32EP DSP offers 16-bit
matrix multiplication (via its 16 16x16 bit multipliers), but not 32-bit matrix multiplication. In
this case, ordinary scalar code and scalar instructions are used.
One interesting example of complex vectorization and inference starts with the following
source code:
Note the automatic inference of a complex conjugate multiply BBE_MULANX16J from the
construct in C X[i] * ~X[i], where X is defined as a complex Q15 type.
Manual Vectorization
Of course, even the best compiler cannot automatically vectorize all code and loop nests
even if all the guidelines have been followed. In this case, the next step is to move from
scalar ConnX BBE32EP DSP types to vector types, and manually vectorize the loops. The
compiler may still be able to infer the use of vector instrinsics by using standard C operators.
When you manually vectorize, you reduce the loop count size by the number of elements in
the vector instructions, and replace scalar types with the corresponding vector type. For
example, replacing shorts by xb_vecNx16 (memory variables or vector register variables).
The compiler deals with the ConnX BBE32EP DSP vector data types as it does with any
other type.
Below is an example of manually vectorized code for a vector add function:
void vector_add (const short a[ ], const short b[ ],
short c[ ], unsigned len)
{
int i;
// Cast pointers to short into pointers to vectors
const xb_vecNx16 *va = (xb_vecNx16*)a;
const xb_vecNx16 *vb = (xb_vecNx16*)b;
// Assume no pointer aliasing
xb_vecNx16 * __restrict vc = (xb_vecNx16*)c;
// Change loop count to work on vector types
for (i = 0; i < len/XCHAL_BBEN_SIMD_WIDTH; i += 1)
{
vc[i] = va[i] + vb[i];
}
}
Here we see the loop count divided by XCHAL_BBEN_SIMD_WIDTH, which is equal to 16 for
ConnX BBE32EP DSP, and the use of xb_vecNx16 vector variables. Also note the use of the
__restrict attribute to allow efficient compilation by telling the compiler there is no pointer
aliasing to array c.
The programmer casts short pointers (in this case, array references) to vector pointers. The
compiler will automatically generate the correct loads and stores. Note that this example
assumes that "len" is a multiple of BBEN SIMD WIDTH which is sixteen. If it is not, then the
programmer needs to write extra code for the more general situation. However, if the data is
arranged to always be a size multiple of the normal vector size, then the result can be more
efficient even if a few unnecessary computations are included. Padding a data structure with
a few zeroes to make it a multiple (of sixteen in the case above) is also often easy to do.
61
C-Intrinsic-based Programming
The final programming style is to use explicit intrinsics. Interestingly, it may not be necessary
to use intrinsics everywhere, as the compiler may, for example, infer the right vector loads
and stores. Sometimes adding just a few strategic intrinsics may be sufficient to achieve
maximum efficiency. The compiler can still be counted on for efficient scheduling and
optimization.
Here is a simple example adding up a vector:
short addemup(short a[ ], unsigned int n)
{
int i;
short sum = 0;
for (i = 0; i < n; i += 1)
{
sum += a[i];
}
return sum;
}
Here is an optimized intrinsic-based version:
short addemup_v(short a[ ], unsigned int n)
{
int i;
// Set a vector pointer to array a
xb_vecNx16 *pa = ((xb_vecNx16 *) a);
// Declare sum as a vector type and initialize to zero
xb_vecNx16 sum = 0;
xb_vecNx16 avec;
for (i = 0; i < n; i += XCHAL_BBEN_SIMD_WIDTH)
{
sum += (*pa++);
}
// Add vector of intermediate sums and return short result
return BBE_RADDNX16(sum);
}
Following are several interesting points:
•There is no need to use explicit vector loads.
•Similarly, the efficient vector adds are inferred from the code, which is still "C-like".
•The only explicit intrinsic necessary is the BBE_RADDNX16.
•This is a simple evolution from a manually vectorized version of this code.
•"sum" is initialized by casting it to a short 0, which initializes the vector "sum" to 0 in each
element.
•Note that intrinsics are not assembly operations. They need not be manually scheduled
into FLIX bundles; the compiler takes care of all that. And the code still remains quite "Clike".
62
•Intrinsic based programming can make use of the rich set of the ConnX BBE32EP DSP
data types and the right proto can be chosen that maps the data type into the underlying
base instruction. Protos are listed in detail in the ISA HTML.
•The compiler will automatically select load/store instructions, but programmers may be
able to optimize results using their own selection, by using the correct intrinsic instead of
leaving it to the compiler.
4.5 Conditional Code
Programmers are encouraged to use the N-way programming model in ConnX BBE32EP
DSP. N-way programming is a convenient way to write code that offers ease of portability
across cores in the BBE family. However, the assembler still uses operations with specific
SIMD number native to the core used, in this case N=16. N-way programming is supported
by inclusion of the primary header file - <xtensa/tie/xt_bben.h>. This header file
xt_bben.h includes a supplementary include file specific to the BBE machine being
programmed, here <xtensa/tie/xt_ bbe32.h>. Within these include files, there are a
number of #defines that define “XCHAL” variables describing the machine characteristics,
such as:
In addition to N-way programming, to make code robust to the wide configurability in ConnX
BBE32EP DSP, it is useful to be able to write conditional code so that if an optional package
is present (say the FFT option), the ISA support from the package can be explicitly used.
Otherwise, the code may use a slower emulation for that operation. This is particularly useful
in a complicated codebase shared between many cores with no source differences, one can
use an appropriate #define to write conditional code specific to the SIMD size or configuration
options included, so that all variants can be located in a single source file. The #defines for
external, user configurable options are separately listed as a section inside the primary
header file (<xtensa/tie/xt_bben.h>).
The following example illustrates the usage of vector types, XCHAL_BBEN _SIMD _WIDTH and
an intrinsic call based on the N-way programming model:
xb_vecNx16 vin;
xb_vecNx40 vout = 0;
for (i=0; i < N/XCHAL_BBEN_SIMD_WIDTH; i++)
vout += vin[i] * vin[i];
*out_p = BBE_RADDNX40(vout);
63
4.6 Using the Two Local Data RAMs and Two Load/Store Units
The ConnX BBE32EP DSP has two load/store units, which are generally used with two local
data RAMs. Effective use of these local memories and obtaining the best performance results
may require experimentation with several options and pragmas in your program.
In addition, to correctly analyze your specific code performance, it is important to carry out
profiling and performance analysis using the right ISS options.
You may have a "CBox" (Xtensa Connection Box) configured with your ConnX BBE32EP
DSP configuration in order to access the two local data RAMs. For example, the two default
ConnX BBE32EP DSP templates described in Implementation Methodology on page 147.
Using the CBox, if a single instruction issues two loads to the same local memory, the
processor will stall for one cycle. Stores are buffered by the hardware so it can often sneak
into a cycle that does not access the same memory. For example, use of a CBox with two
local data RAMs may cause occasional access contention, depending on the data usage and
access patterns of the code. This access contention is not modeled by the ISS unless you
select the --mem_model simulation parameter. Thus, if your code uses the two local data
RAMs and your configuration has a CBox, it is important to select memory modeling when
studying the code performance.
If you are using the standard set of LSPs (Linker Support Packages) provided with your
ConnX BBE32EP DSP configuration, and do not have your own LSP, use of the "sim-local"
LSP will automatically place compiled code and data into local instruction and data memories
to the extent that this is possible. Thus, Tensilica® recommends the use of sim-local LSP or
your own LSP for finer grained control.
Finer-grained control over the placement of data arrays and items into local memories and
assigning specific items to specific data memories can be achieved through using attributes
on data definitions. For example, the following declaration might be used in your source
code:
short ar[NSAMPLES][ARRAY_SIZE][ARRAY_SIZE] __attribute__ (section(".dram1.data"));
This code declares a short 3-dimensional array ar, and places it in data RAM 1. The compiler
automatically aligns it to a 32-byte boundary.
Once you have placed arrays into the specific data RAM you wish, there are two further
things to control. The first is to tell the compiler that data items are distributed into the two
data RAMs, which can be thought of as "X" and "Y" memory as is often discussed with DSPs.
The second one is to tell the compiler you are using a CBox to access the two data RAMs.
There are two controls, currently not documented in the Xtensa® C Application Programmer’s
Guide or Xtensa C and C++ Compiler User’s Guide, that provide this further level of control.
These two controls are a compiler flag, -mcbox, and a compiler pragma (placed in your
source code) called "ymemory".
64
The -mcbox compiler flag tells the compiler to never bundle two loads of "x" memory into the
same instruction or two loads of "y" memory into the same instruction (stores are exempt as
the hardware buffers them until a free slot into the appropriate memory bank is available).
Anything marked with the ymemory will be viewed by the compiler as "y" memory. Everything
else will be viewed as "x" memory.
There are some subtleties in using these two controls — when they should be used and how.
Here are some guidelines:
•If your configuration does not have CBox, you should not use -mcbox as you are
constraining the compiler to avoid an effect that does not apply.
•If you are simulating without --mem _model, -mcbox might seem to degrade performance
as the simulator will not account for the bank stalls.
•If you have partitioned your memory into the two data RAMs, but you have not marked
half the memory using the ymemory pragma, use of -mcbox may give worse performance.
Without it, randomness will avoid half of all load-load stalls. With the flag, you will never
get to issue two loads in the same instruction.
•However, also note that there are scenarios where -mcbox will help. If, for example, there
are not many loads in the loop, it might be possible to go full speed without ever issuing
two loads in one instruction. In that case, -mcbox will give perfect performance, while not
having -mcbox might lead to random collisions.
•If you properly mark your dataram1 memory using ymemory, or if all your memory is in
one of the data rams, -mcbox should always be used.
•Without any -mcbox flag, but with the ymemory pragma, the compiler will never bundle
two "y" loads together but might still bundle together two "x" loads. With the --mcbox flag,
it will also not bundle together two "x" loads.
Thus in general, the most effective strategy for optimal performance is to always analyze ISS
results that have used memory modeling; to assign data items to the two local data memories
using attributes when declaring them; to mark this using the ymemory pragma; and to use mcbox assuming your configuration has a CBox.
Use of the ymemory pragma is illustrated in the following code:
Note: In this code that we place input array a in data RAM 1; b in data RAM 0; and
the output array in data RAM 0. We tell the compiler with the ymemory pragma that
the a array is in ymemory, which effectively tells it the other two arrays are in x
memory. Finally, since this is run on a configuration with a cbox, we compile with the mcbox option and run with the memory modeling enabled in the ISS. The combination
of the ymemory pragma and the mcbox compiler directive produces better results than
if only one was used.
4.7 Other Compiler Switches
The following two other compiler switches are important:
•-mcoproc: Discussed in the Xtensa® C Application Programmer’s Guide and Xtensa C and
C++ Compiler User’s Guide may give better results to certain program code.
•O3 and SIMD vectorization: If you use intrinsic-based code and manually vectorize it, it
may not be necessary to use O3 and SIMD options. In fact, this may produce code that
takes longer to execute than using O2 (without SIMD, which only has effect at O3).
However, if you are relying on the compiler to automatically vectorize, it is essential to use
O3 and SIMD to see this happen. As is the case with all compiler controls and switches,
experimenting with them is recommended. In general, -O3 (without SIMD) will still be
better than -O2.
4.8 TI C6x Instrinsics Porting Assistance Library
Tensilica® provides the following include header file for the RG-2016.4 release:
This file is included to help in porting code that uses TI C6x intrinsics to any Tensilica® Xtensa
processor as it maps these intrinsics to standard C. Because it maps TI C6x intrinsics to
standard C, the performance of the code is not optimized for the ConnX BBE32EP DSP; to
optimize the code further, you need to manually modify it using either ConnX BBE32EP DSP
data types that the compiler can vectorize and infer from, and/or ConnX BBE32EP DSP
intrinsics.
66
Note: The path to this header file has to be modified accordingly for a different
release.
Thus, this header file is intended as a porting aid only. One recommended methodology is:
•Include this code in your source files that use TI C6x intrinsics and move them to ConnX
BBE32EP DSP. As it handles most intrinsics, the code should, with little manual effort,
compile and execute successfully on ConnX BBE32EP DSP.
•Using the command line or Xtensa Xplorer profiling capabilities, profile the code to
determine those functions, loops and loop nests which take most of the cycles.
•Rewrite those computationally-intensive functions, loops or loop nests to use ConnX
BBE32EP DSP data types and compiler automatic vectorization, or ConnX BBE32EP
DSP intrinsics, to maximize application performance. You could substitute calls to the
ConnX BBE32EP DSP library functions in these places.
The TI C6X standard C intrinsics implement 122 of 131 TI C6x intrinsic functions. Those not
implemented are: _gmpy, _gmpy4, _xormpy, _lssub, _cmpy, _cmpyr, _cmpyr1, _ddotpl2r, and
_ddotph2r.
Porting TI C6X Code Examples
This section contains some simple examples for using the intrinsic porting assistance file to
port TI C6X code.
The first example uses the TI C6X _mpy intrinsic.
Example 1: _mpy
int a[VEC_SIZE], b[VEC_SIZE], c[VEC_SIZE];
void test_mpy()
{
int i;
for (i = 0; i < VEC_SIZE; i++)
{
c[i] = _mpy(a[i], b[i]);
}
}
int main()
{
test_mpy();
}
The _mpy intrinsic, which is used when you include c6x-compat.h, is:
static inline int _mpy(int src1, int src2)
{
return (short) src1 * (short) src2;
}
67
Note that the TI _mpy intrinsic is just mapped into a standard C multiply of two short
variables. The inner-loop disassembly is:
However, for a few TI instructions, the compiler can do better with ConnX BBE32EP DSP
vectorization and automatic inference of intrinsics. The inner-loop disassembly is:
Note the compiler automatically uses 16-way multiplies, 16-way loads (for ints which are cast
to shorts), packs (to convert 40-bit results to shorts, 16-way stores (for ints), etc. – all from
standard C code.
Example 2: _add2
Some intrinsics may need some manual code modification in order to make use of compiler
automated vectorization. TI has a number of intrinsics that unpack integers into two shorts,
and repack shorts back into integers after computation, such as _add2:
void test_add2()
{
int i;
for (i = 0; i < VEC_SIZE; i++) {
e[i] = _add2(a[i], b[i]);
}
}
One approach is to do the unpacking and packing in separate loops and then transform
_add2 into two routines, calc and merge, as follows:
void test_add2_transform_calc()
{
int i;
for (i = 0; i < VEC_SIZE; i++)
{
g1[i] = a[i] & 0xffff;
g2[i] = a[i] >> 16;
h1[i] = b[i] & 0xffff;
h2[i] = b[i] >> 16;
}
for (i = 0; i < VEC_SIZE; i++)
{
r[i] = g1[i] + h1[i];
s[i] = g2[i] + h2[i];
68
}
}
void test_add2_transform_merge()
{
int i;
for (i = 0; i < VEC_SIZE; i++)
{
e[i] = ((unsigned int) r[i] << 16) | ((unsigned int) s[i]);
}
}
The calc routine unpacks the operands and does the computation, using ordinary C code.
The merge routine packs the two results back together in the form the TI intrinsic does.
But the merge disassembly does not. Therefore, try to avoid transforming data to and from
packed intrinsic forms in the loops that must be optimized.
Example 3: _add2 and _sub2
Suppose you had a short sequence of a TI _add2 and then a TI_sub2 intrinsics. To begin to
optimize this sequence, we want to avoid the transformation of intermediates backed into
packed form. This can be tried by creating a _add2_sub2_calc routine, followed by a merge
routine, which does the pack back into the TI merged form.
If there is a long sequence of calculations, avoiding the packing back allows the xt-xcc
compiler to vectorize and infer naturally, which will save considerable cycles.
Manual Vectorization
There may be times when manual vectorization and intrinsics may be necessary to achieve
optimal results. Also, you will need to decide what to do about saturating operations. The
69
ConnX BBE32EP DSP does not saturate on most operations, nor does it have an overflow
register. Instead, it uses 40-bit vector registers that includes 8 guard bits and 16-bit vector
registers without guard bits. For 32/40-bit data contained in 40-bit vector registers, the data is
either packed and saturated to 16-bits or saturated to 32-bits. In either case post-saturation,
the data is moved into one or two narrow vec registers respectively before storing as 16-bit
elements in memory. Keep this different model in mind when converting code.
If the code with intrinsics is in parts of code which do not take many cycles in execution, (for
example, control code, not loop-intensive data code) then you may just leave the intrinsic
conversion to standard C and divide the code into one of the following categories:
•Rarely executed, low-cycle count code (that is, do not optimize the code)
•Heavily-executed, high cycle-count (optimize manually where the compiler does not)
•“Middle ground” code, which you must decide whether to optimize based on time and
performance goals
70
5. Configurable Options
Topics:
•FFT
•Symmetric FIR
•Packed Complex Matrix
Multiply
•LFSR and Convolutional
Encoding
•Linear Block Decoder
•1D Despread
•Soft-bit Demapping
•Comparison of Divide
Related Options
•Advanced Precision
Multiply/Add
•Inverse Log-likelihood
Ratio (LLR)
•Single and Dual Peak
Search
•Single-precision Vector
Floating-Point
71
5.1 FFT
This option provides FFT and DFT support offering significant performance improvements on
FFT operations of any size.
The ConnX BBE32EP DSP FFT package provides special FFT instructions optimized to
perform a range of DFT computations. These FFT instructions implement a generic Discrete
Fourier Transform in a series of steps. Each step is a pass over the whole input.
The basic sequence for decimation in frequency FFT operation is as follows:
1. Load a vector of inputs,
2. Multiply by constants (also called “rotations”). These are constant throughout all FFT pass
sizes and depend on the radix type (1, j, -1 and –j) in Figure 4: Radix4 FFT Pass on page
72. Effectively, these rotation constants only modify the sign of the inputs (real or
imaginary) and thus paired with the next step.
3. Perform a radix add. The radix add instructions add the inputs to a radix block while
applying the correct sign change coming from the rotation constants.
4. Complex multiply by constants (also called “twiddle factors”). These depend on the FFT
size and each individual value going through the transform. They are marked as Tw in
Figure 4: Radix4 FFT Pass on page 72.
Figure 4: Radix4 FFT Pass
In the general case, the ConnX BBE32EP DSP architecture allows a vector/state load, a
vector store, a radix4 add, a vector-multiply and a shuffle to execute all in parallel. This
produces to a full vector worth of radix4 FFT results as output per cycle. As long as the inputs
for the butterfly are coming from a distance of at least one vector length away, the following
sequence of basic instructions is followed:
1. Load four input vectors,
2. Use four appropriate FFT add instructions,
3. Apply the twiddle factors. The twiddle factor for one of the vectors is ‘1’, the other three
require a multiply (see Figure 4: Radix4 FFT Pass on page 72),
72
4. Store four output vectors.
Note: In some stages, an additional interleave or shuffle step may be required.
Some FFT operations require control or monitoring of an instruction for, say, normalization
and/or scaling. The BBE_MODE control register state is used to set appropriate mode for
such operations.
Depending on the FFT size (N) the usage of FFT instructions in the ConnX BBE32EP DSP
can differ slightly. Consequently, the package has built-in support for radix-2, radix-3, radix-4
and radix-5 FFT implementations.
•When N is a power of ‘2’
•Even powers of ‘2’ – radix4
•Odd powers of ‘2’ – radix4 and radix2 (for only the last pass when N=8)
•When N is a non-power of ‘2’
•Radix3 and radix5
When performing an FFT computation, unless the results go into an inverse FFT (with some
processing applied), the results are usually needed in natural order. For best performance it's
recommended to use auto-sort FFT algorithms, for example Stockham FFT.
5.2 Symmetric FIR
This option inserts a pre-adder function in front of the multiplier tree, effectively doubling the
number of taps that can be handled by the core MACs.
Optimized ISA support to accelerate symmetric FIR operations is available as a separate
configurable option. The symmetric FIR operations double the effective MACs per cycle.
Symmetric FIR functions are supported by the following ConnX BBE32EP DSP special
operations:
•For real coefficients, real data
•BBE_ADDPNX16RRU - State-Based Add with Shift Right for Real Data Symmetric FIR
with Updates of States by Shifting Elements
•BBE_ADDPNX16RRUMBC - State-Based Add with Shift Right for Real Data
Symmetric FIR with Update of States Partially by Shifting Elements and Partially by
using Input Vectors
•BBE_ADDPNX16RRUMBCIAD - State-Based Add with Shift Right for Real Data
Symmetric FIR with Update of States B and C by using Input Vectors
•For real coefficients, complex data
•BBE_ADDPNX16RCU - State-Based Add with Shift Right for Complex Data Symmetric
FIR by Shifting Elements
73
•BBE_ADDPNX16RCUMBC - State-Based Add with Shift Right for Complex Data
Symmetric FIR with Update of States Partially by Shifting Elements and Partially by
using Input Vectors
•BBE_ADDPNX16RCUMBCIAD - State-Based Add with Shift Right for Complex Data
Symmetric FIR with Update of States B and C by using Input Vectors
This package supports complex data , real coefficients and does not support complex data,
complex coefficients symmetric FIR filters.
In ConnX BBE32EP DSP, the doubling of effective acceleration of FIR filtering computations
occurs when real coefficients are assumed to have a symmetric impulse response while the
data can be real/complex. For real data, operation BBE_ADDPNX16RRU is first used for data
summation to produce two results X, Y. Four special machine states BBE_STATE{A,B,C,D}
are pre-loaded with Nx16 real data
1. StateA and StateD are first added into Nx16 result X
2. {StateA, StateB} concatenated (1:16) elements are added to {StateD, StateC}
concatenated elements (15:30) for Nx16 result Y
3. Concatenated {StateA, StateB} are updated by shifting down 2 real elements and
setting 0s at the top. {StateD, StateC} data is updated by shifting up 2 real elements
and filling the bottom with the top 2 elements of state D
4. Use BBE_ADDPNX16RRU{MBC, MBCIAD} variants to input new vector data into states while
rotating and shifting
For complex data, BBE_ADDPNX16RCU is first used for data summation to produce two results
X, Y. Four special states BBE_STATE{A,B,C,D} are pre-loaded with N/2x16 complex data (see
Figure 5: 16-tap Real Symmetric FIR with Complex Data on page 76)
1. StateA and StateD are first added into N/2x16 complex result X
2. {StateA, StateB} concatenated (1:8) complex elements are added to {State D, State
C} concatenated complex elements (7:14) for complex N/2x16 result Y
3. {StateA, StateB} updated by down 2 complex element shift and 0 setting at the top;
{StateD, StateC} shifted up 2 complex elements and filling of the bottom with the top 2
elements of stateD
4. Use BBE_ADDPNX16RCU{MBC, MBCIAD} variants to input new vector data into states while
rotating and shifting
Use BBE_EXTRNX16C or BBE_L32XP operations to prepare pair of real 16-bit coefficients.
Results X, Y are then multiplied with coefficients and pairwise added using special operation
BBE_MUL(A)NX16PR as covered earlier
•out = X * p + Y * q, where p, and q are real 16-bit coefficients, and X, Y are the results
found above.
An Example of Symmetric FIR Operations with Complex Data is shown below.
74
For Symmetric FIR Operations with Complex Data, there are 4 special states used to help
calculation. Follow the procedure to help with the calculations.
1. Load data to {StateA, StateB, StateC, stateD} There are 8 elements in each state.
Each element is 16 bit complex, thus there are a total of 256 bits . E.g. 16 bit complex
•data[0]-data[7]->stateA
•data[8]-data[15]->stateB
•data[7]-data[14]->stateC
•data[15]-data[22]-->stateD
2. Use Operation: BBE_ADDPNX16RCU E.g
•xb_vecNx16 sel_2, sel_1;
•BBE_ADDPNX16RCU(sel_1,sel_2);
As seen in Figure 5, X=Sel_1,Y=Sel_2.
After operation BBE_ADDPNX16RCU completes, {StateA, StateB, StateC, stateD}
will be updated as shown in Figure5.
3. Use operation BBE_EXTRNX16C to load a 32 bit coefficient from memory to register. As
shown in Figure 5, p and q are 16 bit.
4. Use operation BBE_MUL(A)NX16PR to do MUL/MAC for X/Y and p/q. E.g.
Figure 5: 16-tap Real Symmetric FIR with Complex Data
Refer to Implementation Methodology on page 147 on how to configure a ConnX BBE32EP
DSP with the symmetric FIR option. The ISA HTML for symmetric FIR instructions describes
each operation’s implementation and how the different operands are set up for a symmetric
FIR operation.
Note:
•Symmetric FIR option adds four states - BBE_STATE{A,B,C,D} - to a ConnX
BBE32EP DSP configuration which are otherwise not included.
•FFT & Symmetric FIR options share significant amounts of hardware. When
configuring in one of these options in your ConnX BBE32EP DSP core, it may be
appropriate to add the other as the incremental cost is small.
•As a reference, several code examples are packaged with standared ConnX
BBE32EP DSP configurations in Xtensa Xplorer.
5.3 Packed Complex Matrix Multiply
This option adds support for vectors where the matrix elements are ordered by matrix rather
than grouped by element.
ConnX BBE32EP DSP has support for efficient computation of small complex packed matrix
multiplies - 2x2, 4x4, and variants.
76
Complex packed matrix multiplies in ConnX BBE32EP DSP are accomplished by
decomposing the problem into smaller tasks:
1. Load matrix data into machine vector registers
2. Use special shuffle instructions in multi-step execution to reorganize order of data
depending on matrix size
3. Perform multistep MAC operations and store result
•For square matrices, use regular multiplies after shuffle
•For rectangular matrices use special multiplies with replication
The special shuffle instructions (patterns for matrix element reordering) are available only as
a configuration option to enhance regular shuffle operation BBE_SHFLNX16I and select
operation BBE_SELNX16I. The shuffle/select patterns reorder N complex elements of two
source vectors into a N/2 complex elements of destination output vector for processing.
The immediate value argument in these operations selects one of many patterns to select/
shuffle from inputs to outputs. As for notations, even though operations target complex
values, elements are designated as real and imaginary pairs, so (0, 1, 2, 3, ) elements means
0th real, 0th imaginary, 1st real, 1st imaginary elements for the first two complex numbers
and so on.
Here's an example of a 2x2 complex packed matrix multiply routine in ConnX BBE32EP DSP
// Using matrix multiply uses special data shuffles for complex 2x2*2x2 matrix multiply
vout += (select_vec1 * select_vec2); // multiply complete result
out_p[i] = BBE_PACKVN_2XC40(vout, shft); // variable pack and store
}
77
return;
}
Note: Streaming order matrix multiplies can be done without shuffling with regular
machine multiplies for real or complex.
5.4 LFSR and Convolutional Encoding
This option adds support for channel coding operations that involve LSFR code generation,
such as scrambling and channel spreading. This option also provides support for
convolutional encoding.
The LFSR option adds special operations to accelerate LFSR sequence generation up to 32bits per cycle. This is done by emulating a 32x32-bit matrix by 32x1-bit multiplication. All the
multiplication and reduction add operations in this option are in Galois Field (GF2). ConnX
BBE32EP DSP LFSR generation is executed in steps:
1. Initialize 32x32-bit matrix state to hold polynomial variants for 32 shifts. Use
BBE_MOVBMULSTATEV operation to load state BBE_BMUL_STATE
2. Initialize 32-bit vector to hold initial shift register value. Use BBE_MOVBMULACCA operation to
load state BBE_BMUL_ACC. The state is updated with result for the next issue.
3. Output 32-bit result at each issue and shift right old results using the BBE_BMUL32A
operation. Set the second 32-bit register input to 0, not used for LFSR.
The above sequence can be modified to efficient implement various variants - generate
Gold31 (3GPP) standard PRBS sequence (see following code sample). This can be done by
using two sequence generators and XOR'ing the resulting sequences. You may also need to
interleave parallel processed results to avoid interlocks. A similar methodology can be used
for XRC processing too - use second input 32-bit register for updating the CRC state by
XOR'ing input bits with output.
/* First block LFSR bits generated outside loop for x and y polynomials along with the current
initial states*/
// Inner loop, generating the next nbits_256 - 1 blocks
for(i=0;i < nbits_256;++i) {
out_ptr[i] = v1^v2; // scrambler output as (x XOR y)
BBE_MOVBMULACCA(x1); // initial condition for x
// Compute next 256 LFSR bits, preparing x states
S0 = x_mat_ptr[0];
S1 = x_mat_ptr[1];
BBE_MOVBMULSTATEV(S1,S0,0);
S0 = x_mat_ptr[2];
S1 = x_mat_ptr[3];
BBE_MOVBMULSTATEV(S1,S0,1);
// Two-vector processing to avoid interlocks
v1 = BBE_BMUL32A(v1, 0); // generate 32 bits x
v3 = BBE_BMUL32A(v3, 0);
// Pick top half vectors
v1 = BBE_SELNX16I(v3, v1, BBE_SELI_INTERLEAVE_2_HI); x1 = BBE_MOVABMULACC(); // keep x
state
// Repeat for vector y sequence
// Set up y states
BBE_MOVBMULACCA(x2);
S0 = y_mat_ptr[0];
S1 = y_mat_ptr[1];
BBE_MOVBMULSTATEV(S1,S0,0);
S0 = y_mat_ptr[2];
S1 = y_mat_ptr[3];
BBE_MOVBMULSTATEV(S1,S0,1);
onvolutional coding is a form of forward error correction (FEC) coding based on finite state
machines - input bit stream is augmented by adding patterns of redundancy data. ConnX
BBE32EP DSP version supports up to 16-bit polynomial encoder computing 64 encoded bits
at each issue.
The optional BBE_CC64 convolutional coding operation accepts the following inputs:
•An input Nx16-bit vector, 79 LSBs are used at each cycle (taken from the 5 LSB
elements). A shuffle is needed to process 64 new bits at the LSB positions.
•Polynomial definition (16 LSB bits of AR register) to form register state definition
The operations returns 64 encoded bits placed at 4x16 MSB elements of inout register and
shifts right old 64 processed bits to prepare for next cycle.
79
5.5 Linear Block Decoder
The support from this option is used for decoding block linear error-correction codes such as
Hamming codes. In general, it can correlate a set of binary code vectors against a vector of
real quantities.
The support to accelerate linear block decoding in the ConnX BBE32EP DSP comes in the
form of two optimized operations:
•BBE_DSPRMCNRNX16CS8 - 8-Codeset 16-Way 16-bit Real Coded Multiply and Reduce
for Linear Block Decoding with No Rotation
•BBE_DSPRMCANX16CS8 - 8-Codeset 16-Way 16-bit Real Coded Multiply, Reduction
and Accumulate for Linear Block Decoding with No Rotation
Both operations operate on 16-bit real data and perform a 16-way multiplication based on a
vector of 1-bit codes. The operations support up to eight such sets of 1-bit code vectors. The
ISA HTML for these two operations has more information on how a linear block decode
operation is set up.
During a linear block decoding operation, reduction-add’s between intermediate results are
performed in full precision (32-bits wide) for each of the eight code-words. This is designed
considering algorithms needs and hardware optimization. The result after a reduction-add is
then sign-extended to 40-bits.
For the accumulating version of a linear block decoding operation, the accumulation is
performed only after truncating the 40-bit results to 32-bits first, and then after the
accumulation, the 32-bit sums are sign-extended back to 40-bit results.
5.6 1D Despread
This option provides support to correlate a complex-binary vector against a complex 16i+16q
or 8i+8q vector. This is particularly useful in despreading operations in 3G standards.
The despread and descramble functions are needed at different sections of the 3G
(WCDMA) receiver chain; these are listed below:
•1D/2D single/multi-code 16-bit complex despread functions are used for S-SCH inner/
outer code correlations, for coarse frequency-offset estimation (based on P-SCH
correlation) as well as coarse channel estimation (based on CPICH correlation) and for
CCPCH channels dispreading (through 1D single-code 16-bit complex despread function)
Scrambling by itself, in 3G, only involves complex multiplication and since we usually need to
work with a single (primary) scrambling code (as most channels are scrambled using a single
scrambling code), 1D single-code type function is required for the descrambling operation.
De-spread function also includes reduction-add following multiplication.
80
ConnX BBE32EP DSP supports 1D despreading of 16-bit real or 8/16-bit complex input data.
The 1D despread operations perform N-way element-wise signed, vector multiplication of
real/complex inputs with a vector of real/complex codes at full precision. If n is the spreading
factor (SF), n consecutive products are consecutively added (2, 4, 8 or 16). A vector of N/n
wide outputs, correspondingly real/complex, is produced every cycle. Say for SF=4, N-way
real data and real codes
FOR i in {0...N/n}
re_res[i] = SUM(j=0...n, re_data[i*4+j] * re_code[i*4+j])
Inputs to the despreading operation are a Nx16-bit vector of real data to be multiplied by the
codes, a Nx16-bit vector of code sets containing sixteen Nx1b codes, immediate(s) that
select(s) the code-sets to be processed each issue & the type of code. The accumulating
versions of the despreading operations allow accumulating the reduction-add results with
previous results into the output wide vector at 40-bit precision to obtain SF beyond 16 .
Note: Types of codes can be {1, -1} for real or {+/-1, +/-j, +/-1+/-j} for complex
The ISA support to accelerate applications performing 1D despreading comes in the form of
the following special operations:
•BBE_DSPR1DANX16CSF8 - 8-way 16-bit Complex Coded Multiply, Reduction and
Accumulate for Despreading with spreading factor 8
•BBE_DSPR1DANX16SF16 - 16-way 16-bit Real Coded Multiply, Reduction and
Accumulate for Despreading with spreading factor 16
•BBE_DSPR1DANX8CSF16 - 16-way 8-bit Complex Coded Multiply, Reduction and
Accumulate for Despreading with spreading factor 16
•BBE_DSPR1DNX16CSF4 - 8-way 16-bit Complex Coded Multiply and Reduction for
Despreading with spreading factor 4
•BBE_DSPR1DNX16CSF8 - 8-way 16-bit Complex Coded Multiply and Reduction for
Despreading with spreading factor 8
•BBE_DSPR1DNX16SF16 - 16-way 16-bit Real Coded Multiply and Reduction for
Despreading with spreading factor 16
•BBE_DSPR1DNX16SF4 - 16-way 16-bit Real Coded Multiply and Reduction for
Despreading with spreading factor 4
•BBE_DSPR1DNX16SF8 - 16-way 16-bit Real Coded Multiply and Reduction for
Despreading with spreading factor 8
•BBE_DSPR1DNX8CSF16 - 16-way 8-bit Complex Coded Multiply and Reduction for
Despreading with spreading factor 16
•BBE_DSPR1DNX8CSF4 - 16-way 8-bit Complex Coded Multiply and Reduction for
Despreading with spreading factor 4
•BBE_DSPR1DNX8CSF8 - 16-way 8-bit Complex Coded Multiply and Reduction for
Despreading with spreading factor 8
81
These operations are a part of the 1D despread option configurable in the ConnX BBE32EP
DSP. Refer to the operation's ISA HTML page for more details on its implementation and
setup.
Table 12: Decoding Complex Codes for Despreading
Encoding (2b)Code-type (1b
immediate)
0001+j11
010-1+j-11
1001-j1-1
110-1-j-1-1
001110
011-j0-1
101-1-10
111j0-1
Decoded valuere_code (1b)im_code (1b)
5.7 Soft-bit Demapping
This option supports up to 256 QAM soft-bit demapping.
The soft-bit demapping operations are used to convert soft-symbol estimates, outputs of an
equalizer, into soft bit estimates, or log-likelihood ratios (LLRs), later to be processed by a
soft channel decoder for error correction and detection. The soft-bit demapper typically sits at
the interface between complex and soft-bit domains.
The soft-bit demapper accepts as inputs complex-valued soft-symbol estimates x in addition
to a scaling factor. Given these inputs, for each bit bi, it calculates the log-likelihood ratio
according to the mapping of bits to a constellation S. The LLR calculation uses a Max-Log
approximation and assumes an unbiased symbol estimate with zero-mean additive white
Gaussian noise (AWGN), i.e. x=s+w where s belongs to S and w is AWGN. Therefore, the
SDMAP output is given by
82
The scaling factor is used to account for signal-to-noise ratio and any other desired weighting
adjustments. Users can negate the LLR values with an additional sign option.
Supported constellations and mappings are summarized in Table 13: Set of Symbol
Constellations Supported on page 83. Symbol mappings for 3GPP and WiFi use different
Gray Encoding formats, both supported by the soft-bit demapper operations.
Inputs to the ConnX BBE32EP DSP soft-bit demap operations are
•Complex constellation points
•Assumed Q5.10 (16 bit resolution), not normalized
•Scale factors per point: 4-bit mantissa and 4-bit exponent in paired vector elements
•Used for SNR and channel weighting adjustments
•Three immediate values to select between various modes
•Pick upper or lower half of input complex vector to operate
•Optionally negate soft-bit LLRs
•Optionally interleave output soft-bit LLRs for real/imaginary parts (IEEE vs. 3GPP
standard)
And, as for the outputs of the ConnX BBE32EP DSP soft-bit demap operations,
83
•Up to 2N soft-bits per half complex vector are computed each cycle, scaled by the scaling
factors, with rounding and saturation to 8-bit integer resolution at output.
Scaling before the soft demapper is needed to place onto an integer grid (assumed hardware
implementation Q5.10 format). Scaling after the soft-demodulation is optionally applied by the
operations.
5.8 Comparison of Divide Related Options
ISA support for divide related functions in the ConnX BBE32EP DSP is offered as three
configurable options:
Each option provides multiple operations (by adding more hardware) to the ConnX BBE32EP
DSP ISA to accelerate a specific divide related function. Table 14: Comparison of Divide
Related Options on page 84) briefly outlines the intended use for each operation along with
any functional overlap with other operation(s).
For some insight into the usage of above operations, refer to the packaged code example
vector_divide (computes 16b by 16b signed vector division) in several different ways.
Table 14: Comparison of Divide Related Options
Class
Advanced
Vector
Reciprocal
Advanced
Vector
Reciprocal
Square
Root
Fast Vector
Reciprocal
OperationsFunctionOverlaps
BBE_RECIPLUNX40_0High precision reciprocal
approximation: ~23b
mantissa accuracy
BBE_RECIPLUNX40_1High precision reciprocal
approximation: ~23b
mantissa accuracy
BBE_RSQRTLUNX40_0High precision reciprocal
square root approximation:
~24b mantissa accuracy
BBE_RSQRTLUNX40_1High precision reciprocal
square root approximation:
~24b mantissa accuracy
BBE_RECIPUNX16_0Unsigned integer reciprocal
approximation: results with
approximately 15b
precision
approximation: results with
approximately 15b
precision
BBE_RECIPNX16_1Signed integer reciprocal
approximation: results with
approximately 15b
precision
BBE_FPRECIPNX16_0Signed floating point
reciprocal approximation:
results with approximately
10b precision in the typical
case, 7.5b in the worst
case.
BBE_FPRECIPNX16_1Signed floating point
reciprocal approximation:
results with approximately
10b precision in the typical
case, 7.5b in the worst
case.
BBE_DIVADJNX16Signed divide adjust to
allow C-exact 16b/16b
integer divide - works with
BBE_RECIPNX16*
BBE_DIVUADJNX16Unsigned divide adjust to
allow C-exact 16b/16b
integer divide
BBE_FPRSQRTNX16_0Signed floating point
reciprocal square root
approximation: results with
approximately 9.7b
precision in the typical
case, 7.5b in the worst
case.
BBE_FPRSQRTNX16_1Signed floating point
eciprocol square root
approximation: results with
approximately 9.7b
precision in the typical
case, 7.5b in the worst
case.
BBE_DIVNX16S_5STEP0_0Signed 16b/16b vector
divide
BBE_DIVNX16S_5STEP0_1Signed 16b/16b vector
divide
16b/16b signed divide
16b/16b signed divide
16b/6b signed divide,
32b/16b divide and
advanced precision
recip
16b/6b signed divide,
32b/16b divide and
advanced precision
recip
16b/16b signed divide
16b/16b unsigned
divide
High precision
reciprocal square root
High precision
reciprocal square root
BBE_RECIPNX16Vector Divide
BBE_RECIPNX16
85
Vector
Divide Other Steps
32-bit
Vector
Divide First Step
BBE_DIVNX16U_4STEP0_0Unsigned 16b/16b vector
divide
BBE_DIVNX16U_4STEP0_1Unsigned 16b/16b vector
divide
BBE_DIVNX16Q_4STEP0_0Unsigned fractional
16b/16b vector divide
BBE_DIVNX16Q_4STEP0_1Unsigned fractional
16b/16b vector divide
BBE_DIVNX16S_3STEPN_0Last step of 32b/16b and
16b/16b signed vector
divide
BBE_DIVNX16S_3STEPN_1Last step of 32b/16b and
16b/16b signed vector
divide
BBE_DIVNX16S_4STEP_0Middle step of 32b/16b and
16b/16b signed vector
divide
BBE_DIVNX16S_4STEP_1Middle step of 32b/16b and
16b/16b signed vector
divide
BBE_DIVNX16U_4STEP_0Middle step of 32b/16b and
16b/16b unsigned vector
divide
BBE_DIVNX16U_4STEP_1Middle step of 32b/16b and
16b/16b unsigned vector
divide
BBE_DIVNX16U_4STEPN_0Last step of 32b/16b and
16b/16b unsigned vector
divide
BBE_DIVNX16U_4STEPN_1Last step of 32b/16b and
16b/16b unsigned vector
divide
BBE_DIVNX32S_5STEP0_0Signed 32b/16b vector
divide
BBE_DIVNX32S_5STEP0_1Signed 32b/16b vector
divide
BBE_RECIPUNX16
BBE_RECIPUNX16
BBE_FPRECIPNX16
BBE_FPRECIPNX16
BBE_RECIPNX16
BBE_RECIPNX16
BBE_RECIPUNX16,
high precision
reciprocal
approximation
BBE_RECIPUNX16,
high precision
reciprocal
approximation
BBE_RECIPUNX16,
high precision
reciprocal
approximation
BBE_RECIPUNX16,
high precision
reciprocal
approximation
BBE_RECIPUNX16,
high precision
reciprocal
approximation
BBE_RECIPUNX16,
high precision
reciprocal
approximation
High precision
reciprocal
approximation
High precision
reciprocal
approximation
86
BBE_DIVNX32U_4STEP0_0Unsigned 32b/16b vector
divide
BBE_DIVNX32U_4STEP0_1Unsigned 32b/16b vector
divide
BBE_RECIPUNX16
BBE_RECIPUNX16
Note: It is recommended to use protos that combine operation sequences appropriate
for the type. The packaged code examples may be used as a reference to identify
such protos.
5.8.1 Vector Divide
For 16-bit/16-bit and 32-bit/16-bit vector division with 16-bit outputs.
The support for 32-bit integer or scalar divide is an option available to supplement the base
Xtensa ISA and turning it on upon configuring a ConnX BBE32EP DSP provides users with
the following protos:
•Signed integer divide – QUOS() and REMS()
•Unsigned integer divide – QUOU() and REMU()
More significantly, the ConnX BBE32EP DSP can be configured with a vector divide option.
Low-precision vector division operations in the ConnX BBE32EP DSP are multiple cycle
operations and are executed stepwise – four step operations for N/2 results of 16-bit
precision and each step taking one cycle. These steps are pipelined by interleaving even and
odd elements of the N-element vector. For convenience of programming, two sets of these
four steps, one set each for even and odd elements are bundled into an intrinsic that
produces N results of vector division in 8 cycles.
This option also supports high-precision division often useful for dividing arbitrary 16-bit fixedpoint formats by appropriately shifting high-precision dividends. These high-precision vector
division operations differ in that the dividend vector is 32-bits/element although stored in a
40-bit/element guarded wvec register.
On inclusion of this option in a ConnX BBE32EP DSP configuration, the core supports the
following types of vector division:
•16-bit by 16-bit unsigned vector divide
•It takes a set of sixteen 16-bit unsigned dividends and sixteen 16-bit unsigned divisors
from the vec register file and produces sixteen unsigned quotients and sixteen
unsigned remainders, both with 16-bit precision and zero-extended to 16-bits per SIMD
element.
•Overflow and divide-by-zero conditions return 0x7FFF.
•Protos used - BBE_DIVNX16U() / BBE_DIVNX16U()
•16-bit by 16-bit signed vector divide
•It takes a set of sixteen 16-bit signed dividends and sixteen 16-bit signed divisors from
the vec register file and produces sixteen signed quotients and sixteen signed
remainders, both with 16-bit precision and sign-extended to 16-bits per SIMD element.
87
•Overflow returns 0x7FFF and underflow returns 0x8000 for negative. Similarly, for
divide-by-zero, the operation returns either 0x7FFF or 0x8000 appropriately.
•Protos used - BBE_DIVNX16() / BBE_DIVNX16()
•32 -bit by 16-bit unsigned vector divide
•It takes a set of sixteen 32-bit dividends from the wvec register file and sixteen 16-bit
unsigned divisors from the vec register file and produces sixteen unsigned quotients
and sixteen unsigned remainders, both with 16-bit precision and zero-extended to 16bits per SIMD element.
•Overflow and divide-by-zero conditions return 0x7FFF.
•Protos used - BBE_DIVNX32U() / BBE_DIVNX32U()
•32-bit by 16-bit signed vector divide
•It takes a set of sixteen 32-bit signed dividends from the wvec register file and sixteen
16-bit signed divisors from the vec register file and produces sixteen signed quotients
and sixteen signed remainders, both with 16-bit precision and sign-extended to 16-bits
per SIMD element.
•Overflow returns 0x7FFF and underflow returns 0x8000 for negative. Similarly, for
divide-by-zero, the operation returns either 0x7FFF or 0x8000 appropriately.
•Protos used - BBE_DIVNX32() / BBE_DIVNX32()
If users need to write scalar code with the use of scalar data-types like xb_int16, xb _int16U,
xb _int40, xb _c16 and xb _c40, the Xtensa compiler can auto-vectorize the divide operations
in the code by using the following set of ‘compiler-assist’ protos:
•Unsigned scalar divide – BBE_DIV16U() and BBE_DIV32U()
•Signed scalar divide – BBE_DIV16() and BBE_DIV32()
Refer to the description section of the various divide operations in the ISA HTML for further
understanding of stepwise divide operations. Additionally, the vector_divide example included
in the ConnX BBE32EP DSP installation illustrates a sample usage of this option using a 16bit by 16-bit signed vector divide example.
5.8.2 Fast Vector Reciprocal & Reciprocal Square Root
Reduced cycle counts for 16-bit operations as compared to the vector divide option.
Refer to the ISA HTML of the following operations for more details on implementation and
setup.
•Fast vector reciprocal
•BBE_RECIPUNX16_0 - Unsigned 16-bit reciprocal approximation on even elements
•BBE_RECIPUNX16_1 - Unsigned 16-bit reciprocal approximation on odd elements
•BBE_RECIPNX16_0 - Signed 16-bit reciprocal approximation on even elements
•BBE_RECIPNX16_1 - Signed 16-bit reciprocal approximation on odd elements
•BBE_FPRECIPNX16_0 - Signed 16-bit + 7-bit pseudo-floating point reciprocal
approximation on even elements
88
•BBE_FPRECIPNX16_1 - Signed 16-bit + 7-bit pseudo-floating point reciprocal
approximation on odd elements
•BBE_DIVADJNX16 - Compute adjustment for correction of 16b signed divide based on
fast reciprocal
•BBE_DIVUADJNX16 - Compute adjustment for correction of 16b unsigned divide
based on fast reciprocal
•Fast vector reciprocal square root
•BBE_FPRSQRTNX16_0 - 16-bit mantissa + 7b exponent pseudo-floating point
reciprocal square-root approximation on even elements
•BBE_FPRSQRTNX16_1 - 16-bit mantissa + 7b exponent pseudo-floating point
reciprocal square-root approximation on odd elements
For more insight into the usage of above operations, refer to the packaged code example
vector_recip_fast (computes reciprocal of input vector using the BBE_FPRECIP operations)
and vector_rsqrt_fast (computes reciprocal sqrt of input vector using the BBE_FPRSQRT
operations)
Increased precision compared to the Fast Vector Reciprocal & Reciprocal Square Root
option. Inputs and outputs are generally 16-bit fixed although intermediate results can be
stored in higher precision with extra effort.
The advanced precision - 40-bit mantissa, 7-bit exponent - reciprocal (RECIP) and reciprocal
square root (RSQRT) options provide operations that compute lookup table based terms to
support Taylor's series expansion. The actual expansion may be performed using the core
MAC operations. These options are recommended when more than 16-bit precision is
required, particularly for inversions of ill-conditioned matrices. The throughput of this option
set is about two 32-bit results per cycle. This estimate also includes the relevant MAC and
normalization operations needed in the sequence.
The input to these operations needs to be normalized first. And, the operations compute
either even or odd elements within Nx16 or Nx40 vectors. To hide the three cycle latency,
they may be issued in pairs and software pipelined. Overflow or saturation conditions
resulting from zero valued input elements are designated with the most negative
representable number at the 40-bit wide output.
The Taylor's series approximation of y(x) is given as:
y = f(x0) + (x-x0)*f'(x0) + (x-x0)^2*f"(x0)/2 = A + (x-x0)*(B+(x-x0)*C)
We use lookup tables to approximate A B, and C with x0 being the low end of segment
range. The RECIP and RSQRT functions compute N-way 32-bit outputs in multiple steps.
1. Start by finding normalization amounts and then normalize wide precision inputs. Use
BBE_NSANX40 for RECIP, and BBE_NSAENX40 for RSQRT, and apply BBE_SLLNX40 to both.
89
2. Next, use BBE_RECIPLUNX40_{0,1} / BBE_RSQRTLUNX40_{0,1} even-odd pairs
appropriately. These generate a wide output approximation for term A along with scaled
(x-x0) inputs and an adjusted slope term B=(x-x0)*C. Use the signed MAC operation
BBE_MULUSANX16 for RECIP and the unsigned MAC operation BBE_MULUUSNX16 for
RSQRT to complete the rest of the Tayalor's series expansion as shown above.
3. Now, to obtain a floating-point output format, use BBE_PACKVNX40 to pack the Nx40 values
into Nx16, and use the normalization amounts to find the appropriate output exponent.
4. Or, to obtain fixed-point output format with desired denormalization into a wide vector
register, use BBE_SRSNX40 with an appropriate shift amount.
In the default use case, the input is assumed to be Q39 - in the range between 0.5 and 1.0 and the output after MAC sequence without any shifts will be Q1.38 - greater than equal to
1.0 but under 2.0. Here's a code sequence illustrating this use case of ConnX BBE32EP DSP
RSQRT operations along with the said data formats of its inputs and outputs:
xb_vecNx40 inpw, inpnormw; // inputs
xb_vecNx40 a ; // to hold the a's from lookups
xb_vecNx16 inp_scaled; // to hold scaled input
xb_vecNx16 b; // to hold b's from lookup
vsaN norm, renorm; // normalization amounts
// a containts mantissa of rsqrt output in Q1.38 format
// Q1.38 format => there is one sign bit, one integer bit, and 38 fractional bits
renorm = BBE_SUBSR1SAVSN(0,norm);
// Output exp = -norm/2
// BBE_SUBSR1SAVSN(0,norm) calculates 0 - (norm/2)
If the input data is not in Q39 format, it first needs to be converted to Q39 since the operation
requires it. The same shift amount may be used to compute the output exponent.
For ConnX BBE32EP DSP RECIP, the input format is assumed to be normalized Q39. The
sequence of operations involved in implementing the Taylor's series approximation will output
a result in Q2.37 format. Here's a similar example illustrating the code sequence:
xb_vecNx40 inpw, inpnormw; // inputs
xb_vecNx40 a ; // to hold the a's from lookups
xb_vecNx16 inp_scaled; // to hold scaled input
xb_vecNx16 b; // to hold b's from lookup
vsaN norm, renorm; // normalization amounts
•BBE_RSQRTLUNX40_0 - Compute normalization and table lookup factors for advanced
reciprocal square root approximation - even elements (...,4,2,0)
•BBE_RSQRTLUNX40_1 - Compute normalization and table lookup factors for advanced
reciprocal square root approximation - odd elements (...,5,3,1)
5.9 Advanced Precision Multiply/Add
This option supports 23-bit (16b mantissa, 7b exponent), and 32-bit fixed point multiplication
and addition. Inputs and outputs are 16-bit/32-bit fixed precision although intermediate results
can be stored at a higher precision.
The ConnX BBE32EP DSP advanced precision multiply/add is a configurable option that
enables two new classes of data representation and related computations:
•23-bit floating point - 16-bit mantissa and 7-bit exponent
•32-bit high precision fixed point
The expect MAC performance of the 23-bit floating point is 1.5 times the cycle cost of 16-bit
fixed point. And, the expected MAC performance of 32-bit fixed point is 3 times the cycle cost
of 16-bit fixed point.
23-bit Floating Point
The mantissas are held in Nx16-bit vector registers (vec) and represent normalized signed
Q15 values. For complex numbers, at least one element of the real-imaginary pair is
normalized. The exponents are held in Nx7-bit vsa registers (vsa) and denote the amount of
right shift needed to convert the mantissa back to fixed Q15 value. The legitimate range of
values is -64<exponent<63. The special values 63 and -64 represents zero and BIG NaN
respectively. BBE_FLUSH_TO_ZERO state is set when an resulting exponent is 63
(designating a zero). By convention, complex numbers have a common exponent.
91
32-bit Fixed Point
All the 32-bit fixed point data is held in wide vector registers (wvec), or in some cases in a
pair of narrow vector registers (vec) as low and high 16-bit parts. The related operations use
16-bit multiply hardware and helper operations to emulate 32-bit multiplication with
appropriate shifting and component gathering/accumulation. The intermediate multiply results
are stored into new 2-entry Nx32-bit register file (mvec), then shifted appropriately and
accumulated into regular Nx40-bit wide vector registers (wvec).
Decoupling of advanced precision multiply/add operations allow multiply-shift-accumulate
sequences to be folded into parallel slots with a pipeline depth of 2-3 stages.
Table 15: Advanced Precision Multiply/Add Operations Overview on page 92 presents an
overview of the various classes of ConnX BBE32EP DSP operations added by the advanced
precision multiply/add option.
The following brief examples show the usage of some of the advanced precision multiply/add
operations.
93
Real floating point addition
Two floating point vectors added. Operation BBE_FPADDNX16 appropriately aligns
mantissas and adjust exponents for proper sum. The result is in 40-bit wvec then packed to
output 16-bit mantissa and new exponent using operation BBE_FPPACKNX40.
Two sets of products of floating point vectors accumulated. Intermediate products into 32-bit
mvec, accumulation into 40-bit wvec using operation BBE_FPADDMNX40C, then pack with
operation BBE_FPPACKNX40.
Complex high precision fixed point multiply-accumulate(keep top 32 bits of full
precision 64-bit result)
Move each 32-bit number into low/high 16-bit parts. Use appropriate singed/unsigned
multiply to compute low/high, high/low, high/high products and accumulate with operations
BBE_SRAIMNX40 with appropriate shifts.
This options is used to reciprocate the soft-bit demapping operation. It converts a set of LLR
values to the mean and variance of the corresponding complex N-QAM symbol. This is useful
in SIC and turbo equalization type of operations.
Turbo Equalization: Turbo equalizer is based on the Turbo principle and approaches the
performance of MAP (Maximum aposteriori) receiver via iterative message passing between
a soft-in soft-out (SISO) equalizer and a SISO decoder. This technique is used in MIMOOFDM and CDMA systems including 3GPP LTE, WCDMA/HSPA, DVB etc. The decoder
used is the SISO Turbo decoder, whereas the equalizer algorithm can be LMMSE equalizer
using a-priori information based on feedback from Turbo Decoder. This technique can also be
used in conjunction with SIC (Sequential interfence cancelation) & PIC(Parallel interfence
cancelation).
The inverse LLR calculation converts the extrinsic LLR values output by the Turbo decoder to
mean and variance values which can be used by the next iteration of the MMSE algorithm as
shown below. The mean and variance calculation needs to be performed per symbol.
95
Figure 6: Inverse LLR Calculation
ConnX BBE32EP DSP includes optional operations to facilitate efficient estimation of softcomplex QAM symbols from log-likelihood ratios (LLR). This process uses a lookup table for
the computation of non-linear hyperbolic tangent (tanh). Also, different operations are needed
for each size of supported constellations (4/16/64/256-QAM).
The inputs, to the BBE_INVLLRNX16C operation, represent LLR values for a vector of N/2
complex, QAM symbols. For example, 6 such LLRs are needed for each 64-QAM complex
symbol - 3 for real part and 3 for imaginary part.
•Immediate-value inputs select whether the sequence of LLR values, per complex point,
are to be considered interleaved (3GGP) or not (IEEE), and whether or not they should be
negated before processing further
•Input LLR values are packed in bytes, however, only the 6 LSBs are used in a Q2.3
format
Output of the operation is a vector of N-bit probabilities, in Q5.10 format, computed after a
11-bit lookup table approximation of tanh nonlinearity. The bit-probabilities for each complex
element are interleaved as real-imaginary pairs. A sequence of operations is needed to
obtain all bit-probabilities for any given QAM case, for example 256-QAM requires four issues
to obtain the four pairs of bit probabilities. An immediate allows obtaining the value (2-bit
probability) for a certain bit to accelerate the estimation process.
96
Operation BBE_INVLLRNX16C computes bit probabilities as the tanh of input LLRs for different
QAM sizes. Once the bit probabilities are available, the mean and variance of the complex
constellation point can be estimated using regular ConnX BBE32EP DSP MAC operations.
1. Denote pi the bit probabilities at the output. Then the complex mean is given by
•256-QAM mean - real = p0*(8-p1*(4-p2*(2-p3))) & imag = p4*(8-p5*(4-
p6*(2-p7)))
•64-QAM mean - real = p0*(4-p1*(2-p2)) & imag = p3*(4-p4*(2-p5))
•16-QAM mean - real = p0*(2-p1) & imag = p2*(2-p3)
•4-QAM mean - real = p0 and imag = p1
2. The real-imaginary parts for each bit probability pi are interleaved at the output so that
helps in computing SIMD-way the quantities above for N/2 complex element vectors
3. Start with (2-pi) quantity in high QAM cases and successively real multiply-add the next
bit probability N/2-way to compute the complex mean as equations dictate
•For (2-pi) use special immediate setting to enable the factor 2, otherwise disable to
produce only the pi value
Some iterative equalization methods require both the mean and variance of the complex
constellation point computed (i.e. Turbo equalization). Variance is estimated applying a
similar approach and the use of real MAC operations, mean, and bit probabilities already
computed above. For example, for 64-QAM, with mean denoted as mean64,
Single and dual peak (value and its index) search of 16/32-bit, signed/unsigned, real/complex
inputs. In baseband, this option accelerates peak search of sequences resulting from real
and complex correlation processes, for example to detect frame alignment during acquisition,
or extracting frequency/phase correction.
The peak search configuration option is based on operations optimized in the use of 32-bit
SIMD comparators in conjunction with some helper operations to extract up to two peaks and
their indices. The 16-bit peak search operations internally sign extend inputs to 32 bits for
use with the 32-bit comparators. In both 32-bit single peak search and 16/32-bit dual peak
search, the operations process N input elements per issue. But, with 16-bit single peak
search, the 2N elements are processed per issue.
Several state registers are added upon configuring a ConnX BBE32EP DSP with the peak
search option in order to not increase register usage.
•BBE_MAX and BBE_MAX2 are 256-bit [SIMD] states added to hold first and second peak
values respectively.
97
•BBE_MAXIDX and BBE_MAXIDX2 are 96-bit [SIMD] states added to hold first and second
peak indices respectively. The indices here are stored with relative SIMD lane offsets and
are converted into actual index values at the final stage.
•BBE_IDX is a 11-bit state added to hold count of the number of peak search passes. This
is internally used inside BBE_MAXIDX/BBE_MAXIDX2 to compute relative offsets to indices
of peaks.
Use model of these peak search operations is a multi-step process involving a setup
(initialization), followed by operations that compare, aggregate and extract peaks and values
from intermediate vectors. Let's go through the flow:
1. Initialize all index-states with zero and value-states with -infinity using
BBE_SETDUALMAX().
2. Use one of the following operations to process the loaded vector input, and appropriately
save any single/dual peaks and their respective indices into the aforementioned states.
•For 16-bit signed/unsigned single peak search, use BBE_DMAX[U]NX16()
•For 16-bit signed/unsigned dual peak search, use BBE_DUALMAX[U]NX16()
•For 32-bit signed/unsigned single/dual peak search, use BBE_DUALMAX[U]WNX32()
3. Next, for 16-bit single peak search use BBE_GTMAXNX16() and BBE_MOVDUALMAXT(), and
for all other cases use only BBE_MOVDUALMAXT() to aggregate peaks and indices held in
two states into a single vector.
4. Finally, BBE_RBDUALMAXR() and BBE_SELMAXIDX() are used to extract the single/dual
peaks and their indices from the single vector obtained in the previous step.
Typically, in implementing complex correlation functions, BBE_MAGINX16C() is used to
compute magnitudes of complex values. This output data of this operation is interleaved. To
account for this data reordering, a 1-bit immediate passed to BBE_SELMAXIDX controls the
index value computation of the absolute maximum value; see ISA HTML for more details.
Optionally, a right shift may be performed when using BBE_DUALMAX{U}WNx32() to avoid any
potential input overflow. An example code of 32-bit complex dual peak search is provided
below:
// Find power of complex data and search for dual peaks/indexes
for(j=0;j<size_N;j=j+1) {
xb_vecNx16 vec0,vec1;
xb_vecNx40 wvec0;
/* process two complex vectors */
vec0 = data_ptr[2*j];
98
vec1 = data_ptr[2*j+1];
wvec0 = BBE_MAGINX16C(vec1,vec0); // this interleaves data, maxindex has to deinterleave below
BBE_DUALMAXUWNX32(wvec0, 0);
}
// get the maximum and lane flag for max
BBE_RBDUALMAXUR(maxpeak,selector);
// get corresponding index, note deinterleaving index mode
int32_t maxindex = BBE_SELMAXIDX(selector,1);
// replace max with max2 for max lane
BBE_MOVDUALMAXT(selector);
// get the maximum and lane flag for max2
BBE_RBDUALMAXUR(secondpeak,selector2);
// get corresponding index for max2, deinterleaving mode
int32_t secondindex = BBE_SELMAXIDX(selector2,1);