Analog Devices EE197 Application Notes

Engineer To Engineer Note EE-197
a
Technical Notes on using Analog Devices' DSP components and development tools
Contact our technical support by phone: (800) ANALOG-D or e-mail: dsp.support@analog.com Or vi sit ou r on-l ine re sourc es ht tp:// www.analog.com/dsp and http://www.analog.com/dsp/EZAnswers
ADSP-BF531/532/533 Blackfin® Processor Multi-cycle Instructions and Latencies
Contributed by Tom L. September 24, 2003

Introduction

This document contains a description of the multicycle instructions and latencies specific to the ADSP­BF531/532/533 Blackfin® Processor devices. Multicycle instructions take more than one cycle to complete. The specific cycle count cannot be reduced without removing the instruction that caused it. A latency condition can occur when two instructions require extra cycles to complete because they are close to each other in the assembly program. The programmer can avoid this cycle penalty by separating the two instructions. Other causes for latencies are memory stalls and store buffer hazards; for these conditions, a discussion of how to improve performance is provided.
The Pipeline Viewer within the VisualDSP++™ simulator provides a way of looking at the
way instructions are pushed through the processor’s pipeline. While the causes for various conditions like stalls can be discovered interactively, this document contains more detail about the nature of execution latencies.
All of the cycle counts described in this document are based on the assumption that code is executed from L1 memory.

Multicycle Instructions

This section describes the instructions that take more than one cycle to complete. All instructions not mentioned in this discussion are single-cycle instructions, provided they are executed from L1 memory.
Multicycle instructions include the following categories: Push Multiple/Pop Multiple, 32-bit Multiply, Call, Jump, Conditional Branch, Returns from Events, Core and System Synchronization, Linkage, Interrupts and Emulation, and Testset. In the following examples, the total number of cycles needed to complete a certain instruction is shown next to the corresponding instruction. The full descriptions of each instruction’s functionality is provided in the Blackfin Processor Instruction Set Reference.

Push Multiple/Pop Multiple

The Push Multiple and Pop Multiple instructions take n core cycles to complete, where n is the number of registers pushed or popped, assuming the stack is located in L1 data memory.
Copyright 2003, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
Example Number of Cycles
[--SP] = (R7:0, P5:0); 14 cycles (R7:0, P5:3) = [SP++]; 11 cycles
a
32-bit Multiply (modulo 2
32
)
The 32-bit by 32-bit integer multiply instruction always takes 3 cycles to complete.
Example Number of Cycles
R0 *= R1; 3 cycles

Call, Jump

All call and jump instructions take 5 cycles to complete, provided the target address is an aligned location (see Instruction Alignment Unit Empty Latencies later in this document).
Example Number of Cycles
CALL 0x22; 5 cycles CALL (PC + P0); 5 cycles CALL (P0); 5 cycles JUMP 0x22; 5 cycles JUMP (PC + P0); 5 cycles JUMP(P0); 5 cycles

Conditional Branch

The number of cycles a branch takes depends on the prediction as well as the actual outcome.
Prediction Taken Not taken Outcome Number of Cycles

Returns from Events

Examples Number of Cycles
RTX; // return from an exception 5 cycles RTE; // return from emulation 5 cycles RTN; // return from an NMI 5 cycles
ADSP-BF531/532/533 Blackfin® Processor Multi-cycle Instructions and Latencies (EE-197) Page 2 of 12
Taken Not taken Taken Not taken 5 cycles 9 cycles 9 cycles 1 cycle
RTI; // return from an interrupt 5 cycles RTS; // return from a subroutine 5 cycles

Core and System Synchronization

Examples Number of Cycles
CSYNC; 10 cycles SSYNC; >10 cycles

Linkage

Examples Number of Cycles
LINK 4; 3 cycles UNLINK; 2 cycles

Interrupts and Emulation

Examples Number of Cycles
a
RAISE 10; 3 cycles (if interrupt branch is not taken) EXCPT 3; 3 cycles (if exception branch is not taken) STI R4; 3 cycles

Testset

The TESTSET instruction is a multicycle instruction that executes in a variable number of cycles. It is dependent on the cycles needed for a read acknowledge from off-core memory. It is also dependent on whether the address being tested is both in the cache and dirty. The number of cycles can be determined as follows,
cycles = 1 (instruction) + 1 (stall) + x (read acknowledge) + y (cache latency)

Instruction Latencies

Unlike multicycle instructions, instruction latencies are contingent on the placement of specific instruction pairs relative to one another. They can be avoided by separating them by as many instructions as there are cycles incurred between them. For example, if a pair of instructions incurs a two cycle latency, separating them by two instructions will eliminate that latency.
Bold blue type is used to identify register dependencies within the instruction pairs. A dependency
occurs if a register is accessed in the instruction immediately following an instruction that modifies the register. The lack of the color blue in a entry indicates that the latency condition will occur regardless of what registers are used. Italicized red type is used to highlight the stall consequences.
ADSP-BF531/532/533 Blackfin® Processor Multi-cycle Instructions and Latencies (EE-197) Page 3 of 12
a
Instruction latencies are separated into these groups: Accumulator to Data Register Latencies, Register Move Latencies, Move Conditional and Move CC Latencies, Loop Setup Latencies, Hardware Loop Latencies, Instruction Alignment Unit Latencies, and Miscellaneous Latencies. The total cycle time of each entry can be calculated by adding the cycles taken by each instruction to the number of stall cycles for the instruction pair.
Refer to the Appendix for abbreviations, instruction group descriptions, as well as register

Accumulator to Data Register Latencies

Description Example <Cycles + Stalls>
groupings.
- dreg = Areg2Dreg op
- video op using dreg as src

Register Move Latencies

R1 = R6.L * R4.H (IS);
R5 = BYTEOP1P (R3:2, R1:0);
<1> <1+1>
In each of the following cases, the stall condition occurs when the same register is used in both instructions.
Description Example <Cycles + Stalls>
- dreg = sysreg
- multiply/video op with dreg as src
- preg = dreg
- any op using preg
- dagreg = dreg
- any op using dagreg
- POP to dagreg
- any op using dagreg
- LOAD/POP to preg
- any op using preg
R0 = LC0;
R2.H = R1.L * R0.H;
P0 = R3;
R0 = P0;
I3 = R3;
R0 = I3;
I3 = [SP++];
R0 = I3;
P3 = [SP++];
R0 = P3;
<1> <1+1> <1> <1+4> <1> <1+4> <1> <1+3> <1> <1+3>
- dreg = seqreg
- any ALU op using dreg
- dreg = MMR register
- any ALU op using dreg
R0 = RETS;
R1 = R0 + R3;
R3 = [P0]; // P0 points to an MMR
R0 = R3 – R0;
<1> <1+1> <1> <1+1>

Move Conditional and Move CC Latencies

In each of the following cases, the stall condition occurs when the same register is used in both instructions.
ADSP-BF531/532/533 Blackfin® Processor Multi-cycle Instructions and Latencies (EE-197) Page 4 of 12
Loading...
+ 8 hidden pages