INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER
INTELLECTUAL PROPERTY RIGHT.
Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the
presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by
estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
The IXP2800 Network Processor may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means without the express written consent of Intel Corporation.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling
1-800-548-4725 or by visiting Intel's website at http://www.intel.com.
Intel and XScale are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
160 ME00 PMU Event List .............................................................................................................. 409
161 ME01 PMU Event List .............................................................................................................. 410
162 ME02 PMU Event List .............................................................................................................. 411
163 ME03 PMU Event List .............................................................................................................. 411
164 ME04 PMU Event List .............................................................................................................. 412
165 ME05 PMU Event List .............................................................................................................. 412
166 ME06 PMU Event List .............................................................................................................. 413
167 ME07 PMU Event List .............................................................................................................. 413
168 ME10 PMU Event List .............................................................................................................. 414
169 ME11 PMU Event List .............................................................................................................. 414
170 ME12 PMU Event List .............................................................................................................. 415
171 ME13 PMU Event List .............................................................................................................. 415
172 ME14 PMU Event List .............................................................................................................. 416
173 ME15 PMU Event List .............................................................................................................. 416
174 ME16 PMU Event List .............................................................................................................. 417
175 ME17 PMU Event List .............................................................................................................. 417
176 SRAM DP1 PMU Event List ..................................................................................................... 418
177 SRAM DP0 PMU Event List ..................................................................................................... 418
178 SRAM CH3 PMU Event List .....................................................................................................420
179 SRAM CH3 PMU Event List .....................................................................................................421
180 SRAM CH3 PMU Event List .....................................................................................................421
22Hardware Reference Manual
Contents
181 SRAM CH0 PMU Event List .....................................................................................................422
182 IXP2800 Network Processor Dram DPLA PMU Event List....................................................... 423
183 IXP2800 Network Processor Dram DPSA PMU Event List ......................................................424
184 IXP2800 Network Processor Dram CH2 PMU Event List.........................................................425
185 IXP2800 Network Processor Dram CH1 PMU Event List.........................................................429
186 IXP2800 Network Processor Dram CH0 PMU Event List.........................................................429
Hardware Reference Manual 23
Contents
24Hardware Reference Manual
Intel® IXP2800 Network Processor
Introduction
Introduction1
1.1About This Document
This document is the hardware reference manual for the Intel® IXP2800 Network Processor.
This information is intended for use by developers and is organized as follows:
Section 2, “Technical Description” contains a hardware overview.
Section 6, “SRAM Interface” describes the SRAM Unit.
Section 7, “SHaC — Unit Expansion” describes the Scratchpad, Hash Unit, and CSRs (SHaC).
®
Core” describes the embedded core.
Section 8, “Media and Switch Fabric Interface” describes the Media and Switch Fabric (MSF)
Interface used to connect the network processor to a physical layer device.
Section 9, “PCI Unit” describes the PCI Unit.
Section 10, “Clocks and Reset” describes the clocks, reset and initialization sequence.
Section 11, “Performance Monitor Unit” describes the PMU.
1.2Related Documentation
Further information on the IXP2800 is available in the following documents:
IXP2800 Network Processor Datasheet – Contains summary information on the IXP2800 Network
Processor including a functional description, signal descriptions, electrical specifications, and
mechanical specifications.
IXP2400 and IXP2800 Network Processor Programmer’s Reference Manual – Contains detailed
programming information for designers.
IXP2400/IXP2800 Network Processor Development Tools User’s Guide – Describes the Developer
Workbench and the development tools you can access through the use of the Workbench GUI.
Hardware Reference Manual25
Introduction
1.3Terminology
Table 1 and Table 2 list the terminology used in this manual.
Table 1. Data Terminology
TermWordsBytesBits
Byte½18
Word1216
Longword2432
Quadword4864
Table 2. Longword Formats
Endian Type32-Bit64-Bit
Little-Endian(0x12345678) arranged as {12 34 56 78}
Big-Endian(0x12345678) arranged as {78 56 34 12}
Intel® IXP2800 Network Processor
64-bit data 0x12345678 9ABCDE56
arranged as {12 34 56 78 9A BC DE 56}
64-bit data 0x12345678 9ABCDE56
arranged as {78 56 34 12, 56 DE BC 9A}
26Hardware Reference Manual
Intel® IXP2800 Network Processor
Technical Description
Technical Description2
2.1Overview
This section provides a brief overview of the IXP2800 Network Processor internal hardware.
This section is intended as an overall hardware introduction to the network processor.
The major blocks are:
• Intel XScale
compliant) used to initialize and manage the network processor, and can be used for higher
layer network processing tasks.
• Intel XScale
Purpose I/O (GPIO) and interface to low-speed off chip peripherals (such as maintenance port
of network devices) and Flash ROM.
Connected to the S_Push/Pull Buses
Connected to the S_Push/Pull Buses and D_Push/Pull Buses
= Chassis Components
A9750-03
Hardware Reference Manual29
Intel® IXP2800 Network Processor
Technical Description
2.2Intel XScale® Core Microarchitecture
The Intel XScale® microarchitecture consists of a 32-bit general purpose RISC processor that
incorporates an extensive list of architecture features that allows it to achieve high performance.
2.2.1ARM* Compatibility
The Intel XScale® microarchitecture is ARM* Version 5 (V5) Architecture compliant. It
implements the integer instruction set of ARM* V5, but does not provide hardware support of the
floating point instructions.
The Intel XScale
ARM V5E DSP extensions.
Backward compatibility with the first generation of StrongARM* products is maintained for usermode applications. Operating systems may require modifications to match the specific hardware
features of the Intel XScale
enhancements added to the Intel XScale
®
microarchitecture provides the Thumb instruction set (ARM V5T) and the
®
microarchitecture and to take advantage of the performance
®
core.
2.2.2Features
2.2.2.1Multiply/Accumulate (MAC)
The MAC unit supports early termination of multiplies/accumulates in two cycles and can sustain a
throughput of a MAC operation every cycle. Several architectural enhancements were made to the
MAC to support audio coding algorithms, which include a 40-bit accumulator and support for
16-bit packed values.
2.2.2.2Memory Management
The Intel XScale® microarchitecture implements the Memory Management Unit (MMU)
Architecture specified in the ARM Architecture Reference Manual. The MMU provides access
protection and virtual to physical address translation.
The MMU Architecture also specifies the caching policies for the instruction cache and data
memory. These policies are specified as page attributes and include:
• identifying code as cacheable or non-cacheable
• selecting between the mini-data cache or data cache
• write-back or write-through data caching
• enabling data write allocation policy
• and enabling the write buffer to coalesce stores to external memory
2.2.2.3Instruction Cache
The Intel XScale® microarchitecture implements a 32-Kbyte, 32-way set associative instruction
cache with a line size of 32 bytes. All requests that “miss” the instruction cache generate a 32-byte
read request to external memory. A mechanism to lock critical code within the cache is also
provided.
30Hardware Reference Manual
2.2.2.4Branch Target Buffer
The Intel XScale® microarchitecture provides a Branch Target Buffer (BTB) to predict the
outcome of branch type instructions. It provides storage for the target address of branch type
instructions and predicts the next address to present to the instruction cache when the current
instruction address is that of a branch.
The BTB holds 128 entries.
2.2.2.5Data Cache
The Intel XScale® microarchitecture implements a 32-Kbyte, 32-way set associative data cache
and a 2-Kbyte, 2-way set associative mini-data cache. Each cache has a line size of 32 bytes, and
supports write-through or write-back caching.
The data/mini-data cache is controlled by page attributes defined in the MMU Architecture and by
coprocessor 15.
Intel® IXP2800 Network Processor
Technical Description
The Intel XScale
as data RAM. Software may place special tables or frequently used variables in this RAM.
®
microarchitecture allows applications to reconfigure a portion of the data cache
2.2.2.6Interrupt Controller
The Intel XScale® microarchitecture provides two levels of interrupt, IRQ and FIQ. They can be
masked via coprocessor 13. Note that there is also a memory-mapped interrupt controller described
with the Intel XScale
many chip-wide interrupt sources.
®
technology peripherals (see Section 3.12), which is used to mask and steer
Hardware Reference Manual31
Intel® IXP2800 Network Processor
Technical Description
2.2.2.7Address Map
Figure 3 shows the partitioning of the Intel XScale® core microarchitecture 4-Gbyte address space.
The Microengines do most of the programmable pre-packet processing in the IXP2800 Network
Processor. There are 16 Microengines, connected as shown in Figure 1. The Microengines have
access to all shared resources (SRAM, DRAM, MSF, etc.) as well as private connections between
adjacent Microengines (referred to as “next neighbors”).
The block diagram in Figure 4 is used in the Microengine description. Note that this block diagram
is simplified for clarity; some blocks and connectivity have been omitted to make the diagram
more readable. Also, this block diagram does not show any pipeline stages, rather it shows the
logical flow of information.
Microengines provide support for software-controlled multi-threaded operation. Given the
disparity in processor cycle times versus external memory times, a single thread of execution often
blocks, waiting for external memory operations to complete. Multiple threads allow for threadinterleave operation, as there is often at least one thread ready to run while others are blocked.
The IXP2800 Network Processor supports a single D_Push/D_Pull bus, and both Microengine
clusters interface to the same bus. Also, it supports two command buses, and two sets of
S_Push/S_Pull buses connected as shown in Table 3, which also shows the next neighbor
relationship between the Microengine.
Table 3. IXP2800 Network Processor Microengine Bus Arrangement
Technical Description
Microengine
Cluster
0
1
2.3.2Control Store
Microengine
Number
0x000x01NA
0x010x020x00
0x020x030x01
0x030x040x02
0x040x050x03
0x050x060x04
0x060x070x05
0x070x100x06
0x100x110x07
0x110x120x10
0x120x130x11
0x130x140x12
0x140x150x13
0x150x160x14
0x160x170x15
0x17NA0x16
Next
Neighbor
Previous
Neighbor
Command
Bus
00
11
S_Push and
S_Pull Bus
The Control Store is a RAM that holds the program that is executed by the Microengine. It holds
8192 instructions, each of which is 40 bits wide. It is initialized by the Intel XScale
®
core, which
writes to USTORE_ADDR and USTORE_DATA Local CSRs.
The Control Store is protected by parity against soft errors. Parity checking is enabled by
CTX_ENABLE[CONTROL STORE PARITY ENABLE]. A parity error on an instruction read
will halt the Microengine and assert an interrupt to the Intel XScale
®
core.
2.3.3Contexts
There are eight hardware Contexts available in the Microengine. To allow for efficient context
swapping, each Context has its own register set, Program Counter, and Context specific Local
registers. Having a copy per Context eliminates the need to move Context specific information to/
from shared memory and Microengine registers for each Context swap. Fast context swapping
allows a Context to do computation while other Contexts wait for I/O (typically external memory
accesses) to complete or for a signal from another Context or hardware unit. (A context swap is
similar to a taken branch in timing.)
Hardware Reference Manual35
Intel® IXP2800 Network Processor
Technical Description
Each of the eight Contexts is in one of four states.
1. Inactive — Some applications may not require all eight contexts. A Context is in the Inactive
state when its CTX_ENABLE CSR enable bit is a 0.
2. Executing — A Context is in Executing state when its context number is in
ACTIVE_CTX_STS CSR. The executing Context’s PC is used to fetch instructions from the
Control Store. A Context will stay in this state until it executes an instruction that causes it to
go to Sleep state (there is no hardware interrupt or preemption; Context swapping is
completely under software control). At most one Context can be in Executing state at any time.
3. Ready — In this state, a Context is ready to execute, but is not because a different Context is
executing. When the Executing Context goes to the Sleep state, the Microengine’s context
arbiter selects the next Context to go to the Executing state from among all the Contexts in the
Ready state. The arbitration is round robin.
4. Sleep — Context is waiting for external event(s) specified in the
INDIRECT_WAKEUP_EVENTS CSR to occur (typically, but not limited to, an I/O access).
In this state the Context does not arbitrate to enter the Executing state.
The state diagram in Figure 5 illustrates the Context state transitions. Each of the eight Contexts
will be in one of these states. At most one Context can be in Executing state at a time; any number
of Contexts can be in any of the other states.
Figure 5. Context State Transition Diagram
Reset
CTX_ENABLE
Note:
After reset, the Intel XScale
CTX_WAKEUP_EVENTS to 0x1 (voluntary), and then set the appropriate CTX_ENABLE bits to begin
executing Context(s).
InactiveReady
bit is cleared
E
SleepExecuting
®
Core processor must load the starting address of the CTX_PC, load the
The Microengine is in Idle state whenever no Context is running (all Contexts are in either Inactive
or Sleep states). This state is entered:
CTX_ENABLE bit is set by
Intel XScale
CTX_ENABLE bit is cleared
v
E
l
a
n
r
e
t
x
Context executes
CTX Arbitration instruction
®
Core
s
e
v
i
r
r
a
l
a
n
g
i
S
t
n
e
Executing Context goes
to Sleep state, and this
Context is the highest
round-robin priority.
A9352-03
1. After reset (CTX_ENABLE Local CSR is clear, putting all Contexts into Inactive states).
2. When a context swap is executed, but no context is ready to wake up.
3. When a
condition 2 above, since the
ctx_arb[bpt] instruction is executed by the Microengine (this is a special case of
ctx_arb[bpt] clears CTX_ENABLE, putting all Contexts into
Inactive states).
36Hardware Reference Manual
The Microengine provides the following functionality during the Idle state:
1. The Microengine continuously checks if a Context is in Ready state. If so, a new Context
begins to execute. If no Context is Ready, the Microengine remains in the Idle state.
2. Only the ALU instructions are supported. They are used for debug via special hardware
defined in number 3 below.
3. A write to the USTORE_ADDR Local CSR with the USTORE_ADDR[ECS] bit set, causing
the Microengine to repeatedly execute the instruction pointed by the address specified in the
USTORE_ADDR CSR. Only the ALU instructions are supported in this mode. Also, the result
of the execution is written to the ALU_OUT Local CSR rather than a destination register.
4. A write to the USTORE_ADDR Local CSR with the USTORE_ADDR[ECS] bit set, followed
by a write to the USTORE_DATA Local CSR loads an instruction into the Control Store. After
the Control Store is loaded, execution proceeds as described in number 3 above.
2.3.4Datapath Registers
As shown in the block diagram in Figure 4, each Microengine contains four types of 32-bit
datapath registers:
1. 256 General Purpose registers
Intel® IXP2800 Network Processor
Technical Description
2. 512 Transfer registers
3. 128 Next Neighbor registers
4. 640 32-bit words of Local Memory
2.3.4.1General-Purpose Registers (GPRs)
GPRs are used for general programming purposes. They are read and written exclusively under
program control. GPRs, when used as a source in an instruction, supply operands to the execution
datapath. When used as a destination in an instruction, they are written with the result of the
execution datapath. The specific GPRs selected are encoded in the instruction.
The GPRs are physically and logically contained in two banks, GPR A, and GPR B, defined in
Table 5.
2.3.4.2Transfer Registers
Transfer (abbreviated as Xfer) registers are used for transferring data to and from the Microengine
and locations external to the Microengine, (for example DRAMs, SRAMs etc.). There are four
types of transfer registers.
• S_TRANSFER_IN
• S_TRANSFER_OUT
• D_TRANSFER_IN
• D_TRANSFER_OUT
TRANSFER_IN registers, when used as a source in an instruction, supply operands to the
execution datapath. The specific register selected is either encoded in the instruction, or selected
indirectly via T_INDEX. TRANSFER_IN registers are written by external units (A typical case is
when the external unit returns data in response to read instructions. However, there are other
Hardware Reference Manual37
Intel® IXP2800 Network Processor
Technical Description
methods to write TRANSFER_IN registers, for example a read instruction executed by one
Microengine may cause the data to be returned to a different Microengine. Details are covered in
the instruction set descriptions).
TRANSFER_OUT registers, when used as a destination in an instruction, are written with the
result from the execution datapath. The specific register selected is encoded in the instruction, or
selected indirectly via T_INDEX. TRANSFER_OUT registers supply data to external units
(for example, write data for an SRAM write).
The S_TRANSFER_IN and S_TRANSFER_OUT registers connect to the S_PUSH and S_PULL
buses, respectively.
The D_TRANSFER_IN and D_TRANSFER_OUT Transfer registers connect to the D_PUSH and
D_PULL buses, respectively.
Typically, the external units access the Transfer registers in response to instructions executed by the
Microengines. However, it is possible for an external unit to access a given Microengine’s Transfer
registers either autonomously, or under control of a different Microengine, or the Intel XScale
core, etc. The Microengine interface signals controlling writing/reading of the TRANSFER_IN
and TRANSFER_OUT registers are independent of the operation of the rest of the Microengine,
therefore the data movement does not stall or impact other instruction processing
(it is the responsibility of software to synchronize usage of read data).
®
2.3.4.3Next Neighbor Registers
Next Neighbor registers, when used as a source in an instruction, supply operands to the execution
datapath. They are written in two different ways:
1. By an adjacent Microengine (the “Previous Neighbor”).
2. By the same Microengine they are in, as controlled by CTX_ENABLE[NN_MODE].
The specific register is selected in one of two ways:
1. Context-relative, the register number is encoded in the instruction.
2. As a Ring, selected via NN_GET and NN_PUT CSR registers.
The usage is configured in CTX_ENABLE[NN_MODE].
• When CTX_ENABLE[NN_MODE] is ‘0’ — when Next Neighbor is a destination in an
instruction, the result is sent out of the Microengine, to the Next Neighbor Microengine.
• When CTX_ENABLE[NN_MODE] is ‘1’ — when Next Neighbor is used as a destination in
an instruction, the instruction result data is written to the selected Next Neighbor register in the
same Microengine. Note that there is a 5-instruction latency until the newly written data may
be read. The data is not sent out of the Microengine as it would be when
CTX_ENABLE[NN_MODE] is ‘0’.
Table 4. Next Neighbor Write as a Function of CTX_ENABLE[NN_MODE]
Where the Write Goes
NN_MODE
External?
0YesNo
1NoYes
NN Register in this
Microengine?
38Hardware Reference Manual
Intel® IXP2800 Network Processor
2.3.4.4Local Memory
Local Memory is addressable storage within the Microengine. Local Memory is read and written
exclusively under program control. Local Memory supplies operands to the execution datapath as a
source, and receives results as a destination. The specific Local Memory location selected is based
on the value in one of the LM_ADDR registers, which are written by local_csr_wr instructions.
There are two LM_ADDR registers per Context and a working copy of each. When a Context goes
to the Sleep state, the value of the working copies is put into the Context’s copy of LM_ADDR.
When the Context goes to the Executing state, the value in its copy of LM_ADDR are put into the
working copies. The choice of LM_ADDR_0 or LM_ADDR_1 is selected in the instruction.
It is also possible to make use of both or one LM_ADDRs as global by setting
CTX_ENABLE[LM_ADDR_0_GLOBAL] and/or CTX_ENABLE[LM_ADDR_1_GLOBAL].
When used globally, all Contexts use the working copy of LM_ADDR in place of their own
Context specific one; the Context specific ones are unused. There is a three-instruction latency
when writing a new value to the LM_ADDR, as shown in Example 1.
Example 1. Three-Cycle Latency when Writing a New Value to LM_ADDR
;some instruction to compute the address into gpr_m
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_m]; put gpr_m into lm_addr
;unrelated instruction 1
;unrelated instruction 2
;unrelated instruction 3
alu[dest_reg, *l$index0, op, src_reg]
;dest_reg can be used as a source in next instruction
Technical Description
LM_ADDR can also be incremented or decremented in parallel with use as a source and/or
destination (using the notation *l$index#++ and *l$index#--), as shown in Example 2, where three
consecutive Local Memory locations are used in three consecutive instructions.
Example 2. Using LM_ADDR in Consecutive Instructions
Local Memory is written by selecting it as a destination. Example 3 shows copying a section of
Local Memory to another section. Each instruction accesses the next sequential Local Memory
location from the previous instruction.
Example 3. Copying One Section of Local Memory to Another Section
As shown in Example 1, there is a latency in loading LM_ADDR. Until the new value is loaded,
the old value is still usable. Example 5 shows the maximum pipelined usage of LM_ADDR.
LM_ADDR can also be used as the base of a 16 32-bit word region of memory, with the instruction
specifying the offset from that base, as shown in Example 6. The source and destination can use
different offsets.
Example 6. LM_ADDR Used as Base of a 16 32-Bit Word Region of Local Memory
alu[*l$index0[3], *l$index0[4], +, 1]
Note: Local Memory has 640 32-bit words. The local memory pointers (LM_ADDR) have an addressing
range of up to 1K longwords. However, only 640 longwords are currently populated with RAM.
Therefore:
0 – 639 (0x0 – 0x27F) are addressable as local memory.
640 – 1023 (0x280 – 0x3FF) are addressable, but not populated with RAM.
To the programmer, all instructions using Local Memory act as follows, including
read/modify/write instructions like immed_w0, ld_field, etc.
1. Read LM_ADDR location (if LM_ADDR is specified as source).
2. Execute logic function.
3. Write LM_ADDR location (if LM_ADDR is specified as destination).
4. If specified, increment or decrement LM_ADDR.
5. Proceed to next instruction.
Example 7 is legal because
lm_addr_0[2] does not post-modify LM_ADDR.
Example 7. LM_ADDR Use as Source and Destination
alu[*l$index0[2], --, ~B, *l$index0]
In Example 7, the programmer sees:
1. Read Local Memory memory location pointed to by LM_ADDR.
2. Invert the data.
3. Write the data into the address pointed to by LM_ADDR with the value of 2 that is OR’ed into
the lower bits.
4. Increment LM_ADDR.
5. Proceed to next instruction.
40Hardware Reference Manual
In Example 8, the second instruction will access the Local Memory location one past the source/
destination of the first.
GPRs can be accessed in either a context-relative or an absolute addressing mode. Some
instructions can specify either mode; other instructions can specify only Context-Relative mode.
Transfer and Next Neighbor registers can be accessed in Context-Relative and Indexed modes, and
Local Memory is accessed in Indexed mode. The addressing mode in use is encoded directly into
each instruction, for each source and destination specifier.
2.3.5.1Context-Relative Addressing Mode
The GPRs are logically subdivided into equal regions such that each Context has relative access to
one of the regions. The number of regions is configured in the CTX_ENABLE CSR, and can be
either 4 or 8. Thus a Context-Relative register number is actually associated with multiple different
physical registers. The actual register to be accessed is determined by the Context making the
access request (the Context number is concatenated with the register number specified in the
instruction). Context-Relative addressing is a powerful feature that enables eight (or four) different
contexts to share the same code image, yet maintain separate data.
Intel® IXP2800 Network Processor
Technical Description
Table 5 shows how the Context number is used in selecting the register number in relative mode.
The register number in Table 5 is the Absolute GPR address, or Transfer or Next Neighbor Index
number to use to access the specific Context-Relative register. For example, with eight active
Contexts, Context-Relative Register 0 for Context 2 is Absolute Register Number 32.
Table 5. Registers Used By Contexts in Context-Relative Addressing Mode
Number of
Active
Contexts
8
(Instruction
always specifies
registers in
range 0 – 15)
4
(Instruction
always specifies
registers in
range 0 – 31)
Active
Context
Number
00 – 150 – 150 – 150 – 15
116 – 3116 – 3116 – 3116 – 31
232 – 4732 – 4732 – 4732 – 47
348 – 6348 – 6348 – 6348 – 63
464 – 7964 – 7964 – 7964 – 79
580 – 9580 – 9580 – 9580 – 95
696 – 11196 – 11196 – 11196 – 111
7112 – 127112 – 127112 – 127112 – 127
00 – 310 – 310 – 310 – 31
232 – 6332 – 6332 – 6332 – 63
464 – 9564 – 9564 – 9564 – 95
696 – 12796 – 12796 – 12796 – 127
Absolute Register Numbers
A PortB Port
GPR
S_Transfer or
Neighbor
Index Number
D_Transfer
Index Number
Hardware Reference Manual41
Intel® IXP2800 Network Processor
Technical Description
2.3.5.2Absolute Addressing Mode
With Absolute addressing, any GPR can be read or written by any of the eight Contexts in a
Microengine. Absolute addressing enables register data to be shared among all of the Contexts,
e.g., for global variables or for parameter passing. All 256 GPRs can be read by Absolute address.
2.3.5.3Indexed Addressing Mode
With Indexed addressing, any Transfer or Next Neighbor register can be read or written by any one
of the eight Contexts in a Microengine. Indexed addressing enables register data to be shared
among all of the Contexts. For indexed addressing the register number comes from the T_INDEX
register for Transfer registers or NN_PUT and NN_GET registers (for Next Neighbor registers).
Example 9 shows the Index Mode usage. Assume that the numbered bytes have been moved into
the S_TRANSFER_IN registers as shown.
Example 9. Use of Indexed Addressing Mode
Transfer
Register
Number
00x000x010x020x03
10x040x050x060x07
20x080x090x0a0x0b
30x0c0x0d0x0e0x0f
40x100x110x120x013
50x140x150x160x17
60x180x190x1a0x1b
70x1c0x1d0x1e0x1f
31:2423:1615:87:0
Data
If the software wants to access a specific byte that is known at compile-time, it will normally use
context-relative addressing. For example to access the word in transfer register 3:
alu[dest, --, B, $xfer3] ; move the data from s_transfer 3 to gpr dest
If the location of the data is found at run-time, indexed mode can be used, e.g., if the start of an
encapsulated header depends on an outer header value (the outer header byte is in a fixed location).
; Check byte 2 of transfer 0
; If value==5 header starts on byte 0x9, else byte 0x14
br=byte[$0, 2, 0x5, L1#], defer_[1]
local_csr_wr[t_index_byte_index, 0x09]
local_csr_wr[t_index_byte_index, 0x14]
nop ; wait for index registers to be loaded
L1#:
; Move bytes right justified into destination registers
nop ; wait for index registers to be loaded
nop ;
byte_align_be[dest1, *$index++]
byte_align_be[dest2, *$index++] ;etc.
; The t_index and byte_index registers are loaded by the same instruction.
42Hardware Reference Manual
2.3.6Local CSRs
Local Control and Status registers (CSRs) are external to the Execution Datapath, and hold specific
data. They can be read and written by special instructions (local_csr_rd and local_csr_wr) and are
accessed less frequently than datapath registers.
Because Local CSRs are not built in the datapath, there is a write-to-use delay of three instructions,
and a read-to-consume penalty of two instructions.
2.3.7Execution Datapath
The Execution Datapath can take one or two operands, perform an operation, and optionally write
back a result. The sources and destinations can be GPRs, Transfer registers, Next Neighbor
registers, and Local Memory. The operations are shifts, add/subtract, logicals, multiply, byte align,
and find first one bit.
2.3.7.1Byte Align
The datapath provides a mechanism to move data from source register(s) to any destination
register(s) with byte aligning. Byte aligning takes four consecutive bytes from two concatenated
values (8 bytes), starting at any of four byte boundaries (0, 1, 2, 3), and based on the endian-type
(which is defined in the instruction opcode), as shown in Example 5. The four bytes are taken from
two concatenated values. Four bytes are always supplied from a temporary register that always
holds the A or B operand from the previous cycle, and the other four bytes from the B or A operand
of the Byte Align instruction.
Intel® IXP2800 Network Processor
Technical Description
The operation is described below, using the block diagram in Figure 6. The alignment is controlled
by the two LSBs of the BYTE_INDEX Local CSR.
Table 6. Align Value and Shift Amount
Right Shift Amount (Number of Bits)
Align Value
(in Byte_Index[1:0])
Little-EndianBig-Endian
0032
1824
21616
3248
(Decimal)
Hardware Reference Manual43
Intel® IXP2800 Network Processor
Technical Description
Figure 6. Byte-Align Block Diagram
Prev_B
. . .
Example 10 shows a big-endian align sequence of instructions and the value of the various
operands. Table 7 shows the data in the registers for this example. The value in
BYTE_INDEX[1:0] CSR (which controls the shift amount) for this example is 2.
NOTE: A Operand comes from Prev_B register during byte_align_be instructions.
44Hardware Reference Manual
Example 11 shows a little-endian sequence of instructions and the value of the various operands.
Table 8 shows the data in the registers for this example. The value in BYTE_INDEX[1:0] CSR
(which controls the shift amount) for this example is 2.
NOTE: B Operand comes from Prev_A register during byte_align_le instructions.
As the examples show, byte aligning “n” words takes “n+1” cycles due to the first instruction
needed to start the operation.
Another mode of operation is to use the T_INDEX register with post-increment, to select the
source registers. T_INDEX operation is described later in this chapter.
Byte 2
[23:16]
Byte 1
[15:8]
Byte 0
[7:0]
2.3.7.2CAM
The block diagram in Figure 7 is used to explain the CAM operation.
The CAM has 16 entries. Each entry stores a 32-bit value, which can be compared against a source
operand by instruction:
CAM_Lookup[dest_reg, source_reg]
All entries are compared in parallel, and the result of the lookup is a 9-bit value that is written into
the specified destination register in bits 11:3, with all other bits of the register 0 (the choice of bits
11:3 is explained below). The result can also optionally be written into either of the LM_Addr
registers (see below in this section for details).
The 9-bit result consists of four State bits (dest_reg[11:8]), concatenated with a 1-bit Hit/Miss
indication (dest_reg[7]), concatenated with 4-bit entry number (dest_reg[6:3]). All other bits of
dest_reg are written with 0. Possible results of the lookup are:
• miss (0) — lookup value is not in CAM, entry number is Least Recently Used entry (which
can be used as a suggested entry to replace), and State bits are 0000.
• hit (1) — lookup value is in CAM, entry number is entry that has matched; State bits are the
value from the entry that has matched.
Hardware Reference Manual45
Intel® IXP2800 Network Processor
Technical Description
Figure 7. CAM Block Diagram
Lookup Value
(from A port)
Tag
TagState
TagState
TagState
State
Match
Match
Match
Match
StateStatusEntry Number
0000Miss 0LRU Entry
StateHit 1Hit Entry
Status
and
LRU
Logic
Lookup Status
(to Dest Req)
A9354-01
Note: The State bits are data associated with the entry. The use is only by software. There is no
implication of ownership of the entry by any Context. The State bits hardware function is:
• the value is set by software (at the time the entry is loaded, or changed in an already loaded
entry).
• its value is read out on a lookup that hits, and used as part of the status written into the
destination register.
• its value can be read out separately (normally only used for diagnostic or debug).
The LRU (Least Recently Used) Logic maintains a time-ordered list of CAM entry usage. When an
entry is loaded, or matches on a lookup, it is marked as MRU (Most Recently Used). Note that a
lookup that misses does not modify the LRU list.
The CAM is loaded by instruction:
CAM_Write[entry_reg, source_reg, state_value]
The value in the register specified by source_reg is put into the Tag field of the entry specified by
entry_reg. The value for the State bits of the entry is specified in the instruction as state_value.
46Hardware Reference Manual
The value in the State bits for an entry can be written, without modifying the Tag, by instruction:
CAM_Write_State[entry_reg, state_value]
Note: CAM_Write_State does not modify the LRU list.
One possible way to use the result of a lookup is to dispatch to the proper code using instruction:
jump[register, label#],defer [3]
where the register holds the result of the lookup. The State bits can be used to differentiate cases
where the data associated with the CAM entry is in flight, or is pending a change, etc. Because the
lookup result was loaded into bits[11:3] of the destination register, the jump destinations are spaced
eight instructions apart. This is a balance between giving enough space for many applications to
complete their task without having to jump to another region, versus consuming too much Control
Store. Another way to use the lookup result is to branch on just the hit miss bit, and use the entry
number as a base pointer into a block of Local Memory.
When enabled, the CAM lookup result is loaded into Local_Addr as follows:
LM_Addr[5:0] = 0 ([1:0] are read-only bits)
LM_Addr[9:6] = lookup result [6:3] (entry number)
LM_Addr[11:10] = constant specified in instruction
Intel® IXP2800 Network Processor
Technical Description
This function is useful when the CAM is used as a cache, and each entry is associated with a block
of data in Local Memory. Note that the latency from when CAM_Lookup executes until the
LM_Addr is loaded is the same as when LM_Addr is written by a Local_CSR_Wr instruction.
The Tag and State bits for a given entry can be read by instructions:
The Tag value and State bits value for the specified entry is written into the destination register,
respectively for the two instructions (the State bits are placed into bits [11:8] of dest_reg, with all
other bits 0). Reading the tag is useful in the case where an entry needs to be evicted to make room
for a new value—the lookup of the new value results in a miss, with the LRU entry number
returned as a result of the miss. The CAM_Read_Tag instruction can then be used to find the value
that was stored in that entry. An alternative would be to keep the tag value in a GPR. These two
instructions can also be used by debug and diagnostic software. Neither of these modify the state of
the LRU pointer.
Note: The following rules must be adhered to when using the CAM.
• CAM is not reset by Microengine reset. Software must either do a CAM_clear prior to using
the CAM to initialize the LRU and clear the tags to 0, or explicitly write all entries with
CAM_write.
• No two tags can be written to have same value. If this rule is violated, the result of a lookup
that matches that value will be unpredictable, and LRU state is unpredictable.
The value 0x00000000 can be used as a valid lookup value. However, note that
instruction puts 0x00000000 into all tags. To avoid violating rule 2 after doing
necessary to write all entries to unique values prior to doing a lookup of 0x00000000.
CAM_clear
CAM_clear, it is
Hardware Reference Manual47
Intel® IXP2800 Network Processor
Technical Description
An algorithm for debug software to find out the contents of the CAM is shown in Example 12.
Example 12. Algorithm for Debug Software to Find out the Contents of the CAM
; First read each of the tag entries. Note that these reads
; don’t modify the LRU list or any other CAM state.
tag[0] = CAM_Read_Tag(entry_0);
......
tag[15] = CAM_Read_Tag(entry_15);
; Now read each of the state bits
; Knowing what tags are in the CAM makes it possible to
; create a value that is not in any tag, and will therefore
; miss on a lookup.
; Next loop through a sequence of 16 lookups, each of which will
; miss, to obtain the LRU values of the CAM.
for (i = 0; i < 16; i++)
BEGIN_LOOP
; Do a lookup with a tag not present in the CAM. On a
; miss, the LRU entry will be returned. Since this lookup
; missed the LRU state is not modified.
LRU[i] = CAM_Lookup(some_tag_not_in_cam);
; Now do a lookup using the tag of the LRU entry. This
; lookup will hit, which makes that entry MRU.
; This is necessary to allow the next lookup miss to
; see the next LRU entry.
junk = CAM_Lookup(tag[LRU[i]]);
END_LOOP
; Because all entries were hit in the same order as they were
; LRU, the LRU list is now back to where it started before the
; loop executed.
; LRU[0] through LRU[15] holds the LRU list.
The CAM can be cleared with CAM_Clear instruction. This instruction writes 0x00000000
simultaneously to all entries tag, clears all the state bits, and puts the LRU into an initial state
(where entry 0 is LRU, ..., entry 15 is MRU).
2.3.8CRC Unit
The CRC Unit operates in parallel with the Execution Datapath. It takes two operands, performs a
CRC operation, and writes back a result. CRC-CCITT, CRC-32, CRC-10, CRC-5, and iSCSI
polynomials are supported. One of the operands is the CRC_Remainder Local CSR, and the other
is a GPR, Transfer_In register, Next Neighbor, or Local Memory, specified in the instruction and
passed through the Execution Datapath to the CRC Unit.
The instruction specifies the CRC operation type, whether to swap bytes and or bits, and which
bytes of the operand to include in the operation. The result of the CRC operation is written back
into CRC_Remainder. The source operand can also be written into a destination register (however
the byte/bit swapping and masking do not affect the destination register; they only affect the CRC
computation). This allows moving data, for example, from S_TRANSFER_IN registers to
S_TRANSFER_OUT registers at the same time as computing the CRC.
48Hardware Reference Manual
2.3.9Event Signals
Event Signals are used to coordinate a program with completion of external events. For example,
when a Microengine executes an instruction to an external unit to read data (which will be written
into a Transfer_In register), the program must insure that it does not try to use the data until the
external unit has written it. This time is not deterministic due to queuing delays and other
uncertainty in the external units (for example, DRAM refresh). There is no hardware mechanism to
flag that a register write is pending, and then prevent the program from using it. Instead the
coordination is under software control, with hardware support.
In the instructions that use external units (i.e., SRAM, DRAM, etc.) there are fields that direct the
external unit to supply an indication (called an Event Signal) that the command has been
completed. There are 15 Event Signals per Context that can be used, and Local CSRs per Context
to track which Event Signals are pending and which have been returned. The Event Signals can be
used to move a Context from Sleep state to Ready state, or alternatively, the program can test and
branch on the status of Event Signals.
Event Signals can be set in nine different ways.
1. When data is written into S_TRANSFER_IN registers
2. When data is written into D_TRANSFER_IN registers
3. When data is taken from S_TRANSFER_OUT registers
Intel® IXP2800 Network Processor
Technical Description
4. When data is taken from D_TRANSFER_OUT registers
5. By a write to INTERTHREAD_SIGNAL register
6. By a write from Previous Neighbor Microengine to NEXT_NEIGHBOR_SIGNAL
7. By a write from Next Neighbor Microengine to PREVIOUS_NEIGHBOR_SIGNAL
8. By a write to SAME_ME_SIGNAL Local CSR
9. By Internal Timer
Any or all Event Signals can be set by any of the above sources.
When a Context goes to the Sleep state (executes a
ctx_swap token), it specifies which Event Signal(s) it requires to be put in Ready state.
The
ctx_arb instruction also specifies if the logical AND or logical OR of the Event Signal(s) is
ctx_arb instruction, or an instruction with
needed to put the Context into Ready state.
When all of the Context’s Event Signals arrive, the Context goes to Ready state, and then
eventually to Executing state. In the case where the Event Signal is linked to moving data into or
out of Transfer registers (numbers 1 through 4 in the list above), the code can safely use the
Transfer register as the first instruction (for example, using a Transfer_In register as a source
operand will get the new read data). The same is true when the Event Signal is tested for branches
(br_=signal or br_!signal instructions).
The
ctx_arb instruction, CTX_SIG_EVENTS, and ACTIVE_CTX_WAKEUP_#_EVENTS
Local CSR descriptions provide details.
Hardware Reference Manual49
Intel® IXP2800 Network Processor
Technical Description
2.4DRAM
The IXP2800 Network Processor has controllers for three Rambus* DRAM (RDRAM) channels.
Each of the controllers independently accesses its own RDRAMs, and can operate concurrently
with the other controllers (i.e., they are not operating as a single, wider memory). DRAM provides
high-density, high-bandwidth storage and is typically used for data buffers.
• RDRAM sizes of 64, 128, 256, or 512 Mbytes, and 1 Gbyte are supported; however, each of
the channels must have the same number, size, and speed of RDRAMs populated. Refer to
Section 5.2 for supported size and loading configurations.
• Up to two Gbytes of DRAM is supported. If less than two Gbytes of memory is present, the
upper part of the address space is not used. It is also possible, for system cost and area savings,
to have Channels 0 and 1 populated with Channel 2 empty, or Channel 0 populated with
Channels 1and 2 empty.
• Reads and writes to RDRAM are generated by Microengines, The Intel XScale
(external Bus Masters and DMA Channels). The controllers also do refresh and calibration
cycles to the RDRAMs, transparently to software.
• RDRAM Powerdown and Nap modes are not supported.
• Hardware interleaving (also known as striping) of addresses is done to provide balanced
access to all populated channels. The interleave size is 128 bytes. Interleaving helps to
maintain utilization of available bandwidth by spreading consecutive accesses to multiple
channels. The interleaving is done in the hardware in such a way that the three channels appear
to software as a single contiguous memory space.
®
core, and PCI
• ECC (Error Correcting Code) is supported, but can be disabled. Enabling ECC requires that
x18 RDRAMs be used. If ECC is disabled x16 RDRAMs can be used. ECC can detect and
correct all single-bit errors, and detect all double-bit errors. When ECC is enabled, partial
writes (writes of less than 8 bytes) must be done as read-modify-writes.
2.4.1Size Configuration
Each channel can be populated with anywhere from one-to-four RDRAMs (Short Channel Mode).
Refer to Section 5.2 for supported size and loading configurations. The RAM technology used will
determine the increment size and maximum memory per channel as shown in Table 9.
Table 9. RDRAM Sizes
RDRAM Technology
64/72 MB8 MB256 MB
128/144 MB16 MB512 MB
256/288 MB32 MB1 GB
512/576 MB64 MB2 GB
NOTES:
1. The two numbers shown for each technology indicate x16 parts and x18 parts.
2. The maximum memory that can be addressed across all channels is 2 GB. This limitation is based on the
partitioning of the 4-GB address space (32-bit addresses). Therefore, if all three channels are used, each
can be populated up to a maximum of 768 MB. Two channels can be populated to a maximum of
1 GB each. A single channel can be populated to a maximum of 2 GB.
1
Increment SizeMaximum per Channel
2
2
RDRAMs with 1 x 16 or 2 x 16 dependent banks, and 4 independent banks are supported.
50Hardware Reference Manual
2.4.2Read and Write Access
The minimum DRAM physical access length is 16 bytes. Software (and PCI) can read or write as
little as a single byte, however the time (and bandwidth) taken at the DRAMs is the same as for an
access of 16 bytes. Therefore, the best utilization of DRAM bandwidth will be for accesses that are
multiples of 16 bytes.
If ECC is enabled, writes of less than 8 bytes must do read-modify-writes, which take two 16-byte
time accesses (one for the read and one for the write).
2.5SRAM
The IXP2800 Network Processor has four independent SRAM controllers, which each support
pipelined QDR synchronous static RAM (SRAM) and/or a coprocessor that adheres to QDR
signaling. Any or all controllers can be left unpopulated if the application does not need to use
them. SRAM are accessible by the Microengines, the Intel XScale
(external bus masters and DMA).
The memory is logically four bytes (32-bits) wide; physically the data pins are two bytes wide and
are double clocked. Byte parity is supported. Each of the four bytes has a parity bit, which is
written when the byte is written and checked when the data is read. There are byte-enables that
select which bytes to write for writes of less than 32 bits.
Intel® IXP2800 Network Processor
Technical Description
®
core, and the PCI Unit
Each of the 4 QDR ports are QDR and QDRII compatible. Each port implements the “_K” and
“_C” output clocks and “_CQ” as an input and their inversions. (Note: the “_C” and “_CQ” clocks
are optional). Extensive work has been performed providing impedance controls within the
IXP2800 Network Processor for processor-initiated signals driving to QDR parts. Providing a
clean signaling environment is critical to achieving 200 – 250 MHz QDRII data transfers.
The configuration assumptions for the IXP2800 Network Processor I/O driver/receiver
development includes four QDR loads and the IXP2800 Network Processor. The IXP2800
Network Processor supports bursts of two SRAMs, but does not support bursts of four SRAMs.
The SRAM controller can also be configured to interface to an external coprocessor that adheres to
the QDR electricals and protocol. Each SRAM controller may also interface to an external
coprocessor through its standard QDR interface. This interface enables the cohabitation of both
SRAM devices and coprocessors to operate on the same bus. The coprocessor behaves as a
memory-mapped device on the SRAM bus.
Hardware Reference Manual51
Intel® IXP2800 Network Processor
Technical Description
2.5.1QDR Clocking Scheme
The controller drives out two pairs of K clock (K and K#). It also drives out two pairs of C clock
(C and C#). Both C/C# clocks externally return to the controller for reading data. Figure 8 shows
Figure 8. Echo Clock Configuration
the clock diagram if the clocking scheme for QDR interface driving four SRAM chips.
Termination
Clam-shelled SRAMS
CQ/CQ#
Package Balls
QDRn_CIN[0]
®
Intel
IXP2800
Network
Processor
*The CIN[1] pin is not used internally to capture the READ data; however, the I/O Pad can be used
to terminate the signal.
QDRn_K[0]
QDRn_C[0]
QDRn_C[1]
QDRn_K[1]
*QDRn_CIN[1]
Termination
2.5.2SRAM Controller Configurations
Each channel has enough address pins (24) to support up to 64 Mbytes of SRAM. The SRAM
controllers can directly generate multiple port enables (up to four pairs) to allow for depth
expansion. Two pairs of pins are dedicated for port enables. Smaller RAMs use fewer address
signals than the number provided to accommodate the largest RAMs, so some address pins (23:20)
are configurable as either address or port enable based on CSR setting as shown in Table 10.
Note that all of the SRAMs on a given channel must be the same size.
K/K#
C/C#
C/C#
K/K#
Package Balls
CQ/CQ#
B3664-01
Table 10. SRAM Controller Configurations (Sheet 1 of 2)
SRAM
Configuration
512K x 181 MB17:023:22, 21:204
1M x 182 MB18:023:22, 21:204
2M x 184 MB19:023:22, 21:204
4M x 188 MB20:023:223
SRAM Size
52Hardware Reference Manual
Addresses Needed
to Index SRAM
Addresses Used
as Port Enables
Total Number of Port
Select Pairs Available
Table 10. SRAM Controller Configurations (Sheet 2 of 2)
Intel® IXP2800 Network Processor
Technical Description
SRAM
Configuration
8M x 1816 MB21:023:223
16M x 1832 MB22:0None2
32M x 1864 MB23:0None2
SRAM Size
Addresses Needed
to Index SRAM
Addresses Used
as Port Enables
Total Number of Port
Select Pairs Available
Each channel can be expanded by depth according to the number of port enables available. If
external decoding is used, then the number of SRAMs used is not limited by the number of port
enables generated by the SRAM controller.
Note: Doing external decoding may require external pipeline registers to account for the decode time,
depending on the desired frequency.
Maximum SRAM system sizes are shown in Table 11. Shaded entries require external decoding,
because they use more port enables than the SRAM controller can supply directly.
Table 11. Total Memory per Channel
SRAM Size
12345678
512K x 181 MB2 MB3 MB
1M x 182 MB4 MB6 MB8 MB10 MB12 MB14 MB16 MB
2M x 184 MB8 MB12 MB16 MB20 MB24 MB28 MB32 MB
4M x 188 MB16 MB24 MB32 MB64 MBNANANA
8M x 1816 MB32 MB48 MB64 MBNANANANA
16M x 1832 MB64 MBNANANANANANA
32M x 1864 MBNANANANANANANA
Number of SRAMs on Channel
4 MB5 MB6 MB7 MB8 MB
2.5.3SRAM Atomic Operations
In addition to normal reads and writes, SRAM supports the following atomic operations.
Microengines have specific instructions to do each atomic operation; Intel XScale
microarchitecture uses aliased address regions to do atomic operations.
• bit set
• bit clear
• increment
• decrement
• add
• swap
The SRAM does read-modify-writes for the atomic operations, the pre-modified data can also be
returned if desired. The atomic operations operate on a single 32-bit word.
Hardware Reference Manual53
®
Intel® IXP2800 Network Processor
Technical Description
2.5.4Queue Data Structure Commands
The ability to enqueue and dequeue data buffers at a fast rate is key to meeting line-rate
performance. This is a difficult problem as it involves dependent memory references that must be
turned around very quickly. The SRAM controller includes a data structure (called the Q_array)
and associated control logic to perform efficient enqueue and dequeue operations. The Q_array has
64 entries, each of which can be used in one of four ways.
• Linked-list queue descriptor (resident queues)
• Cache of recently used linked-list queue descriptors (backing store for the cache is in SRAM)
• Ring descriptor
• Journal
The commands provided are:
For Linked-list queues or Cache of recently used linked-list queue descriptors
Note: The Read_Q_Descriptor_Head, Read_Q_Descriptor_Tail, etc.) are used to initialize the rings and
journals but not used to perform the ring and journal function.
2.5.5Reference Ordering
This section covers the ordering between accesses to any one SRAM controller.
2.5.5.1Reference Order Tables
Table 12 shows the architectural guarantees of order to access to the SAME SRAM address
between a reference of any given type (shown in the column labels) and a subsequent reference of
any given type (shown in the row labels). The definition of first and second is defined by the order
they are received by the SRAM controller.
Note: A given Network Processor version may implement a superset of these order guarantees. However,
that superset may not be supported in future implementations.
54Hardware Reference Manual
Verification is required to test only the order rules shown in Table 12 and Table 13).
Note: A blank entry in Table 12 means that no order is enforced.
Table 12. Address Reference Order
Intel® IXP2800 Network Processor
Technical Description
1st ref
nd
2
refMemory
Read
Memory ReadOrder
CSR ReadOrder
Memory WriteOrder
CSR WriteOrder
Memory RMWOrder
Queue / Ring / Q_
Descr Commands
CSR Read
Table 13 shows the architectural guarantees of order to access to the SAME SRAM Q_array entry
between a reference of any given type (shown in the column labels) and a subsequent reference of
any given type (shown in the row labels). The definition of first and second is defined by the order
they are received by the SRAM controller. The same caveats apply as for Table 12.
Table 13. Q_array Entry Reference Order
1st ref
nd
2
ref
Read_Q
_Descr
head,
tail
Read_
Q_Des
cr
other
Write_Q
_Descr
Memory
Write
Enqueue DequeuePutGet Journal
CSR Write
Memory
RMW
Queue /
Ring /
Q_Descr
Commands
See
Table 13.
Read_Q_Descr
head,tail
Read_Q_
Descr other
Write_Q_
Descr
EnqueueOrderOrderOrder
DequeueOrderOrderOrder
PutOrder
GetOrder
JournalOrder
Order
Order
Hardware Reference Manual55
Intel® IXP2800 Network Processor
Technical Description
2.5.5.2Microengine Software Restrictions to Maintain Ordering
It is the Microengine programmer’s job to ensure order where the program flow finds order to be
necessary and where the architecture does not guarantee that order. The signaling mechanism can
be used to do this. For example, say that microcode needs to update several locations in a table. A
location in SRAM is used to “lock” access to the table. Example 13 is the code for the table update.
Example 13. Table Update Code
IMMED [$xfer0, 1]
SRAM [write, $xfer0, flag_address, 0, 1], ctx_swap [SIG_DONE_2]
; At this point, the write to flag_address has passed the point of coherency. Do
the table updates.
SRAM [write, $xfer1, table_base, offset1, 2] , sig_done [SIG_DONE_3]
SRAM [write, $xfer3, table_base, offset2, 2] , sig_done [SIG_DONE_4]
CTX_ARB [SIG_DONE_3, SIG_DONE_4]
; At this point, the table writes have passed the point of coherency. Clear the
flag to allow access by other threads.
IMMED [$xfer0, 0]
SRAM [write, $xfer0, flag_address, 0, 1, ctx_swap [SIG_DONE_2]
Other rules:
• All accesses to atomic variables should be via read-modify-write instructions.
• If the flow must know that a write is completed (actually in the SRAM itself), follow the write
with a read to the same address. The write is guaranteed to be complete when the read data has
been returned to the Microengine.
• With the exception of initialization, never do WRITE commands to the first three longwords
of a queue_descriptor data structure (these are the longwords that hold head, tail, and count,
etc.). All accesses to this data must be via the Q commands.
• To initialize the Q_array registers, perform a memory write of at least three longwords,
followed by a memory read to the same address (to guarantee that the write completed).
Then, for each entry in the Q_array, perform a read_q_descriptor_head followed by a
read_q_descriptor_other using the address of the same three longwords.
2.6Scratchpad Memory
The IXP2800 Network Processor contains a 16 Kbytes of Scratchpad Memory, organized as 4K
32-bit words, that is accessible by Microengines and the Intel XScale
Memory provides the following operations:
• Normal reads and writes. 1–16 32-bit words can be read/written with a single Microengine
instruction. Note that Scratchpad is not byte-writable (each write must write all four bytes).
and swap. The RMW operations can also optionally return the pre-modified data.
• Sixteen Hardware Assisted Rings for interprocess communication. (A ring is a FIFO that uses
a head and tail pointer to store/read information in Scratchpad memory.)
®
core. The Scratchpad
Scratchpad Memory is provided as a third memory resource (in addition to SRAM and DRAM)
that is shared by the Microengines and the Intel XScale
®
XScale
core can distribute memory accesses between these three types of memory resources to
®
core. The Microengines and the Intel
provide a greater number of memory accesses occurring in parallel.
56Hardware Reference Manual
2.6.1Scratchpad Atomic Operations
Intel® IXP2800 Network Processor
Technical Description
In addition to normal reads and writes, the Scratchpad Memory supports the following atomic
operations. Microengines have specific instructions to do each atomic operation; the Intel XScale
microarchitecture uses aliased address regions to do atomic operations.
• bit set
• bit clear
• increment
• decrement
• add
• subtract
• swap
The Scratchpad Memory does read-modify-writes for the atomic operations, the pre-modified data
can also be returned if desired. The atomic operations operate on a single 32-bit word.
2.6.2Ring Commands
The Scratchpad Memory provides sixteen Rings used for interprocess communication. The rings
provide two operations.
• Get(ring, length)
• Put(ring, length)
®
Ring is the number of the ring (0 through 15) to get or put from, and length specifies the
number of 32-bit words to transfer. A logical view of one of the rings is shown in Tabl e 9.
Figure 9. Logical View of Rings
1 of 16
HeadCountSizeTail
Address
Decoder
Scratchpad RAM
Read / Write / Atomic
Addresses
Full
A9355-01
Hardware Reference Manual57
Intel® IXP2800 Network Processor
Technical Description
Head, Tail, and Size are registers in the Scratchpad Unit. Head and Tail point to the actual ring data,
which is stored in the Scratchpad RAM. The count of how many entries are on the Ring is
determined by hardware using the Head and Tail. For each Ring in use, a region of Scratchpad
RAM must be reserved for the ring data.
Note: The reservation is by software convention. The hardware does not prevent other accesses to the
region of Scratchpad Memory used by the Ring. Also the regions of Scratchpad Memory allocated
to different Rings must not overlap.
Head points to the next address to be read on a get, and Tail points to the next address to be written
on a put. The size of each Ring is selectable from the following choices: 128, 256, 512, or 1024
32-bit words.
Note: The region of Scratchpad used for a Ring is naturally aligned to it size.
When the Ring is near full, it asserts an output signal, which is used as a state input to the
Microengines. They must use that signal to test (by doing Branch on Input State) for room on the
Ring before putting data onto it. There is a lag in time from a put instruction executing to the Full
signal being updated to reflect that put. To guarantee that a put will not overfill the ring there is a
bound on the number of Contexts and the number of 32-bit words per write based on the size of the
ring, as shown in Table 14. Each Context should test the Full signal, then do the put if not Full, and
then wait until the Context has been signaled that the data has been pulled before testing the Full
signal again.
An alternate usage method is to have Contexts allocate and deallocate entries from a shared count
variable, using the atomic subtract to allocate and atomic add to deallocate. In this case the
Full signal is not used.
Table 14. Ring Full Signal Use – Number of Contexts and Length versus Ring Size
Number of
Contexts
1 16161616
2 16161616
48161616
84121616
16261416
2414916
3213715
40Illegal2512
48Illegal2410
64Illegal137
128IllegalIllegal13
NOTES:
1. Number in each table entry is the largest length that should be put. 16 is the largest length that a single put
instruction can generate.
2. Illegal -- With that number of Contexts, even a length of one could cause the Ring to overfill.
1282565121024
Ring Size
58Hardware Reference Manual
Intel® IXP2800 Network Processor
2.7Media and Switch Fabric Interface
The Media and Switch Fabric (MSF) Interface is used to connect the IXP2800 Network Processor
to a physical layer device (PHY) and/or to a Switch Fabric. the MSF consists of separate receive
and transmit interfaces. Each of the receive and transmit interfaces can be separately configured for
either SPI-4 Phase 2 (System Packet Interface) for PHY devices or CSIX-L1 protocol for Switch
Fabric Interfaces.
The receive and transmit ports are unidirectional and independent of each other. Each port has 16
data signals, a clock, a control signal, and a parity signal, all of which use LVDS (differential)
signaling, and are sampled on both edges of the clock. There is also a flow control port consisting
of a clock, data, and ready status bits, and used to communicate between two IXP2800 Network
Processors, or the IXP2800 Network Processor chip and a Switch Fabric Interface. These are also
LVDS, dual-edge data transfer. All of the high speed LVDS interfaces support dynamic deskew
training.
The block diagram in Figure 10 shows a typical configuration.
Figure 10. Example System Block Diagram
Technical Description
Receive protocol is SPI-4
Transmit mode is CSIX
Ingress
®
Intel
IXP2800
Network Processor
Framing/MAC
Device
(PHY)
SPI-4
Protocol
RDAT
RSTAT
Intel
Egress
®
IXP2800
TDAT
Flow Control
Optional
Gasket
(Note
Switch
1
)
Fabric
CSIX
Protocol
Network Processor
TSTAT
TDAT
Receive protocol is CSIX
Transmit mode is SPI-4
Notes:
1. Gasket is used to convert 16-bit, dual-data IXP2800 signals to wider single edge CWord signals
used by Switch Fabric, if required.
2. Per the CSIX specification, the terms "egress" and ingress" are with respect to the Switch Fabric.
So the egress processor handles traffic received from the Switch Fabric and the ingress
processor handles traffic sent to the Switch Fabric.
RDAT
A9356-03
Hardware Reference Manual59
Intel® IXP2800 Network Processor
Technical Description
An alternate system configuration is shown in the block diagram in Figure 11. In this case, a single
IXP2800 Network Processor is used for both Ingress and Egress. The bit rate supported would be
less than in Figure 10. A hypothetical Bus Converter chip, external to the IXP2800 Network
Processor is used. The block diagram in Figure 11 is only an illustrative example.
Figure 11. Full-Duplex Block Diagram
Receive and transmit protocol
is SPI-4 and CSIX on transferby-transfer basis.
Intel® IXP2800
Network Processor
TDATRDAT
Notes:
2.7.1SPI-4
SPI-4 is an interface for packet and cell transfer between a physical layer (PHY) device and a link
layer device (the IXP2800 Network Processor), for aggregate bandwidths of OC-192 ATM and
Packet over SONET/SDH (POS), as well as 10 Gb/s Ethernet applications.
The Optical Internetworking Forum (OIF), www.oiforum.com, controls the SPI-4 Implementation
Agreement document.
Framing/MAC
Device
(PHY)
Rx
Tx
Switch
Fabric
Bus Converter
UTOPIA-3
or IXBUS
Protocol
The Bus Converter chip receives and transmits both SPI-4 and CSIX protocols from/to Intel
IXP2800 Network Processor. It steers the data, based on protocol, to either PHY device or
Switch Fabric. PHY interface can be UTOPIA-3, IXBUS, or any other required protocol.
Tx
Rx
CSIX
Protocol
A9357-02
SPI-4 protocol transfers data in variable length bursts. Associated with each burst is information
such as Port number (for a multi-port device such as a 10 x 1 GbE), SOP, and EOP. This
information is collected by the MSF and passed to the Microengines.
60Hardware Reference Manual
2.7.2CSIX
CSIX-L1 (Common Switch Interface) defines an interface between a Traffic Manager (TM) and a
Switch Fabric (SF) for ATM, IP, MPLS, Ethernet, and similar data communications applications.
The Network Processor Forum (NPF) www.npforum.org, controls the CSIX-L1 specification.
The basic unit of information transferred between Traffic Managers and Switch Fabrics is called a
CFrame. There are three categories of CFrames:
• Data
• Control
• Flow Control
Associated with each CFrame is information such as length, type, address. This information is
collected by MSF and passed to Microengines.
MSF also contains a number of hardware features related to flow control.
2.7.3Receive
Intel® IXP2800 Network Processor
Technical Description
Figure 12 is a simplified block diagram of the MSF receive section.
CSIX CFrames mapped by RX_Port_Map CSR
(normally Flow Control CFrames are mapped here)
Buffers
Control
Receive
Thread
Freelists
(to MEs)
(to DRAM)
CSR Write
A9365-01
32
64
Hardware Reference Manual61
Intel® IXP2800 Network Processor
Technical Description
2.7.3.1RBUF
RBUF is a RAM that holds received data. It stores received data in sub-blocks (referred to as
elements), and is accessed by Microengines or the Intel XScale
®
core reading the received
information. Details of how RBUF elements are allocated and filled is based on the receive data
protocol. When data is received, the associated status is put into the FULL_ELEMENT_LIST
FIFO and subsequently sent to Microengines to process. FULL_ELEMENT_LIST insures that
received elements are sent to Microengines in the order that the data was received.
RBUF contains a total of 8 KB of data. The element size is programmable as either 64 bytes,
128 bytes, or 256 bytes per element. In addition, RBUF can be programmed to be split into one,
two, or three partitions depending on application. For receiving SPI-4, one partition would be used.
For receiving CSIX, two partitions are used (Control CFrames and Data CFrames). When both
protocols are being used, the RBUF can be split into three partitions. For both SPI-4 and CSIX,
three partitions are used.
Microengines can read data from the RBUF to Microengine S_TRANSFER_IN registers using the
msf[read] instruction where they specify the starting byte number (which must be aligned to 4
bytes), and number of 32-bit words to read. The number in the instruction can be either the number
of 32-bit words, or number of 32-bit word pairs, using the single and double instruction modifiers,
respectively.
Microengines can move data from RBUF to DRAM using the
dram instruction where they specify
the starting byte number (which must be aligned to 4 bytes), the number of 32-bit words to read,
and the address in DRAM to write the data.
For both types of RBUF read, reading an element does not modify any RBUF data, and does not
free the element, so buffered data can be read as many times as desired. This allows, for example, a
processing pipeline to have different Microengines handle different protocol layers, with each
Microengine reading only the specific header information it requires.
2.7.3.1.1SPI-4 and the RBUF
SPI-4 data is placed into RBUF with each SPI-4 burst allocating an element. If a SPI-4 burst is
larger than the element size, another element is allocated. The status information for the element
contains the following information:
The definitions of the fields are shown in Table 90, “RBUF SPIF-4 Status Definition” on page 252.
2
62Hardware Reference Manual
2.7.3.1.2CSIX and RBUF
CSIX CFrames are placed into either RBUF with each CFrame allocating an element. Unlike
SPI-4, a single CFrame must not spill over into another element. Since CSIX spec specifies a
maximum CFrame size of 256 bytes, this can be done by programming the element size to 256
bytes. However, if the Switch Fabric uses a smaller CFrame size, then a smaller RBUF element
size can be used.
Flow Control CFrames are put into the FCEFIFO, to be sent to the Ingress IXP2800 Network
Processor where a Microengine will read them to manage flow control information to the Switch
Fabric.
The status information for the element contains the following information:
The definitions of the fields are shown in Table 91, “RBUF CSIX Status Definition” on page 254.
2.7.3.2Full Element List
Receive control hardware maintains the FULL_ELEMENT_LIST to hold the status of valid RBUF
elements, in the order in which they were received. When an RBUF element is filled, its status is
added to the tail of the FULL_ELEMENT_LIST. When a Microengine is notified of element
arrival (by having the status written to its S_Transfer register), it is removed from the head of the
FULL_ELEMENT_LIST.
2.7.3.3RX_THREAD_FREELIST
RX_THREAD_FREELIST is a FIFO that indicates Microengine Contexts that are awaiting an
RBUF element to process. This allows the Contexts to indicate their ready status prior to the
reception of the data, as a way to eliminate latency. Each entry added to a Freelist also has an
associated S_TRANSFER register and signal number. There are three RX_THREAD_FREELISTS
that correspond to the RBUF partitions.
2
To be added as ready to receive an element, a Microengine does an
msf[fast_write] to the RX_THREAD_FREELISTaddress; the write data is the Microengine/
msf[write] or an
CONTEXT/S_TRANSFER register number to add to the Freelist.
When there is valid status at the head of the Full Element List, it will be pushed to a Microengine.
The receive control logic pushes the status information (which includes the element number) to the
Microengine in the head entry of RX_THREAD_FREELIST, and sends an Event Signal to the
Microengine. It then removes that entry from the RX_THREAD_FREELIST, and removes the
status from Full Element List.
Hardware Reference Manual63
Intel® IXP2800 Network Processor
Technical Description
Each RX_THREAD_FREELIST has an associated countdown timer. If the timer expires and no
new receive data is available yet, the receive logic will autopush a Null Receive Status Word to the
next thread on the RX_THREAD_FREELIST. A Null Receive Status Word has the “Null” bit set,
and does not have any data or RBUF entry associated with it.
The RX_THREAD_FREELIST timer is useful for certain applications. Its primary purpose is to
keep the receive processing pipeline (implemented as code running on the Microengines) moving
even when the line has gone idle.
It is especially useful if the pipeline is structured to handle mpackets in groups, i.e., eight mpackets
at a time. If seven mpackets are received, then the line goes idle, then the timeout will trigger the
autopush of a null Receive Status Word, filling the eighth slot and allowing the pipeline to advance.
Another example is if one valid mpacket is received before the line goes idle for a long period;
seven null Receive Status Words will be autopushed, allowing the pipeline to proceed. Typically
the timeout interval is programmed to be slightly larger than the minimum arrival time of the
incoming cells or packets.
The timer is controlled using the RX_THREAD_FREELIST_TIMEOUT_# CSR. The timer may
be enabled or disabled, and the timeout value specified using this CSR.
2.7.3.4Receive Operation Summary
During receive processing, received CFrames, and SPI-4 cells and packets (which in this context
are all called mpackets) are placed into the RBUF, and then handed off to a Microengine to process.
Normally, by application design, some number of Microengine Contexts will be assigned to
receive processing. Those Contexts will have their number added to the proper
RX_THREAD_FREELIST (via
wait for arrival of an mpacket (or alternatively poll waiting for arrival of an mpacket).
When an mpacket arrives, MSF receive control logic will autopush eight bytes of information for
the element to the Microengine/CONTEXT/S_TRANSFER registers at the head of
RX_THREAD_FREELIST. The information pushed is:
msf[write]or msf[fast_write]), and then will go to sleep to
• Status Word (SPI-4) or Header Status (CSIX) — see Table 90, “RBUF SPIF-4 Status
Definition” on page 252 for more information.
• Checksum (SPI-4) or Extension Header (CSIX) — see Table 91, “RBUF CSIX Status
Definition” on page 254 for more information.
To handle the case where the receive Contexts temporarily fall behind and
RX_THREAD_FREELIST is empty, all received element numbers are held in the
FULL_ELEMENT_LIST. In that case, as soon as an RX_THREAD_FREELIST entry is entered,
the status of the head element of FULL_ELEMENT_LIST will be pushed to it.
The Microengines may read part of (or the entire) RBUF element to their S_TRANSFER registers
(via an
msf[read] instruction) for header processing, etc., and may also move the element data to
DRAM (via a
When a Context is done with an element, it does an
RBUF_ELEMENT_DONE address; the write data is the element number. This marks the element
as free and available to be re-used. There is no restriction on the order in which elements are freed;
Contexts can do different amounts of processing per element based on the contents of the element
— therefore elements can be returned in a different order than they were handed to Contexts.
dram[rbuf_rd] instruction).
msf[write]or msf[fast_write] to
64Hardware Reference Manual
2.7.4Transmit
Figure 13 is a simplified Block Diagram of the MSF transmit section.
TBUF is a RAM that holds data and status to be transmitted. The data is written into sub-blocks
referred to as elements, by Microengines or the Intel XScale
From ME
From DRAM
- - - - - -
- - - - - -
- - - - - -
- - - - - -
Element
RXCFC
(FCIFIFO full)
TBUF
Valid
Logic
FCIFIFO
- - - - - -
- - - - - -
- - - - - -
- - - - - -
RXCDAT
Control
ME Reads
(S_Push_Bus)
SPI-4
Protocol
Logic
CSIX
Protocol
Logic
RXCSRB
(Ready Bits)
®
core.
Internal Clock
for Transmit
Logic
Byte Align
Internal
Clock
TDAT
TCTL
TPAR
TCLK
TCLK REF
A9366-01
TBUF contains a total of 8 Kbytes of data. The element size is programmable as either 64 bytes,
128 bytes, or 256 bytes per element. In addition, TBUF can be programmed to be split into one,
two, or three partitions depending on application. For transmitting SPI-4, one partition would be
used. For transmitting CSIX, two partitions are used (Control CFrames and Data CFrames). For
both SPI-4 and CSIX, three partitions are used.
Microengines can write data from Microengine S_TRANSFER_OUT registers to the TBUF using
the
msf[write] instruction where they specify the starting byte number (which must be aligned to
4 bytes), and number of 32-bit words to write. The number in the instruction can be either the
number of 32-bit words, or number of 32-bit word pairs, using the single and double instruction
modifiers, respectively.
Microengines can move data from DRAM to TBUF using the
dram instruction where they specify
the starting byte number (which must be aligned to 4 bytes), the number of 32-bit words to write,
and the address in DRAM of the data.
Hardware Reference Manual65
Intel® IXP2800 Network Processor
Technical Description
All elements within a TBUF partition are transmitted in the order. Control information associated
with the element defines which bytes are valid. The data from the TBUF will be shifted and byte
aligned as required to be transmitted.
2.7.4.1.1SPI-4 and TBUF
For SPI-4, data is put into the data portion of the element, and information for the SPI-4 Control
Word that will precede the data is put into the Element Control Word.
When the Element Control Word is written, the information is:
The definitions of the fields are shown in Table 15.
Table 15. TBUF SPI-4 Control Definition
FieldDefinition
Indicates the number of Payload bytes, from 1 to 256, in the element. The value of 0x00
Payload Length
Prepend OffsetIndicates the first valid byte of Prepend, from 0 to 7
Prepend LengthIndicates the number of bytes in Prepend, from 0 to 31.
Payload OffsetIndicates the first valid byte of Payload, from 0 to 7.
Skip
SOP
EOP
ADR
NOTE:
1. Normally EOPS is sent on the next Control Word (along with ADR and SOP) to start the next element. If
there is no valid element pending at the end of sending the data, the transmit logic will insert an Idle Control
Word with the EOPS information.
means 256 bytes. The sum of Prepend Length and Payload Length will be sent. That
value will also control the EOPS field (1 or 2 bytes valid indicated) of the Control Word
that will succeed the data transfer. Note 1.
Allows software to allocate a TBUF element and then not transmit any data from it.
0—transmit data according to other fields of Control Word.
1—free the element without transmitting any data.
Indicates if the element is the start of a packet. This field will be sent in the SOPC field of
the Control Word that will precede the data transfer.
Indicates if the element is the end of a packet. This field will be sent in the EOPS field of
the Control Word that will succeed the data transfer. Note 1.
The port number to which the data is directed. This field will be sent in the ADR field of the
Control Word that will precede the data transfer.
2
66Hardware Reference Manual
2.7.4.1.2CSIX and TBUF
For CSIX, payload information is put into the data area of the element, and Base and Extension
Header information is put into the Element Control Word.
When the Element Control Word is written, the information is:
The definitions of the fields are shown in Table 16.
Table 16. TBUF CSIX Control Definition
FieldDefinition
Indicates the number of Payload bytes, from 1 to 256, in the element. The value of 0x00
Payload Length
Prepend OffsetIndicates the first valid byte of Prepend, from 0 to 7.
Prepend LengthIndicates the number of bytes in Prepend, from 0 to 31.
Payload OffsetIndicates the first valid byte of Payload, from 0 to 7.
Skip
CRCR (CSIX Reserved) bit to put into the CSIX Base Header.
PP (Private) bit to put into the CSIX Base Header.
TypeType Field to put into the CSIX Base Header. Idle type is not legal here.
Extension Header
means 256 bytes. The sum of Prepend Length and Payload Length will be sent, and also
put into the CSIX Base Header Payload Length field. Note that this length does not
include any padding that may be required. Padding is inserted by transmit hardware as
needed.
Allows software to allocate a TBUF element and then not transmit any data from it.
0—transmit data according to other fields of Control Word.
1—free the element without transmitting any data.
The Extension Header to be sent with the CFrame. The bytes are sent in big-endian
order; byte 0 is in bits 63:56, byte 1 is in bits 55:48, byte 2 is in bits 47:40, and byte 3 is in
bits 39:32.
2
2.7.4.2Transmit Operation Summary
During transmit processing data to be transmitted is placed into the TBUF under Microengine
control. The Microengine allocates an element in software; the transmit hardware processes TBUF
elements within a partition in strict sequential order so the software can track which element to
allocate next.
Microengines may write directly into an element by an
DRAM written into the element by a
dram[tbuf_wr] instruction. Data can be merged into the
element by doing both.
Hardware Reference Manual67
msf[write] instruction, or have data from
Intel® IXP2800 Network Processor
Technical Description
There is a Transmit Valid bit per element, that marks the element as ready to be transmitted.
Microengines move all data into the element, by either or both of
dram[tbuf_wr] instructions to the TBUF. Microengines also write the element Transmit Control
Word with information about the element. When all of the data movement is complete, the
Microengine sets the element valid bit.
1. Move data into TBUF by either or both of
the TBUF.
2. Wait for 1 to complete.
3. Write Transmit Control Word at TBUF_ELEMENT_CONTROL_# address. Using this
address sets the Transmit Valid bit.
2.7.5The Flow Control Interface
The MSF provides flow control support for SPI-4 and CSIX.
2.7.5.1SPI-4
SPI-4 uses a FIFO Status Channel to provide flow control information. MSF receives the
information from the PHY device and stores it so that Microengines can read the information on a
per-port basis. It can then use that information to determine when to transmit data to a given port.
msf[write] and
msf[write] and dram[tbuf_wr] instructions to
The MSF also sends status to the PHY based on the amount of available space in the RBUF —
i.e., done by hardware without Microengines.
2.7.5.2CSIX
CSIX provides two types of flow control — link level and per queue.
• The link level control is handled by hardware. MSF will stop transmission is response to link
level flow control received from the Switch Fabric. MSF will assert link level flow control
based on the amount of available space in the RBUF.
• Per queue flow control information is put into the FCIFIFO and handled by Microengine
software. Also, if required, Microengines can send Flow Control CFrames to the Switch
Fabric under software control.
In both cases, for a full-duplex configuration, information is passed from the Switch Fabric to the
Egress IXP2800 Network Processor, which then passes it to the Ingress IXP2800 Network
Processor over a proprietary flow control interface.
68Hardware Reference Manual
2.8Hash Unit
The IXP2800 Network Processor contains a Hash Unit that can take 48-, 64-, or 128-bit data and
produce a 48-, 64-, or a 128-bit hash index, respectively. The Hash Unit is accessible by the
Microengines and the Intel XScale
example L2 addresses. Figure 14 is a block diagram of the Hash Unit.
Intel® IXP2800 Network Processor
Technical Description
®
core, and is useful in doing table searches with large keys, for
Up to three hash indexes can be created using a single Microengine instruction. This helps to
minimize command overhead. The Intel XScale
A Microengine initiates a hash operation by writing the hash operands into a contiguous set of
S_TRANSFER_OUT registers and then executing the hash instruction. The Intel XScale
initiates a hash operation by writing a set of memory-mapped HASH_OP registers, which are built
in the Intel XScale
®
core gasket, with the data to be used to generate the hash index. There are
®
core can only do a single hash at a time.
®
core
separate registers for 48-, 64-, and 128-bit hashes. The data is written from MSB to LSB, with the
write to LSB triggering the Hash Operation. In both cases, the Hash Unit reads the operand into an
input buffer, performs the hash operation, and returns the result.
The Hash Unit uses a hard-wired polynomial algorithm and a programmable hash multiplier to
create hash indexes. Three separate multipliers are supported, one for 48-bit hash operations, one
for 64-bit hash operations and one for 128-bit hash operations. The multiplier is programmed
through Control registers in the Hash Unit.
The multiplicand is shifted into the hash array, 16 bits at a time. The hash array performs a
1’s-complement multiply and polynomial divide, using the multiplier and 16 bits of the
multiplicand. The result is placed into an output buffer register and also feeds back into the array.
This process is repeated three times for a 48-bit hash (16 bits x 3 = 48), four times for a 64-bit hash
(16 bits x 4 = 64), and eight times for a 128-bit hash (16 x 8 = 128). After the multiplicand has been
passed through the hash array, the resulting hash index is placed into a two-stage output buffer.
After each hash index is completed, the Hash Unit returns the hash index to the Microengines’
S_TRANSFER_IN registers, or the Intel XScale
®
core HASH_OP registers. For Microengine
initiated hash operations, the Microengine is signaled after all the hashes specified in the
instruction have been completed.
For the Intel XScale
®
core initiated hash operations, the Intel XScale® core reads the results from
the memory-mapped HASH_OP registers. The addresses of Hash Results are the same as the
HASH_OP registers. Because of queuing delays at the Hash Unit, the time to complete an
operation is not fixed. The Intel XScale
®
core can do one of two operations to get the hash results.
• Poll the HASH_DONE register. This register is cleared when the HASH_OP registers are
written. Bit [0] of HASH_DONE register is set when the HASH_OP registers get the return
result from the Hash Unit (when the last word of the result is returned). The Intel XScale
software can poll on HASH_DONE, and read HASH_OP when HASH_DONE is equal to
0x00000001.
• Read HASH_OP directly. The interface hardware will acknowledge the read only when the
result is valid. This method will result in the Intel XScale
®
core stalling if the result is not
valid when the read happens.
The number of clock cycles required to perform a single hash operation equals: two or four cycles
through the input buffers, three, four or eight cycles through the hash array, and two or four cycles
through the output buffers. Because of the pipeline characteristics of the Hash Unit, performance is
improved if multiple hash operations are initiated with a single instruction rather than separate hash
instructions for each hash operation.
Hardware Reference Manual69
®
core
Intel® IXP2800 Network Processor
Technical Description
Figure 14. Hash Unit Block Diagram
Data Used to Create Hash
Index from S_Transfer_Out
2-Stage Input Buffer
128
Hashed Multiplicand 3
128
Hashed Multiplicand 2
Hashed Multiplicand 1
Hash Indexes to S_Transfer_In
Registers
Multiplicand 3
Multiplicand 2
128
Multiplicand 1
16
Hash Array
2-Stage Output Buffer
shift
Hash_Multiplier_48
Hash_Multiplier_64
Hash_Multiplier_128
48-bit, 64-bit or 128-bit Hash Select
A9367-02
70Hardware Reference Manual
2.9PCI Controller
The PCI Controller provides a 64-bit, 66 MHz capable PCI Local Bus Revision 2.2 interface, and is
compatible to 32-bit or 33 MHz PCI devices. The PCI controller provides the following functions:
• Target Access (external Bus Master access to SRAM, DRAM, and CSRs)
• Master Access (the Intel XScale
• Two DMA Channels
• Mailbox and Doorbell registers for the Intel XScale
• PCI arbiter
The IXP2800 Network Processor can be configured to act as PCI central function (for use in a
stand-alone system), where it provides the PCI reset signal, or as an add-in device, where it uses the
PCI reset signal as the chip reset input. The choice is made by connecting the cfg_rst_dir input pin
low or high.
2.9.1Target Access
Intel® IXP2800 Network Processor
®
core access to PCI Target devices)
®
core to Host communication
Technical Description
There are three Base Address Registers (BARs) to allow PCI Bus Masters to access SRAM,
DRAM, and CSRs, respectively. Examples of PCI Bus Masters include a Host Processor (for
example a Pentium
encryption coprocessor.
The SRAM BAR can be programmed to sizes of 16, 32, 64, 128, or 256 Mbytes, or no access.
The DRAM BAR can be programmed to sizes of 128, 256, or 512 Mbytes or 1 Gbyte, or no access.
The CSR BAR is 8 KB.
PCI Boot Mode is supported, in which the Host downloads the Intel XScale
DRAM, while holding the Intel XScale
Intel XScale
attached to the Slowport.
®
processor), or an I/O device such as an Ethernet controller, SCSI controller, or
®
core reset is deasserted. The alternative is to provide the boot image in a Flash ROM
2.9.2Master Access
The Intel XScale® core and Microengines can directly access the PCI bus. The Intel XScale® core
can do loads and stores to specific address regions to generate all PCI command types.
Microengines use PCI instruction, and also use address regions to generate different PCI
commands.
2.9.3DMA Channels
®
®
core in reset. Once the boot image has been loaded, the
core boot image into
There are two DMA Channels, each of which can move blocks of data from DRAM to the PCI or
from the PCI to DRAM. The DMA channels read parameters from a list of descriptors in SRAM,
perform the data movement to or from DRAM, and stop when the list is exhausted. The descriptors
are loaded from predefined SRAM entries or may be set directly by CSR writes to DMA Channel
registers. There is no restriction on byte alignment of the source address or the destination address.
Hardware Reference Manual71
Intel® IXP2800 Network Processor
Technical Description
For PCI to DRAM transfers, the PCI command is Memory Read, Memory Read line, or Memory
Read Multiple. For DRAM to PCI transfers, the PCI command is Memory Write. Memory Write
Invalidate is not supported.
Up to two DMA channels are running at a time with three descriptors outstanding. Effectively, the
active channels interleave bursts to or from the PCI Bus.
Interrupts are generated at the end of DMA operation for the Intel XScale
Microengines do not provide an interrupt mechanism. The DMA Channel will instead use an Event
Signal to notify the particular Microengine on completion of DMA.
2.9.3.1DMA Descriptor
Each descriptor uses four 32-bit words in SRAM, aligned on a 16-byte boundary. The DMA
channels read the descriptors from SRAM into working registers once the control register has been
set to initiate the transaction. This control must be set explicitly; this starts the DMA transfer.
Register names for DMA channels are listed in Figure 15 and Table 17 lists the descriptor contents.
(X can be 1, 2, or 3)
CHAN_X_BYTE_COUNT
CHAN_X_PCI_ADDR
CHAN_X_DRAM_ADDR
CHAN_X_DESC_PTR
Control Register
DMA Channel Register
Control RegisterCHAN_X_CONTROL
Channel Register Name
(X can be 1, 2, or 3)
®
core. However,
A9368-01
After a descriptor is processed, the next descriptor is loaded in the working registers. This process
repeats until the chain of descriptors is terminated (i.e., the End of Chain bit is set).
Table 17. DMA Descriptor Format
Offset from Descriptor PointerDescription
0x0Byte Count
0x4PCI Address
0x8DRAM Address
0xCNext Descriptor Address
72Hardware Reference Manual
2.9.3.2DMA Channel Operation
The DMA channel can be set up to read the first descriptor in SRAM, or with the first descriptor
written directly to the DMA channel registers. When descriptors and the descriptor list are in
SRAM, the procedure is as follows:
1. The DMA channel owner writes the address of the first descriptor into the DMA Channel
Descriptor Pointer register (DESC_PTR).
2. The DMA channel owner writes the DMA Channel Control register (CONTROL) with
miscellaneous control information and also sets the channel enable bit (bit 0). The channel
initial descriptor bit (bit 4) in the CONTROL register must also be cleared to indicate that the
first descriptor is in SRAM.
3. Depending on the DMA channel number, the DMA channel reads the descriptor block into the
corresponding DMA registers, BYTE_COUNT, PCI_ADDR, DRAM_ADDR, and
DESC_PTR.
4. The DMA channel transfers the data until the byte count is exhausted, and then sets the
channel transfer done bit in the CONTROL register.
5. If the end of chain bit (bit 31) in the BYTE_COUNT register is clear, the channel checks the
Chain Pointer value. If the Chain Pointer value is not equal to 0. it reads the next descriptor
and transfers the data (step 3 and 4 above). If the Chain Pointer value is equal to 0, it waits for
the Descriptor Added bit of the Channel Control register to be set before reading the next
descriptor and transfers the data (step 3 and 4 above). If bit 31 is set, the channel sets the
channel chain done bit in the CONTROL register and then stops.
6. Proceed to the Channel End Operation.
Intel® IXP2800 Network Processor
Technical Description
When single descriptors are written into the DMA channel registers, the procedure is as follows:
1. The DMA channel owner writes the descriptor values directly into the DMA channel registers.
The end of chain bit (bit 31) in the BYTE_COUNT register must be set, and the value in the
DESC_PTR register is not used.
2. The DMA channel owner writes the base address of the DMA transfer into the PCI_ADDR to
specify the PCI starting address.
3. When the first descriptor is in the BYTE_COUNT register, the DRAM_ADDR register must
be written with the address of the data to be moved.
4. The DMA channel owner writes the CONTROL register with miscellaneous control
information, along with setting the channel enable bit (bit 0). The channel initial descriptor in
register bit (bit 4) in the CONTROL register must also be set to indicate that the first descriptor
is already in the channel descriptor registers.
5. The DMA channel transfers the data until the byte count is exhausted, and then sets the
channel transfer done bit (bit 2) in the CONTROL register.
6. Since the end of the chain bit (bit 31) in the BYTE_CONT register is set, the channel sets the
channel chain done bit (bit 7) in the CONTROL register and then stops.
7. Proceed to the Channel End Operation.
Hardware Reference Manual73
Intel® IXP2800 Network Processor
Technical Description
2.9.3.3DMA Channel End Operation
1. Channel owned by PCI:
If not masked via the PCI Outbound Interrupt Mask register, the DMA channel interrupts the
PCI host after the setting of the DMA done bit in the CHAN_X_CONTROL register, which is
readable in the PCI Outbound Interrupt Status register.
2. Channel owned by the Intel XScale
If enabled via the Intel XScale
the Intel XScale
®
core by setting the DMA channel done bit in the CHAN_X_CONTROL
register, which is readable in the Intel XScale
®
core:
®
core Interrupt Enable registers, the DMA channel interrupts
®
core Interrupt Status register.
3. Channel owned by Microengine:
If enabled via the Microengine Auto-Push Enable registers, the DMA channel signals the
Microengine after setting the DMA channel done bit in the CHAN_X_CONTROL register,
which is readable in the Microengine Auto-Push Status register.
2.9.3.4Adding Descriptors to an Unterminated Chain
It is possible to add a descriptor to a chain while a channel is running. To do so, the chain should be
left unterminated, i.e., the last descriptor should have End of Chain clear, and the Chain Pointer
value equal to 0. A new descriptor (or linked list of descriptors) can be added to the chain by
overwriting the Chain Pointer value of the unterminated descriptor (in SRAM) with the Local
Memory address of the (first) added descriptor (the added descriptor must actually be valid in
Local Memory prior to that). After updating the Chain Pointer field, the software must write a 1 to
the Descriptor Added bit of the Channel Control register. This is necessary for the case where the
channel was paused to reactivate the channel. However, software need not check the state of the
channel before writing that bit; there is no side-effect of writing that bit in the case where the
channel had not yet read the unlinked descriptor.
If the channel was paused or had read an unlinked Pointer, it will re-read the last descriptor
processed (i.e., the one that originally had the 0 value for Chain Pointer) to get the address of the
newly added descriptor.
A descriptor cannot be added to a descriptor that has End of Chain set.
2.9.4Mailbox and Message Registers
Mailbox and Doorbell registers provide hardware support for communication between the Intel
®
XScale
Four 32-bit mailbox registers are provided so that messages can be passed between the Intel
XScale
both the Intel XScale
messages are not used internally by the PCI Unit in any way. The mailbox registers are often used
with the Doorbell interrupts.
Doorbell interrupts provide an efficient method of generating an interrupt as well as encoding the
purpose of the interrupt. The PCI Unit supports a 32-bit the Intel XScale
register that is used by a PCI device to generate an the Intel XScale
32-bit PCI DOORBELL register that is used by the Intel XScale
A source generating the Doorbell interrupt can write a software defined bitmap to the register to
indicate a specific purpose. This bitmap is translated into a single interrupt signal to the destination
core and a device on the PCI Bus.
®
core and a PCI device. All four registers can be read and written with byte resolution from
®
core and PCI. How the registers are used is application dependent and the
®
core DOORBELL
®
core interrupt, and a separate
®
core to generate a PCI interrupt.
74Hardware Reference Manual
(either a PCI interrupt or an Intel XScale® core interrupt). When an interrupt is received, the
DOORBELL registers can be read and the bit mask can be interpreted. If a larger bit mask is
required than that is provided by the DOORBELL register, the MAILBOX registers can be used to
pass up to 16 bytes of data.
The doorbell interrupts are controlled through the registers shown in Table 18.
Table 18. Doorbell Interrupt Registers
Register NameDescription
XSCALE DOORBELLUsed to generate the Intel XScale
XSCALE DOORBELL
SETUP
PCI DOORBELLUsed to generate the PCI Doorbell interrupts.
PCI DOORBELL SETUPUsed to initialize the PCI Doorbell register and for diagnostics.
2.9.5PCI Arbiter
The PCI unit contains a PCI bus arbiter that supports two external masters in addition to the PCI
Unit’s initiator interface. If more than two external masters are used in the system, the aribter can
be disabled and an external (to the IXP2800 Network Processor used. In that case, the IXP2800
Network Processor will provide its PCI request signal to the external aribter, and use that arbiters
grant signal.
Used to initialize the Intel XScale
Intel® IXP2800 Network Processor
Technical Description
®
core Doorbell interrupts.
®
core Doorbell register and for diagnostics.
The arbiter uses a simple round-robin priority algorithm; it asserts the grant signal corresponding to
the next request in the round-robin during the current executing transaction on the PCI bus (this is
also called hidden arbitration). If the arbiter detects that an initiator has failed to assert frame_l
after 16 cycles of both grant assertion and PCI bus idle condition, the arbiter deasserts the grant.
That master does not receive any more grants until it deasserts its request for at least one PCI clock
cycle. Bus parking is implemented in that the last bus grant will stay asserted if no request is
pending.
To prevent bus contention, if the PCI bus is idle, the arbiter never asserts one grant signal in the
same PCI cycle in which it deasserts another, It deasserts one grant, and then asserts the next grant
after one full PCI clock cycle has elapsed to provide for bus driver turnaround.
Hardware Reference Manual75
Intel® IXP2800 Network Processor
Technical Description
2.10Control and Status Register Access Proxy
The Control and Status Register Access Proxy (CAP) contains a number of chip-wide control and
status registers. Some provide miscellaneous control and status, while others are used for interMicroengine or Microengine to the Intel XScale
Scratchpad Memory and SRAM can also be used for inter-process communication). These include:
• INTERTHREAD SIGNAL — Each thread (or context) on a Microengine can send a signal to
any other thread by writing to InterThread_Signal register. This allows a thread to go to sleep
waiting completion of a task by a different thread.
• THREAD MESSAGE — Each thread has a message register where it can post a software-
specific message. Other Microengine threads, or the Intel XScale
availability of messages by reading theTHREAD_MESSAGE_SUMMARY register. Both the
THREAD_MESSAGE and corresponding THREAD_MESSAGE_SUMMARY clear upon a
read of the message; this eliminates a race condition when there are multiple message readers.
Only one reader will get the message.
• SELF DESTRUCT — This register provides another type of communication. Microengine
software can atomically set individual bits in the SELF_DESTRUCT registers; the registers
clear upon read. The meaning of each bit is software-specific. Clearing the register upon read
eliminates a race condition when there are multiple readers.
• THREAD INTERRUPT — Each thread can interrupt the Intel XScale
interrupts; the usage is software-specific. Having two interrupts allows for flexibility, for
example, one can be assigned to normal service requests and one can be assigned to error
conditions. If more information needs to be associated with the interrupt, mailboxes or Rings
in Scratchpad Memory or SRAM could be used.
• REFLECTOR — CAP provides a function (called “reflector”) where any Microengine thread
can move data between its registers and those of any other thread. In response to a single write
or read instruction (with the address in the specific reflector range) CAP will get data from the
source Microengine and put it into the destination Microengine. Both the sending and
receiving threads can optionally be signaled upon completion of the data movement.
®
core communication (note that rings in
®
core, can poll for
®
core on two different
2.11Intel XScale® Core Peripherals
2.11.1Interrupt Controller
The Interrupt Controller provides the ability to enable or mask interrupts from a number of chip
wide sources, for example:
• Timers (normally used by Real-Time Operating System).
• Interrupts generated by Microengine software to request services from the Intel XScale
• External agents such as PCI devices.
• Error conditions, such as DRAM ECC error, or SPI-4 parity error.
Interrupt status is read as memory mapped registers; the state of an interrupt signal can be read
even if it is masked from interrupting. Enabling and masking of interrupts is done as writes to
memory mapped registers.
76Hardware Reference Manual
®
core.
Intel® IXP2800 Network Processor
Technical Description
2.11.2Timers
The IXP2800 Network Processor contains four programmable 32-bit timers, which can be used for
software support. Each timer can be clocked by the internal clock, by a divided version of the
clock, or by a signal on an external GPIO pin. Each timer can be programmed to generate a
periodic interrupt after a programmed number of clocks. The range is from several ns to several
minutes depending on the clock frequency.
In addition, timer 4 can be used as a watchdog timer. In this use, software must periodically reload
the timer value; if it fails to do so and the timer counts to 0, it will reset the chip. This can be used
to detect if software “hangs” or for some other reason fails to reload the timer.
2.11.3General Purpose I/O
The IXP2800 Network Processor contains eight General Purpose I/O (GPIO) pins. These can be
programmed as either input or output and can be used for slow speed I/O such as LEDs or input
switches. They can also be used as interrupts to the Intel XScale
programmable timers.
®
core, or to clock the
2.11.4Universal Asynchronous Receiver/Transmitter
The IXP2800 Network Processor contains a standard RS-232 compatible Universal Asynchronous
Receiver/Transmitter (UART), which can be used for communication with a debugger or
maintenance console. Modem controls are not supported; if they are needed, GPIO pins can be
used for that purpose.
The UART performs serial-to-parallel conversion on data characters received from a peripheral
device and parallel-to-serial conversion on data characters received from the processor. The
processor can read the complete status of the UART at any time during operation. Available status
information includes the type and condition of the transfer operations being performed by the
UART and any error conditions (parity, overrun, framing or break interrupt).
The serial ports can operate in either FIFO or non-FIFO mode. In FIFO mode, a 64-byte transmit
FIFO holds data from the processor to be transmitted on the serial link and a 64-byte receive FIFO
buffers data from the serial link until read by the processor.
The UART includes a programmable baud rate generator that is capable of dividing the internal
clock input by divisors of 1 to 2
logic. It also drives the receive logic. The UART can be operated in polled or in interrupt driven
mode as selected by software.
2.11.5Slowport
The Slowport is an external interface to the IXP2800 Network Processor, used for Flash ROM
access and 8, 16, or 32-bit asynchronous device access. It allows the Intel XScale
write data transfers to these slave devices.
16
- 1 and produces a 16X clock to drive the internal transmitter
®
core do read/
The address bus and data bus are multiplexed to reduce the pin count. In addition, 24 bits of
address are shifted out on three clock cycles. Therefore, an external set of buffers is needed to latch
the address. Two chip selects are provided.
Hardware Reference Manual77
Intel® IXP2800 Network Processor
Technical Description
The access is asynchronous. Insertion of delay cycles for both data setup and hold time is
programmable via internal Control registers. The transfer can also wait for a handshake
acknowledge signal from the external device.
2.12I/O Latency
Table 19 shows the latencies for transferring data between the Microengine and the other sub-
system components. The latency is measured in 1.4 GHz cycles.
Table 19. I/O Latency
Sub-system
DRAM
(RDR)
Transfer Size
Average Read
Latency
Average Write
Latency
Note1: RDR, QDR, MSF, and Scratch values are extracted from a simulation model.
Note 2: Minimum DRAM burst size on pins is 16 bytes. Transfers less than 16 bytes incur the same as a
16-byte transfer.
Note 3: At 1016 MHz, read latency should be ~ 240 cycles.
8 bytes – 16 bytes
(note 2)
~ 295 cycles
(note 3)
~ 53 cycles~ 53 cycles~ 40 cycles
2.13Performance Monitor
The Intel XScale® core hardware provides two 32-bit performance counters that allow two unique
events to be monitored simultaneously. In addition, the Intel XScale
clock counter that can be used in conjunction with the performance counters; its sole purpose is to
count the number of core clock cycles, which is useful in measuring total execution time.
SRAM
(QDR)
4 bytes4 bytes8 bytes
100 (light load) –
160 (heavy load)
ScratchMSF
~ 100 cycles
(range 53 – 152)
®
core implements a 32-bit
range 53 – 120
(RBUF)
~ 48 cycles
(TBUF)
78Hardware Reference Manual
Intel® IXP2800 Network Processor
Intel XScale
®
Core
Intel XScale
®
This section contains information describing the Intel XScale® core, Intel XScale® core gasket, and
Intel XScale
For additional information about the Intel XScale
Developers Manual available on Intel’s Developers web site (http://www.developer.intel.com).
3.1Introduction
The Intel XScale® core is an ARM* V5TE compliant microprocessor. It has been designed for high
performance and low-power; leading the industry in mW/MIPs. The Intel XScale
incorporates an extensive list of architecture features that allows it to achieve high performance.
Many of the architectural features added to the Intel XScale
often is a serious impediment to high performance processors.
This includes:
• The ability to continue instruction execution even while the data cache is retrieving data from
external memory.
• A write buffer.
• Write-back caching.
• Various data cache allocation policies that can be configured different for each application.
®
core Peripherals (XPI).
Core3
®
architecture refer to the Intel XScale® Core
®
core
®
core help hide memory latency that
• Cache locking.
All these features improve the efficiency of the memory bus external to the core.
ARM* Version 5 (V5) Architecture added floating point instructions to ARM* Version 4. The Intel
®
XScale
provide hardware support of the floating point instructions.
The Intel XScale
extensions.
core implements the integer instruction set architecture of ARM* V5, but does not
®
core provides the Thumb instruction set (ARM* V5T) and the ARM* V5E DSP
Hardware Reference Manual79
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.2Features
Figure 16 shows the major functional blocks of the Intel XScale® core.
Figure 16. Intel XScale
®
Core Architecture Features
Instruction Cache
32 Kbytes
32 ways
Lockable by line
Branch Target
Buffer
128 entries
IMMU
32 entry TLB
Fully associative
Lockable by entry
Data Cache
Performance
Monitoring
Power
Management
Debug
Hardware Breakpoint
Branch History Table
Idle
Drowsy
Sleep
3.2.1Multiply/ACcumulate (MAC)
The MAC unit supports early termination of multiplies/accumulates in two cycles and can sustain a
throughput of a MAC operation every cycle. Architectural enhancements to the MAC support
audio coding algorithms, including a 40-bit accumulator and support for 16-bit packed data.
Max 32 Kbytes
32 ways
wr-back or
wr-through
Hit under
miss
DMMU
32 entry TLB
Fully associative
Lockable by entry
MAC
Single Cycle
Throughput (16*32)
16-bit SIMD
40-bit Accumulator
Data RAM
Max 28 Kbytes
Re-map of
data cache
Mini-Data
Cache
2 Kbytes
2 ways
Fill Buffer
4 - 8 entries
Write Buffer
8 entries
Full coalescing
JTAG
A9642-01
3.2.2Memory Management
The Intel XScale® core implements the Memory Management Unit (MMU) Architecture specified
in the ARM* Architecture Reference Manual (see the ARM* website at http://www.arm.com).
The MMU provides access protection and virtual to physical address translation. The MMU
Architecture also specifies the caching policies for the instruction cache and data memory.
These policies are specified as page attributes and include:
• identifying code as cacheable or non-cacheable
• selecting between the mini-data cache or data cache
• write-back or write-through data caching
• enabling data write allocation policy
• and enabling the write buffer to coalesce stores to external memory
80Hardware Reference Manual
3.2.3Instruction Cache
The Intel XScale® core implements a 32-Kbyte, 32-way set associative instruction cache with a
line size of 32 bytes. All requests that “miss” the instruction cache generate a 32-byte read request
to external memory. A mechanism to lock critical code within the cache is also provided.
3.2.4Branch Target Buffer (BTB)
The Intel XScale® core provides a Branch Target Buffer to predict the outcome of branch type
instructions. It provides storage for the target address of branch type instructions and predicts the
next address to present to the instruction cache when the current instruction address is that of a
branch.
The BTB holds 128 entries.
3.2.5Data Cache
The Intel XScale® core implements a 32-Kbyte, a 32-way set associative data cache and a 2-Kbyte,
2-way set associative mini-data cache. Each cache has a line size of 32 bytes, and supports writethrough or write-back caching.
Intel® IXP2800 Network Processor
Intel XScale
®
Core
The data/mini-data cache is controlled by page attributes defined in the MMU Architecture and by
coprocessor 15. The Intel XScale
cache as data RAM. Software may place special tables or frequently used variables in this RAM.
®
core allows applications to reconfigure a portion of the data
3.2.6Performance Monitoring
Two performance monitoring counters have been added to the Intel XScale® core that can be
configured to monitor various events. These events allow a software developer to measure cache
efficiency, detect system bottlenecks, and reduce the overall latency of programs.
3.2.7Power Management
The Intel XScale® core incorporates a power and clock management unit that can assist in
controlling clocking and managing power.
3.2.8Debugging
The Intel XScale® core supports software debugging through two instruction address breakpoint
registers, one data-address breakpoint register, one data-address/mask breakpoint register, and a
trace buffer.
3.2.9JTAG
Testability is supported on the Intel XScale® core through the Test Access Port (TAP) Controller
implementation, which is based on IEEE 1149.1 (JTAG) Standard Test Access Port and BoundaryScan Architecture. The purpose of the TAP controller is to support test logic internal and external
to the Intel XScale
Hardware Reference Manual81
®
core such as built-in self-test, boundary-scan, and scan.
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.3Memory Management
The Intel XScale® core implements the Memory Management Unit (MMU) Architecture specified
in the ARM Architecture Reference Manual. To accelerate virtual to physical address translation,
the Intel XScale
TLB to cache the latest translations. Each TLB holds 32 entries and is fully-associative. Not only
do the TLBs contain the translated addresses, but also the access rights for memory references.
If an instruction or data TLB miss occurs, a hardware translation-table-walking mechanism is
invoked to translate the virtual address to a physical address. Once translated, the physical address
is placed in the TLB along with the access rights and attributes of the page or section. These
translations can also be locked down in either TLB to guarantee the performance of critical
routines.
The Intel XScale
memory:
• cacheable
• bufferable
• line allocate policy
• write policy
®
core uses both an instruction Translation Look-aside Buffer (TLB) and a data
®
core allows system software to associate various attributes with regions of
• I/O
• mini Data Cache
• Coalescing
• P bit
Note: The virtual address with which the TLBs are accessed may be remapped by the PID register.
3.3.1Architecture Model
3.3.1.1Version 4 versus Version 5
ARM* MMU Version 5 Architecture introduces the support of tiny pages, which are 1 Kbyte in
size. The reserved field in the first-level descriptor (encoding 0b11) is used as the fine page table
base address.
3.3.1.2Memory Attributes
The attributes associated with a particular region of memory are configured in the memory
management page table and control the behavior of accesses to the instruction cache, data cache,
mini-data cache and the write buffer. These attributes are ignored when the MMU is disabled.
To allow compatibility with older system software, the new Intel XScale
advantage of encoding space in the descriptors that was formerly reserved.
®
core attributes take
3.3.1.2.1Page (P) Attribute Bit
The P bit assigns a page attribute to a memory region. Refer to the Intel
Network Processor Programmer’s Reference Manual for details about the P bit.
82Hardware Reference Manual
®
IXP2400 and IXP2800
3.3.1.2.2Instruction Cache
When examining these bits in a descriptor, the Instruction Cache only utilizes the C bit. If the C bit
is clear, the Instruction Cache considers a code fetch from that memory to be non-cacheable, and
will not fill a cache entry. If the C bit is set, then fetches from the associated memory region will be
cached.
3.3.1.2.3Data Cache and Write Buffer
All of these descriptor bits affect the behavior of the Data Cache and the Write Buffer.
If the X bit for a descriptor is 0 (see Table 20), the C and B bits operate as mandated by the ARM*
architecture. If the X bit for a descriptor is one, the C and B bits’ meaning is extended, as detailed
in Table 21.
Table 20. Data Cache and Buffer Behavior when X = 0
Intel® IXP2800 Network Processor
Intel XScale
®
Core
C BCacheable?Bufferable?Write Policy
0 0NN——Stall until complete
0 1NY——
1 0YYWrite ThroughRead Allocate
1 1YYWrite BackRead Allocate
1.Normally, the processor will continue executing after a data access if no dependency on that access is encountered. With
this setting, the processor will stall execution until the data access completes. This guarantees to software that the data access has taken effect by the time execution of the data access instruction completes. External data aborts from such accesses will be imprecise.
Table 21. Data Cache and Buffer Behavior when X = 1
C BCacheable?Bufferable?Write Policy
0 0————Unpredictable; do not use
0 1NY——
1 0
1 1YYWrite Back
1.Normally, bufferable writes can coalesce with previously buffered data in the same address range
(Mini Data
Cache)
———
Line
Allocation
Policy
Line
Allocation
Policy
Read/Write
Allocate
Notes
1
Notes
Writes will not coalesce into
1
buffers
Cache policy is determined
by MD field of Auxiliary
Control register
3.3.1.2.4Details on Data Cache and Write Buffer Behavior
If the MMU is disabled all data accesses will be non-cacheable and non-bufferable. This is the
same behavior as when the MMU is enabled, and a data access uses a descriptor with X, C, and B
all set to 0.
The X, C, and B bits determine when the processor should place new data into the Data Cache. The
cache places data into the cache in lines (also called blocks). Thus, the basis for making a decision
about placing new data into the cache is a called a “Line Allocation Policy.”
Hardware Reference Manual83
Intel® IXP2800 Network Processor
Intel XScale
®
Core
If the Line Allocation Policy is read-allocate, all load operations that miss the cache request a
32-byte cache line from external memory and allocate it into either the data cache or mini-data
cache (this is assuming the cache is enabled). Store operations that miss the cache will not cause a
line to be allocated.
If read/write-allocate is in effect, load or store operations that miss the cache will request a 32-byte
cache line from external memory if the cache is enabled.
The other policy determined by the X, C, and B bits is the Write Policy. A write-through policy
instructs the Data Cache to keep external memory coherent by performing stores to both external
memory and the cache. A write-back policy only updates external memory when a line in the cache
is cleaned or needs to be replaced with a new line. Generally, write-back provides higher
performance because it generates less data traffic to external memory.
3.3.1.2.5Memory Operation Ordering
A fence memory operation (memop) is one that guarantees all memops issued prior to the fence
will execute before any memop issued after the fence. Thus software may issue a fence to impose a
partial ordering on memory accesses.
Table 22 shows the circumstances in which memops act as fences.
Any swap (SWP or SWPB) to a page that would create a fence on a load or store is a fence.
Table 22. Memory Operations that Impose a Fence
operationXCB
load—0—
store101
load or store000
3.3.2Exceptions
The MMU may generate prefetch aborts for instruction accesses and data aborts for data memory
accesses.
Data address alignment checking is enabled by setting bit 1 of the Control register (CP15,
register 1). Alignment faults are still reported even if the MMU is disabled. All other MMU
exceptions are disabled when the MMU is disabled.
84Hardware Reference Manual
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.3.3Interaction of the MMU, Instruction Cache, and Data Cache
The MMU, instruction cache, and data/mini-data cache may be enabled/disabled independently.
The instruction cache can be enabled with the MMU enabled or disabled. However, the data cache
can only be enabled when the MMU is enabled. Therefore only three of the four combinations of
the MMU and data/mini-data cache enables are valid (see Table 23). The invalid combination will
cause undefined results.
Table 23. Valid MMU and Data/Mini-Data Cache Combinations
MMUData/Mini-data Cache
OffOff
OnOff
OnOn
3.3.4Control
3.3.4.1Invalidate (Flush) Operation
The entire instruction and data TLB can be invalidated at the same time with one command or they
can be invalidated separately. An individual entry in the data or instruction TLB can also be
invalidated.
Globally invalidating a TLB will not affect locked TLB entries. However, the invalidate-entry
operations can invalidate individual locked entries. In this case, the locked remains in the TLB, but
will never “hit” on an address translation. Effectively, a hole is in the TLB. This situation may be
rectified by unlocking the TLB.
3.3.4.2Enabling/Disabling
The MMU is enabled by setting bit 0 in coprocessor 15, register 1 (Control register). When the
MMU is disabled, accesses to the instruction cache default to cacheable and all accesses to data
memory are made non-cacheable. A recommended code sequence for enabling the MMU is shown
in Example 14.
Example 14. Enabling the MMU
; This routine provides software with a predictable way of enabling the MMU.
; After the CPWAIT, the MMU is guaranteed to be enabled. Be aware
; that the MMU will be enabled sometime after MCR and before the instruction
; that executes after the CPWAIT.
; Programming Note: This code sequence requires a one-to-one virtual to
; physical address mapping on this code since
; the MMU may be enabled part way through. This would allow the instructions
; after MCR to execute properly regardless the state of the MMU.
MRC P15,0,R0,C1,C0,0; Read CP15, register 1
ORR R0, R0, #0x1; Turn on the MMU
MCR P15,0,R0,C1,C0,0; Write to CP15, register 1
; The MMU is guaranteed to be enabled at this point; the next instruction or
; data address will be translated.
Hardware Reference Manual85
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.3.4.3Locking Entries
Individual entries can be locked into the instruction and data TLBs. If a lock operation finds the
virtual address translation already resident in the TLB, the results are unpredictable. An invalidate
by entry command before the lock command will ensure proper operation. Software can also
accomplish this by invalidating all entries, as shown in Example 15.
Locking entries into either the instruction TLB or data TLB reduces the available number of entries
(by the number that was locked down) for hardware to cache other virtual to physical address
translations.
A procedure for locking entries into the instruction TLB is shown in Example 15.
If a MMU abort is generated during an instruction or data TLB lock operation, the Fault Status
register is updated to indicate a Lock Abort, and the exception is reported as a data abort.
Example 15. Locking Entries into the Instruction TLB
; R1, R2 and R3 contain the virtual addresses to translate and lock into
; the instruction TLB.
; The value in R0 is ignored in the following instruction.
; Hardware guarantees that accesses to CP15 occur in program order
MCR P15,0,R0,C8,C5,0; Invalidate the entire instruction TLB
MCR P15,0,R1,C10,C4,0 ; Translate virtual address (R1) and lock into
; instruction TLB
MCR P15,0,R2,C10,C4,0 ; Translate
; virtual address (R2) and lock into instruction TLB
MCR P15,0,R3,C10,C4,0 ; Translate virtual address (R3) and lock into
; instruction TLB
CPWAIT
; The MMU is guaranteed to be updated at this point; the next instruction will
; see the locked instruction TLB entries.
Note: If exceptions are allowed to occur in the middle of this routine, the TLB may end up caching a
translation that is about to be locked. For example, if R1 is the virtual address of an interrupt
service routine and that interrupt occurs immediately after the TLB has been invalidated, the lock
operation will be ignored when the interrupt service routine returns back to this code sequence.
Software should disable interrupts (FIQ or IRQ) in this case.
As a general rule, software should avoid locking in all other exception types.
86Hardware Reference Manual
Intel® IXP2800 Network Processor
Intel XScale
The proper procedure for locking entries into the data TLB is shown in Example 16.
Example 16. Locking Entries into the Data TLB
; R1, and R2 contain the virtual addresses to translate and lock into the data TLB
MCR P15,0,R1,C8,C6,1; Invalidate the data TLB entry specified by the
; virtual address in R1
MCR P15,0,R1,C10,C8,0; Translate virtual address (R1) and lock into
; data TLB
; Repeat sequence for virtual address in R2
MCR P15,0,R2,C8,C6,1; Invalidate the data TLB entry specified by the
; virtual address in R2
MCR P15,0,R2,C10,C8,0; Translate virtual address (R2) and lock into
; data TLB
CPWAIT; wait for locks to complete
; The MMU is guaranteed to be updated at this point; the next instruction will
; see the locked data TLB entries.
Note: Care must be exercised here when allowing exceptions to occur during this routine whose handlers
may have data that lies in a page that is trying to be locked into the TLB.
®
Core
3.3.4.4Round-Robin Replacement Algorithm
The line replacement algorithm for the TLBs is round-robin; there is a round-robin pointer that
keeps track of the next entry to replace. The next entry to replace is the one sequentially after the
last entry that was written. For example, if the last virtual to physical address translation was
written into entry 5, the next entry to replace is entry 6.
At reset, the round-robin pointer is set to entry 31. Once a translation is written into entry 31, the
round-robin pointer gets set to the next available entry, beginning with entry 0 if no entries have
been locked down. Subsequent translations move the round-robin pointer to the next sequential
entry until entry 31 is reached, where it will wrap back to entry 0 upon the next translation.
A lock pointer is used for locking entries into the TLB and is set to entry 0 at reset. A TLB lock
operation places the specified translation at the entry designated by the lock pointer, moves the
lock pointer to the next sequential entry, and resets the round-robin pointer to entry 31. Locking
entries into either TLB effectively reduces the available entries for updating. For example, if the
first three entries were locked down, the round-robin pointer would be entry 3 after it rolled over
from entry 31.
Only entries 0 through 30 can be locked in either TLB; entry 31can never be locked. If the lock
pointer is at entry 31, a lock operation will update the TLB entry with the translation and ignore the
lock. In this case, the round-robin pointer will stay at entry 31.
Hardware Reference Manual87
Intel® IXP2800 Network Processor
Intel XScale
®
Core
Figure 17 illustrates locked entries in TLB.
Figure 17. Example of Locked Entries in TLB
entry 0
entry 1
Note: 8 entries locked, 24 entries available for round robin replacement
3.4Instruction Cache
The Intel XScale® core instruction cache enhances performance by reducing the number of
instruction fetches from external memory. The cache provides fast execution of cached code. Code
can also be locked down when guaranteed or fast access time is required.
Figure 18 shows the cache organization and how the instruction address is used to access the cache.
The instruction cache is a 32-Kbyte, 32-way set associative cache; this means there are 32 sets with
each set containing 32 ways. Each way of a set contains eight 32-bit words and one valid bit, which
is referred to as a line. The replacement policy is a round-robin algorithm and the cache also
supports the ability to lock code in at a line granularity.
entry 7
entry 8
entry 22
entry 23
entry 30
entry 31
Locked
A9684-01
88Hardware Reference Manual
Figure 18. Instruction Cache Organization
Set Index
Set 0
way 0
This example
shows Set 0 being
selected by the
Set Index
Tag
Word Select
way 1
way 31
Set 1
way 0
way 1
Intel® IXP2800 Network Processor
Set 31
way 0
way 1
8 Words (cache line)
8 Words (cache line)
DataCAM
8 Words (cache line)
DataCAM
DataCAM
Intel XScale
®
Core
Instruction Word
(4 bytes)
Instruction Address (Virtual)
3154 21010 9
Tag
Note: CAM = Content Addressable Memory
The instruction cache is virtually addressed and virtually tagged. The virtual address presented to
the instruction cache may be remapped by the PID register.
3.4.1Instruction Cache Operation
3.4.1.1Operation when Instruction Cache is Enabled
When the cache is enabled, it compares every instruction request address to the addresses of
instructions that it is holding in cache. If the requested instruction is found, the access “hits” the
cache, which returns the requested instruction. If the instruction is not found, the access “misses”
the cache, which requests a fetch from external memory of the 8-word line (32 bytes) that contains
the instruction (using the fetch policy). As the fetch returns instructions to the cache, they are put in
one of two fetch buffers and the requested instruction is delivered to the instruction decoder. A
fetched line is written into the cache if it is cacheable (code is cacheable if the MMU is disabled or
if the MMU is enabled and the cacheable (C) bit is set to 1 in its corresponding page).
Set Index
Word
A9685-01
Note: An instruction fetch may “miss” the cache but “hit” one of the fetch buffers. If this happens, the
requested instruction is delivered to the instruction decoder in the same manner as a cache “hit.”
Hardware Reference Manual89
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.4.1.2Operation when Instruction Cache is Disabled
Disabling the cache prevents any lines from being written into the instruction cache. Although the
cache is disabled, it is still accessed and may generate a “hit” if the data is already in the cache.
Disabling the instruction cache does not disable instruction buffering that may occur within the
instruction fetch buffers. Two 8-word instruction fetch buffers will always be enabled in the cache
disabled mode. As instruction fetches continue to “hit” within either buffer (even in the presence of
forward and backward branches), no external fetches for instructions are generated. A miss causes
one or the other buffer to be filled from external memory using the fill policy.
3.4.1.3Fetch Policy
An instruction-cache “miss” occurs when the requested instruction is not found in the instruction
fetch buffers or instruction cache; a fetch request is then made to external memory. The instruction
cache can handle up to two “misses.” Each external fetch request uses a fetch buffer that holds
32-bytes and eight valid bits, one for each word. A miss causes the following:
1. A fetch buffer is allocated.
2. The instruction cache sends a fetch request to the external bus. This request is for a 32-byte line.
3. Instructions words are returned back from the external bus, at a maximum rate of 1 word per
core cycle. As each word returns, the corresponding valid bit is set for the word in the fetch
buffer.
4. As soon as the fetch buffer receives the requested instruction, it forwards the instruction to the
instruction decoder for execution.
5. When all words have returned, the fetched line will be written into the instruction cache if it is
cacheable and if the instruction cache is enabled. The line chosen for update in the cache is
controlled by the round-robin replacement algorithm. This update may evict a valid line at that
location.
6. Once the cache is updated, the eight valid bits of the fetch buffer are invalidated.
3.4.1.4Round-Robin Replacement Algorithm
The line replacement algorithm for the instruction cache is round-robin. Each set in the instruction
cache has a round-robin pointer that keeps track of the next line (in that set) to replace. The next
line to replace in a set is the one after the last line that was written. For example, if the line for the
last external instruction fetch was written into way 5-set 2, the next line to replace for that set
would be way 6. None of the other round-robin pointers for the other sets are affected in this case.
After reset, way 31 is pointed to by the round-robin pointer for all the sets. Once a line is written
into way 31, the round-robin pointer points to the first available way of a set, beginning with way0
if no lines have been locked into that particular set. Locking lines into the instruction cache
effectively reduces the available lines for cache updating. For example, if the first three lines of a
set were locked down, the round-robin pointer would point to the line at way 3 after it rolled over
from way 31.
90Hardware Reference Manual
3.4.1.5Parity Protection
The instruction cache is protected by parity to ensure data integrity. Each instruction cache word
has 1 parity bit. (The instruction cache tag is not parity protected.) When a parity error is detected
on an instruction cache access, a prefetch abort exception occurs if the Intel XScale
to execute the instruction. Before servicing the exception, hardware place a notification of the error
in the Fault Status register (Coprocessor 15, register 5).
A software exception handler can recover from an instruction cache parity error. This can be
accomplished by invalidating the instruction cache and the branch target buffer and then returning
to the instruction that caused the prefetch abort exception. A simplified code example is shown in
Example 17. A more complex handler might choose to invalidate the specific line that caused the
exception and then invalidate the BTB.
Example 17. Recovering from an Instruction Cache Parity Error
; Prefetch abort handler
MCR P15,0,R0,C7,C5,0; Invalidate the instruction cache and branch target
; buffer
CPWAIT; wait for effect
;
Intel® IXP2800 Network Processor
Intel XScale
®
®
Core
core attempts
SUBS PC,R14,#4; Returns to the instruction that generated the
; parity error
; The Instruction Cache is guaranteed to be invalidated at this point
If a parity error occurs on an instruction that is locked in the cache, the software exception handler
needs to unlock the instruction cache, invalidate the cache and then re-lock the code in before it
returns to the faulting instruction.
3.4.1.6Instruction Cache Coherency
The instruction cache does not detect modification to program memory by loads, stores or actions
of other bus masters. Several situations may require program memory modification, such as
uploading code from disk.
The application program is responsible for synchronizing code modification and invalidating the
cache. In general, software must ensure that modified code space is not accessed until modification
and invalidating are completed.
To achieve cache coherence, instruction cache contents can be invalidated after code modification
in external memory is complete.
If the instruction cache is not enabled, or code is being written to a non-cacheable region, software
must still invalidate the instruction cache before using the newly-written code. This precaution
ensures that state associated with the new code is not buffered elsewhere in the processor, such as
the fetch buffers or the BTB.
Naturally, when writing code as data, care must be taken to force it completely out of the processor
into external memory before attempting to execute it. If writing into a non-cacheable region,
flushing the write buffers is sufficient precaution. If writing to a cacheable region, then the data
cache should be submitted to a Clean/Invalidate operation to ensure coherency.
Hardware Reference Manual91
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.4.2Instruction Cache Control
3.4.2.1Instruction Cache State at Reset
After reset, the instruction cache is always disabled, unlocked, and invalidated (flushed).
3.4.2.2Enabling/Disabling
The instruction cache is enabled by setting bit 12 in coprocessor 15, register 1 (Control register).
This process is illustrated in Example 18.
Example 18. Enabling the Instruction Cache
; Enable the ICache
MRC P15, 0, R0, C1, C0, 0; Get the control register
ORR R0, R0, #0x1000; set bit 12 -- the I bit
MCR P15, 0, R0, C1, C0, 0; Set the control register
CPWAIT
3.4.2.3Invalidating the Instruction Cache
The entire instruction cache along with the fetch buffers are invalidated by writing to
coprocessor 15, register 7. This command does not unlock any lines that were locked in the
instruction cache nor does it invalidate those locked lines. To invalidate the entire cache including
locked lines, the unlock instruction cache command needs to be executed before the invalidate
command.
There is an inherent delay from the execution of the instruction cache invalidate command to
where the next instruction will see the result of the invalidate. The routine in Example 19 can be
used to guarantee proper synchronization.
Example 19. Invalidating the Instruction Cache
MCR P15,0,R1,C7,C5,0; Invalidate the instruction cache and branch
; target buffer
CPWAIT
; The instruction cache is guaranteed to be invalidated at this point; the next
; instruction sees the result of the invalidate command.
The Intel XScale
®
core also supports invalidating an individual line from the instruction cache.
3.4.2.4Locking Instructions in the Instruction Cache
Software has the ability to lock performance critical routines into the instruction cache. Up to
28 lines in each set can be locked; hardware will ignore the lock command if software is trying to
lock all the lines in a particular set (i.e., ways 28 – 31can never be locked). When this happens, the
line is still allocated into the cache, but the lock will be ignored. The round-robin pointer will stay
at way 31 for that set.
Lines can be locked into the instruction cache by initiating a write to coprocessor 15. Register Rd
contains the virtual address of the line to be locked into the cache.
92Hardware Reference Manual
There are several requirements for locking down code:
1. The routine used to lock lines down in the cache must be placed in non-cacheable memory,
which means the MMU is enabled. As a corollary: no fetches of cacheable code should occur
while locking instructions into the cache.
2. The code being locked into the cache must be cacheable.
3. The instruction cache must be enabled and invalidated prior to locking down lines.
Failure to follow these requirements will produce unpredictable results when accessing the
instruction cache.
System programmers should ensure that the code to lock instructions into the cache does not reside
closer than 128 bytes to a non-cacheable/cacheable page boundary. If the processor fetches ahead
into a cacheable page, then the first requirement noted above could be violated.
Lines are locked into a set starting at way 0 and may progress up to way 27; which set a line gets
locked into depends on the set index of the virtual address. Figure 19 is an example of where lines
of code may be locked into the cache along with how the round-robin pointer is affected.
Figure 19. Locked Line Effect on Round Robin Replacement
Intel® IXP2800 Network Processor
Intel XScale
®
Core
way 0
way 1
way 7
way 8
set 0
Locked
set 1
Locked
set 2set 31
Locked
way 22
way 23
way 30
way 31
Notes:
set 0: 8 ways locked, 24 ways available for round robin replacement
set 1: 23 ways locked, 9 ways available for round robin replacement
set 2: 28 ways locked, only way 28-31 available for replacement
set 31: all 32 ways available for round robin replacement
A9686-01
Software can lock down several different routines located at different memory locations. This may
cause some sets to have more locked lines than others as shown in Figure 19.
Hardware Reference Manual93
Intel® IXP2800 Network Processor
Intel XScale
®
Core
Example 20 shows how a routine, called “lockMe” in this example, might be locked into the
instruction cache. Note that it is possible to receive an exception while locking code.
Example 20. Locking Code into the Cache
lockMe:; This is the code that will be locked into the cache
mov r0, #5
add r5, r1, r2
. . .
lockMeEnd:
. . .
codeLock:; here is the code to lock the “lockMe” routine
ldr r0, =(lockMe AND NOT 31); r0 gets a pointer to the first line we
should lock
ldr r1, =(lockMeEnd AND NOT 31); r1 contains a pointer to the last line we
should lock
lockLoop:
mcr p15, 0, r0, c9, c1, 0; lock next line of code into ICache
cmp r0, r1; are we done yet?
add r0, r0, #32; advance pointer to next line
bne lockLoop; if not done, do the next line
3.4.2.5Unlocking Instructions in the Instruction Cache
The Intel XScale® core provides a global unlock command for the instruction cache. Writing to
coprocessor 15, register 9 unlocks all the locked lines in the instruction cache and leaves them
valid. These lines then become available for the round-robin replacement algorithm.
3.5Branch Target Buffer (BTB)
The Intel XScale® core uses dynamic branch prediction to reduce the penalties associated with
changing the flow of program execution. The Intel XScale
that provides the instruction cache with the target address of branch type instructions. The branch
target buffer is implemented as a 128-entry, direct mapped cache.
3.5.1Branch Target Buffer Operation
The BTB stores the history of branches that have executed along with their targets. Figure 20
shows an entry in the BTB, where the tag is the instruction address of a previously executed branch
and the data contains the target address of the previously executed branch along with two bits of
history information.
®
core features a branch target buffer
94Hardware Reference Manual
Figure 20. BTB Entry
Intel® IXP2800 Network Processor
Intel XScale
®
Core
Branch Address[31:9,1]
The BTB takes the current instruction address and checks to see if this address is a branch that was
previously seen. It uses bits [8:2] of the current address to read out the tag and then compares this
tag to bits [31:9,1] of the current instruction address. If the current instruction address matches the
tag in the cache and the history bits indicate that this branch is usually taken in the past, the BTB
uses the data (target address) as the next instruction address to send to the instruction cache.
Bit[1] of the instruction address is included in the tag comparison to support Thumb execution.
This organization means that two consecutive Thumb branch (B) instructions, with instruction
address bits[8:2] the same, will contend for the same BTB entry. Thumb also requires 31 bits for
the branch target address. In ARM* mode, bit[1] is 0.
The history bits represent four possible prediction states for a branch entry in the BTB. Figure 21
shows these states along with the possible transitions. The initial state for branches stored in the
BTB is Weakly-Taken (WT). Every time a branch that exists in the BTB is executed, the history
bits are updated to reflect the latest outcome of the branch, either taken or not-taken.
The BTB does not have to be managed explicitly by software; it is disabled by default after reset
and is invalidated when the instruction cache is invalidated.
Figure 21. Branch History
TAG
DATA
Target Address[31:1]
History
Bits[1:0]
A9687-01
n
e
k
a
T
e
k
n
a
T
n
e
k
a
T
t
o
SNWNWTST
N
N
Notes:
n
o
e
t
k
T
a
n
e
k
a
T
t
o
N
SN: Strongly Not TakeST: Strongly Taken
WN: Weakly Not TakenWT: Weakly Taken
e
k
n
a
T
T
a
k
e
n
N
n
o
e
t
k
T
a
A9688-01
3.5.1.1Reset
After Processor Reset, the BTB is disabled and all entries are invalidated.
Hardware Reference Manual95
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.5.2Update Policy
A new entry is stored into the BTB when the following conditions are met:
• The branch instruction has executed
• The branch was taken
• The branch is not currently in the BTB
The entry is then marked valid and the history bits are set to WT. If another valid branch exists at
the same entry in the BTB, it will be evicted by the new branch.
Once a branch is stored in the BTB, the history bits are updated upon every execution of the branch
as shown in Figure 21.
3.5.3BTB Control
3.5.3.1Disabling/Enabling
The BTB is always disabled with Reset. Software can enable the BTB through a bit in a
coprocessor register.
Before enabling or disabling the BTB, software must invalidate it (described in the following
section). This action will ensure correct operation in case stale data is in the BTB. Software should
not place any branch instruction between the code that invalidates the BTB and the code that
enables/disables it.
3.5.3.2Invalidation
There are four ways the contents of the BTB can be invalidated.
1. Reset.
2. Software can directly invalidate the BTB via a CP15, register 7 function.
3. The BTB is invalidated when the Process ID register is written.
4. The BTB is invalidated when the instruction cache is invalidated via CP15, register 7
functions.
3.6Data Cache
The Intel XScale® core data cache enhances performance by reducing the number of data accesses
to and from external memory. There are two data cache structures in the Intel XScale
Kbyte data cache and a 2-Kbyte mini-data cache. An eight entry write buffer and a four entry fill
buffer are also implemented to decouple the Intel XScale
memory accesses, which increases overall system performance.
®
core, a 32-
®
core instruction execution from external
96Hardware Reference Manual
3.6.1Overviews
3.6.1.1Data Cache Overview
The data cache is a 32-Kbyte, 32-way set associative cache, i.e., there are 32 sets and each set has
32 ways. Each way of a set contains 32 bytes (one cache line) and one valid bit. There also exist
two dirty bits for every line, one for the lower 16 bytes and the other one for the upper 16 bytes.
When a store hits the cache, the dirty bit associated with it is set. The replacement policy is a
round-robin algorithm and the cache also supports the ability to reconfigure each line as data RAM.
Figure 22 shows the cache organization and how the data address is used to access the cache.
Cache policies may be adjusted for particular regions of memory by altering page attribute bits in
the MMU descriptor that controls that memory.
The data cache is virtually addressed and virtually tagged. It supports write-back and write-through
caching policies. The data cache always allocates a line in the cache when a cacheable read miss
occurs and will allocate a line into the cache on a cacheable write miss when write allocate is
specified by its page attribute. Page attribute bits determine whether a line gets allocated into the
data cache or mini-data cache.
Figure 22. Data Cache Organization
Intel® IXP2800 Network Processor
Intel XScale
®
Core
Set 31
way 0
way 1
Set Index
Set 0
This example
shows Set 0 being
selected by the
Set Index
Tag
Word Select
Byte Select
Data Address (Virtual)
3154 21 010 9
way 0
way 1
way 31
Set 1
way 0
way 1
32 bytes (cache line)
Byte Alignment
Sign Extension
(4 bytes to Destination Register)
Tag
32 bytes (cache line)
DataCAM
DataCAM
Data Word
32 bytes (cache line)
DataCAM
Set Index
Word
Byte
Note: CAM = Content Addressable Memory
A9689-01
Hardware Reference Manual97
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.6.1.2Mini-Data Cache Overview
The mini-data cache is a 2-Kbyte, 2-way set associative cache; this means there are 32 sets with
each set containing 2 ways. Each way of a set contains 32 bytes (one cache line) and one valid bit.
There also exist 2 dirty bits for every line, one for the lower 16 bytes and the other one for the
upper 16 bytes. When a store hits the cache, the dirty bit associated with it is set. The replacement
policy is a round-robin algorithm.
Figure 23 shows the cache organization and how the data address is used to access the cache.
The mini-data cache is virtually addressed and virtually tagged and supports the same caching
policies as the data cache. However, lines cannot be locked into the mini-data cache.
Figure 23. Mini-Data Cache Organization
This example
shows Set 0 being
selected by the
Set Index
Set 0
way 0
way 1
Set 1
way 0
way 1
Set 31
way 0
way 1
32 bytes (cache line)
32 bytes (cache line)
32 bytes (cache line)
Tag
Word Select
Byte Select
Data Address (Virtual)
3154 21 010 9
Tag
Note: CAM = Content Addressable Memory
Byte Alignment
Sign Extension
Data Word
(4 bytes to Destination Register)
Set Index
Word
Byte
A9692-01
98Hardware Reference Manual
Intel® IXP2800 Network Processor
3.6.1.3Write Buffer and Fill Buffer Overview
The Intel XScale® core employs an eight entry write buffer, each entry containing 16 bytes. Stores
to external memory are first placed in the write buffer and subsequently taken out when the bus is
available. The write buffer supports the coalescing of multiple store requests to external memory.
An incoming store may coalesce with any of the eight entries.
The fill buffer holds the external memory request information for a data cache or mini-data cache
fill or non-cacheable read request. Up to four 32-byte read request operations can be outstanding in
the fill buffer before the Intel XScale
The fill buffer has been augmented with a four-entry pend buffer that captures data memory
requests to outstanding fill operations. Each entry in the pend buffer contains enough data storage
to hold one 32-bit word, specifically for store operations. Cacheable load or store operations that
hit an entry in the fill buffer get placed in the pend buffer and are completed when the associated
fill completes. Any entry in the pend buffer can be pended against any of the entries in the fill
buffer; multiple entries in the pend buffer can be pended against a single entry in the fill buffer.
Pended operations complete in program order.
®
core needs to stall.
3.6.2Data Cache and Mini-Data Cache Operation
Intel XScale
®
Core
The following discussions refer to the data cache and mini-data cache as one cache (data/minidata) since their behavior is the same when accessed.
3.6.2.1Operation when Caching is Enabled
When the data/mini-data cache is enabled for an access, the data/mini-data cache compares the
address of the request against the addresses of data that it is currently holding. If the line containing
the address of the request is resident in the cache, the access “hits’ the cache. For a load operation
the cache returns the requested data to the destination register and for a store operation the data is
stored into the cache. The data associated with the store may also be written to external memory if
write-through caching is specified for that area of memory. If the cache does not contain the
requested data, the access ‘misses’ the cache, and the sequence of events that follows depends on
the configuration of the cache, the configuration of the MMU and the page attributes.
3.6.2.2Operation when Data Caching is Disabled
The data/mini-data cache is still accessed even though it is disabled. If a load hits the cache it will
return the requested data to the destination register. If a store hits the cache, the data is written into
the cache. Any access that misses the cache will not allocate a line in the cache when it’s disabled,
even if the MMU is enabled and the memory region’s cacheability attribute is set.
Hardware Reference Manual99
Intel® IXP2800 Network Processor
Intel XScale
®
Core
3.6.2.3Cache Policies
3.6.2.3.1Cacheability
Data at a specified address is cacheable given the following:
• The MMU is enabled
• The cacheable attribute is set in the descriptor for the accessed address
• The data/mini-data cache is enabled
3.6.2.3.2Read Miss Policy
The following sequence of events occurs when a cacheable load operation misses the cache:
1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
— If so, the current request is placed in the pending buffer and waits until the previously
requested fill completes, after which it accesses the cache again, to obtain the request data
and returns it to the destination register.
— If there is no outstanding fill request for that line, the current load request is placed in the
fill buffer and a 32-byte external memory read request is made. If the pending buffer or fill
buffer is full, the Intel XScale
®
core will stall until an entry is available.
2. A line is allocated in the cache to receive the 32 bytes of fill data. The line selected is
determined by the round-robin pointer (see Section 3.6.2.4). The line chosen may contain a
valid line previously allocated in the cache. In this case both dirty bits are examined and if set,
the four words associated with a dirty bit that’s asserted will be written back to external
memory as a 4-word burst operation.
3. When the data requested by the load is returned from external memory, it is immediately sent
to the destination register specified by the load. A system that returns the requested data back
first, with respect to the other bytes of the line, will obtain the best performance.
4. As data returns from external memory, it is written into the cache in the previously allocated
line.
A load operation that misses the cache and is not cacheable makes a request from external memory
for the exact data size of the original load request. For example, LDRH requests exactly two bytes
from external memory, LDR requests four bytes from external memory, etc. This request is placed
in the fill buffer until, the data is returned from external memory, which is then forwarded back to
the destination register(s).
100Hardware Reference Manual
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.