INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH Intel® PRODUCTS. NO LICENSE, Express* OR IMPL IED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY Expres s* OR IMPL IE D WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS
PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING T O S ALE AND/OR USE OF INTEL PRODUCT S INCLUDING
LIABILITY OR WARRANTIES RELA TING T O FITNES S FOR A PARTICULAR PURPOSE, MERCHANT ABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY
APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CR EA TE A SITUA TION WHERE PERSONAL INJURY OR DEATH
MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the
absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The
information here is subject to change without notice. Do not finalize a design with this information.
The Intel® Xeon® Processor E5-1600/ E5-2600/E5-4600 Product Families, Intel® C600 series chipset, and the Intel® Xeon®
Processor E5-1600/ E5-2600/E5-4600 Product Families-based Platform described in this document may contain design defects or
errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are
available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained
by calling 1-800-548-4725, or go to: http://www.intel.com/#/en_US_01
Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology
enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and so ftware you use. For
more information including details on which processors support HT Technology, see
Enabling Execute Disable Bit functionality requires a PC with a processor with Execute Disable Bit capability and a supporting
operating system. Check with your PC manufacturer on whether your system delivers Execute Disable Bit functionality.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor
(VMM) and, for some uses, certain computer system software enabled for it. Functionality, performance or other benefits will vary
depending on hardware and software configur ations and may re quire a BIOS update. Software applications may not be compatible
with all operating systems. Please check with your application vendor.
Intel® Turbo Boost Technology requires a PC with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost
Technology performance varies depending on hardware, software and overall system configuration. Check with your PC
manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see
http://www.intel.com/technology/turboboost/.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device
drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software
configurations. Consult with your system vendor for more information.
Δ Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor
family, not across different processor families. See http://www.intel.com/products/processor%5Fnumber/ for details.
2
C is a two-wire communications bus/protocol developed by Philips. SMBus is a subset of the I2C bus/protocol and was developed
I
by Intel. Implementations of the I
North American Philips Corporation.
2
C bus/protocol may require licenses from various entities, including Philips Electronics N.V. and
The Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families Datasheet Volume One provides DC specifications, signal integrity, differential signaling
specifications, land and signal definitions, and an overview of additional processor
feature interfaces.
The Intel® Xeon® processor E5-1600/E5-2600/E5-4600 product families are the next
generation of 64-bit, multi-core enterprise processors built on 32-nanometer process
technology. Throughout this document, the Intel® Xeon® processor E5-1600/E52600/E5-4600 product families may be referred to as simply the processor. Where
information differs between the EP and EP 4S SKUs, this document uses specific Intel®
Xeon® processor E5-1600 product family, Intel® Xeon® processor E5-2600 product
family, and Intel® Xeon® processor E5-4600 product family notation.Based on the
low-power/high performance 2nd Genera tion Intel® Core™ Processor Family
microarchitecture, the processor is designed for a two chip platform consisting of a
processor and a Platform Controller Hub (PCH) enabling higher performance, easier
validation, and improved x-y footprint. The Intel® Xeon® processor E5-1600 product
family and the Intel® Xeon® processor E5-2600 product family are designed for
Efficient Performance server, workstation and HPC platforms. The Intel® Xeon®
processor E5-4600 product family processor supports scalable server and HPC
platforms of two or more processors, including “glueless” 4-way platforms. Note: some
processor features are not available on all platforms.
These processors feature per socket, two Intel® QuickPath Interconnect point-to-point
links capable of up to 8.0 GT/s, up to 40 lanes of PCI Express* 3.0 links capable of
8.0 GT/s, and 4 lanes of DMI2/PCI Express* 2.0 interface with a peak transfer rate of
5.0 GT/s. The processor supports up to 46 bits of physical address space and 48-bit of
virtual address space.
Included in this family of processors is an integrated memory controller (IMC) and
integrated I/O (IIO) (such as PCI Express* and DMI2) on a single silicon die. This single
die solution is known as a monolithic processor.
Figure 1-1 and Figure 1-2, shows the processor 2-socket and 4-socket platform
configuration. The “Legacy CPU” is the boot processor that is connected to the PCH
component, this socket is set to NodeID[0]. In the 4-socket configuration, the “R emote
CPU” is the processor which is not connected to the Legacy CPU.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families13
Datasheet Volume One
Figure 1-1. Intel® Xeon® Processor E5-2600 Product Family on the 2 Socket
Platform
Overview
Figure 1-2. Intel® Xeon® Processor E5-4600 Product Family on the 4 Socket
Platform
1.1.1Processor Feature Details
• Up to 8 execution cores
• Each core supports two threads (Intel® Hyper-Threading Technology), up to 16
threads per socket
• Up to 8 ranks supported per memory channel, 1, 2 or 4 ranks per DIMM
• Open with adaptive idle page close timer or closed page policy
• Per channel memory test and initialization engine can initialize DRAM to all logical
zeros with valid ECC (with or without data scrambler) or a predefined test pattern
• Isochronous access support for Quality of Service (QoS), native 1 and 2 socket
platforms - Intel® Xeon® processor E5-1600 and E5-2600 product families only
• Minimum memory configuration: independent channel support with 1 DIMM
populated
• Integrated dual SMBus master controllers
• Command launch modes of 1n/2n
• RAS Support (including and not limited to):
— Rank Level Sparing and Device Tagging
— Demand and Patrol Scrubbing
— DRAM Single Device Data Correction (SDDC) for any single x4 or x8 DRAM
— Lockstep mode where channels 0 & 1 and channels 2 & 3 are operated in
lockstep mode
— The combination of memory channel pair lockstep and memory mirroring is not
supported
— Data scrambling with address to ease detection of write errors to an incorrect
address.
— Error reporting via Machine Check Architecture
— Read Retry during CRC error handling checks by iMC
— Channel mirroring within a socket Channel Mirroring mode is supported on
memory channels 0 & 1 and channels 2 & 3
— Corrupt Data Containment
—MCA Recovery
• Memory thermal monitoring support for DIMM temperature via two memory
signals, MEM_HOT_C{01/23}_N
1.2.2PCI Express*
• The PCI Express* port(s) are fully-compliant to the PCI Express* Base
Specification, Revision 3.0 (PCIe* 3.0)
• Support for PCI Express* 3.0 (8.0 GT/s), 2.0 (5.0 GT/s), and 1.0 (2.5 GT/s)
• Up to 40 lanes of PCI Express* interconnect for general purpose PCI Express*
devices at PCIe* 3.0 speeds that are configurable for up to 10 independent ports
• 4 lanes of PCI Express* at PCIe* 2.0 speeds when not using DMI2 port (Port 0),
also can be downgraded to x2 or x1
• Negotiating down to narrower widths is supported, see Figure 1-3:
— x16 port (Port 2 & Port 3) may negotiate down to x8, x4, x2, or x1.
— x8 port (Port 1) may negotiate down to x4, x2, or x1.
— x4 port (Port 0) may negotiate down to x2, or x1.
— When negotiating down to narrower widths, there are caveats as to how lane
reversal is supported.
• Non-Transparent Bridge (NTB) is supported by PCIe* Port3a/IOU1. For more details
on NTB mode operation refer to PCI Express Base Specification - Revision 3.0:
— x4 or x8 widths and at PCIe* 1.0, 2.0, 3.0 speeds
— Two usage models; NTB attached to a Root Port or NTB attached to another
NTB
— Supports three 64-bit BARs
— Supports posted writes and non-posted memory read transactions across the
NTB
— Supports INTx, MSI and MSI-X mechanisms for interrupts on both side of NTB
in upstream direction only
• Address Translation Services (ATS) 1.0 support
• Hierarchical PCI-compliant configuration mechanism for downstream devices.
• Traditional PCI style traffic (asynchronous snooped, PCI ordering).
• PCI Express* extended configuration space. The first 256 bytes of configuration
space aliases directly to the PCI compatibility configuration space. The remaining
portion of the fixed 4-KB block of memory-mapped space above that (starting at
100h) is known as extended configuration space.
• PCI Express* Enhanced Access Mechanism. Accessing the device configuration
space in a flat memory mapped fashion.
• Automatic discovery, negotiation, and training of link out of reset.
• Supports receiving and decoding 64 bits of address from PCI Express*.
— Memory transactions received from PCI Express* that go above the top of
physical address space (when Intel VT -d is enabled, the check would be against
the translated HPA (Host Physical Address) address) are reported as errors by
the processor.
— Outbound access to PCI Express* will always have address bits 63 to 46
cleared.
• Re-issues Configuration cycles that have been previously completed with the
Configuration Retry status.
• Power Management Event (PME) functions.
• Message Signaled Interrupt (MSI and MSI-X) messages
• Degraded Mode support and Lane Reversal support
• Static lane numbering reversal and polarity inversion support
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families17
Datasheet Volume One
Overview
Transaction
Link
Physical
0…3
X4
DMI
Port 0
DMI / PCIe
4…7
X4
Port 1b
Transaction
Link
Physical
0…3
X4
Port 1a
Port 1
(IOU2)
PCIe
X8
Port 1a
8…11
Transaction
Link
Physic al
0…3
Port 2
(IOU0)
PCIe
X4
Port 2b
X4
Port 2a
X8
Port 2a
X4
Port 2d
X4
Port 2c
X8
Port 2c
X16
Port 2a
12..154…78…11
Transaction
Link
Physical
0…3
Port 3
(IOU1)
PCIe
X4
Port 3b
X4
Port 3a
X8
Port 3a
X4
Port 3d
X4
Port 3c
X8
Port 3c
X16
Port 3a
12..154…7
Figure 1-3. PCI Express* Lane Partitioning and Direct Media Interface Gen 2 (DMI2)
1.2.3Direct Media Interface Gen 2 (DMI2)
• Serves as the chip-to-chip interface to the Intel® C600 Chipset
• The DMI2 port supports x4 link width and only operates in a x4 mode when in DMI2
• Operates at PCI Express* 1.0 or 2.0 speeds
• Transparent to software
• Processor and peer-to-peer writes and reads with 64-bit address support
• APIC and Message Signaled Interrupt (MSI) support. Will send Intel-defined “End of
Interrupt” broadcast message when initiated by the processor.
• System Management Interrupt (SMI), SCI, and SERR error indication
• Static lane numbering reversal support
• Supports DMI2 virtual channels VC0, VC1, VCm, and VCp
1.2.4Intel® QuickPath Interconnect (Intel® QPI)
• Compliant with Intel QuickPath Interconnect v1.1 standard packet formats
• Implements two full width Intel QPI ports
• Full width port includes 20 data lanes and 1 clock lane
• 64 byte cache-lines
• Isochronous access support for Quality of Service (QoS), native 1 and 2 socket
platforms - Intel® Xeon® processor E5-1600 and E5-2600 product families only
• No Intel QuickPath Interconnect bifurcation support
• Differential signaling
• Forwarded clocking
• Up to 8.0 GT/s data rate (up to 16 GB/s direction peak bandwidth per port)
— All ports run at same operational frequency
— Reference Clock is 100 MHz
— Slow boot speed initialization at 50 MT/s
• Common reference clocking (same clock generator for both sender and receiver)
• Intel® Interconnect Built-In-Self-Test (Intel® IBIST) for high-speed testability
• Polarity and Lane reversal (Rx side only)
1.2.5Platform Environment Control Interface (PECI)
The PECI is a one-wire interface that provides a communication channel between a
PECI client (the processor) and a PECI master (the PCH).
• Supports operation at up to 2 Mbps data transfers
• Link layer improvements to support additional services and higher efficiency over
PECI 2.0 generation
• Services include CPU thermal and estimated power information, control functions
for power limiting, P-state and T-state control, and access for Machine Check
Architecture registers and PCI configuration space (both within the processor
package and downstream devices)
• PECI address determined by SOCKET_ID configuration
• Single domain (Domain 0) is supported
1.3Power Management Support
1.3.1Processor Package and Core States
• ACPI C-states as implemented by the following processor C-states:
— Package: PC0, PC1/PC1E, PC2, PC3, PC6 (Package C7 is not supported)
— Core: CC0, CC1, CC1E, CC3, CC6, CC7
• Enhanced Intel SpeedStep® Technology
1.3.2System States Support
• S0, S1, S3, S4, S5
1.3.3Memory Controller
• Multiple CKE power down modes
• Multiple self-refresh modes
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families19
Datasheet Volume One
• Memory thermal monitoring via MEM_HOT_C01_N and MEM_HOT_C23_N Signals
1.3.4PCI Express
• L0s is not supported
• L1 ASPM power management capability
1.3.5Intel QuickPath Interconnect
• L0s is not supported
• L0p and L1 power management capabilities
1.4Thermal Management Support
• Digital Thermal Sensor with multiple on-die temperature zones
• Adaptive Thermal Monitor
• THERMTRIP_N and PROCHOT_N signal support
• On-Demand mode clock modulation
• Open and Closed Loop Thermal Throttling (OLTT/CLTT) support for system memory
in addition to Hybrid OLTT/CLTT mode
• Fan speed control with DTS
• Two integrated SMBus masters for accessing thermal data from DIMMs
• New Memory Thermal Throttling features via MEM_HOT_C{01/23}_N signals
• Running Average Power Limit (RAPL), Processor and DRAM Thermal and Power
Optimization Capabilities
Overview
1.5Package Summary
The processor socket is a 52.5 x 45 mm FCLGA package (LGA2011-0 land FCLGA10).
1.6Terminology
TermDescription
ASPMActive State Power Management
BMCBaseboard Management Controllers
CboCache and Core Box. It is a term used for internal logic providing ring interface to
DDR3Third generation Double Data Rate SDRAM memory technology that is the
DMADirect Memory Access
DMIDirect Media Interface
DMI2Direct Media Interface Gen 2
DTSDigital Thermal Sensor
ECCError Correction Code
Execute Disable BitThe Execute Disable bit allows memory to be marked as executable or non-
FlitFlow Control Unit. The Intel QPI Link layer’s unit of transfer; 1 Flit = 80-bits.
Functional OperationRefers to the normal operating conditions in which all processor specifications,
IMC
IIOThe Integrated I/O Controller. An I/O controller that is integrated in the
Intel® VT-dIntel® Virtualization Technology (Intel® VT) for Directed I/O. Intel VT-d is a
Intel® Xeon® processor
E5-1600 product family
and Intel® Xeon®
processor E5-2600
product family
Intel® Xeon® processor
E5-4600 product family
Integrated Heat Spreader
(IHS)
JitterAny timing variation of a transition edge or edges from the defined Unit Interval
IOVI/O Virtualization
LGA2011-0 land FCLGA10
Socket
Allows the operating system to reduce power consumption when performance is
not needed.
executable, when combined with a supporting operating system. If code
attempts to run in non-executable memory the processor raises an error to the
operating system. This feature can prevent some classes of viruses or worms
that exploit buffer overrun vulnerabilities and can thus help improve the overall
security of the system. See the Intel® 64 and IA-32 Architectures Software Developer's Manuals for more detailed information.
including DC, AC, system bus, signal quality, mechanical, and thermal, are
satisfied.
The Integrated Memory Controller. A Memory Controller that is integrated in the
processor die.
processor die.
Intel QuickData Technology is a platform solution designed to maximize the
throughput of server data traffic across a broader range of configurations and
server environments to achieve faster, scalable, and more reliable I/O.
A cache-coherent, link-based Interconnect specification for Intel processors,
chipsets, and I/O bridge components.
architecture and programming model can be found at
http://developer.intel.com/technology/intel64/.
Intel® Turbo Boost Technology is a way to automatically run the processor core
faster than the marked frequency if the part is operating under power,
temperature, and current specifications limits of the Thermal Design Power
(TDP). This results in increased performance of both single and multi-threaded
applications.
Processor virtualization which when used in conjunction with Virtual Machine
Monitor software enables multiple, robust independent software environments
inside a single platform.
hardware assist, under system software (Virtual Machine Manager or OS)
control, for enabling I/O device virtualization. Intel VT-d also brings robust
security by providing protection from errant DMAs by using DMA remapping, a
key feature of Intel VT-d.
Intel’s 32-nm processor design, follow-on to the 32-nm 2nd Generation Intel®
Core™ Processor Family design. It is the fir st pr oce sso r for us e in Intel® Xeon®
processor E5-1600 and E5-2600 product families-based platforms. Intel®
Xeon® processor E5-1600 product family and Intel® Xeon® processor E5-2600
product family supports Efficient Performance server, workstation and HPC
platforms
Intel’s 32-nm processor design, follow-on to the 32-nm processor design. It is
the first processor for use in Intel® Xeon® processor E5-4600 product familybased platforms. Intel® Xeon® processor E5-4600 product family supports
scalable server and HPC platforms for two or mor e processors, i ncluding gluele ss
four-way platforms.
A component of the processor package used to enhance the thermal
performance of the package. Component thermal solutions interface with the
processor at the IHS surface.
(UI).
The processor mates with the system board through this surface mount,
LGA2011-0 land FCLGA10 contact socket, for the Intel® Xeon® processor E5
product family-based platform.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families21
Datasheet Volume One
Overview
TermDescription
LLCLast Level Cache
LRDIMMLoad Reduced Dual In-line Memory Module
NCTFNon-Critical to Function: NCTF locations are typically redundant ground or non-
NEBSNetwork Equipment Building System. NEBS is the most common set of
PCHPlatform Controller Hub (Intel® C600 Chipset). The next generation chipset with
PCUPower Control Unit
PCI Express* 3.0The third generation PCI Express* specification that oper ates at twice the speed
PCI Express* 3PCI Express* Generation 3.0
PCI Express* 2PCI Express* Generation 2.0
PCI Express*PCI Express* Generation 2.0/3.0
PECIPlatform Environment Control Interface
PhitPhysical Unit. An Intel® QPI terminology defining units of tr ansfer at the physical
ProcessorThe 64-bit, single-core or multi-core component (package)
Processor CoreThe term “processor core” refers to silicon die itself which can contain multiple
RDIMMRegistered Dual In-line Memory Module
RankA unit of DRAM corresponding four to eight devices in parallel, ignoring ECC.
Scalable-2SIntel® Xeon® processor E5 product family-based platform targeted for scalable
SCISystem Control Interrupt. Used in ACPI protocol.
SSEIntel® Streaming SIMD Extensions (Intel® SSE)
SKUA processor Stock Keeping Unit (SKU) to be installed in either server or
SMBusSystem Management Bus. A two-wire interface through which simpl e system and
Storage ConditionsA non-operational state. The processor may be installed in a platform, in a tray,
TACThermal Averaging Constant
critical reserved, so the loss of the solder joint continuity at end of life co nditions
will not affect the overall product functionality.
environmental design guidelines applied to telecommunications equipment in the
United States.
centralized platform capabilities including the main I/O interfaces along with
display connectivity , audio features, power management, manageability , security
and storage features.
of PCI Express* 2.0 (8 Gb/s); however, PCI Express* 3.0 is completely backward
compatible with PCI Express* 1.0 and 2.0.
layer. 1 Phit is equal to 20 bits in ‘full width mode’ and 10 bits in ‘half width
mode’
execution cores. Each execution core has an instruction cache, data cache, and
256-KB L2 cache. All execution cores share the L3 cache. All DC and signal
integrity specifications are measured at the processor die (pads), unless
otherwise noted.
These devices are usually, but not always, mounted on a single side of a DDR3
DIMM.
designs using third party Node Controller chip . In the se designs, Node Controlle r
is used to scale the design beyond one/two/four sockets.
workstation platforms. Electrical, power and thermal specifications for these
SKU’s are based on specific use condition assumptions. Server processors may
be further categorized as Efficient Performance server, workstation and HPC
SKUs. For further details on use condition assumptions, please refer to the latest
Product Release Qualification (PRQ) Report available via your Customer Quality
Engineer (CQE) contact.
power management related devices can communicate with the rest of the
system. It is based on the principals of the operation of the I2C* two-wire serial
bus from Philips Semiconductor.
or loose. Processors may be sealed in packaging or exposed to free air. Under
these conditions, processor landings should not be connected to any supply
voltages, have any I/Os biased or receive any clocks. Upon exposure to “free air”
(i.e., unsealed packaging or a device removed from packaging material) the
processor must be handled in accordance with moisture sensitivity labeling
(MSL) as indicated on the packaging material.
TDPThermal Design Power
TSODThermal Sensor on DIMM
UDIMMUnbuffered Dual In-line Module
UncoreThe portion of the processor comprising the shared cache, IMC, HA, PCU, UBox,
and Intel QPI link interface.
Unit IntervalSignaling convention that is binary and unidirectional. In this binary signaling,
V
CC
V
SS
V
CCD_01, VCCD_23
one bit is sent for every edge of the forwarded clock, whether it be a rising edge
or a falling edge. If a number of edges are collected at instances t
then the UI at instance “n” is defined as:
= tn - tn - 1
UI
n
Processor core power supply
Processor ground
Variable power supply for the processor system memory interface. VCCD is the
generic term for V
CCD_01, VCCD_23.
, t2, tn,...., t
1
x1Refers to a Link or Port with one Physical Lane
x4Refers to a Link or Port with four Physical Lanes
x8Refers to a Link or Port with eight Physical Lanes
x16Refers to a Link or Port with sixteen Physical Lanes
k
1.7Related Documents
Refer to the following documents for additional information.
Table 1-1.Referenced Docum ents (Sheet 1 of 2)
DocumentLocation
Intel® Xeon® Processor E5 Product Family Datasheet Volume Two http://www.intel.com
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families
– BSDL (Boundary Scan Description Language)
Intel® C600 Series Chipset Data Sheethttp://www.intel.com
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(SDM) Volumes 1, 2, and 3
Advanced Configuration and Power Interface Specification 3.0http://www.acpi.info
PCI Local Bus Specification 3.0 http://www.pcisig.com/specifications
PCI Express Base Specification - Revision 2.1 and 1.1
PCI Express Base Specification - Revision 3.0
System Management Bus (SMBus) Specificationhttp://smbus.org/
DDR3 SDRAM Specificationhttp://www.jedec.org
Low (JESD22-A119) and High (JESD-A103) Temperature Storage Life
Specifications
Intel 64 and IA-32 Architectures Software Developer's Manuals
• Volume 1: Basic Architecture
• Volume 2A: Instruction Set Reference, A-M
• Volume 2B: Instruction Set Reference, N-Z
• Volume 3A: System Programming Guide
• Volume 3B: System Programming Guide
Intel® 64 and IA-32 Architectures Optimization Reference Manual
This chapter describes the interfaces supported by the processor.
2.1System Memory Interface
2.1.1System Memory Technology Support
The Integrated Memory Controller (IMC) supports DDR3 protocols with four
independent 64-bit memory channels with 8 bits of ECC for each channel (total of
72-bits) and supports 1 to 3 DIMMs per channel depending on the type of memory
installed. The type of memory supported by the processor is dependent on the target
platform:
• Intel® Xeon® processor E5 product family-based platforms support:
— ECC registered DIMMs: with a maximum of three DIMMs per channel allowing
up to eightdevice ranks per channel.
— ECC and non-ECC unbuffered DIMMs: with a maximum of two DIMMs per
channel thus allowing up to four device ranks per channel. Support for mixed
non-ECC with ECC un-buffered DIMM configurations.
2.1.2System Memory Timing Support
The IMC supports the following DDR3 Speed Bin, CAS Write Latency (CWL), and
command signal mode timings on the main memory interface:
• tCL = CAS Latency
• tRCD = Activate Command to READ or WRITE Command delay
• tRP = PRECHARGE Command Period
• CWL = CAS Write Latency
• Command Signal modes = 1n indicates a new command may be issued every clock
and 2n indicates a new command may be issued every 2 clocks. Command launch
mode programming depends on the transfer rate and memory configuration.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families25
Datasheet Volume One
2.2PCI Express* Interface
Transaction
Data Link
Physical
Logical Sub-Block
Electrical Sub-Block
RXTX
Transaction
Data Link
Physical
Logical Sub-Block
Electrical Sub-Block
RXTX
Transaction
Data Link
Physical
Logical Sub-Block
Electrical Sub-Block
RXTX
Transaction
Data Link
Physical
Logical Sub-Block
Electrical Sub-Block
RXTX
Interfaces
This section describes the PCI Express* 3.0 interface capabilities of the processor. See
the PCI Express* Base Specification for details of PCI Express*
2.2.1PCI Express* Architecture
Compatibility with the PCI addressing model is maintained to ensure that all existing
applications and drivers operate unchanged. The PCI Express* configuration uses
standard mechanisms as defined in the PCI Plug-and-Play specification.
The PCI Express* architecture is specified in three layers: T ransaction Layer, Data Link
Layer, and Physical Layer. The partitionin g in the component is not necessarily along
these same boundaries. Refer to Figure 2-1 for the PCI Express* Layering Diagram.
PCI Express* uses packets to communicate information between components. Packets
are formed in the Transaction and Data Link Layers to carry the information from the
transmitting component to the receiving component. As the transmitted packets flow
through the other layers, they are extended with additional information necessary to
handle packets at those layers. At the receiving side, the reverse process occurs and
packets get transformed from their Physical Layer representation to the Data Link
Layer representation and finally (for Transaction Layer Packets) to the form that can be
processed by the Transaction Layer of the receiving device.
Datasheet Volume One
Interfaces
Framing
Sequence
Number
HeaderDataLCRCECRCFraming
Transaction Layer
Physical Layer
Data Link Layer
Figure 2-2. Packet Flow through the Layers
2.2.1.1Transaction Layer
The upper layer of the PCI Express* architecture is the Transaction Layer. The
Transaction Layer's primary responsibility is the assembly and disassembly of
Transaction Layer Packets (TLPs). TLPs are used to communicate transactions, such as
read and write, as well as certain types of events. The Transaction Layer also manages
flow control of TLPs.
2.2.1.2Data Link Layer
The middle layer in the PCI Express* stack, the Data Link Layer, serves as an
intermediate stage between the Transaction Layer and the Physical Layer.
Responsibilities of Data Link Layer include link management, error detection, and error
correction.
The transmission side of the Data Link Layer accepts TLPs assembled by the
Transaction Layer, calculates and applies data protection code and TLP sequence
number, and submits them to Physical Layer for transmission across the Link. The
receiving Data Link Layer is responsible for checking the integrity of received TLPs and
for submitting them to the T ransaction Layer for further processing. On detection of TLP
error(s), this layer is responsible for requesting retransm ission of TLPs until information
is correctly received, or the Link is determined to have failed. The Data Link Layer also
generates and consumes packets which are used for Link management functions.
2.2.1.3Physical Layer
The Physical Layer includes all circuitry for interface operation, including driver and
input buffers, parallel-to-serial and serial-to-parallel conversion, PLL(s), and impedance
matching circuitry . It also includes logical functions related to interface initialization and
maintenance. The Physical Layer exchanges data with the Data Link Layer in an
implementation-specific format, and is responsible for converting this to an appropriate
serialized format and transmitting it across the PCI Express* Link at a frequency and
width compatible with the remote device.
2.2.2PCI Express* Configuration Mechanism
The PCI Express* link is mapped through a PCI-to-PCI bridge structure.
PCI Express* extends the configuration space to 4096 bytes per-device/function, as
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families27
Datasheet Volume One
compared to 256 bytes allowed by the Conventional PCI Specification. PCI Express*
configuration space is divided into a PCI-compatible region (which consists of the first
256 bytes of a logical device's configuration space) and an extended PCI Express*
region (which consists of the remaining configuration space). The PCI-compatible
region can be accessed using either the mechanisms defined in the PCI specification or
using the enhanced PCI Express* configuration access mechanism described in the PCI
Express* Enhanced Configuration Mechanism section.
The PCI Express* Host Bridge is required to translate the memory-mapped PCI
Express* configuration space accesses from the host processor to PCI Express*
configuration cycles. To maintain compatibility with PCI configuration addressing
mechanisms, it is recommended that system software access the enhanced
configuration space using 32-bit operations (32-bit aligned) only.
See the PCI Express* Base Specification for details of both the PCI-compatible and PCI
Express* Enhanced configuration mechanisms and transaction rules.
2.3DMI2/PCI Express* Interface
Direct Media Interface 2 (DMI2) connects the processor to the Platform Controller Hub
(PCH). DMI2 is similar to a four-lane PCI Express* supporting a speed of 5 GT/s per
lane. This interface can be configured at power-on to serve as a x4 PCI Express* link
based on the setting of the SOCKET_ID[1:0] and FRMAGENT signal for processors not
connected to a PCH.
Interfaces
Note:Only DMI2 x4 configuration is supported.
2.3.1DMI2 Error Flow
DMI2 can only generate SERR in response to errors, never SCI, SMI, MSI, PCI INT, or
GPE. Any DMI2 related SERR activity is associated with Device 0.
2.3.2Processor/PCH Compatibility Assumptions
The processor is compatible with the PCH and is not compatible with any previous MCH
or ICH products.
2.3.3DMI2 Link Down
The DMI2 link going down is a fatal, unrecoverable error. If the DMI2 data link goes to
data link down, after the link was up, then the DMI2 link hangs the system by not
allowing the link to retrain to prevent data corruption. This is controlled by the PCH.
Downstream transactions that had been successfully transmitted across the link prior
to the link going down may be processed as normal. No completions from downstream,
non-posted transactions are returned upstream over the DMI2 link after a link down
event.
2.4Intel QuickPath Interconnect
The Intel QuickPath Interconnect is a high speed, packetized, point-to-point
interconnect used in the 2nd Generation Intel(r) Core(TM) Processor Family. The
narrow high-speed links stitch together processors in distributed shared memory and
integrated I/O platform architecture. It offers much higher bandwidth with low latency.
The Intel QuickPath Interconnect has an efficient architecture allowing more
interconnect performance to be achieved in real systems. It has a snoop protocol
optimized for low latency and high scalability, as well as packet and lane structures
enabling quick completions of transactions. Reliability, availability, and serviceability
features (RAS) are built into the architecture.
The physical connectivity of each interconnect link is made up of twenty differential
signal pairs plus a differential forwarded clock. Each port supports a link pair consisting
of two uni-directional links to complete the connection between two components. This
supports traffic in both directions simultaneously. To facilitate flexibility and longevity,
the interconnect is defined as having five layers: Physical, Link, R outing, Transport, and
Protocol.
• The Physical layer consists of the actual wires carrying the signals, as well as
circuitry and logic to support ancillary features required in the transmission and
receipt of the 1s and 0s. The unit of transfer at the Physical layer is 20-bits, which
is called a Phit (for Physical unit).
• The Link layer is responsible for reliable transmission and flow control. The Link
layer’s unit of transfer is 80-bits, which is called a Flit (for Flow control unit).
• The Routing layer provides the framework for directing packets through the
fabric.
• The Transport layer is an architecturally defined layer (not implemented in the
initial products) providing advanced routing capability for reliable end-to-end
transmission.
• The Protocol layer is the high-level set of rules for exchanging packets of data
between devices. A packet is comprised of an integral number of Flits.
The Intel QuickPath Interconnect includes a cache coherency protocol to keep the
distributed memory and caching structures coherent during system operation. It
supports both low-latency source snooping and a scalable home snoop behavior. The
coherency protocol provides for direct cache-to-cache transfers for optimal latency.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families29
Datasheet Volume One
2.5Platform Environment Control Interface (PECI)
The Platform Environment Control Interface (PECI) uses a single wire for self-clocking
and data transfer. The bus requires no additional control lines. The physical layer is a
self-clocked one-wire bus that begins each bit with a driven, rising edge from an idle
level near zero volts. The duration of the signal driven high depends on whether the bit
value is a logic ‘0’ or logic ‘1’. PECI also includes variable data transfer rate established
with every message. In this way, it is highly flexible even though underlying logic is
simple.
The interface design was optimized for interfacing to Intel processor and chipset
components in both single processor and multiple processor environments. The single
wire interface provides low board routing overhead for the multiple load connections in
the congested routing area near the processor and chipset components. Bus speed,
error checking, and low protocol overhead provides adequate link bandwidth and
reliability to transfer critical device operating conditions and configuration information.
The PECI bus offers:
• A wide speed range from 2 Kbps to 2 Mbps
• CRC check byte used to efficiently and atomically confirm accurate data delivery
• Synchronization at the beginning of every message minimizes device timing
accuracy requirements
Note:The PECI commands described in this document apply primarily to the Intel® Xeon®
processor E5-1600/E5-2600/E5-4600 product families. The processors utilizes the
capabilities described in this document to indicate support for four memory channels.
Refer to Table 2-1 for the list of PECI commands supported by the processors.
Table 2-1.Summary of Processor-specific PECI Commands
CommandSupported on the Processor
Ping()Yes
GetDIB()Yes
GetTemp()Yes
RdPkgConfig()Yes
WrPkgConfig()Yes
RdIAMSR()Yes
WrIAMSR()No
RdPCIConfig()Yes
WrPCIConfig()No
RdPCIConfigLocal()Yes
WrPCIConfigLocal()Yes
2.5.1PECI Client Capabilities
The processor PECI client is designed to support the following sideband functions:
• Processor and DRAM thermal management
• Platform manageability functions including thermal, power, and error monitoring
— The platform ‘power’ management includes monitoring and control for both the
processor and DRAM subsystem to assist with data center power limiting.
• Processor interface tuning and diagnostics capabilities (Intel® Interconnect BIST).
Processor fan speed control is managed by comparing Digital Thermal Sensor (DTS)
thermal readings acquired via PECI against the processor-specific fan speed control
reference point, or T
CONTROL
. Both T
CONTROL
and DTS thermal readings are accessible
via the processor PECI client. These variables are referenced to a common
temperature, the TCC activation point, and are both defined as negative offsets from
that reference.
PECI-based access to the processor package configuration space provides a means for
Baseboard Management Controllers (BMCs) or other platform management devices to
actively manage the processor and memory power and thermal features. Details on the
list of available power and thermal optimization services can be found in
Section 2.5.2.6.
2.5.1.2Platform Manageability
PECI allows read access to certain error registers in the processor MSR space and
status monitoring registers in the PCI configuration space within the processor and
downstream devices. Details are covered in subsequent sections.
PECI permits writes to certain Memory Controller RAS-related registers in the processor
PCI configuration space. Details are covered in Section 2.5.2.10.
2.5.1.3Processor Interface Tuning and Diagnostics
The processor Intel® Interconnect Built In Self Test (Intel® IBIST) allows for in-field
diagnostic capabilities in the Intel® QPI and memory controller interfaces. PECI
provides a port to execute these diagnostics via its PCI Configuration read and write
capabilities in the BMC INIT mode. Refer to Section 2.5.3.7 for more details.
2.5.2Client Command Suite
PECI command requires at least one frame check sequence (FCS) byte to ensure
reliable data exchange between originator and client. The PECI message protocol
defines two FCS bytes that are returned by the client to the message originator. The
first FCS byte covers the client address byte, the Read and Write Length bytes, and all
bytes in the write data block. The second FCS byte covers the read response data
returned by the PECI client. The FCS byte is the result of a cyclic redundancy check
(CRC) of each data block.
2.5.2.1Ping()
Ping() is a required message for all PECI devices. This message is used to enumerate
devices or determine if a device has been removed, been powered-off, etc. A Ping()
sent to a device address always returns a non-zero Write FCS if the device at the
targeted address is able to respond.
2.5.2.1.1Command Format
The Ping() format is as follows:
Write Length: 0x00
Read Length: 0x00
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families31
Datasheet Volume One
Figure 2-3. Ping()
Byte #
Byte
Definition
0
Client Address
1
Write Length
0x00
2
Read Length
0x00
3
FCS
Byte #
Byte
Definition
0
0x30
1
0x00
2
0x00
3
0xe1
Byte #
Byte
Definition
0
Client Address
1
Write Length
0x01
2
Read Length
0x08
4
FCS
3
Cmd Code
0xf7
5
Device Info
6
Revision
Number
7
Reserved
8
Reserved
9
Reserved
10
Reserved
11
Reserved
12
Reserved
13
FCS
An example Ping() command to PECI device address 0x30 is shown below.
Figure 2-4. Ping() Example
2.5.2.2GetDIB()
The processor PECI client implementation of GetDIB() includes an 8-byte response and
provides information regarding client revision number and the number of supported
domains. All processor PECI clients support the GetDIB() command.
The Device Info byte gives details regarding the PECI client configuration. At a
minimum, all clients supporting GetDIB will return the number of domains inside the
package via this field. With any client, at least one domain (Domain 0) must exist.
Therefore, the Number of Domains reported is defined as the number of domains in
addition to Domain 0. For example, if bit 2 of the Device Info byte returns a ‘1’, that
would indicate that the PECI client supports two domains.
Figure 2-6. Device Info Field Definition
2.5.2.2.3Revision Number
All clients that support the GetDIB command also support Revision Number reporting.
The revision number may be used by a host or originator to manage different command
suites or response codes from the client. Revision Number is always reported in the
second byte of the GetDIB() response. The ‘Major Revision’ number in Figure 2-7
always maps to the revision number of the PECI specification that the PECI client
processor is designed to. The ‘Minor Revision’ number value depends on the exact
command suite supported by the PECI client as defined in Table 2-2.
For the processor PECI client the Revision Number will return ‘0011 0100b’.
2.5.2.3GetTemp()
The GetTemp() command is used to retrieve the maximum die temperature from a
target PECI address. The temperature is used by the external thermal management
system to regulate the temperature on the die. The data is returned as a negative
value representing the number of degrees centigrade below the maximum processor
junction temperature (T
corresponds to the processor T
which the processor Thermal Control Circuit activates. The actual value that the
thermal management system uses as a control set point (T
negative number below T
issuing a PECI RdPkgConfig() command as described in Section 2.5.2.4 or using a
RDMSR instruction. T
CONTROL
the Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families Thermal/Mechanical Design Guide.
Please refer to Section 2.5.7 for details regarding PECI temperature data formatting.
). The maximum PECI temperature value of zero
jmax
jmax
. T
jmax
CONTROL
application to fan speed control management is defined in
. This also represents the default temperature at
CONTROL
) is also defined as a
may be extracted from the processor by
2.5.2.3.1Command Format
The GetTemp() format is as follows:
Write Length: 0x01
Read Length: 0x02
Command: 0x01
Description: Returns the highest die temperature for addressed processor PECI client.
Example bus transaction for a thermal sensor device located at address 0x30 returning
Byte #
Byte
Definition
0
0x30
1
0x01
2
0x02
4
0xef
5
0x80
6
0xfd
7
0x4b
3
0x01
a value of negative 10 counts is show in Figure 2-9.
Figure 2-9. GetTemp() Example
2.5.2.3.2Supported Responses
The typical client response is a passing FCS and valid thermal data. Under some
conditions, the client’s response will indicate a failure. GetTemp() response definitions
are listed in Table 2-3. Refer to Section 2.5.7.4 for more details on sensor errors.
Table 2-3.GetTemp() Response Definition
ResponseMeaning
General Sensor Error (GSE)
Bad Write FCSElectrical error
Abort FCSIllegal command formatting (mismatched RL/WL/Command Code)
1
0x0000
All other dataValid temperature reading, reported as a negative offset from the processor
1
Thermal scan did not complete in time. Retry is appropriate.
Processor is running at its maximum temperature or is currently being reset.
.
T
jmax
Notes:
1.This response will be reflected in Bytes 5 & 6 in Figure 2-9.
2.5.2.4RdPkgConfig()
The RdPkgConfig() command provides read access to the package configuration space
(PCS) within the processor, including various power and thermal management
functions. Typical PCS read services supported by the processor may include access to
temperature data, energy status, run time information, DIMM temperatures and so on.
Refer to Section 2.5.2.6 for more details on processor-specific services supported
through this command.
2.5.2.4.1Command Format
The RdPkgConfig() format is as follows:
Write Length: 0x05
Read Length: 0x05 (dword)
Command: 0xa1
Description: Returns the data maintained in the processor package configuration
space for the PCS entry as specified by the ‘index’ and ‘parameter’ fields. The ‘index’
field contains the encoding for the requested service and is used in conjunction with the
‘parameter’ field to specify the exact data being requested. The Read Length dictates
the desired data return size. This command supports only dword responses on the
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families35
Datasheet Volume One
processor PECI clients. All command responses are prepended with a completion code
that contains additional pass/fail status information. Refer to Section 2.5.5.2 for details
regarding completion codes.
Figure 2-10. RdPkgConfig()
Note: The 2-byte parameter field and 4-byte read data field defined in Figure 2-10 are sent in standard PECI ordering with LSB
first and MSB last.
2.5.2.4.2Supported Responses
The typical client response is a passing FCS, a passing Completion Code and valid data.
Under some conditions, the client’s response will indicate a failure.
CC: 0x40Command passed, data is valid.
CC: 0x80Response timeout. The processor is not able to gen er ate the req uired respon se in a time ly
CC: 0x81Response timeout. The processor is not able to allocate resources for servicing this
CC: 0x90Unknown/Invalid/Illegal Request
CC: 0x91PECI control hardware, firmware or associated logic error. The processor is unable to
fashion. Retry is appropriate.
command at this time. Retry is appropriate.
process the request.
2.5.2.5WrPkgConfig()
The WrPkgConfig() command provides write access to the package configuration space
(PCS) within the processor, including various power and thermal management
functions. Typical PCS write services supported by the processor may include power
limiting, thermal averaging constant programming and so on. Refer to Section 2.5.2.6
for more details on processor-specific services supported through this command.
2.5.2.5.1Command Format
The WrPkgConfig() format is as follows:
Write Length: 0x0a(dword)
Read Length: 0x01
Command: 0xa5
AW FCS Support: Yes
Description: Writes data to the processor PCS entry as specified by the ‘index’ and
‘parameter’ fields. This command supports only dword data writes on the processor
PECI clients. All command responses include a completion code that provides additional
pass/fail status information. Refer to Section 2.5.5.2 for details regarding completion
codes.
The Assured Write FCS (AW FCS) support provides the processor client a high degree of
confidence that the data it received from the host is correct. This is especially critical
where the consumption of bad data might result in improper or non-recoverable
operation.
Figure 2-11. WrPkgConfig()
Note: The 2-byte parameter field and 4-byte write data field defined in Figure 2-11 are sent in standard PECI
ordering with LSB fir st and MSB last.
2.5.2.5.2Supported Responses
The typical client response is a passing FCS, a passing Completion Code and valid data.
Under some conditions, the client’s response will indicate a failure.
Table 2-5.WrPkgConfig() Response Definition (Sheet 1 of 2)
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families37
Datasheet Volume One
Table 2-5.WrPkgConfig() Response Definition (Sheet 2 of 2)
ResponseMeaning
CC: 0x80Response timeout. The processor was not able to generate the required response in a
CC: 0x81Response timeout. The processor is not able to allocate resources for servicing this
CC: 0x90Unknown/Invalid/Illegal Request
CC: 0x91PECI control hardware, firmware or associated logic error. The processor is unable to
timely fashion. Retry is appropriate.
command at this time. Retry is appropriate.
process the request.
2.5.2.6Package Configuration Capabilities
Table 2-6 combines both read and write services. Any service listed as a “read” would
use the RdPkgConfig() command and a service listed as a “write” would use the
WrPkgConfig() command. PECI requests for memory temperature or other data
generated outside the processor package do not trigger special polling cycles on the
processor memory or SMBus interfaces to procure the required information.
2.5.2.6.1DRAM Thermal and Power Optimization Capabilities
DRAM thermal and power optimization (also known as RAPL or “Running Average
Power Limit”) services provide a way for platform thermal management solutions to
program and access DRAM power, energy and temperature parameters. Memory
temperature information is typically used to regulate fan speeds, tune refresh rates and
throttle the memory subsystem as appropriate. Memory temperature data may be
derived from a variety of sources including on-die or on-board DIMM sensors, DRAM
activity information or a combination of the two. Though memory temperature data is a
byte long, range of actual temperature values are determined by the DIMM
specifications and operating range.
Note:DRAM related PECI services described in this section apply only to the memory
connected to the specific processor PECI client in question and not the overall platform
memory in general. For estimating DRAM thermal information in closed loop throttling
mode, a dedicated SMBus is required between the CPU and the DIMMs. The processor
PCU requires access to the VR12 voltage regulator for reading average output current
information through the SVID bus for initial DRAM RAPL related power tuning.
Table 2-6 provides a summary of the DRAM power and thermal optimization capabilities
that can be accessed over PECI on the processor. The Index values referenced in
Table 2-6 are in decimal format.
Table 2-6 also provides information on alternate inband mechanisms to access similar
or equivalent information through register reads and writes where applicable. The user
should consult the Intel® 64 and IA-32 Architectures Software Developer’s Manual
(SDM) Volumes 1, 2, and 3 or Intel® Xeon® Processor E5 Product Family Datasheet
Volume Two for details on MSR and CSR register contents.
Table 2-6.RdPkgConfig() & WrPkgConfig() DRAM Thermal and Power Optimization
Services Summary (Sheet 1 of 2)
Service
DRAM Rank
Temperature
Write18
DIMM
Temperature
Read14
DIMM Ambient
Temperature
Write / Read
DIMM Ambient
Temperature
Write / Read
DRAM Channel
Temperature
Read
Accumulated
DRAM Energy
Read
DRAM Power
Info Read
DRAM Power
Info Read
Index
Value
(decimal)
190x0000N/A
190x0000
220x0000
04
350x0000
360x0000
Parameter
Value
(word)
Channel
Index &
DIMM Index
Channel
Index
Channel
Index
0x00FF - All
Channels
RdPkgConfig()
Data
(dword)
N/A
Absolute
temperature in
Degrees Celsius for
DIMMs 0, 1, & 2
Absolute
temperature in
Degrees C to be
used as ambient
temperature
reference
Maximum of all rank
temperatures for
each channel in
Degrees Celsius
DRAM energy
consumed by the
DIMMs
Typical and
minimum DRAM
power settings
Maximum DRAM
power settings &
maximum time
window
WrPkgConfig()
Data
(dword)
Absolute
temperature in
Degrees Celsius
for ranks 0, 1, 2
& 3
N/A
Absolute
temperature in
Degrees C to be
used as ambient
temperature
reference
N/A
N/A
N/A
N/A
N/A
Description
Write
temperature for
each rank within
a single DIMM.
Read
temperature of
each DIMM
within a
channel.
Write ambient
temperature
reference for
activity-based
rank
temperature
estimation.
Read ambient
temperature
reference for
activity-based
rank
temperature
estimation.
Read the
maximum DRAM
channel
temperature.
Read the DR AM
energy
consumed by all
the DIMMs in all
the channels or
all the DIMMs
within a
specified
channel.
Read DRAM
power settings
info to be used
by power
limiting entity.
Read DRAM
power settings
info to be used
by power
limiting entity
Alternate Inband
MSR or CSR
Access
N/A
DIMMTEMPSTAT_[0:2]
DRAM_ENERGY_STAT US
DRAM_ENERGY_STAT US
DRAM_ENERGY_STATUS_C
CSR: DRAM_POWER_INFO
CSR: DRAM_POWER_INFO
CSR:
N/A
N/A
N/A
MSR 619h:
CSR:
CSR:
H[0:3]
MSR 61Ch:
DRAM_POWER_INFO
MSR 61Ch:
DRAM_POWER_INFO
1
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families39
Datasheet Volume One
Table 2-6.RdPkgConfig() & WrPkgConfig() DRAM Thermal and Power Optimization
Memory Thermal Estimation Configuration Data
RESERVED
10
31
BETA VARIABLE
9
0
THETA VARIABLE
1920
Services Summary (Sheet 2 of 2)
Service
DRAM Power
Limit Data
Write / Read
DRAM Power
Limit Data
Write / Read
DRAM Power
Limit
Performance
Status Read
Notes:
1.Time, energy and power units should be ass umed, where applicable, to be based on values returne d by a read of the
PACKAGE_POWER_SKU_UNIT MSR or through the Package Power SKU Unit PCS read service.
Index
Value
(decimal)
340x0000N/A
340x0000
380x0000
Parameter
Value
(word)
RdPkgConfig()
Data
(dword)
DRAM Plane Power
Limit Data
Accumulated DRAM
throttle time
WrPkgConfig()
Data
(dword)
DRAM Plane
Power Limit Data
N/A
N/A
Description
Write DRAM
Power Limit Data
Read DRAM
Power Limit Data
Read sum of all
time durations
for which each
DIMM has been
throttled
Alternate Inband
MSR or CSR
Access
MSR 618h:
DRAM_POWER_LIMIT
DRAM_PLANE_POWER_LIM
DRAM_PLANE_POWER_LIM
DRAM_RAPL_PERF_STATUS
CSR:
IT
MSR 618h:
DRAM_POWER_LIMIT
CSR:
IT
CSR:
2.5.2.6.2DRAM Thermal Estimation Configuration Data Read/Write
This feature is relevant only when activity-based DRAM temperature estimation
methods are being utilized and would apply to all the DIMMs on all the memory
channels. The write allows the PECI host to configure the ‘β’ and ‘θ’ variables in
Figure 2-12 for DRAM channel temperature filtering as per the equation below:
TN = β ∗ T
TN and T
+ θ ∗ ΔEnergy
N-1
are the current and previous DRAM temperature estimates respectively in
N-1
degrees Celsius, ‘β’ is the DRAM temperature decay factor, ‘ΔEnergy’ is the energy
difference between the current and previous memory transactions as determined by
the processor power control unit and ‘θ’ is the DRAM energy-to-temperature translation
coefficient. The default value of ‘β’ is 0x3FF. ‘θ’ is defined by the equation:
The ‘Thermal Resistance’ serves as a multiplier for translation of DRAM energy changes
to corresponding temperature changes and may be derived from actual platform
characterization data. The ‘Scaling Factor’ is used to convert memory transaction
information to energy units in Joules and can be derived from system/memory
configuration information. Refer to the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) Volumes 1, 2, and 3 for methods to program and access
‘Scaling Factor’ information.
Figure 2-12. DRAM Thermal Estimation Configuration Data
This feature allows the PECI host to program into the processor, the temperature for all
the ranks within a DIMM up to a maximum of four ranks as shown in Figure 2-13. The
DIMM index and Channel index are specified through the parameter field as shown in
Table 2-7. This write is relevant in platforms that do not have on-die or on-board
DIMM thermal sensors to provide memory temperature information or if the processor
does not have direct access to the DIMM thermal sensors. This temperature
information is used by the processor in conjunction with the activity-based DRAM
temperature estimations.
Table 2-7.Channel & DIMM Index Decoding
Index EncodingPhysical Channel#Physical DIMM#
00000
00111
01022
0113Reserved
Figure 2-13. DRAM Rank Temperature Write Data
2.5.2.6.4DIMM Temperature Read
This feature allows the PECI host to read the temperature of all the DIMMs within a
channel up to a maximum of three DIMMs. This read is not limited to platforms using a
particular memory temperature source or temperature estimation method. For
platforms using DRAM thermal estimation, the PCU will provide the estimated
temperatures. Otherwise, the data represents the latest DIMM temperature provided
by the TSOD or on-board DIMM sensor and requires that CLTT (closed loop throttling
mode) be enabled and OLT T (open loop throttling mode) be disabled. Refer to Table 2-7
for channel index encodings.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families41
Datasheet Volume One
Figure 2-14. The Processor DIMM Temperature Read / Write
1570
DIMM Temperature Data
Reserved
DIMM# 2
Absolute Temp
(in Degrees C)
DIMM# 1
Absolute Temp
(in Degrees C)
DIMM# 0
Absolute Temp
(in Degrees C)
816232431
15
2
Parameter format
ReservedChannel Index
30
70
Ambient Temperature Reference Data
Reserved
Ambient
Temperature
(in Degrees C)
831
2.5.2.6.5DIMM Ambient Temperature Write / Read
This feature allows the PECI host to provide an ambient temperature reference to be
used by the processor for activity-based DRAM temperature estimation. This write is
used only when no DIMM temperature information is available from on-board or on-die
DIMM thermal sensors. It is also possible for the PECI host controller to read back the
DIMM ambient reference temperature.
Since the ambient temperature may vary ov er time within a system, it is recommended
that systems monitoring and updating the ambient temperature at a fast rate use the
‘maximum’ temperature value while those updating the ambient temperature at a slow
rate use an ‘average’ value. The ambient temperature assumes a single value for all
memory channel/DIMM locations and does not account for possible temperature
variations based on DIMM location.
Figure 2-15. Ambient Temperature Reference Data
2.5.2.6.6DRAM Channel Temperature Read
This feature enables a PECI host read of the maximum temperature of each channel.
This would include all the DIMMs within the channel and all the ranks within each of the
DIMMs. Channels that are not populated will return the ‘ambient temperature’ on
systems using activity-based temperature estimations or alternatively return a ‘zero’
for systems using sensor-based temperatures.
This feature allows the PECI host to read the DRAM energy consumed by all the DIMMs
within all the channels or all the DIMMs within just a specified channel. The parameter
field is used to specify the channel index. Units used are defined as per the Package
Power SKU Unit read described in Section 2.5.2.6.11. This information is tracked by a
32-bit counter that wraps around. The channel index in Figure 2-17 is specified as per
the index encoding described in Table 2-7. A channel index of 0x00FF is used to specify
the “all channels” case. While Intel requires reading the accumulated energy data at
least once every 16 seconds to ensure functional correctness, a more realistic polling
rate recommendation is once every 100 mS for better accuracy. This feature assumes a
200W memory capacity. In general, as the power capability decreases, so will the
minimum polling rate requirement.
When determining energy changes by subtracting energy values between successive
reads, Intel advocates using the 2’s complement method to account for counter wraparounds. Alternatively, adding all ‘F’s (‘0xFFFFFFFF’) to a negative result from the
subtraction will accomplish the same goal.
Figure 2-17. Accumulated DRAM Energy Data
2.5.2.6.8DRAM Power Info Read
This read returns the minimum, typical and maximum DRAM power settings and the
maximum time window over which the power can be sustained for the entire DRAM
domain and is inclusive of all the DIMMs within all the memory channels. Any power
values specified by the power limiting entity that is outside of the range specified
through these settings cannot be guaranteed. Since this data is 64 bits wide, PECI
facilitates access to this register by allowing two requests to read the lower 32 bits and
upper 32 bits separately as shown in Table 2-6. Power and time units for this read are
defined as per the Package Power SKU Unit settings described in Section 2.5.2.6.11.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families43
Datasheet Volume One
The minimum DRAM power in Figure 2-18 corresponds to a minimum bandwidth
DRAM_POWER_INFO (lower bits)
Reserved
14
Minimum DRAM Power
16
TDP DRAM Powe r
(Typical Value)
30015
Reserved
31
DRAM_POWER_INFO (upper bits)
Maximum DRAM Power
3246
Reserved
47
Maximum Time
Window
4854
Reserved
5563
setting of the memory interface. It does ‘not’ correspond to a processor IDLE or
memory self-refresh state. The ‘time window’ in Figure 2-18 is representative of the
rate at which the power control unit (PCU) samples the DRAM energy consumption
information and reactively takes the necessary measures to meet the imposed power
limits. Programming too small a time window may not give the PCU enough time to
sample energy information and enforce the limit while too large a time window runs the
risk of the PCU not being able to monitor and take timely action on energy excursions.
While the DRAM power setting in Figure 2-18 provides a maximum value for the ‘time
window’ (typically a few seconds), the minimum value may be assumed to be
~100 mS.
The PCU programs the DRAM power settings described in Figure 2-18 when DRAM
characterization has been completed by the memory reference code (MRC) during boot
as indicated by the setting of the RST_CPL bit of the BIOS_RESET_CPL register. The
DRAM power settings will be programmed during boot independent of the ‘DRAM Power
Limit Enable’ bit setting. Please refer to the Intel® Xeon® Processor E5 Product Family Datasheet Volume Two for information on memory energy estimation methods and
energy tuning options used by BIOS and other utilities for determining the range
specified in the DRAM power settings. In general, any tuning of the power settings is
done by polling the voltage regulators supplying the DIMMs.
Figure 2-18. DRAM Power Info Read Data
2.5.2.6.9DRAM Power Limit Data Write / Read
This feature allows the PECI host to program the power limit over a specified time or
control window for the entire DRAM domain covering all the DIMMs within all the
memory channels. Actual values are chosen based on DRAM power consumption
characteristics. The units for the DRAM Power Limit and Control Time Window are
determined as per the Package Power SKU Unit settings described in
Section 2.5.2.6.11. The DRAM Power Limit Enable bit in Figure 2-19 should be set to
activate this feature. Exact DRAM power limit values are largely determined by platform
memory configuration. As such, this feature is disabled by default and there are no
defaults associated with the DRAM power limit values. The PECI host may be used to
enable and initialize the power limit fields for the purposes of DRAM power budgeting.
Alternatively, this can also be accomplished through inband writes to the appropriate
registers. Both power limit enabling and initialization of power limit values can be done
in the same command cycle. All RAPL parameter values including the power limit value,
control time window, and enable bit will have to be specified correctly even if the intent
is to change just one parameter value when programming over PECI.
Datasheet Volume One
The following conversion formula should be used for encoding or programming the
DRAM_POW ER_LIMIT Data
DRAM
Power Limit
Enable
1523
DRAM Power Limit
140
RESERVED
16
Control Time
Window
1731
RESERVED
24
DRAM Power Limit Performance
Accumulated DRAM Throttle Time
0
31
‘Control Time Window’ in bits [23:17].
Control Time Window (in seconds) = ([1 + 0.25 * ‘x’] * 2
‘x’ = integer value of bits[23:22]
‘y’ = integer value of bits[21:17]
‘z’ = Package Power SKU Time Unit[19:16] (see Section 2.5.2.6.13 for details on
Package Power SKU Unit)
For example, using this formula, a control time value of 0x0A will correspond to a
‘1-second’ time window. A valid range for the value of the ‘Control Time Window’ in
Figure 2-19 that can be programmed into bits [23:17] is 250 mS - 40 seconds.
From a DRAM power management standpoint, all post-boot DRAM power management
activities (also referred to as ‘DRAM RAPL’ or ‘DRAM Running Average Power Limit’)
should be managed exclusively through a single interface like PECI or alternatively an
inband mechanism. If PECI is being used to manage DRAM power budgeting activities,
BIOS should lock out all subsequent inband DRAM power limiting accesses by setting
bit 31 of the DRAM_POWER_LIMIT MSR or DRAM_PLANE_POWER_LIMIT CSR to ‘1’.
Figure 2-19. DRAM Power Limit Data
‘y’
) * ‘z’ where
2.5.2.6.10DRAM Power Limit Performance Status Read
This service allows the PECI host to assess the performance impact of the currently
active DRAM power limiting modes. The read return data contains the sum of all the
time durations for which each of the DIMMs has been operating in a low power state.
This information is tracked by a 32-bit counter that wraps around. The unit for time is
determined as per the Package Power SKU Unit settings described in
Section 2.5.2.6.11. The DRAM performance data does not account for stalls on the
memory interface.
In general, for the purposes of DRAM RAPL, the DRAM power management entity
should use PECI accesses to DRAM energy and performance status in conjunction with
the power limiting feature to budget power between the various memory sub-systems
in the server system.
Figure 2-20. DRAM Power Limit Performance Data
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families45
Datasheet Volume One
2.5.2.6.11CPU Thermal and Power Optimization Capabilities
Table 2-8 provides a summary of the processor power and thermal optimization
capabilities that can be accessed over PECI.
Note:The Index values referenced in Table 2-8 are in decimal format.
Table 2-8 also provides information on alternate inband mechanisms to access similar
or equivalent information for register reads and writes where applicable. The user
should consult the appropriate Intel® 64 and IA-32 Architectures Software Developer’s
Manual (SDM) Volumes 1, 2, and 3 or Intel® Xeon® Processor E5 Product Family
Datasheet Volume Two for exact details on MSR or CSR register content.
Table 2-8.RdPkgConfig() & WrPkgConfig() CPU Thermal and Power Optimization
Services Summary (Sheet 1 of 3)
Service
Package
Identifier Read
Package Power
SKU Unit Read
Package Power
SKU Read
Package Power
SKU Read
“Wake on PECI”
Mode Bit Write /
Read
Index
Value
(decimal)
0x0000
0x0001Platform ID
0x0002PCU Device ID
00
0x0003Max Thread ID
0x0004
0x0005
300x0000
280x0000
290x0000
05
Parameter
Value
(word)
0x0001 - Set
0x0000 -
Reset
RdPkgConfig()
Data (dword)
CPUID
Information
CPU Microcode
Update Revision
MCA Error
Source Log
Time, Energy
and Power Units
Package Power
SKU[31:0]
Package Power
SKU[64:32]
N/A
WrPkgConfig()
Data (dword)
N/A
N/A
N/A
“Wake on PECI”
mode bit
Description
Returns processorspecific information
including CPU family ,
model and stepping
information.
Used to ensure
microcode update
compatibility with
processor.
Returns the Device
ID information for
the processor Power
Control Unit.
Returns the
maximum ‘Thread
ID’ value supported
by the processor.
Returns processor
microcode and PCU
firmware revision
information.
Returns the MCA
Error Source Log
Read units for power ,
energy and time
used in power
control registers.
Returns Thermal
Design Power and
minimum package
power values for the
processor SKU.
Returns the
maximum package
power value for the
processor SKU and
the maximum time
interval for which it
can be sustained.
Enables package
pop-up to C2 to
service PECI
PCIConfig() accesses
if appropriate.
Alternate Inband
MSR or CSR Access
Execute CPUID instruction to get
processor signature
This feature enables the PECI host to uniquely identify the PECI client processor. The
parameter field encodings shown in Table 2-8 allow the PECI host to access the
relevant processor information as described below.
• CPUID data: This is the equivalent of data that can be accessed through the
CPUID instruction execution. It contains processor type, stepping, model and
family ID information as shown in Figure 2-21.
Figure 2-21. CPUID Data
• Platform ID data: The Platform ID data can be used to ensure processor
microcode updates are compatible with the processor. The value of the Platform ID
or Processor Flag[2:0] as shown in Figure 2-22 is typically unique to the platform
type and processor stepping. Refer to the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) Volumes 1, 2, and 3 for more information.
Figure 2-22. Platform ID Data
• PCU Device ID: This information can be used to uniquely identify the processor
power control unit (PCU) device when combined with the Vendor Identification
register content and remains constant across all SKUs. Refer to the appropriate
register description for the exact processor PCU Device ID value.
Figure 2-23. PCU Device ID
• Max Thread ID: The maximum Thread ID data provides the number of supported
processor threads. This value is dependent on the number of cores within the
processor as determined by the processor SKU and is independent of whether
certain cores or corresponding threads are enabled or disabled.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families49
Datasheet Volume One
Figure 2-24. Maximum Thread ID
Maximum Thread ID Data
Max Thread
ID
Reserved
331
0
4
CPU microcode and PCU firmware revision
31
0
CPU code patch revision
MCA Error Source Log
Reserved
28
0
MCERRIERRCATERR
293031
Reserved
2031
Time Unit19Reserved
16 15
Energy Unit
128
ReservedPower Unit
1374 30
• CPU Microcode Update Revision: Reflects the revision number for the microcode
update and power control unit firmware updates on the processor sample. The
revision data is a unique 32-bit identifier that reflects a combination of specific
versions of the processor microcode and PCU control firmware.
Figure 2-25. Processor Microcode Revision
• Machine Check Status: Returns error information as logged by the MCA Error
Source Log register. See Figure 2-26 for details. The power control unit will assert
the relevant bit when the error condition represented by the bit occurs. For
example, bit 29 will be set if the package asserted MCERR, bit 30 is set if the
package asserted IERR and bit 31 is set if the package asserted CAT_ERR_N. The
CAT_ERR_N may be used to signal the occurrence of a MCERR or IERR.
Figure 2-26. Machine Check Status
2.5.2.6.13Package Power SKU Unit Read
This feature enables the PECI host to read the units of time, energy and power used in
the processor and DRAM power control registers for calculating power and timing
parameters. In Figure 2-27, the default value of the power unit field [3:0] is 0011b,
energy unit [12:8] is 10000b and the time unit [19:16] is 1010b. Actual unit values are
calculated as shown in Table 2-9.
Table 2-9.Power Control Register Unit Calculations
Unit FieldValue CalculationDefault Value
Time1s / 2
Energy1J / 2
Power1W / 2
TIME UNIT
ENERGY UNIT
POWER UNIT
2.5.2.6.14 Package Power SKU Read
This read allows the PECI host to access the minimum, Thermal Design Power and
maximum power settings for the processor package SKU. It also returns the maximum
time interval or window over which the power can be sustained. If the power limiting
entity specifies a power limit value outside of the range specified through these
settings, power regulation cannot be guaranteed. Since this data is 64 bits wide, PECI
facilitates access to this register by allowing two requests to read the lower 32 bits and
upper 32 bits separately as shown in Table 2-8. Power units for this read are
determined as per the Package Power SKU Unit settings described in
Section 2.5.2.6.13.
‘Package Powe r SKU data’ is programmed by the PCU firmw are during boot time b ased
on SKU dependent power-on default values set during manufacturing. The TDP
package power specified through bits [14:0] in Figure 2-28 is the maximum value of
the ‘Power Limit1’ field i n Section 2.5.2.6.26 while the maximum package power in bits
[46:32] is the maximum value of the ‘Power Limit2’ field.
The minimum package power in bits [30:16] is applicable to both the ‘Power Limit1’ &
‘Power Limit2’ fields and corresponds to a mode when all the cores are operational and
in their lowest frequency mode. Attempts to program the power limit below the
minimum power value may not be effective since BIOS/OS, and not the PCU, controls
disabling of cores and core activity.
1s / 210 = 976 µs
1J / 216 = 15.3 µJ
1W / 23 = 1/8 W
The ‘maximum time window’ in bits [54:48] is representative of the maximum rate at
which the power control unit (PCU) can sample the package energy consumption and
reactively take the necessary measures to meet the imposed power limits.
Programming too large a time window runs the risk of the PCU not being able to
monitor and take timely action on package energy excursions. On the other hand,
programming too small a time window may not give the PCU enough time to sample
energy information and enforce the limit. The minimum value of the ‘time window’ can
be obtained by reading bits [21:15] of the PWR_LIMIT_MISC_INFO CSR using the PECI
RdPCIConfigLocal() command.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families51
Datasheet Volume One
Figure 2-28. Package Power SKU Data
Package Power SKU (lower bits)
Reserved
14
Minimum Package Power
16
TDP Package Power
30015
Reserved
31
Package Power SKU (upper bits)
Maximum Package Power
3246
Reserved
47
Maximum Time
Window
4854
Reserved
5563
Sign
Bit
14
RESERVED
16
PECI Temperature
(Integer Value)
61531
PECI Temperature
(Frac tio na l V alu e)
50
2.5.2.6.15“Wake on PECI” Mode Bit Write / Read
Setting the “Wake on PECI” mode bit enables successful completion of the
WrPCIConfigLocal(), RdPCIConfigLocal(), WrPCIConfig() and RdPCIConfig() PECI
commands by forcing a package ‘pop-up’ to the C2 state to service these commands if
the processor is in a low-power state. The exact power impact of such a ‘pop-up’ is
determined by the product SKU, the C-state from which the pop-up is initiated and the
negotiated PECI bit rate. A ‘reset’ or ‘clear’ of this bit or simply not setting the “Wake
on PECI” mode bit could result in a “timeout” response (completion code of 0x82) from
the processor indicating that the resources required to service the command are in a
low power state.
Alternatively, this mode bit can also be read to determine PECI behavior in package
states C3 or deeper.
2.5.2.6.16Accumulated Run Time Read
This read returns the total time for which the processor has been executing with a
resolution of 1 mS per count. This is tracked by a 32-bit counter that rolls over on
reaching the maximum value. This counter activates and starts counting for the first
time at RESET_N de-assertion.
2.5.2.6.17Package Temperature Read
This read returns the maximum processor die temperature in 16-bit PECI format. The
upper 16 bits of the response data are reserved. The PECI temperature data returned
by this read is the ‘instantaneous’ value and not the ‘average’ value as returned by the
PECI GetTemp() described in Section 2.5.2.3.
This feature enables the PECI host to read the maximum value of the DT S temperature
for any specific core within the processor. Alternatively , this service can be used to read
the System Agent temperature. Temperature is returned in the same format as the
Package Temperature Read described in Section 2.5.2.6.17. Data is returned in relative
PECI temperature format.
Reads to a parameter value outside the supported range will return an error as
indicated by a completion code of 0x90. The supported range of parameter values can
vary depending on the number of cores within the processor. The temperature data
returned through this feature is the instantaneous value and not an averaged value. It
is updated once every 1 mS.
2.5.2.6.19Temperature Target Read
The Temperature Target Read allows the PECI host to access the maximum processor
junction temperature (T
) in degrees Celsius. This is also the default temperature
jmax
value at which the processor thermal control circuit activates. The T
from processor part to part to reflect manufacturing process variations. The
Temperature Target read also returns the processor T
returned in standard PECI temperature format and represents the threshold
temperature used by the thermal management system for fan speed control.
Figure 2-30. Temperature Target Read
CONTROL
jmax
valu e. T
value may vary
CONTROL
is
2.5.2.6.20Package Thermal Status Read / Clear
The Thermal Status Read provides information on package level thermal status. Data
includes:
• Thermal Control Circuit (TCC) activation
• Bidirectional PROCHOT_N signal assertion
•Critical Temperature
Both status and sticky log bits are managed in this status word. All sticky log bits are
set upon a rising edge of the associated status bit and the log bits are cleared only by
Thermal Status reads or a processor reset. A read of the Thermal Status word always
includes a log bit clear mask that allows the host to clear any or all of the log bits that
it is interested in tracking.
A bit set to ‘0’ in the log bit clear mask will result in clearing the associated log bit. If a
mask bit is set to ‘0’ and that bit is not a legal mask, a failing completion code will be
returned. A bit set to ‘1’ is ignored and results in no change to any sticky log bits. For
example, to clear the TCC Activation Log bit and retain all other log bits, the Thermal
Status Read should send a mask of 0xFFFFFFFD.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families53
Datasheet Volume One
Figure 2-31. Thermal Status Word
Critical Temperature Log
Critical Temperature Status
Bidirectional PROCHOT# Log
Bidirectional PROCHOT#
Status
TCC Activation Log
TCC Activation Status
31
6543210
Reserved
Thermal Averaging Constant
RESERVED
4
31
PECI Temperature
Averaging Constant
3
0
2.5.2.6.21Thermal Averaging Constant Write / Read
This feature allows the PECI host to control the window over which the estimated
processor PECI temperature is filtered. The host may configure this window as a power
of two. For example, programming a value of 5 results in a filtering window of 25 or 32
samples. The maximum programmable value is 8 or 256 samples. Programming a
value of zero would disable the PECI temperature averaging feature. The default value
of the thermal averaging constant is 4 which translates to an aver aging window size of
4
or 16 samples. More details on the PECI temperature filtering function can be found
This features allows the PECI host to access the total time for which the processor has
been operating in a lowered power state due to TCC activation. The returned data
includes the time required to ramp back up to the original P-state target after TCC
activation expires. This timer does not include TCC activation as a result of an external
assertion of PROCHOT_N. This is tracked by a 32-bit counter with a resolution of 1mS
per count that rolls over or wraps around. On the processor PECI clients, the only logic
that can be thermally constrained is that supplied by VCC.
2.5.2.6.23Current Limit Read
This read returns the current limit for the processor VCC power plane in 1/8A
increments. Actual current limit data is contained only in the lower 13 bits of the
response data. The default return value of 0x438 corresponds to a current limit value
of 135A.
This service can return the value of the total energy consumed by the entire processor
package or just the logic supplied by the VCC power plane as specified through the
parameter field in Table 2-8. This information is tracked by a 32-bit counter that wraps
around and continues counting on reaching its limit. Energy units for this read are
determined as per the Package Power SKU Unit settings described in
Section 2.5.2.6.13.
While Intel requires reading the accumulated energy data at least once every 16
seconds to ensure functional correctness, a more realistic polling rate recommendation
is once every 100mS for better accuracy. This feature assumes a 150W processor. In
general, as the power capability decreases, so will the minimum polling rate
requirement.
When determining energy changes by subtracting energy values between successive
reads, Intel advocates using the 2’s complement method to account for counter wraparounds. Alternatively, adding all ‘F’s (‘0xFFFFFFFF’) to a negative result from the
subtraction will accomplish the same goal.
Figure 2-34. Accumulated Energy Read Data
2.5.2.6.25Power Limit for the VCC Power Plane Write / Read
This feature allows the PECI host to program the power limit over a specified time or
control window for the processor logic supplied by the VCC power plane. This typically
includes all the cores, home agent and last level cache. The processor does not support
power limiting on a per-core basis. Actual power limit values are chosen based on the
external VR (voltage regulator) capabilities. The units for the Power Limit and Control
Time Window are determined as per the Package Power SKU Unit settings described in
Section 2.5.2.6.13.
Since the exact VCC plane power limit value is a function of the platform VR, this
feature is not enabled by default and there are no default values associated with the
power limit value or the control time window. The Power Limit Enable bit in Figure 2-35
should be set to activate this feature. The Clamp Mode bit is also required to be set to
allow the cores to go into power states below what the operating system originally
requested. In general, this feature provides an improved mechanism for VR protection
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families55
Datasheet Volume One
compared to the input PROCHOT_N signal assertion method. Both power limit enabling
VCC Power Plane Power Limit Data
Power Limit
Enable
1523
VCC Plane Power Limit
140
Clamp
Mode
16
Control Time
Window
1731
RESERVED
24
and initialization of power limit values can be done in the same command cycle. Setting
a power limit for the VCC plane enables turbo modes for associated logic. External VR
protection is guaranteed during boot through operation at safe voltage and frequency.
All RAPL parameter values including the power limit value, control time window, clamp
mode and enable bit will have to be specified correctly even if the intent is to change
just one parameter value when programming over PECI.
The usefulness of the VCC power plane RAPL may be somewhat limited if the platform
has a fully compliant external voltage regulator. However, platforms using lower cost
voltage regulators may find this feature useful. The VCC RAPL value is generally
expected to be a static value after initialization and there may not be any use cases for
dynamic control of VCC plane power limit values during run time. BIOS may be ideally
used to read the VR (and associated heat sink) capabilities and program the PCU with
the power limit information during boot. No matter what the method is, Intel
recommends exclusive use of just one entity or interface, PECI for instance, to manage
VCC plane power limiting needs. If PECI is being used to manage VCC plane power
limiting activities, BIOS should lock out all subsequent inband VCC plane power limiting
accesses by setting bit 31 of the PP0_POWER_LIMIT MSR and CSR to ‘1’.
The same conversion formula used for DRAM Power Limiting (see Section 2.5.2.6.9)
should be applied for encoding or programming the ‘Control Time Window’ in bits
[23:17].
Figure 2-35. Power Limit Data for VCC Power Plane
2.5.2.6.26Package Power Limits For Multiple Turbo Modes
This feature allows the PECI host to program two power limit values to support multiple
turbo modes. The operating systems and drivers can balance the power budget using
these two limits. Two separate PECI requests are available to program the lower and
upper 32 bits of the power limit data shown in Figure 2-36. The units for the Power
Limit and Control Time Window are determined as per the Package Power SKU Unit
settings described in Section 2.5.2.6.13 while the valid range for power limit values are
determined by the Package Power SKU settings described in Section 2.5.2.6.14. Setting
the Clamp Mode bits is required to allow the cores to go into power states below what
the operating system originally requested. The Power Limit Enable bits should be set to
enable the power limiting function. Power limit values, enable and clamp mode bits can
all be set in the same command cycle. All RAPL parameter values including the power
limit value, control time window, clamp mode and enable bit will have to be specified
correctly even if the intent is to change just one parameter value when programming
over PECI.
Intel recommends exclusive use of just one entity or interface, PECI for instance, to
manage all processor package power limiting and budgeting needs. If PECI is being
used to manage package power limiting activities, BIOS should lock out all subsequent
inband package power limiting accesses by setting bit 31 of the
PACKAGE_POWER_LIMIT MSR and CSR to ‘1’. The ‘power limit 1’ is intended to limit
processor power consumption to any reasonable value below TDP and defaults to TDP.
‘Power Limit 1’ values may be impacted by the processor heat sinks and system air
Package Power Limit 1
Power Limit
Enable #1
1523
Power Limit # 1
140
Clamp
Mode #1
16
Control Time
Window #1
1731
RESERVED
24
Package Power Limit 2
Power Limit
Enable #2
4755
Power Limit # 2
4632
Clamp
Mode #2
48
Control Time
Window #2
4963
RESERVED
56
Accumulated CPU Throttle Time
Accumulated CPU Throttle Time
0
31
flow. Processor ‘power limit 2’ can be used as appropriate to limit the current drawn by
the processor to prevent any external power supply unit issues. The ‘Power Limit 2’
should always be programmed to a value (typically 20%) higher than ‘Power Limit 1’
and has no default value associated with it.
Though this feature is disabled by default and external programming is required to
enable, initialize and control package power limit values and time windows, the
processor package will still turbo to TDP if ‘Power Limit 1’ is not enabled or initialized.
‘Control Time Window#1’ (Power_Limit_1_Time also known as Tau) values may be
programmed to be within a range of 250 mS-40 seconds. ‘Control Time Window#2’
(Power_Limit_2_Time) values should be in the range 3 mS-10 mS.
The same conversion formula used for the DRAM Power Limiting feature (see
Section 2.5.2.6.9) should be applied when programming the ‘Control Time Window’ bits
[23:17] for ‘power limit 1’ in Figure 2-36. The ‘Control Time Window’ for ‘power limit 2’
can be directly programmed into bits [55:49] in units of mS without the aid of any
conversion formulas.
Figure 2-36. Package Turbo Power Limit Data
2.5.2.6.27Package Power Limit Performance Status Read
This service allows the PECI host to assess the performance impact of the currently
active power limiting modes. The read return data contains the total amount of time for
which the entire processor package has been operating in a power state that is lower
than what the operating system originally requested. This information is tracked by a
32-bit counter that wraps around. The unit for time is determined as per the Package
Power SKU Unit settings described in Section 2.5.2.6.13.
Figure 2-37. Package Power Limit Performance Data
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families57
Datasheet Volume One
2.5.2.6.28Efficient Performance Indicator Read
Efficien t Pe rfo r m a nc e In dica to r D a ta
Efficient Performance Cycles
0
31
ACPI P-T Notify Data
New P1 stateReserved
73108
The Efficient Performance Indicator (EPI) Read provides an indication of the total
number of productive cycles. Specifically, these are the cycles when the processor is
engaged in any activity to retire instructions and as a result, consuming energy. Any
power management entity monitoring this indicator should sample it at least once
every 4 seconds to enable detection of wraparounds. Refer to the processor Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) Volumes 1, 2, and 3, for
details on programming the Energy/Performance Bias (MSR_MISC_PWR_MGMT)
register to set the ‘Energy Efficiency’ policy of the processor.
Figure 2-38. Efficient Performance Indicator Read
2.5.2.6.29ACPI P-T Notify Write & Read
This feature enables the processor turbo capability when used in conjunction with the
PECI package RAPL or power limit. When the BMC sets the package power limit to a
value below TDP, it also determines a new corresponding turbo frequency and notifies
the OS using the ‘ACPI Notify’ mechanism as supported by the _PPC or performance
present capabilities object. The BMC then notifies the processor PCU using the PECI
‘ACPI P-T Notify’ service by programming a new state that is one p-state below the
turbo frequency sent to the OS via the _PPC method.
When the OS requests a p-state higher than what is specified in bits [7:0] of the PECI
ACPI P-T Notify data field, the CPU will treat it as request for P0 or turbo. The PCU will
use the IA32_ENERGY_PERFORMANCE_BIAS register settings to determine the exact
extent of turbo. Any OS p-state request that is equal to or below what is specified in
the PECI ACPI P-T Notify will be granted as long as the RAPL power limit does not
impose a lower p-state. However, turbo will not be enabled in this instance even if there
is headroom between the processor energy consumption and the RAPL power limit.
This feature does not affect the Thermal Monitor behavior of the processor nor is it
impacted by the setting of the power limit clamp mode bit.
This feature allows the PECI host to read the Caching Agent (Cbo) Table of Requests
(TOR). This information is useful for debug in the event of a 3-strike timeout that
results in a processor IERR assertion. The 16-bit parameter field is used to specify the
Cbo index, TOR array index and bank number according to the following bit
assignments.
• Bits [1:0] - Bank Number - legal values from 0 to 2
• Bits [6:2] - TOR Array Index - legal values from 0 to 19
• Bits [10:7] - Cbo Index - legal values from 0 to 7
• Bit [11] - Read Mode - should be set to ‘0’ for TOR reads
• Bits [15:12] - Reserved
Bit[11] is the Read Mode bit and should be set to ‘0’ for TOR reads. The Read Mode bit
can alternatively be set to ‘1’ to read the ‘Core ID’ (with associated valid bit as shown in
Figure 2-40) that points to the first core that asserted the IERR. In this case bits [10:0]
of the parameter field are ignored. The ‘Core ID’ read may not return valid data until at
least 1 mS after the IERR assertion.
Figure 2-40. Caching Agent TOR Read Data
Note: Reads to caching agents that are not enabled will return all zeroes. Refer to the debug handbook for
details on methods to interpret the crash dump results using the Cbo TOR data shown in Figure 2-40.
2.5.2.6.31Thermal Margin Read
This service allows the PECI host to read the margin to the processor thermal profile or
load line. Thermal margin data is returned in the format shown in Figure 2-41 with a
sign bit, an integer part and a fractional part. A negative thermal margin value implies
that the processor is operating in violation of its thermal load line and may be indicative
of a need for more aggressive cooling mechanisms through a fan speed increase or
other means. This PECI service will continue to return valid margin values even when
the processor die temperature exceeds T
jmax
.
Figure 2-41. DTS Thermal Margin Read
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families59
Datasheet Volume One
2.5.2.7RdIAMSR()
The RdIAMSR() PECI command provides read access to Model Specific Registers
(MSRs) defined in the processor’s Intel® Architecture (IA). MSR definitions may be
found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) Volumes 1, 2, and 3. Refer to Table 2-11 for the exact listing of processor registers
accessible through this command.
2.5.2.7.1Command Format
The RdIAMSR() format is as follows:
Write Length: 0x05
Read Length: 0x09 (qword)
Command: 0xb1
Description: Returns the data maintained in the processor IA MSR space as specified
by the ‘Processor ID’ and ‘MSR Address’ fields. The Read Length dictates the desired
data return size. This command supports only qword responses. All command
responses are prepended with a completion code that contains additional pass/fail
status information. Refer to Section 2.5.5.2 for details regarding completion codes.
2.5.2.7.2Processor ID Enumeration
The ‘Processor ID’ field that is used to address the IA MSR space refers to a specific
logical processor within the CPU. The ‘Processor ID’ always refers to the same physical
location in the processor silicon regardless of configuration as shown in the example in
Figure 2-42. For example, if certain logical processors are disabled by BIOS, the
Processor ID mapping will not change. The total number of Processor IDs on a CPU is
product-specific.
‘Processor ID’ enumeration involves discovering the logical processors enabled within
the CPU package. This can be accomplished by reading the ‘Max Thread ID’ value
through the RdPkgConfig() command (Index 0, Parameter 3) described in
Section 2.5.2.6.12 and subsequently querying each of the supported processor
threads. Unavailable processor threads will return a completion code of 0x90.
Alternatively, this information may be obtained from the RESOLVED_CORES_MASK
register readable through the RdPCIConfigLocal() PECI command described in
Section 2.5.2.9 or other means. Bits [7:0] and [9:8] of this register contain the ‘Core
Mask’ and ‘Thread Mask’ information respectively. The ‘Thread Mask’ applies to all the
enabled cores within the processor package as indicated by the ‘Core Mask’. For
the processor PECI clients, the ‘Processor ID’ may take on values in the range 0
through 15.
Note: The 2-byte MSR Address field and read data field defined in Figure 2-43 are sent in standard PECI ordering with LSB first
and MSB last.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families61
Datasheet Volume One
2.5.2.7.3Supported Responses
The typical client response is a passing FCS, a passing Completion Code and valid data.
Under some conditions, the client’s response will indicate a failure.
Table 2-10. RdIAMSR() Response Definition
ResponseMeaning
Bad FCSElectrical error
Abort FCSIllegal command formatting (mismatched RL/WL/Command Code)
CC: 0x40Command passed, data is valid.
CC: 0x80Response timeout. The processor was not able to generate the required response in a timely
CC: 0x81Response timeout. The processor is not able to allocate resources for servicing this command
CC: 0x82The processor hardware resources required to service this command are in a low power state.
CC: 0x90Unknown/Invalid/Illegal Request
CC: 0x91PECI control hardware, firmware or associated logic error. The processor is unable to process
fashion. Retry is appropriate.
at this time. Retry is appropriate.
Retry may be appropriate after modification of PECI wake mode behavior if appropriate.
the request.
2.5.2.7.4RdIAMSR() Capabilities
The processor PECI client allows PECI RdIAMSR() access to the registers listed in
Table 2-11. These registers pertain to the processor core and uncore error banks
(machine check banks 0 through 19). Information on the exact number of accessible
banks for the processor device may be obtained by reading the IA32_MCG_CAP[7:0]
MSR (0x0179). This register may be alternatively read using a RDMSR BIOS
instruction. Please consult the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) Volumes 1, 2, and 3 for more information on the exact number of cores
supported by a particular processor SKU. Any attempt to read processor MSRs that are
not accessible over PECI or simply not implemented will result in a completion code of
0x90.
PECI access to these registers is expected only when in-band access mechanisms are
not available.
Table 2-11. RdIAMSR() Services Summary (Sheet 1 of 2)
1.The IA32_MC0_MISC register details will be available upon implementation in a future processor stepping.
2.The MCi_ADDR and MCi_MISC registers for machine check banks 2 & 4 are not im plem ented on the pro cesso rs. The MCi_CT L
register for machine check bank 2 is also not implemented.
3.The PECI host must determine the total number of machine check banks and the validity of the MCi_ADDR and MCi_MISC
register contents prior to issuing a read to the machine check bank similar to standard machine check architecture
enumeration and accesses.
4.The information presented in Tab le 2-11 is applicable to the processor only . No association b etween bank numbers and logical
functions should be assumed for any other proc es sor device s ( past, p re sen t or futu re) bas ed on the infor mat ion pre sente d in
Table 2-11.
5.The processor machine check banks 4 through 19 reside in the processor uncore and hence will return the same value
independent of the processor ID used to access these banks.
6.The IA32_MCG_STATUS, IA32_MCG_CONTAIN and IA32_MCG_CAP are located in the uncore and will return the same value
independent of the processor ID used to access them.
7.The processor machine check banks 0 through 3 are core-specific. Since the processor ID is thread-specific and not corespecific, machine check banks 0 through 3 will return the same value for a particular core independent of the thread
referenced by the processor ID.
8.PECI accesses to the machine check banks may not be possible in the event of a core hang. A warm reset of the processor
may be required to read any sticky machine check banks.
9.Valid processor ID values may be obtained by using the enumeration methods described in Section 2.5.2.7.2.
10. Reads to a machine check bank within a core or thread that is disabled will return all zeroes with a completion code of 0x90.
11. For SKUs where Intel QPI is disabled or absent, reads to the corresponding machine check banks will return all zeros with a
completion code of 0x40.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families63
Datasheet Volume One
2.5.2.8RdPCIConfig()
31
Reserved
272820 19151114120
FunctionDevice
Bus
Register
The RdPCIConfig() command provides sideband read access to the PCI configuration
space maintained in downstream devices external to the processor. PECI originators
may conduct a device/function/register enumeration sweep of this space by issuing
reads in the same manner that the BIOS would. A response of all 1’s may indicate that
the device/function/register is unimplemented even with a ‘passing’ completion code.
Alternatively, reads to unimplemented registers may return a completion code of 0x90
indicating an invalid request. Responses will follow normal PCI protocol.
PCI configuration addresses are constructed as shown in Figure 2-44. Under normal inband procedures, the Bus number would be used to direct a read or write to the proper
device. Actual PCI bus numbers for all PCI devices including the PCH are programmable
by BIOS. The bus number for PCH devices may be obtained by reading the CPUBUSNO
CSR. Refer to the Intel® Xeon® Processor E5 Product Family Datasheet Volume Two
document for details on this register.
Figure 2-44. PCI Configuration Address
PCI configuration reads may be issued in byte, word or dword granularities.
Description: Returns the data maintained in the PCI configuration space at the
requested PCI configuration address. The Read Length dictates the desired data return
size. This command supports only dword responses with a completion code on the
processor PECI clients. All command responses are prepended with a completion code
that includes additional pass/fail status information. Refer to Section 2.5.5.2for details
regarding completion codes.
Figure 2-45. RdPCIConfig()
Note: The 4-byte PCI configuration address and read data field defined in Figure 2-45 are sent in standard PECI ordering with
The typical client response is a passing FCS, a passing Completion Code and valid data.
Under some conditions, the client’s response will indicate a failure.
The PECI client response can also vary depending on the address and data. It will
respond with a passing completion code if it successfully submits the request to the
appropriate location and gets a response.
CC: 0x40Command passed, data is valid.
CC: 0x80Response timeout. The processor was not able to generate the required response in a
CC: 0x81Response timeout. The processor is not able to allocate resources for servicing this
CC: 0x82The processor hardware resources required to service this command are in a low power
CC: 0x90Unknown/Invalid/Illegal Request
CC: 0x91PECI control hardware, firmware or associated logic error. The processor is unable to
timely fashion. Retry is appropriate.
command at this time. Retry is appropriate.
state. Retry may be appropriate after modification of PECI wake mode behavior if
appropriate.
process the request.
2.5.2.9RdPCIConfigLocal()
The RdPCIConfigLocal() command provides sideband read access to the PCI
configuration space that resides within the processor. This includes all processor IIO
and uncore registers within the PCI configuration space as described in the Intel® Xeon® Processor E5 Product Family Datasheet Volume Two document.
PECI originators may conduct a device/function enumeration sweep of this space by
issuing reads in the same manner that the BIOS would. A response of all 1’s may
indicate that the device/function/register is unimplemented even with a ‘passing’
completion code. Alternatively, reads to unimplemented or hidden registers may return
a completion code of 0x90 indicating an invalid request. It is also possible that reads to
function 0 of non-existent IIO devices issued prior to BIOS POST may return all ‘0’s
with a passing completion code. PECI originators can access this space even prior to
BIOS enumeration of the system buses. There is no read restriction on accesses to
locked registers.
PCI configuration addresses are constructed as shown in Figure 2-46. Under normal inband procedures, the Bus number would be used to direct a read or write to the proper
device. PECI reads to the processor IIO devices should specify a bus number of ‘0000’
and reads to the rest of the processor uncore should specify a bus number of ‘0001’ for
bits [23:20] in Figure 2-46. Any request made with a bad Bus number is ignored and
the client will respond with all ‘0’s and a ‘passing’ completion code.
Figure 2-46. PCI Configuration Address for local accesses
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families65
Datasheet Volume One
2.5.2.9.1Command Format
01 2 3
Byte #
FCS
12
7
9
Completion
Code
LSB PCI Configuration Address MSB
10
Write Length
0x05
LSB Data (1, 2 or 4 bytes) MSB
5
Cmd Code
0xe1
14
Read Length
{0x02,0x03,0x05}
68
Host ID[7:1] &
Retry[0]
Client Address
13
4
11
FCS
Byte
Definition
The RdPCIConfigLocal() format is as follows:
Write Length: 0x05
Read Length: 0x02 (byte), 0x03 (word), 0x05 (dword)
Command: 0xe1
Description: Returns the data maintained in the PCI configuration space within the
processor at the requested PCI configuration address. The Read Length dictates the
desired data return size. This command supports byte, word and dword responses as
well as a completion code. All command responses are prepended with a completion
code that includes additional pass/fail status information. Refer to Section 2.5.5.2 for
details regarding completion codes.
Figure 2-47. RdPCIConfigLocal()
Note: The 3-byte PCI configuration address and read data field defined in Figure 2-47 are sent in standard PECI ordering with
2.5.2.9.2Supported Responses
Table 2-13. RdPCIConfigLocal() Response Definition (Sheet 1 of 2)
The typical client response is a passing FCS, a passing Completion Code and valid data.
Under some conditions, the client’s response will indicate a failure.
The PECI client response can also vary depending on the address and data. It will
respond with a passing completion code if it successfully submits the request to the
appropriate location and gets a response.
CC: 0x40Command passed, data is valid.
CC: 0x80Response timeout. The processor was not able to generate the required response in a
CC: 0x81Response timeout. The processor is not able to allocate resources for servicing this
timely fashion. Retry is appropriate.
command at this time. Retry is appropriate.
Datasheet Volume One
Table 2-13. RdPCIConfigLocal() Response Definition (Sheet 2 of 2)
ResponseMeaning
CC: 0x82The processor hardware resources required to service this command are in a low power
CC: 0x90Unknown/Invalid/Illegal Request
CC: 0x91PECI control hardware, firmware or associated logic error. The processor is unable to
state. Retry may be appropriate after modification of PECI wake mode behavior if
appropriate.
process the request.
2.5.2.10WrPCIConfigLocal()
The WrPCIConfigLocal() command provides sideband write access to the PCI
configuration space that resides within the processor. PECI originators can access this
space even before BIOS enumeration of the system buses. The exact listing of
supported devices and functions for writes using this command on the processor is
defined in Table 2-19. The write accesses to registers that are locked will not take effect
but will still return a completion code of 0x40. However , write accesses to registers that
are hidden will return a completion code of 0x90.
Because a WrPCIConfigLocal() command results in an update to potentially critical
registers inside the processor, it includes an Assured Write FCS (AW FCS) byte as
part of the write data payload. In the event that the AW FCS mismatches with the
client-calculated FCS, the client will abort the write and will always respond with a bad
write FCS.
PCI Configuration addresses are constructed as shown in Figure 2-46. The write
command is subject to the same address configuration rules as defined in
Section 2.5.2.9. PCI configuration writes may be issued in byte, word or dword
granularity.
2.5.2.10.1Command Format
The WrPCIConfigLocal() format is as follows:
Write Length: 0x07 (byte), 0x08 (word), 0x0a (dword)
Read Length: 0x01
Command: 0xe5
AW FCS Support: Yes
Description: Writes the data sent to the requested register address. Write Length
dictates the desired write granularity. The command always returns a completion code
indicating pass/fail status. Refer to Section 2.5.5.2 for details on completion codes.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families67
Datasheet Volume One
Figure 2-48. WrPCIConfigLocal()
Byte
Definition
01 2
FCS
3
Completion
Code
AW FCS
12
Byte #
FCS
131415
Write Length
{0x07, 0x08, 0x0a}
Host ID[7:1] &
Retry[0]
4
8
Read Length
0x01
56
Cmd Code
0xe5
10
11
Client Address
9
LSB PCI Configuration Address MSB
LSB Data (1, 2 or 4 bytes) MSB
7
Note: The 3-byte PCI configuration address and write data field defined in Figure 2-48 are sent in standard PECI ordering with
LSB first and MSB last.
2.5.2.10.2Supported Responses
The typical client response is a passing FCS, a passing Completion Code and valid data.
Under some conditions, the client’s response will indicate a failure.
The PECI client response can also vary depending on the address and data. It will
respond with a passing completion code if it successfully submits the request to the
appropriate location and gets a response.
On the processor PECI clients, the PECI WrPCIConfigLocal() command provides a
method for programming certain integrated memory controller and IIO functions as
described in Table 2-15. Refer to the Intel® Xeon® Processor E5 Product Family Datasheet Volume Two for more details on specific register definitions. It also enables
writing to processor REUT (Robust Electrical Unified Test) registers associated with the
Intel QPI, PCIe* and DDR3 functions.
Table 2-15. WrPCIConfigLocal() Memory Controller and IIO Device/Function Support
Integrated Memory Controller Thermal Control Registers
2.5.3Client Management
2.5.3.1Power-up Sequencing
The PECI client will not be available when the PWRGOOD signal is de-asserted. Any
transactions on the bus during this time will be completely ignored, and the host will
read the response from the client as all zeroes. PECI client initialization is completed
approximately 100 µS after the PWRGOOD assertion. This is represented by the start of
the PECI Client “Data Not Ready” (DNR) phase in Figure 2-49. While in this phase, the
PECI client will respond normally to the Ping() and GetDIB() commands and return the
highest processor die temperature of 0x0000 to the GetTemp() command. All other
commands will get a ‘Response Timeout’ completion in the DNR phase as shown in
Table 2-16. All PECI services with the exception of core MSR space accesses become
available ~500 µS after RESET_N de-assertion as shown in Figure 2-49. PECI will be
fully functional with all services including core accesses being available when the core
comes out of reset upon completion of the RESET microcode execution.
In the event of the occurrence of a fatal or catastrophic error, all PECI services with the
exception of core MSR space accesses will be available during the DNR phase to
facilitate debug through configuration space accesses.
Table 2-16. PECI Client Response During Power-Up (Sheet 1 of 2)
Command
Ping()Fully functionalFully functional
GetDIB()Fully functionalFully functional
GetT emp()Client responds with a ‘hot’ reading or 0x0000 Fully functional
RdPkgConfig()Client responds with a timeout completion
WrPkgConfig()Client responds with a timeout completion
RdIAMSR()Client responds with a timeout completion
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families69
Datasheet Volume One
code of 0x81
code of 0x81
code of 0x81
Response During
‘Data Not Ready’
Fully functional
Fully functional
Client responds with a timeout
completion code of 0x81
Response During
‘Available Except Core Services’
Table 2-16. PECI Client Response During Power-Up (Sheet 2 of 2)
PWRGOOD
RESET_N
Core executio n
idlerunn i ng
Reset uCodeBoot BIOS
PECI Client
Status
Data Not Ready
Available except core
services
SOCKET_ID[1:0]
XSOCKET ID Valid
In Reset
Fully Opera t io n al
In Reset
Command
RdPCIConfigLocal()Client responds with a timeout completion
WrPCIConfigLocal()Client responds with a timeout completion
RdPCIConfig()Client responds with a timeout completion
code of 0x81
code of 0x81
code of 0x81
Response During
‘Data Not Ready’
In the event that the processor is tri-stated using power-on-configuration controls, the
PECI client will also be tri-stated. Processor tri-state controls are described in
The PECI client is available on all processors. The presence of a PECI enabled processor
in a CPU socket can be confirmed by using the Ping() command described in
Section 2.5.2.1. Positive identification of the PECI revision number can be achieved by
issuing the GetDIB() command. The revision number acts as a reference to the PECI
specification document applicable to the processor client definition. Please refer to
Section 2.5.2.2 for details on GetDIB response formatting.
The PECI client assumes a default address of 0x30. The PECI client address for the
processor is configured through the settings of the SOCKET_ID[1:0] signals. Each
processor socket in the system requires that the two SOCKET_ID signals be configured
to a different PECI addresses. Strapping the SOCKET_ID[1:0] pins results in the client
addresses shown in Table 2-17. These package strap(s) are evaluated at the assertion
of PWRGOOD (as depicted in Figure 2-49). Refer to the appropriate Platform Design
Guide (PDG) for recommended resistor values for establishing non-default SOCKET_ID
settings.
Datasheet Volume One
The client address may not be changed after PWRGOOD assertion, until the next power
cycle on the processor. Removal of a processor from its socket or tri-stating a processor
will have no impact to the remaining non-tri-stated PECI client addresses. Since each
socket in the system should have a unique PECI address, the SOCKET_ID strapping is
required to be unique for each socket.
The processor PECI client may be fully functional in most core and package C-states.
• The Ping(), GetDIB(), GetTemp(), RdPkgConfig() and WrPkgConfig() commands
have no measurable impact on CPU power in any of the core or package C-states.
• The RdIAMSR() command will complete normally unless the targeted core is in a C state that is C3 or deeper. The PECI client will respond with a completion code of
0x82 (see Table 2-22 for definition) for RdIAMSR() accesses in core C-states that
are C3 or deeper.
• The RdPCIConfigLocal(), WrPCIConfigLocal(), and RdPCIConfig() commands will
not impact the core C-states but may hav e a measurable impact on the package Cstate. The PECI client will successfully return data without impacting package Cstate if the resources needed to service the command are not in a low power state.
— If the resources required to service the command are in a low power state, the
PECI client will respond with a completion code of 0x82 (see Table 2-22 for
definition). If this is the case, setting the “W ake on PECI” mode bit as described
in Section 2.5.2.6 can cause a package ‘pop-up’ to the C2 state and enable
successful completion of the command. The exact power impact of a pop-up to
C2 will vary by product SKU, the C -state from which the pop-up is initiated and
the negotiated PECI bit rate.
TT
Ground0x32
V
TT
0x31
0x33
Table 2-18. Power Impact of PECI Commands vs. C-states
CommandPower Impact
Ping()Not measurable
GetDIB()Not measurable
GetTemp()Not measurable
RdPkgConfig()Not measurable
WrPkgConfig()Not measurable
RdIAMSR()Not measurable. PECI client will not return valid data in core C-state that is C3 or deeper
RdPCIConfigLocal()May require package ‘pop-up’ to C2 state
WrPCIConfigLocal() May require package ‘pop-up’ to C2 state
RdPCIConfig()May require package ‘pop-up’ to C2 state
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families71
Datasheet Volume One
2.5.3.5S-states
The processor PECI client is always guaranteed to be operational in the S0 sleep state.
• The Ping(), GetDIB(), GetTemp(), RdPkgConfig(), WrPkgConfig(),
RdPCIConfigLocal() and WrPCIConfigLocal() will be fully operational in S0 and S1.
Responses in S3 or deeper states are dependent on POWERGOOD assertion status.
• The RdPCIConfig() and RdIAMSR() responses are guaranteed in S0 only. Behavior
in S1 or deeper states is indeterminate.
• PECI behavior is indeterminate in the S3, S4 and S5 states and responses to PECI
originator requests when the PECI client is in these states cannot be guaranteed.
2.5.3.6Processor Reset
The processor PECI client is fully reset on all RESET_N assertions. Upon deassertion of
RESET_N where power is maintained to the processor (otherwise known as a ‘warm
reset’), the following are true:
• The PECI client assumes a bus Idle state.
• The Thermal Filtering Constant is retained.
• PECI SOCKET_ID is retained.
• GetTemp() reading resets to 0x0000.
• Any transaction in progress is aborted by the client (as measured by the client no
longer participating in the response).
• The processor client is otherwise reset to a default configuration.
The assertion of the CPU_ONLY_RESET signal does not reset the processor PECI client.
As such, it will have no impact on the basic PECI commands, namely the Ping(),
GetTemp() and GetDIB(). However, it is likely that other PECI commands that utilize
processor resources being reset will receive a ‘resource unavailable’ response till the
reset sequence is completed.
2.5.3.7System Service Processor (SSP) Mode Support
Sockets in SSP mode have limited PECI command support. Only the following PECI
commands will be supported while in SSP mode. Other PECI commands are not
guaranteed to complete in this mode.
•Ping
• RdPCIConfigLocal
• WrPCIConfigLocal (all uncore and IIO CSRs within the processor PCI configuration
space will be accessible)
• RdPkgConfig (Index 0 only)
Sockets remain in SSP mode until the "Go" handshake is received. This is applicable to
the following SSP modes.
2.5.3.7.1BMC INIT Mode
The BMC INIT boot mode is used to provide a quick and efficient means to transfer
responsibility for uncore configuration to a service processor like the BMC. In this
mode, the socket performs a minimal amount of internal configuration and then waits
for the BMC or service processor to complete the initialization.
In cases where the socket is not one Intel QPI hop away from the Firmware Agent
socket, or a working link to the Firmware Agent socket cannot be resolved, the socket
is placed in Link Init mode. The socket performs a minimal amount of internal
configuration and waits for complete configuration by BIOS.
2.5.3.8Processor Error Handling
Availability of PECI services may be affected by the processor PECI client error status.
Server manageability requirements place a strong emphasis on continued availability of
PECI services to facilitate logging and debug of the error condition.
• Most processor PECI client services are available in the event of a CAT_ERR_N
assertion though they cannot be guaranteed.
• The Ping(), GetDIB(), GetT emp(), RdPkgConfig() and WrPkgConfig() commands will
be serviced if the source of the CAT_E RR_N assertion is not in the processor power
control unit hardware, firmware or associated register logic. Additionally, the
RdPCIConfigLocal() and WrPCIConfigLocal() comm ands may also be serviced in this
case.
• It is recommended that the PECI originator read Index 0/Parameter 5 using the
RdPkgConfig() command to debug the CAT_ERR_N assertion.
— The PECI client will return the 0x91 completion code if the CAT_ERR_N
assertion is caused by the PCU hardware, firmware or associated logic errors.
In such an event, only the Ping(), GetTemp() and GetDIB() PECI commands
may be serviced. All other processor PECI services will be unavailable and
further debug of the processor error status will not be possible.
— If the PECI client returns a passing completion code, the originator should use
the response data to determine the cause of the CA T_ERR_N assertion. In such
an event, it is also recommended that the PECI originator determine the exact
suite of available PECI client services by issuing each o f the PECI commands.
The processor will issue ‘timeout’ responses for those services that may not be
available.
— If the PECI client continues to return the 0x81 completion code in response to
multiple retries of the RdPkgConfig() command, no PECI services, with the
exception of the Ping(), GetTemp() and GetDIB(), will be guaranteed.
• The RdIAMSR() command may be serviced during a CA T_ERR_N assertion though it
cannot be guaranteed.
2.5.3.9Originator Retry and Timeout Policy
The PECI originator may need to retry a command if the processor PECI client responds
with a ‘response timeout’ completion code or a bad Read FCS. In each instance, the
processor PECI client may have started the operation but n ot completed it yet. When
the 'retry' bit is set, the PECI client will ignore a new request if it exactly matches a
previous valid request.
The processor PECI client will not clear the semaphore that was acquired to service the
request until the originator sends the ‘retry’ request in a timely fashion to successfully
retrieve the response data. In the absence of any automatic timeouts, this could tie up
shared resources and result in artificial bandwidth conflicts.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families73
Datasheet Volume One
2.5.3.10Enumerating PECI Client Capabilities
The PECI host originator should be designed to support all optional but desirable
features from all processors of interest. Each feature has a discovery method and
response code that indicates availability on the destination PECI client.
The first step in the enumeration process would be for the PECI host to confirm the
Revision Number through the use of the GetDIB() command. The revision number
returned by the PECI client processor always maps to the revision number of the PECI
specification that it is designed to. The Minor Revision Number as described in Table 2-2
may be used to identify the subset of PECI commands that the processor in question
supports for any major PECI revision.
The next step in the enumeration process is to utilize the desired command suite in a
real execution context. If the Write FCS response is an Abort FCS or if the data
returned includes an “Unknown/Invalid/Illegal Re quest” completion code (0x90), then
the command is unsupported.
Enumerating known commands without real, execution context data, or attempting
undefined commands, is dangerous because a write command could result in
unexpected behavior if the data is not properly formatted. Methods for enumerating
write commands using carefully constructed and innocuous data are possible, but are
not guaranteed by the PECI client definition.
This enumeration procedure is not robust enough to detect differences in bit definitions
or data interpretation in the message payload or client response. Instead, it is only
designed to enumerate discrete features.
2.5.4Multi-Domain Commands
The processor does not support multiple domains, but it is possible that future products
will, and the following tables are included as a reference for domain-specific definitions.
The Client responds with an Abort FCS under the following conditions:
• The decoded command is not understood or not supported on this processor (this
includes good command codes with bad Read Length or Write Length bytes).
• Assured Write FCS (AW FCS) failure. Under most circumstances, an Assured Write
failure will appear as a bad FCS. However, when an originator issues a poorly
formatted command with a miscalculated AW FCS, the client will intentionally abort
the FCS in order to guarantee originator notification.
2.5.5.2Completion Codes
Some PECI commands respond with a completion code byte. These codes are designed
to communicate the pass/fail status of the command and may also provide more
detailed information regarding the class of pass or fail. For all commands listed in
Section 2.5.2 that support completion codes, the definition in the following table
applies. Throughout this document, a completion code reference may be abbreviated
with ‘CC’.
An originator that is decoding these commands can apply a simple mask as shown in
Table 2-21 to determine a pass or fail. Bit 7 is always set on a command that did not
complete successfully and is cleared on a passing command.
Table 2-21. Completion Code Pass/Fail Mask
0xxx xxxxbCommand passed
1xxx xxxxbCommand failed
Table 2-22. Device Specific Completion Code (CC) Definition
Completion
Code
0x40Command Passed
CC: 0x80Resp onse timeout. The proces sor was not able to generate the required response in a timely
CC: 0x81Response timeout. The processor was not able to allocate resources for servicing this
CC: 0x82The processor hardware resources required to service this command are in a low power
CC: 0x83-8FReserved
CC: 0x90Unknown/Invalid/Illegal Request
CC: 0x91PECI control hardware, firmw are or associated logic error. The processor is unable to process
CC: 0x92-9FReserved
fashion. Retry is appropriate.
command. Retry is appropriate.
state. Retry may be appropriate after modification of PECI wake mode behavior if
appropriate.
the request.
Description
Note:The codes explicitly defined in Table 2-22 may be useful in PECI originator response
algorithms. Reserved or undefined codes may also be generated by a PECI client
device, and the originating agent must be capable of tolerating any code. The Pass/Fail
mask defined in Table 2-21 applies to all codes, and general response policies may be
based on this information. Refer to Section 2.5.6 for originator response policies and
recommendations.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families75
Datasheet Volume One
2.5.6Originator Responses
The simplest policy that an originator may employ in response to receipt of a failing
completion code is to retry the request. However, certain completion codes or FCS
responses are indicative of an error in command encoding and a retry will not result in
a different response from the client. Furthermore, the message originator must have a
response policy in the event of successive failure responses. Refer to Table 2-22 for
originator response guidelines.
Refer to the definition of each command in Section 2.5.2 for a specific definition of
possible command codes or FCS responses for a given command. The following
response policy definition is generic, and more advanced response policies may be
employed at the discretion of the originator developer.
Table 2-23. Originator Response Guidelines
ResponseAfter 1 AttemptAfter 3 Attempts
Bad FCSRetryFail with PECI client device error.
Abort FCSRetryFail with PECI client device error if command was not illegal or
CC: 0x8xRetryThe PECI client has failed in its attempts to generate a response.
CC: 0x9xAbandon any further
None (all 0’s)Force bus idle (drive
CC: 0x4xPassN/A
Good FCSPassN/A
attempts and notify
application layer
low) for 1 mS and retry
malformed.
Notify application layer.
N/A
Fail with PECI client device error. Client may not be alive or may be
otherwise unresponsive (for example, it could be in RESET).
2.5.7DTS Temperature Data
2.5.7.1Format
The temperature is formatted in a 16-bit, 2’s complement value representing a number
of 1/64 degrees centigrade. This format allows temperatures in a range of ±512° C to
be reported to approximately a 0.016° C resolution.
Figure 2-50. Temperature Sensor Data Format
MSB
Upper nibble
Sxxxxxxxxxxxxxxx
SignInteger Value (0-511)Fractional Value (~0.016)
2.5.7.2Interpretation
The resolution of the processor’s Digital Thermal Sensor (DTS) is approximately 1°C,
which can be confirmed by a RDMSR from the IA32_THERM_STATUS MSR where it is
architecturally defined. The MSR read will return only bits [13:6] of the PECI
temperature sensor data defined in Figure 2-50. PECI temperatures are sent through a
configurable low-pass filter prior to delivery in the GetTemp() response data. The
output of this filter produces temperatures at the full 1/64°C resolution even though
the DTS itself is not this accurate.
Temperature readings from the processor are always negative in a 2’s complement
format, and imply an offset from the processor T
processor T
is 100°C, a PECI thermal reading of -10 implies that the processor is
jmax
running at approximately 10°C below T
not reliable at temperatures above T
range and hence, PECI temperature readings are never positive.
The changes in PECI data counts are approximately linear in relation to changes in
temperature in degrees centigrade. A change of ‘1’ in the PECI count represents
roughly a temperature change of 1 degree centigrade. This linearity is approximate and
cannot be guaranteed over the entire range of PECI temperatures, especially as the
offset from the maximum PECI temperature (zero) increases.
2.5.7.3Temperature Filtering
The processor digital thermal sensor (DTS) provides an improved capability to monitor
device hot spots, which inherently leads to more varying temperature readings over
short time intervals. Coupled with the fact that typical fan speed controllers may only
read temperatures at 4Hz, it is necessary for the thermal readings to reflect thermal
trends and not instantaneous readings. Therefore, PECI supports a configurable lowpass temperature filtering function that is expressed by the equation:
(PECI = 0). For example, if the
jmax
or at 90°C. PECI temperature readings are
jmax
since the processor is outside its operating
jmax
TN = (1-α) * T
where T
respectively,
‘α’ = 1/2
and T
N
T
X
, where ‘X’ is the ‘Thermal Averaging Constant’ that is programmable as
+ α * T
N-1
are the current and previous averaged PECI temperature values
N-1
SAMPLE
is the current PECI temperature sample value and the variable
SAMPLE
described in Section 2.5.2.6.21.
2.5.7.4Reserved Values
Several values well out of the operational range are reserved to signal temperature
sensor errors. These are summarized in Table 2-24.
Table 2-24. Error Codes and Descriptions
Error CodeDescription
0x8000General Sensor Error (GSE)
0x8001Reserved
0x8002Sensor is operational, but has detected a temperature below its operational range
0x8003-0x81ffReserved
(underflow)
§
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families77
Datasheet Volume One
Intel® Virtualization Technology (Intel® VT) makes a single system appear as multiple
independent systems to software. This allows multiple, independent operating systems
to run simultaneously on a single system. Intel VT comprises technology components
to support virtualization of platforms based on Intel architecture microprocessors and
chipsets.
• Intel® Virtualization Technology (Intel® VT) for Intel® 64 and IA-32 Intel® Architecture (Intel® VT-x) adds hardware support in the processor to
improve the virtualization performance and robustness. Intel VT-x specifications
and functional descriptions are included in the
Software Developer’s Manual, Volume 3B
products/processor/manuals/index.htm
• Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) adds processor and uncore implementations to support and
improve I/O virtualization performance and robustness. The Intel VT-d spec and
other Intel VT documents can be referenced at http://www.intel.com/technology/
virtualization/index.htm.
Intel® 64 and IA-32 Architectures
and is available at http://www.intel.com/
3.1.1Intel VT-x Objectives
Intel VT-x provides hardware acceleration for virtualization of IA platforms. Virtual
Machine Monitor (VMM) can use Intel VT-x features to provide improved reliable
virtualized platform. By using Intel VT-x, a VMM is:
• Robust: VMMs no longer need to use para-virtualization or binary translation. This
means that they will be able to run off-the-shelf OS’s and applications without any
special steps.
• Enhanced: Intel VT enables VMMs to run 64-bit guest operating systems on IA x86
processors.
• More reliable: Due to the hardware support, VMMs can now be smaller, less
complex, and more efficient. This improves reliability and availability and reduces
the potential for software conflicts.
• More secure: The use of hardware transitions in the VMM strengthens the isolation
of VMs and further prevents corruption of one VM from affecting others on the
same system.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families79
Datasheet Volume One
3.1.2Intel VT-x Features
The processor core supports the following Intel VT-x features:
— eliminates VM exits from guest OS to the VMM for shadow page-table
maintenance
• Virtual Processor IDs (VPID)
— Ability to assign a VM ID to tag processor core hardware structures (for
example, TLBs)
— This avoids flushes on VM transitions to give a lower-cost VM transition time
and an overall reduction in virtualization overhead.
• Guest Preemption Timer
— Mechanism for a VMM to preempt the execution of a guest OS after an amount
of time specified by the VMM. The VMM sets a timer value before entering a
guest
— The feature aids VMM developers in flexibility and Quality of Service (QoS)
guarantees
• Descriptor-Table Exiting
— Descriptor-table exiting allows a VMM to protect a guest OS from internal
(malicious software based) attack by preventing relocation of key system data
structures like IDT (interrupt descriptor table), GDT (global descriptor table),
LDT (local descriptor table), and TSS (task segment selector).
— A VMM using this feature can intercept (by a VM exit) attempts to relocate
these data structures and prevent them from being tampered by malicious
software.
• Pause Loop Exiting (PLE)
— PLE aims to improve virtualization performance and enhance the scaling of
virtual machines with multiple virtual processors
— PLE attempts to detect lock-holder preemption in a VM and helps the VMM to
make better scheduling decisions
Technologies
3.1.3Intel VT-d Objectives
The key Intel VT-d objectives are domain-based isolation and hardware-based
virtualization. A domain can be abstractly defined as an isolated environment in a
platform to which a subset of host physical memory is allocated. Virtualization allows
for the creation of one or more partitions on a single system. This could be multiple
partitions in the same operating system, or there can be multiple operating system
instances running on the same system – offering benefits such as system
consolidation, legacy migration, activity partitioning or security.
3.1.3.1Intel VT-d Features Supported
The processor supports the following Intel VT-d features:
• Root entry, context entry, and default context
• Support for 4-K page sizes only
• Support for register-based fault recording only (for single entry only) and support
The processor supports the following Intel VT Processor Extensions features:
• Large Intel VT-d Pages
— Adds 2 MB and 1 GB page sizes to Intel VT-d implementations
— Matches current support for Extended Page Tables (EPT)
— Ability to share CPU's EPT page-table (with super-pages) with Intel VT-d
— Benefits:
• Less memory foot-print for I/O page-tables when using super-pages
• Potential for improved performance - Due to shorter page-walks, allows
hardware optimization for IOTLB
• Transition latency reductions expected to improve virtualization performance
without the need for VMM enabling. This reduces the VMM overheads further and
increase virtualization performance.
3.2Security Technologies
3.2.1Intel® Trusted Execution Technology
Intel® Trusted Execution Technology (Intel® TXT) defines platform-level
enhancements that provide the building blocks for creating trusted platforms.
The Intel TXT platform helps to provide the authenticity of the controlling environment
such that those wishing to rely on the platform can make an appropriate trust decision.
The Intel TXT platform determines the identity of the controlling environment by
accurately measuring and verifying the controlling software.
Another aspect of the trust decision is the ability of the platform to resist attempts to
change the controlling environment. The Intel TXT platform will resist attempts by
software processes to change the controlling environment or bypass the bounds set by
the controlling environment.
Intel TXT is a set of extensions designed to provide a measured and controlled launch
of system software that will then establish a protected environment for itself and any
additional software that it may execute.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families81
Datasheet Volume One
Technologies
These extensions enhance two areas:
• The launching of the Measured Launched Environment (MLE).
• The protection of the MLE from potential corruption.
The enhanced platform provides these launch and control interfaces using Safer Mode
Extensions (SMX).
The SMX interface includes the following functions:
• Measured/Verified launch of the MLE.
• Mechanisms to ensure the above measurement is protected and stored in a secure
location.
• Protection mechanisms that allow the MLE to control attempts to modify itself.
For more information refer to the
Development Guide.
http://www.intel.com/technology/security/
For more information on Intel Trusted Execution Technology, see
Intel® Trusted Execution Technology Software
3.2.2Intel Trusted Execution Technology – Server Extensions
• Software binary compatible with Intel Trusted Execution Technology Server
Extensions
• Provides measurement of runtime firmware, including SMM
• Enables run-time firmware in trusted session: BIOS and SSP
• Covers support for existing and expected future Server RAS features
• Only requires portions of BIOS to be trusted, for example, Option ROMs need not
be trusted
• Supports S3 State without teardown: Since BIOS is part of the trust chain
3.2.3Intel® Advanced Encryption Standard Instructions
(Intel® AES-NI)
These instructions enable fast and secure data encryption and decryption, using the
Intel® AES New Instructions (Intel® AES-NI), which is defined by FIPS Publication
number 197. Since Intel AES-NI is the dominant block cipher, and it is deployed in
various protocols, the new instructions will be valuable for a w ide r ange of applications.
The architecture consists of six instructions that offer full hardware support for Intel
AES-NI. Four instructions support the Intel AES-NI encryption and decryption, and the
other two instructions support the Intel AES-NI key expansion. Together, they offer a
significant increase in performance compared to pure software implementations.
The Intel AES-NI instructions have the flexibility to support all three standard Intel
AES-NI key lengths, all standard modes of operation, and even some nonstandard or
future variants.
Beyond improving performance, the Intel AES-NI instructions provide important
security benefits. Since the instructions run in data-independent time and do not use
lookup tables, they help in eliminating the major timing and cache-based attacks that
threaten table-based software implementations of Intel AES-NI. In addition, these
instructions make AES simple to implement, with reduced code size. This helps
reducing the risk of inadvertent introduction of security flaws, such as difficult-todetect side channel leaks.
Intel's Execute Disable Bit functionality can help prevent certain classes of malicious
buffer overflow attacks when combined with a supporting operating system.
• Allows the processor to classify areas in memory by where application code can
execute and where it cannot.
• When a malicious worm attempts to insert code in the buffer, the processor
disables code execution, preventing damage and worm propagation.
3.3Intel® Hyper-Threading Technology
The processor supports Intel® Hyper-Threading Technology (Intel® HT Technology),
which allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose
registers and control registers. This feature must be enabled via the BIOS and requires
operating system support. For more information on Intel Hyper-Threading Technology,
see http://www.intel.com/products/ht/hyperthreading_more.htm.
3.4Intel® Turbo Boost Technology
Intel® Turbo Boost Technology is a feature that allows the processor to
opportunistically and automatically run faster than its rated operating frequency if it is
operating below power, temperature, and current limits. The result is increased
performance in multi-threaded and single threaded workloads. It should be enabled in
the BIOS for the processor to operate with maximum performance.
3.4.1Intel® Turbo Boost Operating Frequency
The processor’s rated frequency assumes that all execution cores are running an
application at the thermal design power (TDP). However, under typical operation, not
all cores are active. Therefore most applications are consuming less than the TDP at the
rated frequency . To take advantage of the available TDP headroom, the active cores can
increase their operating frequency.
To determine the highest performance frequency amongst active cores, the processor
takes the following into consideration:
• The number of cores operating in the C0 state.
• The estimated current consumption.
• The estimated power consumption.
• The die temperature.
Any of these factors can affect the maximum frequency for a given workload. If the
power, current, or thermal limit is reached, the processor will automatically reduce the
frequency to stay with its TDP limit.
Note:Intel T urbo Boost Technology is only active if the operating system is requesting the P0
state. For more information on P-states and C-states refer to Section 4, “Power
Management”.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families83
Datasheet Volume One
3.5Enhanced Intel SpeedStep® Technology
The processor supports Enhanced Intel SpeedStep® T echnology as an advanced means
of enabling very high performance while also meeting the power-conservation needs of
the platform.
Enhanced Intel SpeedStep Technology builds upon that architecture using design
strategies that include the following:
• Separation between Voltage and Frequency Changes. By stepping voltage up
and down in small increments separately from frequency changes, the processor is
able to reduce periods of system unavailability (which occur during frequency
change). Thus, the system is able to transition between voltage and frequency
states more often, providing improved power/performance balance.
• Clock Partitioning and Recovery. The bus clock continues running during state
transition, even when the core clock and Phase-Locked Loop are stopped, which
allows logic to remain active. The core clock is also able to restart more quickly
under Enhanced Intel SpeedStep Technology.
For additional information on Enhanced Intel SpeedStep Technology see Section 4.2.1.
3.6Intel® Intelligent Power Technology
Technologies
Intel® Intelligent Power Technology conserves power while delivering advanced powermanagement capabilities at the rack, group, and data center level. Providing the
highest system-level performance per watt with “Automated Low Power States” and
“Integrated Power Gates”. Improvements to this processor generation are:
• Intel Network Power Management Technology
• Intel Power Tuning Technology
For more information on Intel Intelligent Power Technology, see this link http://
www.intel.com/technology/intelligentpower/.
3.7Intel® Advanced Vector Extensions (Intel® AVX)
Intel® Advanced Vector Extensions (Intel® AVX) is a new 256-bit vector SIMD
extension of Intel Architecture. The introduction of Intel AVX starts with the 2nd
Generation Intel(r) Core(TM) Processor Family. Intel AVX accelerates the trend of
parallel computation in general purpose applications like image, video, and audio
processing, engineering applications such as 3D modeling and analysis, scientific
simulation, and financial analysts.
Intel AVX is a comprehensive ISA extension of the Intel® 64 Architecture. The main
elements of Intel AVX are:
• Support for wider vector data (up to 256-bit) for floating-point computation.
• Efficient instruction encoding scheme that supports 3 operand syntax and
headroom for future extensions.
• Flexibility in programming environment, ranging from branch handling to relaxed
memory alignment requirements.
• New data manipulation and arithmetic compute primitives, including broadcast,
permute, fused-multiply-add, and so forth.
• Performance - Intel AVX can accelerate application performance via data
parallelism and scalable hardware infrastructure across existing and new
application domains:
— 256-bit vector data sets can be processed up to twice the throughput of 128-bit
data sets.
— Application performance can scale up with number of hardware threads and
number of cores.
— Application domain can scale out with advanced platform interconnect fabrics,
such as Intel QPI.
• Power Efficiency - Intel AVX is extremely power efficient. Incremental power is
insignificant when the instructions are unused or scarcely used. Combined with the
high performance that it can deliver, applications that lend themselves heavily to
using Intel AVX can be much more energy efficient and realize a higher
performance-per-watt.
• Extensibility - Intel AVX has built-in extensibility for the future v ector extensions:
— OS context management for vector-widths beyond 256 bits is streamlined.
— Efficient instruction encoding allows unlimited functional enhancements:
• Vector width support beyond 256 bits
• 256-bit Vector Integer processing
• Additional computational and/or data manipulation primitives.
• Compatibility - Intel AVX is backward compatible with previous ISA extensions
including Intel® SSE4:
— Existing Intel SSE applications/library can:
• Run unmodified and benefit from processor enhancements
• Recompile existing Intel SSE intrinsic using compilers that generate Intel
AVX code
• Inter-operate with library ported to Intel AVX
— Applications compiled with Intel AVX can inter-operate with existing Intel SSE
libraries.
3.8Intel® Dynamic Power Technology (Intel® DPT)
Intel® Dynamic Power Technology (Intel® DPT) (Memory Power Management) is a
platform feature with the ability to transition memory components into various low
power states based on workload requirements. The Intel® Xeon® processor E5-1600/
E5-2600/E5-4600 product families platform supports Dynamic CKE (hardware assisted)
and Memory Self Refresh (software assisted). For further details refer to the
Specifications for Memory Power Management
document.
§
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families85
Datasheet Volume One
This chapter provides information on the following power management topics:
•ACPI States
•System States
• Processor Core/Package States
• Integrated Memory Controller (IMC) and System Memory States
• Direct Media Interface Gen 2 (DMI2)/PCI Express* Link States
• Intel QuickPath Interconnect States
4.1ACPI States Supported
The ACPI states supported by the processor are described in this section.
4.1.1System States
Table 4-1.System States
StateDescription
G0/S0Full On
G1/S3-ColdSuspend-to-RAM (STR). Context saved to memory
G1/S4Suspend-to-Disk (STD). All power lost (except wakeup on PCH).
G2/S5Soft off. All power lost (except wakeup on PCH). Total reboot.
G3Mechanical off. All power removed from system.
4.1.2Processor Package and Core States
Table 4-2 lists the package C-state support as: 1) the shallowest core C-state that
allows entry into the package C-state, 2) the additional factors that will restrict the
state from going any deeper, and 3) the actions taken with respect to the Ring Vcc, PLL
state and LLC.
Table 4-3 lists the processor core C-states support.
Table 4-2.Package C-State Support (Sheet 1 of 2)
Package C-
State
PC0 - ActiveCC0N/ANoNo2
PC2 Snoopable Idle
Core
States
CC3-CC7
Limiting Factors
• PCIe/PCH and Remote Socket
Snoops
• PCIe/PCH and Remote Socket
Accesses
• Interrupt response time
requirement
• DMI Sidebands
• Configuration Constraints
Retention and
PLL-Off
VccMin
Freq = MinFreq
PLL = ON
LLC Fully
Flushed
No2
Notes
1
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families87
Datasheet Volume One
Table 4-2.Package C-State Support (Sheet 2 of 2)
Power Management
Package C-
State
PC3 - Light
Retention
PC6 - Deeper
Retention
Notes:
1.Package C7 is not supported.
2.All package states are defined to be "E" states - such that they always exit back into the LFM point upon
execution resume
3.The mapping of actions for PC3, and PC6 are suggestions - microcode will dynamically determine which
actions should be taken based on the desired exit latency parameters.
4.CC3/CC6 will all use a voltage below the VccMin operational point; The exact voltage selected will be a
function of the snoop and interrupt respons e time re quirements made by the devices (PCIe* and DMI) and
the operating system.
CC0RunningOnCoherentActiveMaintained
CC1StoppedOnCoherentActiveMaintained
CC1EStoppedOnCoherentRequest LFMMaintained
CC3StoppedOnFlushed to LLCRequest RetentionMaintained
CC6StoppedOffFlushed to LLCPower GateFlushed to LLC
CC7StoppedOffFlushed to LLCPower GateFlushed to LLC
Limiting Factors
•Core C-state
• Snoop Response Time
• Interrupt Response Time
• Non Snoop Response Time
•LLC ways open
• Snoop Response Time
• Non Snoop Response Time
• Interrupt Response Time
Retention and
PLL-Off
Vcc = retention
PLL = OFF
Vcc = retention
PLL = OFF
LLC Fully
Flushed
No2,3,4
No2,3,4
Notes
1
4.1.3Integrated Memory Controller States
Table 4-4.System Memory Power States (Sheet 1 of 2)
StateDescription
Power Up/Normal Operation CKE asserted. Active Mode, highest power consumption.
CKE Power DownOpportunistic, per rank control after idle time:
• Active Power Down (APD) (default mode)
— CKE de-asserted. Power savings in this mode, relative to active idle
state is about 55% of the memory power. Exiting this mode takes 3
• Pre-charge Power Down Fast Exit (PPDF)
• Pre-charge Power Down Slow Exit (PPDS)
• Register CKE Power Down:
– 5 DCLK cycles.
— CKE de-asserted. DLL - On. Als o known as Fast CKE. Power savings in
this mode, relative to active idle state is about 60% of the memory
power. Exiting this mode takes 3 – 5 DCLK cycles.
— CKE de-asserted. DLL -Off . Also known as Slo w CKE. Power sa vings in
this mode, relative to active idle state is about 87% of the memory
power. Exiting this mode takes 3 – 5 DCLK cycles until the first
command is allowed and 16 cycles until first data is allowed.
— IBT-ON mode: Both CKE’s are de-asserted, the Input Buffer
Terminators (IBTs) are left “on”.
— IBT-OFF mode: Both CKE’s are de-asserted, the Input Buffer
Table 4-4.System Memory Power States (Sheet 2 of 2)
StateDescription
Self-RefreshCKE de-asserted. In this mode, no transactions are executed and the system
memory consumes the minimum possible power. Self refresh modes apply to
all memory channels for the processor.
• IO-MDLL Off: Option that sets the IO master DLL off when self refresh
occurs.
• PLL Off: Option that sets the PLL off when self refresh occurs.
In addition, the register component found on registered DIMMs (RDIMMs) is
complemented with the following power down states:
— Clock Stopped Power Down with IBT-On
— Clock Stopped Power Down with IBT-Off
4.1.4DMI2/PCI Express Link States
Table 4-5.DMI2/PCI Express* Link States
StateDescription
L0Full on – Active transfer state.
L1Lowest Active State Power Management (ASPM) - Longer exit latency.
Note: L1 is only supported when the DMI2/PCI Express* port is operating as a PCI Express* port.
4.1.5Intel QuickPath Interconnect States
Table 4-6.Intel QPI States
StateDescription
L0Link on. This is the power on active working state,
L0pA lower power state from L0 that reduces the link from full width to half width
L1A low power state with longer latency and lower power than L0s and is
activated in conjunction with package C-states below C0.
4.1.6G, S, and C State Combinations
Table 4-7.G, S and C State Combinations
Global (G)
State
G0S0C0 Full OnOn Full On
G0S0C1/C1EAuto-HaltOnAuto-Halt
G0S0C3Deep SleepOnDeep Sleep
G0S0C6/C7Deep Power
G1S3Power offOff, except RTC Suspend to RAM
G1S4Power offOff, except RTC Suspend to Disk
G2S5Power offOff, except RTC Soft Off
G3N/APower offPower offHard off
Sleep
(S) State
Processor
Core
(C) State
Processor
State
Down
OnDeep Power Down
System
Clocks
Description
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families89
Datasheet Volume One
Power Management
4.2P rocessor Core/Package Power Management
While executing code, Enhanced Intel SpeedStep Technology optimizes the processor’s
frequency and core voltage based on workload. Each frequency and voltage operating
point is defined by ACPI as a P-state. When the processor is not executing code, it is
idle. A low-power idle state is defined by ACPI as a C-state. In general, lower power
C-states have longer entry and exit latencies.
4.2.1Enhanced Intel SpeedStep® Technology
The following are the key features of Enhanced Intel SpeedStep Technology:
• Multiple frequency and voltage points for optimal performance and power
efficiency. These operating points are known as P-states.
• Frequency selection is software controlled by writing to processor MSRs. The
voltage is optimized based on temperature, leakage, power delivery loadline and
dynamic capacitance.
— If the target frequency is higher than the current frequency, V
to an optimized voltage. This voltage is signaled by the SVID Bus to the voltage
regulator. Once the voltage is established, the PLL locks on to the target
frequency.
— If the target frequency is lower than the current frequency, the PLL locks to the
target frequency, then transitions to a lower voltage by signaling the target
voltage on the SVID Bus.
— All active processor cores share the same frequency and voltage. In a multi-
core processor, the highest frequency P-state requested amongst all active
cores is selected.
— Software-requested transitions are accepted at any time. The processor has a
new capability from the previous processor generation, it can preempt the
previous transition and complete the new request without waiting for this
request to complete.
• The processor controls voltage ramp rates internally to ensure glitch-free
transitions.
• Because there is low transition latency between P-states, a significant number of
transitions per second are possible.
is ramped up
CC
4.2.2Low-Power Idle States
When the processor is idle, low-power idle states (C-states) are used to save power.
More power savings actions are taken for numerically higher C-states. Howev er, higher
C-states have longer exit and entry latencies. Resolution of C-states occurs at the
thread, processor core, and processor package level. Thread level C-states are
available if Hyper-Threading Technology is enabled. Entry and exit of the C-States at
the thread and core level are shown in Figure 4-2.
Figure 4-1. Idle Power Management Breakdown of the Processor Cores
Figure 4-2. Thread and Core C-State Entry and Exit
While individual threads can request low power C-states, power saving actions only
take place once the core C-state is resolved. Core C-states are automatically resolved
by the processor. For thread and core C-states, a transition to and from C0 is required
before entering any other C-state.
4.2.3Requesting Low-Power Idle States
The core C-state will be C1E if all actives cores have also resolved a core C1 state
or higher.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families91
Datasheet Volume One
The primary software interfaces for requesting low power idle states are through the
MWAIT instruction with sub-state hints and the HLT instruction (for C1 and C1E).
However, software may make C-state requests using the legacy method of I/O reads
from the ACPI-defined processor clock control registers, referred to as P_LVLx. This
method of requesting C-states provides legacy support for operating systems that
initiate C-state transitions via I/O reads.
Power Management
For legacy operating systems, P_LVLx I/O reads are converted within the processor to
the equivalent MWAIT C-state request. Therefore, P_L VLx reads do not directly result in
I/O reads to the system. The feature, known as I/O MWAIT redirection, must be
enabled in the BIOS. To enable it, refer to the
Intel® 64 and IA-32 Architectures
Software Developer’s Manual (SDM) Volumes 1, 2, and 3.
Note:The P_LVLx I/O Monitor address needs to be set up before using the P_LVLx I/O read
interface. Each P-LVLx is mapped to the supported MWAIT(Cx) instruction as follows.
Table 4-8.P_LVLx to MWAIT Conversion
P_LVLxMWAIT(Cx)Notes
P_LVL2MWAIT(C3)The P_LVL2 base address is defined in the PMG_IO_CAPTURE MSR,
P_LVL3MWAIT(C6)C6. No sub-states allowed.
P_LVL4MWAIT(C7)C7. No sub-states allowed.
described in the
Developer’s Manual (SDM) Volumes 1, 2, and 3.
Intel® 64 and IA-32 Architectures Software
The BIOS can write to the C-state range field of the PMG_IO_CAPTURE MSR to restrict
the range of I/O addresses that are trapped and emulate MWAIT like function ality. Any
P_LVLx reads outside of this range does not cause an I/O redirection to MW AIT(Cx) like
request. They fall through like a normal I/O instruction.
Note:When P_LVLx I/O instructions are used, MWAIT substates cannot be defined. The
MWAIT substate is always zero if I/O MWAIT redirection is used. By default, P_LVLx I/O
redirections enable the MWAIT 'break on EFLAGS.IF’ feature which triggers a wakeup
on an interrupt even if interrupts are masked by EFLAGS.IF.
4.2.4Core C-states
The following are general rules for all core C-states, unless specified otherwise:
• A core C-State is determined by the lowest numerical thread state (for example,
Thread 0 requests C1E while Thread 1 requests C3, resulting in a core C1E state).
See Table 4-7.
• A core transitions to C0 state when:
— an interrupt occurs.
— there is an access to the monitored address if the state was entered via an
MWAIT instruction.
• For core C1/C1E, and core C3, an interrupt directed toward a single thread wakes
only that thread. However, since both threads are no longer at the same core
C-state, the core resolves to C0.
• An interrupt only wakes the target thread for both C3 and C6 states. Any interrupt
coming into the processor package may wake any core.
4.2.4.1Core C0 State
The normal operating state of a core where code is being executed.
4.2.4.2Core C1/C1E State
C1/C1E is a low power state entered when all threads within a core execute a HLT or
MWAIT(C1/C1E) instruction.
A System Management Interrupt (SMI) handler returns execution to either Normal
state or the C1/C1E state. See the
Developer’s Manual (SDM) Volumes 1, 2, and 3
While a core is in C1/C1E state, it processes bus snoops and snoops from other
threads. For more information on C1E, see Section 4.2.5.2, “Package C1/C1E”.
4.2.4.3Core C3 State
Individual threads of a core can enter the C3 state by initiating a P_LVL2 I/O read to
the P_BLK or an MWAIT(C3) instruction. A core in C3 state flushes the contents of its
L1 instruction cache, L1 data cache, and L2 cache to the shared L3 cache, while
maintaining its architectural state. All core clocks are stopped at this point. Because the
core’s caches are flushed, the processor does not wake any core that is in the C3 state
when either a snoop is detected or when another core accesses cacheable memory.
4.2.4.4Core C6 State
Individual threads of a core can enter the C6 state by initiating a P_LVL3 I/O read or an
MWAIT(C6) instruction. Before entering core C6, the core will save its architectural
state to a dedicated SRAM. Once complete, a core will have its voltage reduced to zero
volts. In addition to flushing core caches core architecture state is saved to the uncore.
Once the core state save is completed, core voltage is reduced to zero. During exit, the
core is powered on and its architectural state is restored.
Intel® 64 and IA-32 Architectures Software
for more information.
4.2.4.5Core C7 State
Individual threads of a core can enter the C7 state by initiating a P_LVL4 I/O read to
the P_BLK or by an MWAIT(C7) instruction. Core C7 and core C7 substate are the same
as Core C6. The processor does not support LLC flush under any condition.
4.2.4.6C-State Auto-Demotion
In general, deeper C-states such as C6 or C7 have long latencies and have higher
energy entry/exit costs. The resulting performance and energy penalties become
significant when the entry/exit frequency of a deeper C-state is high. In order to
increase residency in deeper C-states, the processor supports C-state auto-demotion.
There are two C-State auto-demotion options:
•C6/C7 to C3
• C3/C6/C7 To C1
The decision to demote a core from C6/C7 to C3 or C3/C6/C7 to C1 is based on each
core’s immediate residency history. Upon each core C6/C7 request, the core C-state is
demoted to C3 or C1 until a sufficient amount of residency has been established. At
that point, a core is allowed to go into C3/C6 or C7. Each option can be run
concurrently or individually.
This feature is disabled by default. BIOS must enable it in the
PMG_CST_CONFIG_CONTROL register. The auto-demotion policy is also configured by
this register. See the
(SDM) Volumes 1, 2, and 3
Intel® 64 and IA-32 Architectures Software Developer’s Manual
for C-state configurations.
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families93
Datasheet Volume One
4.2.5Package C-States
The processor supports C0, C1/C1E, C2, C3, and C6 power states. The following is a
summary of the general rules for package C-state entry. These apply to all package
C-states unless specified otherwise:
• A package C-state request is determined by the lowest numerical core C-state
amongst all cores.
• A package C-state is automatically resolved by the processor depending on the
core idle power states and the status of the platform components.
— Each core can be at a lower idle power state than the package if the platform
does not grant the processor permission to enter a requested package C-state.
— The platform may allow additional power savings to be realized in the
processor.
• For package C-states, the processor is not required to enter C0 before entering any
other C-state.
The processor exits a package C-state when a break event is detected. Depending on
the type of break event, the processor does the following:
• If a core break event is received, the target core is activated and the break event
message is forwarded to the target core.
— If the break event is not masked, the target core enters the core C0 state and
the processor enters package C0.
— If the break event is masked, the processor attempts to re-enter its previous
package state.
• If the break event was due to a memory access or snoop request.
— But the platform did not request to keep the processor in a higher package
C-state, the package returns to its previous C-state.
— And the platform requests a higher power C-state, the memory access or snoop
request is serviced and the package remains in the higher power C-state.
Power Management
The package C-states fall into two categories: independent and coordinated. C0/C1/
C1E are independent, while C2/C3/C6 are coordinated.
Starting with the 2nd Generation Intel(r) Core(TM) Processor Family, package C-states
are based on exit latency requirements which are accumulated from the PCIe* devices,
PCH, and software sources. The level of power savings that can be achieved is a
function of the exit latency requirement from the platform. As a result, there is no fixe d
relationship between the coordinated C-state of a package, and the power savings that
will be obtained from the state. Coordinated package C-states offer a range of power
savings which is a function of the guaranteed exit latency requirement from the
platform.
There is also a concept of Execution Allowed (EA), when EA status is 0, the cores in a
socket are in C3 or a deeper state, a socket initiates a request to enter a coordinated
package C-state. The coordination is across all sockets and the PCH.
Table 4-9 shows an example of a dual-core processor package C-state resolution.
Figure 4-3 summarizes package C-state transitions with package C2 as the interim
Table 4-9.Coordination of Core Power States at the Package Level
Package C-State
C0
C1
Core 0
1. The package C-state will be C1E if all actives cores have resolved a core C1 state or higher.
C3
C6
C0C1C3C6
C0C0C0C0
C0C1
C0C1
C0C1
Figure 4-3. Package C-State Entry and Exit
Core 1
1
1
1
1
C1
C3C3
C3C6
C1
1
4.2.5.1Package C0
The normal operating state for the processor. The processor remains in the normal
state when at least one of its cores is in the C0 or C1 state or when the platform has
not granted permission to the processor to go into a low power state. Individual cores
may be in lower power idle states while the package is in C0.
4.2.5.2Package C1/C1E
No additional power reduction actions are taken in the package C1 state. However, if
the C1E substate is enabled, the processor automatically transitions to the lowest
supported core clock frequency, followed by a reduction in voltage. Autonomous power
reduction actions which are based on idle timers, can trigger depending on the activity
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families95
Datasheet Volume One
in the system.
The package enters the C1 low power state when:
• At least one core is in the C1 state.
• The other cores are in a C1 or lower power state.
The package enters the C1E state when:
• All cores have directly requested C1E via MWAIT(C1) with a C1E sub-state hint.
• All cores are in a power state lower that C1/C1E but the package low power state is
limited to C1/C1E via the PMG_CST_CONFIG_CONTROL MSR.
• All cores have requested C1 using HLT or MWAIT(C1) and C1E auto-promotion is
enabled in POWER_CTL.
No notification to the system occurs upon entry to C1/C1E.
4.2.5.3Package C2 State
Package C2 state is an intermediate state which represents the point at which the
system level coordination is in progress. The package cannot reach this state unless all
cores are in at least C3.
The package will remain in C2 when:
• it is awaiting for a coordinated response
• the coordinated exit latency requirements are too stringent for the package to take
any power saving actions
Power Management
If the exit latency requirements are high enough the package will transition to C3 or C6
depending on the state of the cores.
4.2.5.4Package C3 State
A processor enters the package C3 low power state when:
• At least one core is in the C3 state.
• The other cores are in a C3 or lower power state, and the processor has been
granted permission by the platform.
• L3 shared cache retains context and becomes inaccessible in this state.
• Additional power savings actions, as allowed by the exit latency requirements,
include putting Intel QPI and PCIe* links in L1, the uncore is not available, further
voltage reduction can be taken.
In package C3, the ring will be off and as a result no accesses to the LLC are possible.
The content of the LLC is preserved.
4.2.5.5Package C6 State
A processor enters the package C6 low power state when:
• At least one core is in the C6 state.
• The other cores are in a C6 or lower power state, and the processor has been
granted permission by the platform.
• L3 shared cache retains context and becomes inaccessible in this state.
• Additional power savings actions, as allowed by the exit latency requirements,
include putting Intel QPI and PCIe* links in L1, the uncore is not available, further
voltage reduction can be taken.
In package C6 state, all cores have saved their architectural state and have had their
core voltages reduced to zero volts. The LLC retains context, but no accesses can be
made to the LLC in this state, the cores must break out to the internal state package C2
for snoops to occur.
4.2.6Package C-State Power Specifications
The table below lists the processor package C-state power specifications for various
processor SKUs.
The DDR3 power states can be summarized as the following:
• Normal operation (highest power consumption).
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families97
Datasheet Volume One
• CKE Power-Down: Opportunistic, per rank control after idle time. There may be
different levels.
—Active Power-Down.
— Precharge Power-Down with Fast Exit.
— Precharge power Down with Slow Exit.
• Self Refresh: In this mode no transaction is executed. The DDR consumes the
minimum possible power.
4.3.1CKE Power-Down
The CKE input land is used to enter and exit different power-down modes. The memory
controller has a configurable activity timeout for each rank. Whenever no reads are
present to a given rank for the configured interval, the memory controller will transition
the rank to power-down mode.
The memory controller transitions the DRAM to power-down by de-asserting CKE and
driving a NOP command. The memory controller will tri-state all DDR interface lands
except CKE (de-asserted) and ODT while in power-down. The memory controller will
transition the DRAM out of power-down state by synchronously asserting CKE and
driving a NOP command.
When CKE is off the internal DDR clock is disabled and the DDR power is significantly
reduced.
Power Management
The DDR defines three levels of power-down:
• Active power-down.
• Precharge power-down fast exit.
• Precharge power-down slow exit.
4.3.2Self Refresh
The Power Control Unit (PCU) may request the memory controller to place the DRAMs
in self refresh state. Self refresh per channel is supported. The BIOS can put the
channel in self-refresh if software remaps memory to use a subset of all channels. Also
processor channels can enter self refresh autonomously without PCU instruction when
the package is in a package C0 state.
4.3.2.1Self Refresh Entry
Self refresh entrance can be either disabled or triggered by an idle counter. The idle
counter always clears with any access to the memory controller and remains clear as
long as the memory controller is not drained. As soon as the memory controller is
drained, the counter starts counting, and when it reaches the idle-count, the memory
controller will place the DRAMs in self refresh state.
Power may be removed from the memory controller core at this point. B ut V
(1.5 V or 1.35 V) to the DDR IO must be maintained.
CCD
supply
4.3.2.2Self Refresh Exit
Self refresh exit can be either a message from an external unit or as reaction for an
incoming transaction.
Self refresh, according to configuration, may be a trigger for master DLL shut-down
and PLL shut-down. The master DLL shut-down is issued by the memory controller
after the DRAMs have entered self refresh.
The PLL shut-down and wake-up is issued by the PCU. The memory controller gets a
signal from PLL indicating that the memory controller can start working again.
4.3.3DRAM I/O Power Management
Unused signals are tristated to save power. This includes all signals associated with an
unused memory channel.
The I/O buffer for an unused signal should be tristated (output driver disabled), the
input receiver (differential sense-amp) should be disabled. The input path must be
gated to prevent spurious results due to noise on the unused signals (typically handled
automatically when input receiver is disabled).
4.4DMI2/PCI Express* Power Management
Active State Power Management (ASPM) support using L1 state, L0s is not supported.
§
Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product Families99
Datasheet Volume One