PCIe: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
No edit summary
 
Line 1: Line 1:
[[파일:PCIe layer.png|섬네일|PCIe layer]]Unlike the true buses of previous PCI versions, PCI-X included, PCIe is a kind of packet network faking the traditional PCI bus just like a network, with each card connected to a network switch through a dedicated set of wires. Exactly like a local Ethernet network, each card has its own physical connection to the switch fabric. The communication takes the form of packets transmitted over these dedicated lines, with flow control, error detection and retransmissions. There are no MAC addresses, but PCIe have the card’s physical geographic position addressing instead to define MAC. Even though it’s a packet-based network, it’s all about addresses, reads, writes an interrupt.
[[File:PCIe layer.png|center|frameless|477x477px]]
 
 
 
Unlike the true buses of previous PCI versions, PCI-X included, PCIe is a kind of packet network faking the traditional PCI bus just like a network, with each card connected to a network switch through a dedicated set of wires. Exactly like a local Ethernet network, each card has its own physical connection to the switch fabric. The communication takes the form of packets transmitted over these dedicated lines, with flow control, error detection and retransmissions. There are no MAC addresses, but PCIe have the card’s physical geographic position addressing instead to define MAC. Even though it’s a packet-based network, it’s all about addresses, reads, writes an interrupt.




Line 15: Line 19:
!PCIe 3.0
!PCIe 3.0
!PCIe 4.0
!PCIe 4.0
!PCIe 5.0
![[PCIe 5.0]]
!PCIe 6.0
!PCIe 6.0
!PCIe 7.0  
!PCIe 7.0  
Line 87: Line 91:
PCIe gen 3 and 4, the encoding overhead is roughly 1.54%. So, the maximum theoretical bandwidth in Gbps (Gigabits per second) of each PCIe Gen 5.0 looks like : 32GT/s – (32GT/s x 1.54%) =31.5702 Gbps ( or ~3.95GB/s)
PCIe gen 3 and 4, the encoding overhead is roughly 1.54%. So, the maximum theoretical bandwidth in Gbps (Gigabits per second) of each PCIe Gen 5.0 looks like : 32GT/s – (32GT/s x 1.54%) =31.5702 Gbps ( or ~3.95GB/s)
|}
|}


For example, A GPU uses 16 PCIe lanes,  
For example, A GPU uses 16 PCIe lanes,  
Line 121: Line 124:
#* This packet consists of a header, which is either 3 or 4 32-bit words long (depending on if 32 or 64 bit addressing is used) and one 32-bit word containing the word to be written. This packet simply says “write this data to this address”.
#* This packet consists of a header, which is either 3 or 4 32-bit words long (depending on if 32 or 64 bit addressing is used) and one 32-bit word containing the word to be written. This packet simply says “write this data to this address”.
# This packet is then transmitted on the chipset’s PCIe port (or one of them, if there are several). The target peripheral may be connected directly to the chipset, or there may be a switch network between them. This way or another, the packet is routed to the peripheral, decoded, and executed by performing the desired write operation.
# This packet is then transmitted on the chipset’s PCIe port (or one of them, if there are several). The target peripheral may be connected directly to the chipset, or there may be a switch network between them. This way or another, the packet is routed to the peripheral, decoded, and executed by performing the desired write operation.
#


=== Simple PCIe read operation ===
=== Simple PCIe read operation ===
Line 148: Line 150:


The term “'''link partner'''” and not “'''destination'''” deliberately. Becuase when a peripheral is connected to the Root Complex through a switch, it runs its flow control mechanism against the switch and not the final destination. In other words, once the TLP is transmitted from the peripheral, it’s still subject to the flow control mechanism between the switch and the Root Complex. '''If there are more switches on the way, each leg has its own flow control.'''
The term “'''link partner'''” and not “'''destination'''” deliberately. Becuase when a peripheral is connected to the Root Complex through a switch, it runs its flow control mechanism against the switch and not the final destination. In other words, once the TLP is transmitted from the peripheral, it’s still subject to the flow control mechanism between the switch and the Root Complex. '''If there are more switches on the way, each leg has its own flow control.'''


The flow control mechanism runs independent accounting for six distinct buffer consumers
The flow control mechanism runs independent accounting for six distinct buffer consumers
Line 161: Line 162:
== Virtual channels ==
== Virtual channels ==
TC (Traffic Class) in the TLPs is an identifier used to create Virtual Channels. These Virtual Channels are merely separate sets of data buffers having a separate flow control credits and counters. So by choosing a TC other than zero (and setting up the bus entities accordingly) one can have TLPs being subject to independent flow control systems, preventing TLPs belonging to one channel block the traffic of TLPs belonging to another.
TC (Traffic Class) in the TLPs is an identifier used to create Virtual Channels. These Virtual Channels are merely separate sets of data buffers having a separate flow control credits and counters. So by choosing a TC other than zero (and setting up the bus entities accordingly) one can have TLPs being subject to independent flow control systems, preventing TLPs belonging to one channel block the traffic of TLPs belonging to another.




Line 168: Line 170:


== Identification and routing ==
== Identification and routing ==
Since PCIe is essentially a packet network, with the possibility of switches on the way, these switches need to know where to send each TLP.  
Since PCIe is essentially a packet network, with the possibility of switches on the way, these switches need to know where to send each TLP.
[[파일:Formation of PCIe ID.png|섬네일|Formation of PCIe ID]]
[[File:Formation of PCIe ID.png|center|frameless|528x528px]]  
 
There are three routing methods:  
There are three routing methods:  



Latest revision as of 10:38, 4 September 2024

PCIe layer.png


Unlike the true buses of previous PCI versions, PCI-X included, PCIe is a kind of packet network faking the traditional PCI bus just like a network, with each card connected to a network switch through a dedicated set of wires. Exactly like a local Ethernet network, each card has its own physical connection to the switch fabric. The communication takes the form of packets transmitted over these dedicated lines, with flow control, error detection and retransmissions. There are no MAC addresses, but PCIe have the card’s physical geographic position addressing instead to define MAC. Even though it’s a packet-based network, it’s all about addresses, reads, writes an interrupt.


A minimal (1x) PCIe connection merely consists of four wires

  • two differential pairs in each direction for data transmission
  • another pair of wires to supply the card with a reference clock

PCIe bandwidth

PCIe bandwidth by generation[1]
PCIe generation PCIe 1.0 PCIe 2.0 PCIe 3.0 PCIe 4.0 PCIe 5.0 PCIe 6.0 PCIe 7.0
Year of release 2003 2007 2010 2017 2019 2022 2025 (planned)
Data Transfer Rate 2.5 GT/s 5.0 GT/s 8.0 GT/s 16 GT/s 32 GT/s 64 GT/s 128.0 GT/s
Total BW x 1 Lane 250 MB/s 500 MB/s 1 GB/s 2 GB/s 4 GB/s 8 GB/s 15 GB/s
Total BW x 2 Lane 500 MB/s 1 GB/s 2 GB/s 4 GB/s 8 GB/s 16 GB/s 30 GB/s
Total BW x 4 Lane 1 GB/s 2 GB/s 4 GB/s 8 GB/s 16 GB/s 32 GB/s 60 GB/s
Total BW x 8 Lane 2 GB/s 4 GB/s 8 GB/s 16 GB/s 32 GB/s 64 GB/s 121 GB/s
Total BW x 16Lane 4 GB/s 8 GB/s 16 GB/s 32 GB/s 64 GB/s 128 GB/s 242 GB/s
Data transfer rate is measured in Gigabits (Gb) per second.

On the other hand, bandwidth is measured in Gigabytes (GB) per second: 8 Gigabits = 1 Gigabyte

PCIe gen 3 and 4, the encoding overhead is roughly 1.54%. So, the maximum theoretical bandwidth in Gbps (Gigabits per second) of each PCIe Gen 5.0 looks like : 32GT/s – (32GT/s x 1.54%) =31.5702 Gbps ( or ~3.95GB/s)

For example, A GPU uses 16 PCIe lanes,

  • Your peak theoretical bit rate via PCIe 4.0 would be:
    • 16 Lanes x 16 Gigatransfers per lane = 256 GT/s
  • Your peak theoretical bit rate via PCIe 5.0 would be:
    • 16 Lanes x 32 Gigatransfers per lane = 512 GT/s

PCIe power[2]

Power from PCIe bus

  • All PCI express cards may consume up to 3 A at +3.3 V (9.9 W).
  • x1 cards are limited to 0.5 A at +12 V (6 W) and 10 W combined
  • x4 and wider cards are limited to 2.1 A at +12 V (25 W) and 25 W combined.
  • A full-sized x1 card may draw up to the 25 W limits after initialization and software configuration as a high-power device.
  • A full-sized x16 graphics card may draw up to 5.5 A at +12 V (66 W) and 75 W combined after initialization and software configuration as a high-power device.

Optional connectors

add 75 W (6-pin) or 150 W (8-pin) of +12 V power for up to 300 W total (2 × 75 W + 1 × 150 W) through power supply to the device.

  • Sense0 pin is connected to ground by the cable or power supply, or float on board if cable is not connected.
  • Sense1 pin is connected to ground by the cable or power supply, or float on board if cable is not connected.

Some cards use two 8-pin connectors, but this has not been standardized yet as of 2018, therefore such cards must not carry the official PCI Express logo. This configuration allows 375 W total (1 × 75 W + 2 × 150 W) and will likely be standardized by PCI-SIG with the PCI Express 4.0 standard.[needs update] The 8-pin PCI Express connector could be confused with the EPS12V connector, which is mainly used for powering SMP and multi-core systems. The power connectors are variants of the Molex Mini-Fit Jr. series connectors.

A simple bus transaction

Simple PCIe write operation

  1. PCIe chipset (which, in PCIe terms functions as a Root Complex, can be within CPU or PCIe switch or independant) generates a Memory Write packet for transmission over the bus.
    • This packet consists of a header, which is either 3 or 4 32-bit words long (depending on if 32 or 64 bit addressing is used) and one 32-bit word containing the word to be written. This packet simply says “write this data to this address”.
  2. This packet is then transmitted on the chipset’s PCIe port (or one of them, if there are several). The target peripheral may be connected directly to the chipset, or there may be a switch network between them. This way or another, the packet is routed to the peripheral, decoded, and executed by performing the desired write operation.

Simple PCIe read operation

  1. CPU wants to read from a peripheral. One TLP from the CPU to the peripheral, asking the latter to perform a read operation (PCIe terms, Requester)
  2. Then TLP going back with the data from peripheral (PCIe terms, Completer)
  3. When the peripheral receives a Read Request TLP, it must respond with some sort of Completion TLP, even if it can’t fulfill the action requested

PCIe communication Layers

The communications mechanism consists of three layers: The Transaction Layer, the Data Link Layer, and the Physical Layer

  1. Transaction Layer Packet (TLP) is PCIe’s uppermost layer to define TX/RX packets
  2. Data Link layer is responsible for making sure that every TLP arrives to its destination correctly. It wraps TLPs with its own header and with a Link CRC to make sure TLP’s integrity is assured.
    • An acknowledge-retransmit mechanism makes sure no TLPs are lost on the way
    • A flow control mechanism makes sure a packet is sent only when the link partner is ready to receive it.
    • The TLP’s size limits are set at the peripheral’s configuration stage, but typical numbers are a maximum of 128, 256 or 512 bytes per TLP
    • Data Link Layer Packets (DLLPs) packets for maintaining reliable transmission.
      • Ack DLLP for acknowledging successfully received TLPs.
      • Nack DLLP for indicating that a TLP arrived corrupted, and that a retransmit is due. Note that there’s also a timeout mechanism in case nothing that looks like a TLP arrives.
      • Flow Control DLLPs: InitFC1, InitFC2 and UpdateFC, used to announce credits, as described below.
      • Power Management DLLPs
  3. Physical Layer

Flow control

The data link layer has a Flow Control (FC) mechanism, which makes sure that a TLP is transmitted only when the link partner has enough buffer space to accept it.

The term “link partner” and not “destination” deliberately. Becuase when a peripheral is connected to the Root Complex through a switch, it runs its flow control mechanism against the switch and not the final destination. In other words, once the TLP is transmitted from the peripheral, it’s still subject to the flow control mechanism between the switch and the Root Complex. If there are more switches on the way, each leg has its own flow control.

The flow control mechanism runs independent accounting for six distinct buffer consumers

  1. Posted Requests TLP’s headers
  2. Posted Requests TLP’s data
  3. Non-Posted Requests TLP’s headers
  4. Non-Posted Requests TLP’s data
  5. Completion TLP’s headers
  6. Completion TLP’s data

Virtual channels

TC (Traffic Class) in the TLPs is an identifier used to create Virtual Channels. These Virtual Channels are merely separate sets of data buffers having a separate flow control credits and counters. So by choosing a TC other than zero (and setting up the bus entities accordingly) one can have TLPs being subject to independent flow control systems, preventing TLPs belonging to one channel block the traffic of TLPs belonging to another.


Packet reordering

One of the issues that comes to mind in a packet network, is to what extent the TLPs may arrive in an order different from how they were sent. The Internet Protocol (IP, as in TCP/IP) for example, allows any packet reshuffling on the way. PCIe spec defines reordering rules in full detail.

Identification and routing

Since PCIe is essentially a packet network, with the possibility of switches on the way, these switches need to know where to send each TLP.

Formation of PCIe ID.png

There are three routing methods:

  1. By address routing is applied for Memory and I/O Requests (read and write).
  2. Implicit routing is used only for certain message TLPs, such as broadcasts from Root Complex and messages that always go to the Root Complex
  3. By ID routing are all other TLPs are routed by ID.

The ID is a 16-bit word formed in terms of the well known triplet: Bus number, Device number and Function number. Their meaning is exactly like in legacy PCI buses. The ID is formed as follows:

Posted and non-Posted operations

Posted operations (Write)

such as a Memory Write, A write TLP operation is fire-and-forget method. Once the packet has been formed and handed over to the Data Link Layer, there’s no need to worry about it anymore.

non-Posted operations (Read)

like A read operation, consist of a Request and Completion requires the Requester to wait for a Completion. Until the Completion packet arrives, the Requester must retain information about what the Request was, and sometimes even hold the CPU’s bus. If the CPU’s bus started a read cycle, it must be held in wait states until the value of the desired read operation is available at the bus’ data lines. This can be a horrible slowdown of the bus, which is rightfully avoided in recent systems.

32 vs. 64 bit addressing

the address given in read and write requests can be either 32 or 64 bits wide, making the header either 3 or 4 DWs long. However section 2.2.4.1 in the PCIe spec states that the 4 DW header format must be used only when necessary.

For Addresses below 4 GB, Requesters must use the 32-bit format. The behavior of the receiver is not specified if a 64-bit format request addressing below 4 GB (i.e., with the upper 32 bits of address all 0) is received.

In reality, it's rare that any peripheral's register is mapped over the 4 GB range, however DMA buffers may very well go beyond the 4 GB boundary. As a result, read and write TLPs with 64 bit addressing should be supported when designing a new device.

I/O Requests

The PCIe bus supports I/O operations only for the sake of backward compatibility, and strongly recommends not to use I/O TLPs in new designs. One of the reasons is that both read and write requests in I/O space are non-Posted, so the Requester is forced to wait for a completion on write operations as well. Another issue is that I/O operations only take 32-bit addresses, while the PCIe spec endorses 64-bit support in general.

Interrupts

PCIe supports two kinds of interrupts: Legacy INTx and MSI.

  1. INTx interrupts are supported for the sake of compatibility with legacy software, and also in order to allow bridging between classic PCI buses and PCIe. Since INTx interrupts are level triggered (i.e. the interrupt request is active as long as the physical INTx wire is at low voltage), there’s a TLP packet for saying that the line has been asserted, and another that it has been deasserted. Not only is this a quirky in itself, but the old problems with INTx interrupts retain, such as interrupt sharing and the need for each interrupt handling routine to check who the interrupt is really for.
  2. To enhance INTx, a new form of interrupt, MSI, was introduced in (conventional) PCI 2.2. The idea was, that since virtually all PCI peripherals have bus master capabilities, MSI uses the peripheral signal an interrupt by writing to a certain address. Signaling an interrupt merely consists of sending a TLP over the bus, which is simply a posted Write Request, with a special address, which the host has written into the peripheral’s configuration space during initialization. Any modern operating system (Linux included, of course) can then call the correct interrupt routine, without the need to guess who generated the interrupt. Neither is it really necessary to “clear” the interrupt, if the peripheral doesn’t need the acknowledgment.

Bus Mastering (DMA)

On PCIe, It boils down to the simple notion, that anyone on the bus can send read and write TLPs on the bus, exactly like the Root Complex. This allows the peripheral to access the CPU’s memory directly (DMA) or exchange TLPs with peer peripherals.

To use DMA, two things that need to be set

  1. First, as with any PCI device: The peripheral needs to be granted bus mastering by setting the “Bus Master Enable” bit in one of the standard configuration registers.
  2. The second thing is that the driver software needs to inform the peripheral about the relevant buffer’s physical address, most probably by writing to a BAR-mapped register.

Reference

http://xillybus.com/tutorials/pci-express-tlp-pcie-primer-tutorial-guide-1

http://xillybus.com/tutorials/pci-express-tlp-pcie-primer-tutorial-guide-2