Class 10: The PCI Bus¶

Date: 06.05.2025

Additional Materials¶

Example PCI Device

Introduction¶

PCI (Peripherial Component Interconnect) is a bus commonly used to connect hardware devices to the processor. The original PCI is based on a 32-bit parallel bus shared among multiple devices, operating at a frequency of 33MHz. Later, many other hardware interfaces were introduced that are programmatically compatible with PCI, although they have completely different electrical interfaces and low-level protocols. Interfaces seen programmatically as PCI include, among others, original PCI, PCI-X (an older server variant of PCI with larger slots), AGP, CardBus (used in laptops), PCI Express (an interface with a completely new physical layer and protocol, based on serial unidirectional P2P links instead of a shared access bus), ExpressCard (electrically compatible with PCIE), HyperTransport (an interface between AMD processors and the chipset), as well as internal interfaces of many chipsets.

The PCI bus is organized hierarchically - a single PCI "domain" consists of one or more PCI buses. PCI buses have 8-bit identifiers (bus id). Each bus contains up to 32 physical devices (identified by a 5-bit device id), and each physical device can have up to 8 "functions" (identified by a 3-bit function id). The main PCI bus, with identifier 0, is connected to the processor (called the host) through a so-called host bridge. Other buses are connected to higher-level buses in the hierarchy through so-called PCI-to-PCI bridges. Sometimes (especially in multi-processor systems) there are multiple main PCI buses.

In the era of original PCI, one bus was often sufficient in the entire computer, and a physical device (a single device id) corresponded to a single PCI slot (or a chip soldered onto the motherboard), with a few numbers reserved for the functions of the chipset itself. Over time, the AGP slot was added (as a separate bus), and the interface between the north and south bridges (e.g., HyperTransport) was also separated as a separate bus. Currently, due to its peer-to-peer architecture, the PCIE bus treats each slot (as well as the interiors of so-called PCIE switches) as a separate bus.

PCI-to-PCI bridges are mostly transparent - any pair of devices located in the same PCI domain can communicate with each other, without even needing to know if they are on the same bus. In the vast majority of cases, current computers contain only one PCI domain, but sometimes (most often with architectures other than x86 and high-end hardware) you can find machines in which each PCI slot is in its own domain. This allows, among other things, to protect devices from the effects of failures of others and to control access.

Each PCI device in the system can therefore be uniquely identified by combining the domain number, bus, physical device, and function. Due to the rare use of multiple domains, the domain number is often omitted. It should be noted that "PCI device" usually refers to a function of a physical device, not the physical device itself.

To display all PCI devices in the system, you can use the lspci command. lspci -t will display a hierarchical tree of devices. An example PCI bus (from a nettop class computer with PCIE) might look like this:

[bus:physical_device.function Device Class: Device Description]
00.0 Host bridge: NVIDIA Corporation MCP79 Host Bridge (rev b1)
00.1 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
03.0 ISA bridge: NVIDIA Corporation MCP79 LPC Bridge (rev b2)
03.1 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
03.2 SMBus: NVIDIA Corporation MCP79 SMBus (rev b1)
03.3 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
03.5 Co-processor:
04.0 USB controller: NVIDIA Corporation MCP79 OHCI USB 1.1 Controller (rev b1)
04.1 USB controller: NVIDIA Corporation MCP79 EHCI USB 2.0 Controller (rev b1)
08.0 Audio device: NVIDIA Corporation MCP79 High Definition Audio (rev b1)
09.0 PCI bridge: NVIDIA Corporation MCP79 PCI Bridge (rev b1)
0a.0 Ethernet controller: NVIDIA Corporation MCP79 Ethernet (rev b1)
0b.0 SATA controller: NVIDIA Corporation MCP79 AHCI Controller (rev b1)
10.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge
15.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
00.0 VGA compatible controller: NVIDIA Corporation ION VGA (rev b1)
00.0 Network controller: Atheros Communications Inc. AR9285 Wireless Network Adapter (PCI-Express) (rev 01)

We have 4 PCI buses here:

00: internal chipset bus, connected by 00:00.0 to the processor
01: internal chipset bus for communication with the integrated graphics processor, connected by 00:10.0 to bus 00.
02: PCIE bus (with a mini-PCIE slot) for connecting a wifi card, connected by 00:15.0 to bus 00.
03: a regular PCI bus with no connected devices, connected by 00:09.0 to bus 00.

In larger machines, you can often find even a dozen or so PCI buses.

The PCI bus allows devices to perform so-called transactions among themselves. Any device (including the host) can perform a transaction on any device (including the host). Available transaction types are:

memory read and write - addressed by providing the target memory address
IO port read and write - addressed by providing the target port address
configuration space read and write - addressed by providing the target device identifier and offset in the configuration space
interrupt - always addressed to the host, a device can signal up to 4 different interrupts

Configuration Space¶

The most important innovation of PCI compared to older buses such as ISA or VESA local bus was the introduction of automatic device detection and configuration - in older buses, the user had to manually tell the operating system which devices were present in the system and ensure no resource conflicts between them. This often required solutions such as setting jumpers or DIP switches. Drivers also had to be configured manually. On the PCI bus, these problems were solved by introducing configuration space.

Each PCI device has a so-called configuration space, with a size from 64 to 256 (original PCI, AGP, ...) or up to 4096 bytes (PCI-X, PCIE). This space is used to detect the type, functionality, and requirements of the PCI device and to prepare it for operation by the BIOS or operating system. It is not ordinary memory - its cells, depending on their purpose, can be writable, read-only, have only some bits writable, not exist at all, or have even more complex behavior. It contains, among other things:

16-bit vendor and device model identifiers (vendor id, device id) - allow identifying a specific model
16-bit subsystem vendor and model identifiers (subsystem ids) - allow identifying a specific card or motherboard using a given device
24-bit device class identifier
information about interrupts signaled by the device
registers for configuring memory and IO port ranges used by the device

The content of the configuration space can be displayed in raw form using lspci -xxxx, or in a user-friendly format using lspci -vvv. Access to most of the configuration space requires root privileges.

Registering a PCI Driver in Linux¶

The Linux PCI subsystem itself handles the assignment of the correct driver to a device - the driver should not search for devices available in the system itself, but merely register itself with the PCI subsystem, providing it with a list of supported device types. This is done by creating a pci_driver structure:

struct pci_driver {
    char *name;
    const struct pci_device_id *id_table;
    int  (*probe)  (struct pci_dev *dev, const struct pci_device_id *id);
    void (*remove) (struct pci_dev *dev);
    int  (*suspend) (struct pci_dev *dev, pm_message_t state);
    int  (*resume) (struct pci_dev *dev);
    void (*shutdown) (struct pci_dev *dev);
/* ... */
};

The id_table field should be set to a pointer to an array of supported devices, composed of the following structures:

struct pci_device_id {
    __u32 vendor, device;
    __u32 subvendor, subdevice;
    __u32 class, class_mask;
    kernel_ulong_t driver_data;
};

The vendor and device fields correspond to the vendor and device identifiers that the driver wants to support. The subvendor and subdevice fields correspond to subsystem identifiers. Each of these fields can be replaced by PCI_ANY_ID to ignore that field when matching devices to the driver. The class and class_mask fields allow narrowing down devices to devices of a given class or subclass - a device will be shown to the driver only when (device_class & class_mask) == class. If we do not want to filter devices by class, simply enter 0. For most drivers, it is sufficient to create such structures using the PCI_DEVICE() macro, which takes only the vendor and device identifiers (subsystem and class will be ignored during matching). The array should be terminated with a structure containing all zeros.

The pci_driver structure should be registered by calling pci_register_driver, and unregistered (when unloading the module) by pci_unregister_driver:

int pci_register_driver(struct pci_driver *);
void pci_unregister_driver(struct pci_driver *dev);

When the PCI subsystem finds a device that meets our criteria, it will call the probe() function from the provided structure on it. This function should initialize the device and prepare it for use by the user. Other functions are:

remove: is executed when the driver should be detached from the device - when unregistering the driver, or during a hot-unplug operation.
suspend: executed when the device should be put into a low power consumption state - e.g., when the computer enters a sleep state. The device's state should be saved in this function, as it will be destroyed by its shutdown.
resume: executed when returning from a low power consumption state - should reverse the suspend operation
shutdown: executed when shutting down the system

A PCI device is represented by the pci_dev structure. The driver can attach a pointer to its private data to this structure:

void pci_set_drvdata(struct pci_dev *pdev, void *data);
void *pci_get_drvdata(struct pci_dev *pdev);

Memory and IO Port Regions¶

The most important and most common type of transaction on the PCI bus are memory accesses. The PCI bus address space is 64-bit (although not all devices are capable of using such large addresses - devices are often limited, e.g., to the lower 4GB of address space). Ranges of this space are assigned to devices through the configuration space - each device can have up to 6 memory regions, so-called BARs (from Base Address Register - cells in the configuration space used to set them). The size of the region is always a power of two and is determined by the device, while its beginning is movable and is set by the BIOS or operating system.

The PCI bus allows performing reads and writes of different sizes - 8-bit, 16-bit, or 32-bit. Most often, memory address space regions are used by devices as so-called memory-mapped IO - addresses in this region have special meaning and accesses to them are used to control the device. Such special memory cells are called input/output registers. It often happens that the semantics of such accesses are very different from accesses to ordinary memory - in particular, reads can be significant. For example, there are hardware FIFO queues where writing to a given address adds a value to the queue, and reading takes the value and removes it. Care should also be taken to use read/write operations of a size consistent with the register size when accessing registers - writing a byte to a 32-bit register can have unpredictable consequences. It is also always important to be aware of the byte order used by the device - PCI devices are almost always little-endian, even if they are used in machines using big-endian order.

In Linux, the PCI subsystem handles the mapping of device memory regions to the kernel's virtual address space for us. To use them, we must first enable support for these regions on the given device (and disable it when detaching from the device):

int pci_enable_device(struct pci_dev *dev);
void pci_disable_device(struct pci_dev *dev);

We must also reserve access to the device so that no other driver conflicts with us:

int pci_request_regions(struct pci_dev *pdev, const char *drv_name);
void pci_release_regions(struct pci_dev *);

Then we need to map a specific BAR to our address space:

void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen);
void pci_iounmap(struct pci_dev *dev, void __iomem *p);

The second parameter of pci_iomap is the BAR (region) number, and the third is a limit on the size of the mapped area - if it is non-zero and less than the size of the given region, only its beginning will be mapped. The returned pointer is of a special type __iomem and should not be used directly - accessing it may require special operations on some architectures supported by Linux. Instead, special functions should be used:

unsigned int ioread8(void __iomem *);
unsigned int ioread16(void __iomem *);
unsigned int ioread16be(void __iomem *);
unsigned int ioread32(void __iomem *);
unsigned int ioread32be(void __iomem *);

void iowrite8(u8, void __iomem *);
void iowrite16(u16, void __iomem *);
void iowrite16be(u16, void __iomem *);
void iowrite32(u32, void __iomem *);
void iowrite32be(u32, void __iomem *);

Functions without be in the name perform access in little-endian order, functions with be perform access in big-endian order.

In addition to memory regions, PCI also supports IO port regions. This is a remnant from the early days of the IBM PC architecture and the ISA bus. IO ports are not used in new devices and will not be discussed here.

DMA¶

DMA (Direct Memory Access) occurs when a device itself performs memory access transactions at addresses mapped to the computer's RAM. This is the opposite approach to so-called PIO (Programmed Input/Output), which is the manual transfer of data to/from the device by the processor performing operations on its IO registers. When transferring data via DMA, the processor only provides the device with the address of the data buffer, and the device handles the rest itself.

DMA on the PCI bus is supported natively - any device can perform a memory access transaction. However, it should be remembered that although the addresses used on the PCI bus are usually equivalent to the physical addresses used by the processor, this is not true on all systems. There are systems where memory accesses coming from the PCI bus pass through a so-called IOMMU - a paging unit for devices. It performs address translation similar to the translation of virtual addresses by the processor. Therefore, a distinction should be made between the physical address of a given memory area and the so-called DMA address (also called bus address), which is the address at which the device will see it. IOMMU units are often used on server-class systems and on architectures other than x86.

Another important consideration when using DMA is the maximum address size supported by the device - many devices can generate, for example, only 32-bit addresses, which may prevent access to part of the computer's memory (unless we have an IOMMU that can address this problem).

In Linux, most things are handled for us by the DMA subsystem. For a device to perform DMA, you must first enable the device's ability to perform transactions:

void pci_set_master(struct pci_dev *dev);
void pci_clear_master(struct pci_dev *dev);

You should also inform the DMA subsystem about the address size supported by the device by executing both of the following functions:

int pci_set_dma_mask(struct pci_dev *dev, u64 mask);
int pci_set_consistent_dma_mask(struct pci_dev *dev, u64 mask);

The mask parameter should be set to DMA_BIT_MASK(number of address bits).

There are many ways to allocate memory for DMA, depending on the needs. The simplest is dma_alloc_coherent:

void *dma_alloc_coherent(struct device *dev, size_t size,
                        dma_addr_t *dma_handle, gfp_t flag);
void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
                        dma_addr_t dma_handle);

The dma_alloc_coherent function returns the address of the allocated buffer in the kernel's virtual address space, and the dma_handle parameter returns the DMA address of this buffer in the address space visible to our device. As the first parameter of these functions, you should provide &pdev->dev, where pdev is a pointer to the pci_dev structure. Before freeing DMA memory, always remember to ensure that the device has finished using it.

Interrupts¶

Interrupts are a mechanism that allows devices to inform the processor about interesting events. A PCI device can be connected to a so-called interrupt line. The device can request an interrupt on the line to which it is connected. When this happens, the processor interrupts its current task (unless it is already busy handling another, sufficiently important interrupt), saves its state, and executes the function assigned to the given interrupt line. This function should determine which device signaled the interrupt (due to the small number of interrupt lines, PCI allows them to be shared among different devices), handle the event, and return to the previously executed code. The function must ensure that upon return, the device stops signaling the interrupt - otherwise, the processor may execute this function endlessly.

In Linux, interrupt handling is managed by the architecture code. To have our driver's function executed upon an interrupt, we must register it using request_irq:

typedef irqreturn_t (*irq_handler_t)(int irq, void *dev);
int request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
                const char *name, void *dev);
void free_irq(unsigned int irq, void *dev);

In the functions above, irq is the number of the interrupt line in the system. For PCI devices, it can be obtained from the pdev->irq field. The flags parameter should be set to IRQF_SHARED. The dev parameter should be set to some unique pointer specific to our device - it will be passed to our function.

Since interrupt lines can be shared, there can be multiple interrupt handling functions registered on one line, and calling the function does not necessarily mean that its device caused the interrupt. The function should check if the interrupt comes from its device and inform the system by returning IRQ_HANDLED if the interrupt was successfully handled, or IRQ_NONE if the device did not signal an interrupt.

In interrupt handling functions, remember the limitations - mainly the inability to block the process. Critical sections used within an interrupt must be based on spinlocks (used by spin_lock_irqsave).

Before removing an interrupt handling function, ensure that the device will no longer signal interrupts. Neglecting this can cause the system to crash - the kernel will endlessly try to handle the interrupt line without any active function to do so.

Newer PCI devices (including all PCIE) support an alternative mechanism for delivering interrupts, called MSI (Message Signalled Interrupts). In MSI, instead of interrupt lines, ordinary memory write transactions to the interrupt controller address are used. The signaled interrupt depends on the value written by the device, meaning MSI interrupts are not shared - the interrupt handling function can be sure that its device signaled it. However, we will not describe MSI handling in detail here.

Documentation¶

Many of the kernel interfaces mentioned above are documented in the Documentation directory. In particular, it is worth looking at:

Documentation/PCI/pci.txt
Documentation/DMA-API.txt
Documentation/DMA-API-HOWTO.txt
Documentation/PCI/MSI-HOWTO.txt - for newer hardware with MSI
A. Rubini, J. Corbet, G. Kroah-Hartman, Linux Device Drivers, 3rd edition, O'Reilly, 2005. ([http://lwn.net/Kernel/LDD3/](http://lwn.net/Kernel/LDD3/)) - chapters 9, 10, 12.