Acceldev Device¶
The Acceldev device is attached to the computer via the PCI bus. You will find
the necessary information in acceldev.h
.
The device does not have memory of its own and uses the main memory of the computer with Direct Memory Access (DMA). To overcome memory fragmentation, it uses virtual addresses and page tables in its specific format.
The device supports ACCELDEV_MAX_CONTEXTS
(255) independent contexts, which
do not share memory. These contexts should be tied to the driver contexts.
The device is designed to allow user commands to run without additional kernel mode validation, while ensuring that users cannot access system memory or other contexts on the device. If user commands are invalid or trigger an error, the device marks the context as errored and raises an interrupt to notify the driver.
Ensuring fair compute time across contexts is not guaranteed; one context may occupy the device, preventing others from running. Dealing with this issue is outside the scope of this assignment.
The device is controlled using MMIO registers. It has only one BAR (BAR0), used for these registers, and uses a single PCI interrupt line.
The MMIO area is 64 KiB in size, but only some of this range is used for registers. All documented registers are 32-bit, little-endian format, and should be accessed only through aligned 32-bit reads and writes.
Buffers and Paging¶
Data and commands used by the device are stored in paged buffers. Each context
can have ACCELDEV_NUM_BUFFERS
(16) buffers bound. These are configured using
ACCELDEV_CONTEXTS_CONFIGS
with an array of
acceldev_context_on_device_config
structures and the
ACCELDEV_DEVICE_CMD_TYPE_BIND_SLOT
device command.
Except for ACCELDEV_CONTEXTS_CONFIGS
, which uses 64-bit contiguous memory,
all buffers and page tables use 40-bit physical addresses, 22-bit virtual
addresses in the buffer, and pages of size ACCELDEV_PAGE_SIZE
(4 KiB). The
page tables are single-level.
The kernel passes a 64-bit physical address of the buffer's page table to the device.
Bits 12–21 of the virtual address select the page table entry, which contains the physical address of the page.
Bits 0–11 of the virtual address represent the offset within the page.
Page tables are 4 KiB in size and contain 1024 entries, each being a 32-bit little-endian word. Each page table entry has the following format:
Bit 0:
PRESENT
— if set, the entry is valid. If not set, using the entry raises aMEM_ERROR
.Bits 4–31:
PA
— bits 12–39 of the page's physical address. Bits 0–11 are always zero; pages must be aligned.
Sending Commands¶
The device supports two types of commands:
Device commands, sent and validated by the driver.
User commands (also called context commands), sent by running a code buffer via the
ACCELDEV_DEVICE_CMD_TYPE_RUN
command.
Device Commands¶
Device commands consist of ACCELDEV_DEVICE_CMD_WORDS
(5) 32-bit little-endian
words. They are sent via the CMD_MANUAL_FEED
registers:
BAR0 + 0x008c: CMD_MANUAL_FREE
Read-only register. Shows how many full commands may be queued before aFEED_ERROR
occurs. The queue holdsCMDS_BUFFER_SIZE
(255) commands. Assume the queue is empty after a device reset.BAR0 + 0x008c – BAR0 + 0x009c: CMD_MANUAL_FEED
Five write-only registers for writing command words. Writing the last (4th counting from 0) word atBAR0 + 0x009c
submits the command. Submitting when the queue is full (CMD_MANUAL_FREE == 0
) raises aFEED_ERROR
interrupt.
NOP Command¶
Does nothing. Can be used to fill the queue if you feel like it.
0th word: header - Command type:
0x0
The other words are unused. To submit the command you only need to write the 0th and 4th words.
FENCE Command¶
Signals that all commands submitted before it have been processed.
0th word: header - Command type:
0x3
1st word: 32-bit value
VAL
Behavior:
Waits for completion of all previous commands.
Sets
CMD_FENCE_LAST
toVAL
.If
VAL == CMD_FENCE_WAIT
, triggersFENCE_WAIT
interrupt.
Registers:
BAR0 + 0x00a0: CMD_FENCE_LAST
32-bit read/write register. Set toVAL
while processing FENCE.BAR0 + 0x00a4: CMD_FENCE_WAIT
32-bit read/write register. Used to scheduleFENCE_WAIT
interrupt.
RUN command¶
Schedules user commands on a context.
0th word: header - Bits 0–3: command type
0x1
- Bits 4–31: context ID1st–2nd words: lower/upper 32 bits of code buffer page table address
3rd word: offset (in bytes) of the first command
4th word: size (in bytes) of commands to process (
n_commands * ACCELDEV_USER_CMD_WORDS * sizeof(uint32_t)
)
BIND_SLOT Command¶
Binds or unbinds a data buffer to a slot for a given context.
Binding and unbiding buffers can also be realized using
ACCELDEV_CONTEXTS_CONFIGS
but that method is unsafe to execute
while the device may use the bound buffers.
Therefore, BIND_SLOT
is preferred when the context is already running.
0th word: header - Bits 0–3: command type
0x2
- Bits 4–31: context ID1st word: slot number
2nd–3rd words: lower/upper 32 bits of data buffer page table address
Unbiding a buffer is done by replacing it with a new buffer or submitting 0 as the buffer page table address.
User Commands¶
User commands are ACCELDEV_USER_CMD_WORDS
32-bit aligned little-endian words
in a DMA buffer. Always write the full number of words, even if the command is
shorter.
Supported commands:
NOP (0x0)
FENCE (0x1)
FILL (0x2)
FENCE Command¶
Waits for previous user commands to finish.
Increments
fence_counter
in the context config.Triggers
USER_FENCE_WAIT
interrupt.
0th word: header - Command type:
0x1
FILL Command¶
Fills part of a buffer with a value.
0th word: header - Command type:
0x2
1st word: 32-bit value
2nd word: buffer slot
3rd word: start offset (bytes)
4th word: length (bytes)
A real accelerator would support more interesting commands but as the goal here is system programming, those will suffice. If you are interested in this topic, refer to Extras – ONNX Runtime.
Control Registers¶
BAR0 + 0x0008: ENABLE
Controls whether the device processes commands.
If 0, then the commands are not processed.
BAR0 + 0x000c – 0x0010: CONTEXTS_CONFIGS
Attaches the contexts' configuration memory.
The configuration is stored in contiguous DMA memory
containing an array of acceldev_context_on_device_config
configs for each of the ACCELDEV_CONTEXTS_CONFIGS
contexts.
BAR0 + 0x000c
: lower 32 bits of addressBAR0 + 0x0010
: upper 32 bits of address
Interrupts¶
The device uses six internal interrupts, all multiplexed into one PCI interrupt.
FENCE_WAIT — completion of a FENCE command
FEED_ERROR — command queue full
CMD_ERROR — invalid device command
MEM_ERROR — invalid memory access
SLOT_ERROR — request for inactive slot
USER_FENCE_WAIT — user FENCE command triggered
An interrupt becomes active on event occurrence and inactive when cleared by
writing 1 to its bit in the INTR
register.
Independently, each of the above interrupts can also be either enabled or disabled at any given time. The driver can set an enabled subset of interrupts by writing an appropriate mask to the INTR_ENABLE register. The device will signal an interrupt on its PCI interrupt line if and only if there is an interrupt that is both enabled and active.
BAR0 + 0x0000: INTR
Interrupt status register. Each bit corresponds to an interrupt. Reading returns
1 for active, 0 for inactive. Writing resets (sets to inactive) all interrupts
for which 1s were written.
BAR0 + 0x0004: INTR_ENABLE
Interrupt enable register. Same bit layout as INTR
. A bit value of 1 enables
the interrupt. On reset, this is set to 0.
Upon machine reset, the register is set to 0, blocking the device from signaling
a PCI interrupt until the driver is loaded.
Starting the Device¶
To start the device, follow these steps:
Clear
INTR
by writing all 1s.Enable required interrupts via
INTR_ENABLE
.Attach context configs in
ACCELDEV_CONTEXTS_CONFIGS
.Enable all device blocks via
ENABLE
.Optionally set
CMD_FENCE_LAST
andCMD_FENCE_WAIT
.
To shut down, write 0 to both ENABLE
and INTR_ENABLE
.
If the device reports an error, reset it by repeating the start-up procedure.