Assignment 3: Accelerator Device

Announcement date: 06.05.2025

Due date: 10.06.2025 (final due date 24.06.2025)

Additional materials

Introduction

CPUs are well suited for general-purpose programs, but various kinds of computations, e.g. computer graphics, scientific computing, and machine learning, can benefit significantly from using more specialized devices like GPUs and machine learning accelerators.

These devices are often realized as PCI cards. To expose their functionalities to user-space processes, higher-level abstractions such as CUDA, OpenCL, various graphics APIs (Vulkan, DirectX, OpenGL), and ONNX are typically used. To use those APIs with a specific device, device drivers are needed. These drivers translate higher-level operations into device instructions.

A device driver consists of both kernel mode and user mode code. Depending on specific software and hardware constraints, the real kernel mode driver can be either large or relatively small. With devices and APIs becoming increasingly complex, a recently occurring pattern is:

  • the device includes a complex built-in chip with its own device OS

  • complex user mode code handles APIs

  • a relatively lightweight kernel mode driver connects the user mode code to the device and exposes OS functionality. Some operations traditionally implemented in the kernel are moved to user mode and the device itself.

In this task, you will implement a Linux PCI driver for a simplified imaginary device providing ONNX acceleration, called Acceldev.

The kernel driver should expose the device as a character device. For every Acceldev device attached, it should create a /dev/acceldevX character device, where X is the number of the attached device, starting from 0.

Character device interface

The /dev/acceldevX device should support the following operations:

  • open: allocates a new device context. This context will be used for sending commands. Support up to ACCELDEV_MAX_CONTEXTS open contexts, as each context should be registered in the device.

  • close: closes the context, unregisters it from the device, and frees resources.

  • ioctl(ACCELDEV_IOCTL_CREATE_BUFFER): creates a code buffer (for submitting user commands) or a data buffer (for sharing memory with the device). Use struct acceldev_ioctl_create_buffer. A data buffer should be bound to a context slot; a code buffer should not be bound to any slot. close on the buffer should wait until all previously scheduled runs on the context are completed, then unbind the buffer from the device and free allocated DMA memory.

    The buffer should support mmap to allow reading and writing its contents in user mode and support close. No other operations (read, write, ioctl) are required.

    Validate arguments (return EINVAL), e.g. if the passed size is not less than or equal to ACCELDEV_BUFFER_MAX_SIZE (4 MiB).

  • ioctl(ACCELDEV_IOCTL_RUN): schedules the execution of user commands from a code buffer on a given context. See struct acceldev_ioctl_run and examples.

    Submit runs to the device using ACCELDEV_DEVICE_CMD_TYPE_RUN. Do not store extra run information in the driver unless absolutely necessary. If you need to wait for a specific run or device instruction, use ACCELDEV_DEVICE_CMD_TYPE_FENCE and interrupts.

    If there's insufficient space in CMD_MANUAL_FEED, queue the run in the driver until it can be submitted.

    Validate the arguments, including size and memory alignment. Return EINVAL on error. If the context previously encountered an error, return EIO.

  • ioctl(ACCELDEV_IOCTL_WAIT): waits for the completion of a specific ACCELDEV_USER_CMD_TYPE_FENCE submitted on a given context.

    fence_wait is the number of the fence command to wait for (across all submitted runs on this context) modulo 2^32, starting from 1. The user mode driver is responsible for tracking the number of submitted fences.

    Return 0 on success, EIO on context error, and EINTR on interrupt.

Do not validate the user-submitted commands in code buffers. They are validated by the device. If an error occurs, the device sets the error status flag in acceldev_context_on_device_config for the given context and raises an interrupt. However, do validate arguments where it makes sense, e.g. ioctl calls.

The interface is more strictly defined by the provided examples and acceldev.h. When in doubt, ask.

Solution format

The device driver should be implemented in C as a Linux kernel module, working with the lab's kernel version. The compiled module should be called acceldev.ko.

Submit an archive named ab123456.tar.gz (where ab123456 is your students login). After unpacking, the package should create ab123456 directory with the following contents:

  • the module source files

  • Makefile and Kbuild files — running make should build the acceldev.ko module

  • a README file with a brief description of your solution, including driver design choices (e.g. regarding locking, fences) and code structure

Grading

You can obtain up to 10 points. The assignment is graded based on automated tests and code review. The tests include the provided examples but also some other undisclosed tests which are variations of the provided examples

For the code review, points may be deducted for:

  • detected errors, e.g. regarding locking or memory leaks

  • minor deductions for issues like unclear or convoluted code structure

The driver may consist of a single source file if it's well-structured. However, modular and well-documented code is preferable.

QEMU

Acceldev is implemented as a PCI device in QEMU.

To use the Acceldev device, a modified version of QEMU is required. It is available in source code form.

To compile it:

  • Clone the repository: https://gitlab.uw.edu.pl/zso/2025l-public/zad3-public.git

  • Run: git checkout acceldev-public

  • Ensure that the following dependencies are installed: ncurses, libsdl, curl, and in some distributions also ncurses-dev, libsdl-dev, curl-dev (package names may vary).

  • Run ./configure with the desired options. Suggested flags:

    --target-list=x86_64-softmmu --enable-virtfs --enable-gtk
    
  • Change into the build directory:

    cd build
    
  • Run make (or ninja if installed).

  • Install with make install or run the binary directly (build/qemu-system-x86_64).

To emulate Acceldev:

  • Pass the option -device acceldev to QEMU. Repeat it to emulate multiple devices.

To add the Acceldev device live (while QEMU is running):

  • Enter QEMU monitor mode (Ctrl+Alt+2 inside the window)

  • Type: device_add acceldev

  • Return to the main screen (Ctrl+Alt+1)

  • Run: echo 1 > /sys/bus/pci/rescan to detect the device in Linux

To simulate device removal:

  • Run: echo 1 > /sys/bus/pci/devices/0000:<device_id>/remove

Hints

To create buffer files, use anon_inode_getfile or anon_inode_getfd. To obtain a file struct from a file descriptor, use fdget and fdput. To check if the passed file structure is valid, verify its file_operations.

mmap implementation

  1. Implement the mmap callback in file_operations to set vm_ops in the specified vma to your callbacks struct.

  2. In your vm_operations_struct, fill in the fault callback.

  3. In the fault callback:

    1. Verify that pgoff is within buffer size or return VM_FAULT_SIGBUS.

    2. Get the virtual address (in kernel space) of the appropriate buffer page and translate it with virt_to_page to struct page *.

    3. Increase the page refcount with get_page.

    4. Set the page field in vm_fault.

    5. Return 0.

Extras – ONNX Runtime

For real applications, the kernel mode driver would be part of a larger package with a user mode driver.

For ML accelerators, a popular choice is ONNX, which provides tools for converting machine learning models (e.g. created using scikit-learn, PyTorch, or TensorFlow) into the ONNX format. This format saves models as graphs of ONNX operators, including both simple (e.g. Abs, Vector Addition) and complex operations like Transformer Attention.

To accelerate such a model, the accelerator must support some ONNX operators. This integration can be accomplished using ONNX Runtime by registering a new Execution Provider for the device. This provider informs ONNX Runtime which operations are supported and converts them to device instructions using APIs such as NVIDIA CUDA, AMD ROCm, or the kernel driver.