Assignment 3: Accelerator Device¶
Announcement date: 06.05.2025
Due date: 10.06.2025 (final due date 24.06.2025)
Additional materials¶
For your driver, use the
acceldev.h
file from the simulator. Do not modify it, as tests will be run with the official version.z3-tests-2025.tar.xz
public tests for the driver
Introduction¶
CPUs are well suited for general-purpose programs, but various kinds of computations, e.g. computer graphics, scientific computing, and machine learning, can benefit significantly from using more specialized devices like GPUs and machine learning accelerators.
These devices are often realized as PCI cards. To expose their functionalities to user-space processes, higher-level abstractions such as CUDA, OpenCL, various graphics APIs (Vulkan, DirectX, OpenGL), and ONNX are typically used. To use those APIs with a specific device, device drivers are needed. These drivers translate higher-level operations into device instructions.
A device driver consists of both kernel mode and user mode code. Depending on specific software and hardware constraints, the real kernel mode driver can be either large or relatively small. With devices and APIs becoming increasingly complex, a recently occurring pattern is:
the device includes a complex built-in chip with its own device OS
complex user mode code handles APIs
a relatively lightweight kernel mode driver connects the user mode code to the device and exposes OS functionality. Some operations traditionally implemented in the kernel are moved to user mode and the device itself.
In this task, you will implement a Linux PCI driver for a simplified imaginary device providing ONNX acceleration, called Acceldev.
The kernel driver should expose the device as a character device. For every
Acceldev device attached, it should create a /dev/acceldevX
character
device, where X
is the number of the attached device, starting from 0.
Character device interface¶
The /dev/acceldevX
device should support the following operations:
open
: allocates a new device context. This context will be used for sending commands. Support up toACCELDEV_MAX_CONTEXTS
open contexts, as each context should be registered in the device.close
: closes the context, unregisters it from the device, and frees resources.ioctl(ACCELDEV_IOCTL_CREATE_BUFFER)
: creates a code buffer (for submitting user commands) or a data buffer (for sharing memory with the device). Usestruct acceldev_ioctl_create_buffer
. A data buffer should be bound to a context slot; a code buffer should not be bound to any slot.close
on the buffer should wait until all previously scheduled runs on the context are completed, then unbind the buffer from the device and free allocated DMA memory.The buffer should support
mmap
to allow reading and writing its contents in user mode and supportclose
. No other operations (read
,write
,ioctl
) are required.Validate arguments (return
EINVAL
), e.g. if the passed size is not less than or equal toACCELDEV_BUFFER_MAX_SIZE
(4 MiB).ioctl(ACCELDEV_IOCTL_RUN)
: schedules the execution of user commands from a code buffer on a given context. Seestruct acceldev_ioctl_run
and examples.Submit runs to the device using
ACCELDEV_DEVICE_CMD_TYPE_RUN
. Do not store extra run information in the driver unless absolutely necessary. If you need to wait for a specific run or device instruction, useACCELDEV_DEVICE_CMD_TYPE_FENCE
and interrupts.If there's insufficient space in
CMD_MANUAL_FEED
, queue the run in the driver until it can be submitted.Validate the arguments, including size and memory alignment. Return
EINVAL
on error. If the context previously encountered an error, returnEIO
.ioctl(ACCELDEV_IOCTL_WAIT)
: waits for the completion of a specificACCELDEV_USER_CMD_TYPE_FENCE
submitted on a given context.fence_wait
is the number of the fence command to wait for (across all submitted runs on this context) modulo2^32
, starting from 1. The user mode driver is responsible for tracking the number of submitted fences.Return 0 on success,
EIO
on context error, andEINTR
on interrupt.
Do not validate the user-submitted commands in code buffers. They are validated
by the device. If an error occurs, the device sets the error
status flag in
acceldev_context_on_device_config
for the given context and raises an
interrupt. However, do validate arguments where it makes sense, e.g. ioctl calls.
The interface is more strictly defined by the provided examples and
acceldev.h
. When in doubt, ask.
Solution format¶
The device driver should be implemented in C as a Linux kernel module, working
with the lab's kernel version. The compiled module should be called
acceldev.ko
.
Submit an archive named ab123456.tar.gz
(where ab123456
is your students login).
After unpacking, the package should create ab123456
directory
with the following contents:
the module source files
Makefile
andKbuild
files — runningmake
should build theacceldev.ko
modulea README file with a brief description of your solution, including driver design choices (e.g. regarding locking, fences) and code structure
Grading¶
You can obtain up to 10 points. The assignment is graded based on automated tests and code review. The tests include the provided examples but also some other undisclosed tests which are variations of the provided examples
For the code review, points may be deducted for:
detected errors, e.g. regarding locking or memory leaks
minor deductions for issues like unclear or convoluted code structure
The driver may consist of a single source file if it's well-structured. However, modular and well-documented code is preferable.
QEMU¶
Acceldev is implemented as a PCI device in QEMU.
To use the Acceldev device, a modified version of QEMU is required. It is available in source code form.
To compile it:
Clone the repository: https://gitlab.uw.edu.pl/zso/2025l-public/zad3-public.git
Run:
git checkout acceldev-public
Ensure that the following dependencies are installed: ncurses, libsdl, curl, and in some distributions also ncurses-dev, libsdl-dev, curl-dev (package names may vary).
Run
./configure
with the desired options. Suggested flags:--target-list=x86_64-softmmu --enable-virtfs --enable-gtk
Change into the build directory:
cd build
Run
make
(orninja
if installed).Install with
make install
or run the binary directly (build/qemu-system-x86_64
).
To emulate Acceldev:
Pass the option
-device acceldev
to QEMU. Repeat it to emulate multiple devices.
To add the Acceldev device live (while QEMU is running):
Enter QEMU monitor mode (Ctrl+Alt+2 inside the window)
Type:
device_add acceldev
Return to the main screen (Ctrl+Alt+1)
Run:
echo 1 > /sys/bus/pci/rescan
to detect the device in Linux
To simulate device removal:
Run:
echo 1 > /sys/bus/pci/devices/0000:<device_id>/remove
Hints¶
To create buffer files, use anon_inode_getfile
or anon_inode_getfd
.
To obtain a file struct from a file descriptor, use fdget
and fdput
.
To check if the passed file structure is valid, verify its file_operations
.
mmap
implementation¶
Implement the
mmap
callback infile_operations
to setvm_ops
in the specifiedvma
to your callbacks struct.In your
vm_operations_struct
, fill in thefault
callback.In the
fault
callback:Verify that
pgoff
is within buffer size or returnVM_FAULT_SIGBUS
.Get the virtual address (in kernel space) of the appropriate buffer page and translate it with
virt_to_page
tostruct page *
.Increase the page refcount with
get_page
.Set the
page
field invm_fault
.Return 0.
Extras – ONNX Runtime¶
For real applications, the kernel mode driver would be part of a larger package with a user mode driver.
For ML accelerators, a popular choice is ONNX, which provides tools for converting machine learning models (e.g. created using scikit-learn, PyTorch, or TensorFlow) into the ONNX format. This format saves models as graphs of ONNX operators, including both simple (e.g. Abs, Vector Addition) and complex operations like Transformer Attention.
To accelerate such a model, the accelerator must support some ONNX operators. This integration can be accomplished using ONNX Runtime by registering a new Execution Provider for the device. This provider informs ONNX Runtime which operations are supported and converts them to device instructions using APIs such as NVIDIA CUDA, AMD ROCm, or the kernel driver.