.. _z3-acceldev: ========================================== Assignment 3: Accelerator Device ========================================== Announcement date: 06.05.2025 Due date: 10.06.2025 (final due date 24.06.2025) .. toctree:: :hidden: device Additional materials ==================== - :ref:`z3-device` - `Device simulator in QEMU `_ - For your driver, use the ``acceldev.h`` file from the simulator. Do not modify it, as tests will be run with the official version. - :download:`z3-tests-2025.tar.xz` public tests for the driver Introduction ============ CPUs are well suited for general-purpose programs, but various kinds of computations, e.g. computer graphics, scientific computing, and machine learning, can benefit significantly from using more specialized devices like GPUs and machine learning accelerators. These devices are often realized as PCI cards. To expose their functionalities to user-space processes, higher-level abstractions such as CUDA, OpenCL, various graphics APIs (Vulkan, DirectX, OpenGL), and ONNX are typically used. To use those APIs with a specific device, device drivers are needed. These drivers translate higher-level operations into device instructions. A device driver consists of both kernel mode and user mode code. Depending on specific software and hardware constraints, the *real* kernel mode driver can be either large or relatively small. With devices and APIs becoming increasingly complex, a recently occurring pattern is: - the device includes a complex built-in chip with its own *device OS* - complex user mode code handles APIs - a relatively lightweight kernel mode driver connects the user mode code to the device and exposes OS functionality. Some operations traditionally implemented in the kernel are moved to user mode and the device itself. In this task, you will implement a Linux PCI driver for a simplified imaginary device providing ONNX acceleration, called **Acceldev**. The kernel driver should expose the device as a character device. For every **Acceldev** device attached, it should create a ``/dev/acceldevX`` character device, where ``X`` is the number of the attached device, starting from 0. Character device interface ========================== The ``/dev/acceldevX`` device should support the following operations: - ``open``: allocates a new device context. This context will be used for sending commands. Support up to ``ACCELDEV_MAX_CONTEXTS`` open contexts, as each context should be registered in the device. - ``close``: closes the context, unregisters it from the device, and frees resources. - ``ioctl(ACCELDEV_IOCTL_CREATE_BUFFER)``: creates a code buffer (for submitting user commands) or a data buffer (for sharing memory with the device). Use ``struct acceldev_ioctl_create_buffer``. A data buffer should be bound to a context slot; a code buffer should not be bound to any slot. ``close`` on the buffer should wait until all previously scheduled runs on the context are completed, then unbind the buffer from the device and free allocated DMA memory. The buffer should support ``mmap`` to allow reading and writing its contents in user mode and support ``close``. No other operations (``read``, ``write``, ``ioctl``) are required. Validate arguments (return ``EINVAL``), e.g. if the passed size is not less than or equal to ``ACCELDEV_BUFFER_MAX_SIZE`` (4 MiB). - ``ioctl(ACCELDEV_IOCTL_RUN)``: schedules the execution of user commands from a code buffer on a given context. See ``struct acceldev_ioctl_run`` and examples. Submit runs to the device using ``ACCELDEV_DEVICE_CMD_TYPE_RUN``. Do not store extra run information in the driver unless absolutely necessary. If you need to wait for a specific run or device instruction, use ``ACCELDEV_DEVICE_CMD_TYPE_FENCE`` and interrupts. If there's insufficient space in ``CMD_MANUAL_FEED``, queue the run in the driver until it can be submitted. Validate the arguments, including size and memory alignment. Return ``EINVAL`` on error. If the context previously encountered an error, return ``EIO``. - ``ioctl(ACCELDEV_IOCTL_WAIT)``: waits for the completion of a specific ``ACCELDEV_USER_CMD_TYPE_FENCE`` submitted on a given context. ``fence_wait`` is the number of the fence command to wait for (across all submitted runs on this context) modulo ``2^32``, starting from 1. The user mode driver is responsible for tracking the number of submitted fences. Return 0 on success, ``EIO`` on context error, and ``EINTR`` on interrupt. Do not validate the user-submitted commands in code buffers. They are validated by the device. If an error occurs, the device sets the ``error`` status flag in ``acceldev_context_on_device_config`` for the given context and raises an interrupt. However, do validate arguments where it makes sense, e.g. ioctl calls. The interface is more strictly defined by the provided examples and ``acceldev.h``. When in doubt, ask. Solution format =============== The device driver should be implemented in C as a Linux kernel module, working with the lab's kernel version. The compiled module should be called ``acceldev.ko``. Submit an archive named ``ab123456.tar.gz`` (where ``ab123456`` is your students login). After unpacking, the package should create ``ab123456`` directory with the following contents: - the module source files - ``Makefile`` and ``Kbuild`` files — running ``make`` should build the ``acceldev.ko`` module - a README file with a brief description of your solution, including driver design choices (e.g. regarding locking, fences) and code structure Grading ======= You can obtain up to 10 points. The assignment is graded based on automated tests and code review. The tests include the provided examples but also some other undisclosed tests which are variations of the provided examples For the code review, points may be deducted for: - detected errors, e.g. regarding locking or memory leaks - minor deductions for issues like unclear or convoluted code structure The driver may consist of a single source file if it's well-structured. However, modular and well-documented code is preferable. QEMU ==== **Acceldev** is implemented as a PCI device in QEMU. To use the **Acceldev** device, a modified version of QEMU is required. It is available in source code form. To compile it: - Clone the repository: https://gitlab.uw.edu.pl/zso/2025l-public/zad3-public.git - Run: ``git checkout acceldev-public`` - Ensure that the following dependencies are installed: **ncurses**, **libsdl**, **curl**, and in some distributions also **ncurses-dev**, **libsdl-dev**, **curl-dev** (package names may vary). - Run ``./configure`` with the desired options. Suggested flags: :: --target-list=x86_64-softmmu --enable-virtfs --enable-gtk - Change into the build directory: :: cd build - Run ``make`` (or ``ninja`` if installed). - Install with ``make install`` or run the binary directly (``build/qemu-system-x86_64``). To emulate **Acceldev**: - Pass the option ``-device acceldev`` to QEMU. Repeat it to emulate multiple devices. To add the **Acceldev** device live (while QEMU is running): - Enter QEMU monitor mode (Ctrl+Alt+2 inside the window) - Type: ``device_add acceldev`` - Return to the main screen (Ctrl+Alt+1) - Run: ``echo 1 > /sys/bus/pci/rescan`` to detect the device in Linux To simulate device removal: - Run: ``echo 1 > /sys/bus/pci/devices/0000:/remove`` Hints ===== To create buffer files, use ``anon_inode_getfile`` or ``anon_inode_getfd``. To obtain a file struct from a file descriptor, use ``fdget`` and ``fdput``. To check if the passed file structure is valid, verify its ``file_operations``. ``mmap`` implementation ----------------------- #. Implement the ``mmap`` callback in ``file_operations`` to set ``vm_ops`` in the specified ``vma`` to your callbacks struct. #. In your ``vm_operations_struct``, fill in the ``fault`` callback. #. In the ``fault`` callback: #. Verify that ``pgoff`` is within buffer size or return ``VM_FAULT_SIGBUS``. #. Get the virtual address (in kernel space) of the appropriate buffer page and translate it with ``virt_to_page`` to ``struct page *``. #. Increase the page refcount with ``get_page``. #. Set the ``page`` field in ``vm_fault``. #. Return 0. .. _z3-driver-onnx: Extras – ONNX Runtime ===================== For real applications, the kernel mode driver would be part of a larger package with a user mode driver. For ML accelerators, a popular choice is `ONNX `_, which provides tools for converting machine learning models (e.g. created using scikit-learn, PyTorch, or TensorFlow) into the ONNX format. This format saves models as graphs of `ONNX operators `_, including both simple (e.g. *Abs*, *Vector Addition*) and complex operations like *Transformer Attention*. To accelerate such a model, the accelerator must support some ONNX operators. This integration can be accomplished using `ONNX Runtime `_ by registering a new Execution Provider for the device. This provider informs ONNX Runtime which operations are supported and converts them to device instructions using APIs such as NVIDIA CUDA, AMD ROCm, or the kernel driver.