.. _z3-acceldev:


==========================================
Assignment 3: Accelerator Device
==========================================

Announcement date: 06.05.2025

Due date: 10.06.2025 (final due date 24.06.2025)

.. toctree::
    :hidden:

    device


Additional materials
====================

- :ref:`z3-device`
- `Device simulator in QEMU <https://gitlab.uw.edu.pl/zso/2025l-public/zad3-public/-/tree/acceldev-public>`_
- For your driver, use the ``acceldev.h`` file from the simulator.
  Do not modify it, as tests will be run with the official version.
- :download:`z3-tests-2025.tar.xz` public tests for the driver

Introduction
============

CPUs are well suited for general-purpose programs, but various kinds of
computations, e.g. computer graphics, scientific computing, and machine
learning, can benefit significantly from using more specialized devices like
GPUs and machine learning accelerators.

These devices are often realized as PCI cards. To expose their functionalities
to user-space processes, higher-level abstractions such as CUDA, OpenCL, various
graphics APIs (Vulkan, DirectX, OpenGL), and ONNX are typically used. To use
those APIs with a specific device, device drivers are needed. These drivers
translate higher-level operations into device instructions.

A device driver consists of both kernel mode and user mode code. Depending on
specific software and hardware constraints, the *real* kernel mode driver can
be either large or relatively small. With devices and APIs becoming
increasingly complex, a recently occurring pattern is:

- the device includes a complex built-in chip with its own *device OS*
- complex user mode code handles APIs
- a relatively lightweight kernel mode driver connects the user mode code to
  the device and exposes OS functionality. Some operations traditionally
  implemented in the kernel are moved to user mode and the device itself.

In this task, you will implement a Linux PCI driver for a simplified imaginary
device providing ONNX acceleration, called **Acceldev**.

The kernel driver should expose the device as a character device. For every
**Acceldev** device attached, it should create a ``/dev/acceldevX`` character
device, where ``X`` is the number of the attached device, starting from 0.

Character device interface
==========================

The ``/dev/acceldevX`` device should support the following operations:

- ``open``: allocates a new device context. This context will be used for
  sending commands. Support up to ``ACCELDEV_MAX_CONTEXTS`` open contexts, as
  each context should be registered in the device.
- ``close``: closes the context, unregisters it from the device, and frees
  resources.
- ``ioctl(ACCELDEV_IOCTL_CREATE_BUFFER)``: creates a code buffer (for submitting
  user commands) or a data buffer (for sharing memory with the device). Use
  ``struct acceldev_ioctl_create_buffer``. A data buffer should be bound to a
  context slot; a code buffer should not be bound to any slot. ``close`` on the
  buffer should wait until all previously scheduled runs on the context are
  completed, then unbind the buffer from the device and free allocated DMA
  memory.

  The buffer should support ``mmap`` to allow reading and writing its contents
  in user mode and support ``close``. No other operations (``read``, ``write``,
  ``ioctl``) are required.

  Validate arguments (return ``EINVAL``), e.g. if the passed size is not less
  than or equal to ``ACCELDEV_BUFFER_MAX_SIZE`` (4 MiB).
- ``ioctl(ACCELDEV_IOCTL_RUN)``: schedules the execution of user commands from a
  code buffer on a given context. See ``struct acceldev_ioctl_run`` and
  examples.

  Submit runs to the device using ``ACCELDEV_DEVICE_CMD_TYPE_RUN``. Do not store
  extra run information in the driver unless absolutely necessary. If you need
  to wait for a specific run or device instruction, use
  ``ACCELDEV_DEVICE_CMD_TYPE_FENCE`` and interrupts.

  If there's insufficient space in ``CMD_MANUAL_FEED``, queue the run in the
  driver until it can be submitted.

  Validate the arguments, including size and memory alignment. Return ``EINVAL``
  on error. If the context previously encountered an error, return ``EIO``.
- ``ioctl(ACCELDEV_IOCTL_WAIT)``: waits for the completion of a specific
  ``ACCELDEV_USER_CMD_TYPE_FENCE`` submitted on a given context.

  ``fence_wait`` is the number of the fence command to wait for (across all
  submitted runs on this context) modulo ``2^32``, starting from 1. The user
  mode driver is responsible for tracking the number of submitted fences.

  Return 0 on success, ``EIO`` on context error, and ``EINTR`` on interrupt.

Do not validate the user-submitted commands in code buffers. They are validated
by the device. If an error occurs, the device sets the ``error`` status flag in
``acceldev_context_on_device_config`` for the given context and raises an
interrupt. However, do validate arguments where it makes sense, e.g. ioctl calls.

The interface is more strictly defined by the provided examples and
``acceldev.h``. When in doubt, ask.

Solution format
===============

The device driver should be implemented in C as a Linux kernel module, working
with the lab's kernel version. The compiled module should be called
``acceldev.ko``.

Submit an archive named ``ab123456.tar.gz``
(where ``ab123456`` is your students login).
After unpacking, the package should create ``ab123456`` directory
with the following contents:

- the module source files
- ``Makefile`` and ``Kbuild`` files — running ``make`` should build the
  ``acceldev.ko`` module
- a README file with a brief description of your solution, including driver
  design choices (e.g. regarding locking, fences) and code structure

Grading
=======

You can obtain up to 10 points. The assignment is graded based on automated
tests and code review. The tests include the provided examples
but also some other undisclosed tests which are variations of the
provided examples

For the code review, points may be deducted for:

- detected errors, e.g. regarding locking or memory leaks
- minor deductions for issues like unclear or convoluted code structure

The driver may consist of a single source file if it's well-structured. However,
modular and well-documented code is preferable.

QEMU
====

**Acceldev** is implemented as a PCI device in QEMU.

To use the **Acceldev** device, a modified version of QEMU is required. It is
available in source code form.

To compile it:

- Clone the repository:
  https://gitlab.uw.edu.pl/zso/2025l-public/zad3-public.git
- Run:
  ``git checkout acceldev-public``
- Ensure that the following dependencies are installed: **ncurses**, **libsdl**,
  **curl**, and in some distributions also **ncurses-dev**, **libsdl-dev**,
  **curl-dev** (package names may vary).
- Run ``./configure`` with the desired options. Suggested flags:

  ::

    --target-list=x86_64-softmmu --enable-virtfs --enable-gtk

- Change into the build directory:

  ::

    cd build

- Run ``make`` (or ``ninja`` if installed).
- Install with ``make install`` or run the binary directly
  (``build/qemu-system-x86_64``).

To emulate **Acceldev**:

- Pass the option ``-device acceldev`` to QEMU. Repeat it to emulate multiple
  devices.

To add the **Acceldev** device live (while QEMU is running):

- Enter QEMU monitor mode (Ctrl+Alt+2 inside the window)
- Type: ``device_add acceldev``
- Return to the main screen (Ctrl+Alt+1)
- Run: ``echo 1 > /sys/bus/pci/rescan`` to detect the device in Linux

To simulate device removal:

- Run:
  ``echo 1 > /sys/bus/pci/devices/0000:<device_id>/remove``

Hints
=====

To create buffer files, use ``anon_inode_getfile`` or ``anon_inode_getfd``.
To obtain a file struct from a file descriptor, use ``fdget`` and ``fdput``.
To check if the passed file structure is valid, verify its ``file_operations``.

``mmap`` implementation
-----------------------

#. Implement the ``mmap`` callback in ``file_operations`` to set ``vm_ops`` in
   the specified ``vma`` to your callbacks struct.
#. In your ``vm_operations_struct``, fill in the ``fault`` callback.
#. In the ``fault`` callback:

   #. Verify that ``pgoff`` is within buffer size or return ``VM_FAULT_SIGBUS``.

   #. Get the virtual address (in kernel space) of the appropriate buffer page and translate it with ``virt_to_page`` to ``struct page *``.

   #. Increase the page refcount with ``get_page``.

   #. Set the ``page`` field in ``vm_fault``.

   #. Return 0.

.. _z3-driver-onnx:

Extras – ONNX Runtime
=====================

For real applications, the kernel mode driver would be part of a larger package
with a user mode driver.

For ML accelerators, a popular choice is `ONNX <https://onnx.ai/onnx/intro/index.html>`_,
which provides tools for converting machine learning models (e.g. created using
scikit-learn, PyTorch, or TensorFlow) into the ONNX format. This format saves
models as graphs of
`ONNX operators <https://onnx.ai/onnx/operators/index.html>`_, including both
simple (e.g. *Abs*, *Vector Addition*) and complex operations like *Transformer Attention*.

To accelerate such a model, the accelerator must support some ONNX operators.
This integration can be accomplished using
`ONNX Runtime <https://onnxruntime.ai/docs/execution-providers/>`_ by
registering a new Execution Provider for the device. This provider informs
ONNX Runtime which operations are supported and converts them to device
instructions using APIs such as NVIDIA CUDA, AMD ROCm, or the kernel driver.