Kernel preemption, power consumption

On uniprocessor systems, spinlocks are defined as empty operations because critical sections cannot be entered by several CPUs at the same time. However, this does not apply if kernel preemption is enabled. If the kernel is interrupted in a critical region and this region is then entered by another process, this has exactly the same effect as if the region were actually being executed by two processors on SMP systems. This is prevented by a simple trick - kernel preemption is disabled when the kernel is in a critical region protected by a spinlock. When a uniprocessor kernel is compiled with enabled kernel preemption, spin_lock is (basically) equivalent to preempt_disable and spin_unlock to preempt_enable.

The scheduler is invoked before returning to user mode after system calls or at certain designated points in the kernel. This ensures that the kernel, unlike user processes, cannot be interrupted unless it explicitly wants to be. This may result in deteriorating system latency, which users experience as ''sluggish'' response.

These problems can be resolved by compiling the kernel with support for kernel preemption. This allows not only userspace applications but also the kernel to be interrupted if a high-priority process has some things to do.

The kernel may not be interrupted at all points. Fortunately, most of these points have already been identified by SMP implementation, and this information can be reused to implement kernel preemption.

If the kernel can be preempted, even uniprocessor systems will behave like SMP systems. Consider that the kernel is working inside a critical region when it is preempted. The next task also operates in kernel mode, and unfortunately also wants to access the same critical region. This is effectively equivalent to two processors working in the critical region at the same time and must be prevented. Every time the kernel is inside a critical region, kernel preemption must be disabled.

How does the kernel keep track of whether it can be preempted or not? There is a per CPU preemption counter:


DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;

The value of this element determines whether the kernel is currently at a position where it may be interrupted. If preempt_count is zero, the kernel can be interrupted, otherwise not. The value must not be manipulated directly, but only with the auxiliary functions dec_preempt_count and inc_preempt_count.

The central function for the preemption mechanism is preempt_schedule. The simple desire that the kernel be preempted as indicated by TIF_NEED_RESCHED does not yet guarantee that this is possible - recall that the kernel could currently still be inside a critical region, and must not be disturbed. This is checked by preempt_schedule.


asmlinkage __visible void __sched notrace preempt_schedule(void)
{
	/*
	 * If there is a non-zero preempt_count or interrupts are disabled,
	 * we do not want to preempt the current task. Just return..
	 */
	if (likely(!preemptible()))
		return;

	preempt_schedule_common();
}

#define preemptible()	(preempt_count() == 0 && !irqs_disabled())

Another possibility to activate preemption is after a hardware IRQ has been serviced. If the processor returns to kernel mode after handling the IRQ (return to user mode is not affected), the architecture-specific assembler routine checks whether the value of the preemption counter is 0 - that is, if preemption is allowed - and whether the reschedule flag is set - exactly as in preempt_schedule. If both conditions are satisfied, the scheduler is invoked, this time via preempt_schedule_irq to indicate that the preemption request originated from IRQ context. The essential difference between this function and preempt_schedule is that preempt_schedule_irq is called with IRQs disabled to prevent recursive calls for simultaneous IRQs.

Power consumption - the answer given by dr Marcin Peczarski

Question
Is Intel's power consumption dependent on which processor executes the machine instruction?

Answer
W pierwszym przybliżeniu pobór mocy jest proporcjonalny do częstotliwości taktowania i kwadratu napięcia zasilania, a współczynnik proporcjonalności zależy od technologii i liczby elementów (tranzystorów). Ale jak się przyjrzeć dokładnie, to Intel stosuje pewne sztuczki, żeby zmniejszyć pobór mocy, np. wyłącza pewne fragmenty procesora, jeśli nie są używane. Skutek jest taki, że przy wykonywaniu pewnych instrukcji (albo raczej długich ciągów pewnych instrukcji, bo dla pojedynczej instrukcji efekt nie będzie zauważalny) będzie widać zmniejszenie poboru mocy.

In the first approximation, the power consumption is proportional to the clock frequency and square of the power supply voltage, and the proportionality factor depends on the technology and the number of elements (transistors). But if you look closely, Intel uses some tricks to reduce power consumption, e.g., disables certain processor fragments if they are not used. The effect is that when executing certain instructions (or rather long sequences of certain instructions, because the effect will not be noticeable for a single instruction), you will see a reduction in power consumption.

Question
If the idle process is executed on Linux, may the power consumption be lower at this time?

Answer
Tak. To jest skutkiem nałożenia się kilku efektów: pewnie cały kod tego procesu siedzi w pamięci podręcznej L1, czyli nie ma odwołań do innych pamięci; idle zapewne nie wykonuje skomplikowanych obliczeń, więc pewne fragmenty układów arytmetycznych są nieczynne; pewnie też karta graficzna ma wtedy mniej pracy, co daje znaczne oszczędności.

Yes. This is the result of the overlap of several effects: probably all the code of this process sits in the L1 cache, i.e. there are no references to other memories; idle probably does not perform complicated calculations, so some fragments of arithmetic systems are inactive; probably also the graphics card has less work, which gives significant savings.

Question
What exactly happens when the laptop is going to sleep? How much is it hardware and how much software?

Answer
Podstawowy mechanizm to wyłączenie taktowania jak największej liczby podzespołów (procesora, pamięci). Sprzęt dostarcza jakiś "interfejs", jak to wyłączyć i przerwanie, które go budzi. Oprogramowanie musi zapewnić zapisanie stanu systemu. O ile zamrożenie stanu systemu w RAM w zasadzie nie wymaga wiele po stronie programowej (mogłoby się odbyć czysto sprzętowo), to zapisanie stanu na dysk już wymaga interwencji oprogramowania. Sprawa byłaby prosta, gdyby chodziło tylko o zamrożenie stanu wszystkich procesów - wystarczy, aby stan wszystkich komórek pamięci i rejestrów procesora się zachował. Niestety z moich doświadczeń wynika, że cała zabawa komplikuje się z powodu peryferii - wybudzenie USB czy karty sieciowej jest mocno nietrywialnym zagadnieniem i nie do końca rozwiązanym.

The basic mechanism is switching off the clocking of as many components as possible (processor, memory). The equipment provides some "interface", how to disable it and the interrupt that awakens it. The software must ensure that the system status is saved. While the freezing of the system state in RAM basically does not require much on the software side (it could be purely hardware), saving the state to disk requires the intervention of the software. The matter would be simple if it was only about freezing the state of all processes - it is enough to keep the state of all memory cells and processor registers. Unfortunately, my experience shows that all the fun is complicated by the peripherals - waking up USB or network card is a nontrivial issue and not completely resolved.

Questions from the lecture

Kernel preemption

Power consumption - the answer given by dr Marcin Peczarski

Additional reading