SSE Instructions (Streaming SIMD Extensions)

Introduction

SSE: SIMD-like extension introduced in processors Intel Pentium III i AMD AthlonXP, also known as Katmai New Instructions (KNI). Suported from Windows98 and Linux with kernel jÄ…drem 2.2.

Based on eight new 128-bit registers, each divided into four 32-bit floating-point elements of single precision, named xmm0-xmm7. Additional control register mxcsr.

Later more extensions were added, called SSE2, SSE3 etc.

Instructions

Many new instructions operating on XMM registers (some also on MMX and universal registers). The selection presented below is really fragmentary.

Transfers (load/store)

Moving full 128-bit contents:

Moving 64-bit values:

Moving 32-bit values:

Shuffling

Shuffling is used to change the order of elements in a single XMM register or to mix values from two such registers. The arguments of shufps instruction are two XMM registers and 8-bit mask. The first two elements of destination register are overwritten with any two elements of this register. The third and fourth element are overwritten with two elements form the source register. The selection of elements is controlled by the pairs of bits from the mask, interpreted as numbers from range 0-3. Both arguments could be the same register.

Examples:

  shufps xmm0,xmm0,0x1b  ;reverse the order of element (0x1B=00 01 10 11)

  shufps xmm0,xmm0,0xaa  ;four copies of third element (0xAA=10 10 10 10)

shufps and shufpd instructions operate on packed floating-point numbers. For integer values there are pshufb, pshufw and pshufd instructions (from MMX set)

Example: computing cross product to find a vector perpendicular to given two vectors (32-bit version for now).

;; float *cross_product (float V1[4], float V2[4], float W[4])

;; Find the cross product of two constant vectors and return it.

;; W.x = V1.y * V2.z - V1.z * V2.y
;; W.y = V1.z * V2.x - V1.x * V2.z
;; W.z = V1.x * V2.y - V1.y * V2.x

        global cross_product
        section .text

cross_product:
        push ebp
        mov ebp,esp
        mov eax,[ebp+8]          ;Put argument addresses to registers
        mov ebx,[ebp+12]

        movups xmm0,[eax]        ;If aligned then use movaps
        movups xmm1,[ebx]   

        movaps xmm2,xmm0         ;Copies
        movaps xmm3,xmm1

        shufps xmm0,xmm0,0xd8    ;Exchange 2 and 3 element (V1)
        shufps xmm1,xmm1,0xe1    ;Exchange 1 and 2 element (V2)
        mulps  xmm0,xmm1
               
        shufps xmm2,xmm2,0xe1    ;Exchange 1 and 2 element (V1)
        shufps xmm3,xmm3,0xd8    ;Exchange 2 and 3 element (V2)
        mulps  xmm2,xmm3
              
        subps  xmm0,xmm2

        mov eax,[ebp+16]
        movups [eax],xmm0        ;Result
        pop ebp
        ret

Floating-point arithmetics

Basic arithmetic operations have two variants, recognized from suffixes:

Addition: addss, addps.
Subtraction: subss, subps.
Multiplication: mulss, mulps.
Division: divss, divps.

Reciprocal (1/x): rcpss, rcpps. Binary operations, e.g.

   rcpps xmm1,xmm2

Square root: sqrtss, sqrtps. Binary operations, e.g.

   sqrtss xmm1,xmm2

Reciprocal of square root: rsqrtss, rsqrtps.

Maximum: maxss, maxps.
Minimum: minss, minps.

Example: adding 4-element vectors

;; float *vector_add (float V1[4], float V2[4], float W[4])

        global vector_add
        section .text

vector_add:
        push ebp
        mov ebp,esp
        mov eax,[ebp+8]          ;Move argument addresses to registers
        mov ebx,[ebp+12]

        movups xmm0,[eax]        ;Fetch arguments
        movups xmm1,[ebx]

        addps xmm0,xmm1

        mov eax,[ebp+16]
        movups [eax],xmm0        ;Result
        pop ebp
        ret

The instruction

        movmskps rax,xmm1
puts sign bits of four 32-bit floating-point numbers in a given universla register (on the 4 lowest bits, other bits are cleared).

Integer arithmetics

Arithmetic mean (average):

Maxima and minima

Exotica:

Logics

Comparisons

Conversions

Single:

Parallel (packed):

With saturation:

State

Saving:

32-bit control register mxcsr contains flags with information about computation results and flags controlling the flow of execution for SSE instructions. Bits 0-15 are defined as follows

NameBit numberDescription
FZ15Flush To Zero
R+14Round Positive
R-13Round Negative
PM12Precision Mask
UM11Underflow Mask
OM10Overflow Mask
ZM9Divide By Zero Mask
DM8Denormal Mask
IM7Invalid Operation Mask
DAZ6Denormals Are Zero
PE5Precision Flag
UE4Underflow Flag
OE3Overflow Flag
ZE2Divide By Zero Flag
DE1Denormal Flag
IE0Invalid Operation Flag

Setting the FZ flag causes return of zero in case of underflow. Speeds up the computation, but sometimes decreases the precision.

Flags R+ and R- control the direction of rounding for the lowest bit. If both are set (sometimes denoted as RZ: Round To Zero), then rounding is towards zero. Usually both are cleared (RN: Round To Nearest) and then rounding is towards the closer value.

FlagsPM, UM, MM, ZM, DM and IM are used to mask corresponding exceptions.

Flag DAZ causes automatic change of reated Denormals (very small numbers, which cannot be normalized) to zero. This flag exists only in some processors, before changing make sure that it is handled.

Flags PE, UE, ME, ZE, DE and IE are set after occurence of corresponding exception. They are ``sticky'', i.e. once set they stay set and must be cleared by hand. Using them permits checking the correctnes only once at the end of instruction sequence, but we can lose the information about multiple occurence of some exception.

Controlling the cache memory

Specific instructions