SSE Instructions (Streaming SIMD Extensions)

Introduction

SSE: SIMD-like extension introduced in processors Intel Pentium III i AMD AthlonXP, also known as Katmai New Instructions (KNI). Suported from Windows98 and Linux with kernel jądrem 2.2.

Based on eight new 128-bit registers, each divided into four 32-bit floating-point elements of single precision, named xmm0-xmm7. Additional control register mxcsr.

Later more extensions were added, called SSE2, SSE3 etc.

Instructions

Many new instructions operating on XMM registers (some also on MMX and universal registers). The selection presented below is really fragmentary.

Transfers (load/store)

Moving full 128-bit contents:

movaps - the address must be aligned to 16-bytes boundary;
movups - no need for alignment.

Moving 64-bit values:

movhps - to/from the higher part of xmm register.
movlps - to/from the lower part of xmm.
movhlps - from higher part of source register to lower part of destination register.
movlhps - from lower part of source register to higher part of destination register.

Moving 32-bit values:

movss - moves 32-bit long floating-point number between XMM register and memory or other XMM registers.

Shuffling

Shuffling is used to change the order of elements in a single XMM register or to mix values from two such registers. The arguments of shufps instruction are two XMM registers and 8-bit mask. The first two elements of destination register are overwritten with any two elements of this register. The third and fourth element are overwritten with two elements form the source register. The selection of elements is controlled by the pairs of bits from the mask, interpreted as numbers from range 0-3. Both arguments could be the same register.

Examples:

  shufps xmm0,xmm0,0x1b  ;reverse the order of element (0x1B=00 01 10 11)

  shufps xmm0,xmm0,0xaa  ;four copies of third element (0xAA=10 10 10 10)

shufps and shufpd instructions operate on packed floating-point numbers. For integer values there are pshufb, pshufw and pshufd instructions (from MMX set)

pshufw - moves values form source register (or memory) to target register shuffling according to mask.

Example: computing cross product to find a vector perpendicular to given two vectors (32-bit version for now).

;; float *cross_product (float V1[4], float V2[4], float W[4])

;; Find the cross product of two constant vectors and return it.

;; W.x = V1.y * V2.z - V1.z * V2.y
;; W.y = V1.z * V2.x - V1.x * V2.z
;; W.z = V1.x * V2.y - V1.y * V2.x

        global cross_product
        section .text

cross_product:
        push ebp
        mov ebp,esp
        mov eax,[ebp+8]          ;Put argument addresses to registers
        mov ebx,[ebp+12]

        movups xmm0,[eax]        ;If aligned then use movaps
        movups xmm1,[ebx]   

        movaps xmm2,xmm0         ;Copies
        movaps xmm3,xmm1

        shufps xmm0,xmm0,0xd8    ;Exchange 2 and 3 element (V1)
        shufps xmm1,xmm1,0xe1    ;Exchange 1 and 2 element (V2)
        mulps  xmm0,xmm1
               
        shufps xmm2,xmm2,0xe1    ;Exchange 1 and 2 element (V1)
        shufps xmm3,xmm3,0xd8    ;Exchange 2 and 3 element (V2)
        mulps  xmm2,xmm3
              
        subps  xmm0,xmm2

        mov eax,[ebp+16]
        movups [eax],xmm0        ;Result
        pop ebp
        ret

Floating-point arithmetics

Basic arithmetic operations have two variants, recognized from suffixes:

ss: an instruction operates only on the lowest element, others do not change;
ps: an instruction operates on all element in parallel.

Addition: addss, addps.
Subtraction: subss, subps.
Multiplication: mulss, mulps.
Division: divss, divps.

Reciprocal (1/x): rcpss, rcpps. Binary operations, e.g.

   rcpps xmm1,xmm2

Square root: sqrtss, sqrtps. Binary operations, e.g.

   sqrtss xmm1,xmm2

Reciprocal of square root: rsqrtss, rsqrtps.

Maximum: maxss, maxps.
Minimum: minss, minps.

Example: adding 4-element vectors

;; float *vector_add (float V1[4], float V2[4], float W[4])

        global vector_add
        section .text

vector_add:
        push ebp
        mov ebp,esp
        mov eax,[ebp+8]          ;Move argument addresses to registers
        mov ebx,[ebp+12]

        movups xmm0,[eax]        ;Fetch arguments
        movups xmm1,[ebx]

        addps xmm0,xmm1

        mov eax,[ebp+16]
        movups [eax],xmm0        ;Result
        pop ebp
        ret

The instruction

        movmskps rax,xmm1

puts sign bits of four 32-bit floating-point numbers in a given universla register (on the 4 lowest bits, other bits are cleared).

Integer arithmetics

Arithmetic mean (average):

pavgb - for each pair of bytes;
pavgw - for each pair of 16-bit words.

Maxima and minima

pmaxsb - Returns maxima for each pair of bytes considered as 8-bit signed integers. Changing last letter to w or d we will operate on pairs of 16 and 32-bit signed integers.
pmaxub - Same as before, but for unsigned integers.
pminsb - Returns minima for each pair of bytes (signed integers), other variants as above.
pminub - Same for unsigned integers.

Exotica:

The instruction psadbw returns the sum of absolute differences of all pairs of bytes (unsigned numbers). In 64-bit version (MMX registers) the sum is placed in two lowest bytes of destination register, other bytes are cleared. 128-bit version (XMM registers) returns two sums for lower and upper 64-bit part.

Logics

andnps - bit conjunction with logically negated second argument (AND NOT)
andps - bit conjunction (AND)
orps - bit alternative (OR)
xorps - exclusive alternative (XOR)

Comparisons

cmpps, cmpss - compares arguments and returns all zeros or all ones (in first argument).
cmpwwps - compares 4 pairs 32-bit floating-point numbers,
cmpwwss - compares two 32-bit floating-point numbers.
The letters ww select a method of comparison (in binary this is given as additional byte and there is only one instruction opcode):
- eq - equal
- lt - less
- le - less or equal
- ne - not equal
- nlt - not less
- nle - neither less nor equal
- ord - ordered, you don't want to know
- unord - unordered, you don't want to know
Results as above.
comiss - Compares two 32-bit floating-point numbers at the lowest part in classical way (clears OF, sets other flags).
ucomiss - Same as comiss, but does not raise an exception for QNaNs.

Conversions

Single:

cvtsi2ss - Converts 32-bitową integer number at the lowest position to a floating-point number. Other 3 positions do not change.
cvtss2si - Inverse, converts 32-bit floating-point number in the lowest position to integer number.

Parallel (packed):

cvtpi2ps - Converts two lower 32-bit integer numbers to floating-point numbers. The two upper positions do not change.
cvtps2pi - Inverse, converts lower 32 floating-point numbers to integers.

With saturation:

cvttss2si - Converts 32-bit floating-point number at the lowest position to integer; if the result is too big, it will be cut to the maximum possible value.
cvttps2pi - Same for two lowest 32-bit floating-point numbers.

State

Saving:

fxsave - Saves the state of SSE and FPU units (so also MMX) in 512-byte long region of memory.
fxrstor - Restores the saved state on SSE and FPU units.
stmxcsr - Saves the mxcsr control register.
ldmxcsr - Loads (saved) value to mxcsr control register.

32-bit control register mxcsr contains flags with information about computation results and flags controlling the flow of execution for SSE instructions. Bits 0-15 are defined as follows

Name	Bit number	Description
FZ	15	Flush To Zero
R+	14	Round Positive
R-	13	Round Negative
PM	12	Precision Mask
UM	11	Underflow Mask
OM	10	Overflow Mask
ZM	9	Divide By Zero Mask
DM	8	Denormal Mask
IM	7	Invalid Operation Mask
DAZ	6	Denormals Are Zero
PE	5	Precision Flag
UE	4	Underflow Flag
OE	3	Overflow Flag
ZE	2	Divide By Zero Flag
DE	1	Denormal Flag
IE	0	Invalid Operation Flag

Setting the FZ flag causes return of zero in case of underflow. Speeds up the computation, but sometimes decreases the precision.

Flags R+ and R- control the direction of rounding for the lowest bit. If both are set (sometimes denoted as RZ: Round To Zero), then rounding is towards zero. Usually both are cleared (RN: Round To Nearest) and then rounding is towards the closer value.

FlagsPM, UM, MM, ZM, DM and IM are used to mask corresponding exceptions.

Flag DAZ causes automatic change of reated Denormals (very small numbers, which cannot be normalized) to zero. This flag exists only in some processors, before changing make sure that it is handled.

Flags PE, UE, ME, ZE, DE and IE are set after occurence of corresponding exception. They are ``sticky'', i.e. once set they stay set and must be cleared by hand. Using them permits checking the correctnes only once at the end of instruction sequence, but we can lose the information about multiple occurence of some exception.

Controlling the cache memory

Specific instructions

sfence - All memory writes initiated before this instruction have to performed before writes initited after thie instruction.
Memory write instructions bypassing cache.