SSE: SIMD-like extension introduced in processors Intel Pentium III i AMD
AthlonXP, also known as *Katmai New Instructions* (KNI). Suported from
Windows98 and Linux with kernel jÄ…drem 2.2.

Based on eight new 128-bit registers, each divided into four 32-bit
floating-point elements of single precision, named `xmm0-xmm7`

.
Additional control register `mxcsr`

.

Later more extensions were added, called SSE2, SSE3 etc.

Many new instructions operating on XMM registers (some also on MMX and universal registers). The selection presented below is really fragmentary.

Moving full 128-bit contents:

`movaps`

- the address must be aligned to 16-bytes boundary;`movups`

- no need for alignment.

Moving 64-bit values:

`movhps`

- to/from the higher part of`xmm`

register.`movlps`

- to/from the lower part of`xmm`

.`movhlps`

- from higher part of source register to lower part of destination register.`movlhps`

- from lower part of source register to higher part of destination register.

Moving 32-bit values:

`movss`

- moves 32-bit long floating-point number between XMM register and memory or other XMM registers.

*Shuffling* is used to change the order of elements in a single XMM
register or to mix values from two such registers. The arguments of
`shufps`

instruction are two XMM registers and 8-bit mask.
The first two elements of destination register are overwritten with
any two elements of this register. The third and fourth element are
overwritten with two elements form the source register. The selection
of elements is controlled by the pairs of bits from the mask, interpreted
as numbers from range 0-3.
Both arguments could be the same register.

Examples:

shufps xmm0,xmm0,0x1b;reverse the order of element (0x1B=00 01 10 11)shufps xmm0,xmm0,0xaa;four copies of third element (0xAA=10 10 10 10)

`shufps`

and `shufpd`

instructions operate on
packed floating-point numbers. For integer values there are
`pshufb`

, `pshufw`

and `pshufd`

instructions
(from MMX set)

`pshufw`

- moves values form source register (or memory) to target register shuffling according to mask.

Example: computing *cross product* to find a vector perpendicular
to given two vectors (32-bit version for now).

;; float *cross_product (float V1[4], float V2[4], float W[4]) ;; Find the cross product of two constant vectors and return it. ;; W.x = V1.y * V2.z - V1.z * V2.y ;; W.y = V1.z * V2.x - V1.x * V2.z ;; W.z = V1.x * V2.y - V1.y * V2.x global cross_product section .text cross_product: push ebp mov ebp,esp mov eax,[ebp+8] ;Put argument addresses to registers mov ebx,[ebp+12] movups xmm0,[eax] ;If aligned then use movaps movups xmm1,[ebx] movaps xmm2,xmm0 ;Copies movaps xmm3,xmm1 shufps xmm0,xmm0,0xd8 ;Exchange 2 and 3 element (V1) shufps xmm1,xmm1,0xe1 ;Exchange 1 and 2 element (V2) mulps xmm0,xmm1 shufps xmm2,xmm2,0xe1 ;Exchange 1 and 2 element (V1) shufps xmm3,xmm3,0xd8 ;Exchange 2 and 3 element (V2) mulps xmm2,xmm3 subps xmm0,xmm2 mov eax,[ebp+16] movups [eax],xmm0 ;Result pop ebp ret

Basic arithmetic operations have two variants, recognized from suffixes:

`ss`

: an instruction operates only on the lowest element, others do not change;`ps`

: an instruction operates on all element in parallel.

Addition: `addss`

, `addps`

.

Subtraction: `subss`

, `subps`

.

Multiplication: `mulss`

, `mulps`

.

Division: `divss`

, `divps`

.

*Reciprocal* (1/x): `rcpss`

, `rcpps`

.
Binary operations, e.g.

rcpps xmm1,xmm2

Square root: `sqrtss`

, `sqrtps`

.
Binary operations, e.g.

sqrtss xmm1,xmm2

Reciprocal of square root: `rsqrtss`

,
`rsqrtps`

.

Maximum: `maxss`

, `maxps`

.

Minimum: `minss`

, `minps`

.

Example: adding 4-element vectors

;; float *vector_add (float V1[4], float V2[4], float W[4]) global vector_add section .text vector_add: push ebp mov ebp,esp mov eax,[ebp+8] ;Move argument addresses to registers mov ebx,[ebp+12] movups xmm0,[eax] ;Fetch arguments movups xmm1,[ebx] addps xmm0,xmm1 mov eax,[ebp+16] movups [eax],xmm0 ;Result pop ebp ret

The instruction

movmskps rax,xmm1puts sign bits of four 32-bit floating-point numbers in a given universla register (on the 4 lowest bits, other bits are cleared).

Arithmetic mean (average):

`pavgb`

- for each pair of bytes;`pavgw`

- for each pair of 16-bit words.

Maxima and minima

`pmaxsb`

- Returns maxima for each pair of bytes considered as 8-bit signed integers. Changing last letter to w or d we will operate on pairs of 16 and 32-bit signed integers.`pmaxub`

- Same as before, but for unsigned integers.`pminsb`

- Returns minima for each pair of bytes (signed integers), other variants as above.`pminub`

- Same for unsigned integers.

Exotica:

- The instruction
`psadbw`

returns the sum of absolute differences of all pairs of bytes (unsigned numbers). In 64-bit version (MMX registers) the sum is placed in two lowest bytes of destination register, other bytes are cleared. 128-bit version (XMM registers) returns two sums for lower and upper 64-bit part.

`andnps`

- bit conjunction with logically negated second argument (AND NOT)`andps`

- bit conjunction (AND)`orps`

- bit alternative (OR)`xorps`

- exclusive alternative (XOR)

`cmpps`

,`cmpss`

- compares arguments and returns all zeros or all ones (in first argument).`cmp`

- compares 4 pairs 32-bit floating-point numbers,*ww*ps

`cmp`

- compares two 32-bit floating-point numbers.*ww*ss

The letters*ww*select a method of comparison (in binary this is given as additional byte and there is only one instruction opcode):`eq`

- equal`lt`

- less`le`

- less or equal`ne`

- not equal`nlt`

- not less`nle`

- neither less nor equal`ord`

-*ordered*, you don't want to know`unord`

-*unordered*, you don't want to know

`comiss`

- Compares two 32-bit floating-point numbers at the lowest part in classical way (clears OF, sets other flags).`ucomiss`

- Same as`comiss`

, but does not raise an exception for QNaNs.

Single:

`cvtsi2ss`

- Converts 32-bitowÄ… integer number at the lowest position to a floating-point number. Other 3 positions do not change.`cvtss2si`

- Inverse, converts 32-bit floating-point number in the lowest position to integer number.

Parallel (packed):

`cvtpi2ps`

- Converts two lower 32-bit integer numbers to floating-point numbers. The two upper positions do not change.`cvtps2pi`

- Inverse, converts lower 32 floating-point numbers to integers.

With saturation:

`cvttss2si`

- Converts 32-bit floating-point number at the lowest position to integer; if the result is too big, it will be cut to the maximum possible value.`cvttps2pi`

- Same for two lowest 32-bit floating-point numbers.

`fxsave`

- Saves the state of SSE and FPU units (so also MMX) in 512-byte long region of memory.`fxrstor`

- Restores the saved state on SSE and FPU units.`stmxcsr`

- Saves the`mxcsr`

control register.`ldmxcsr`

- Loads (saved) value to`mxcsr`

control register.

32-bit control register `mxcsr`

contains flags with information
about computation results and flags controlling the flow of execution
for SSE instructions. Bits 0-15 are defined as follows

Name | Bit number | Description |
---|---|---|

FZ | 15 | Flush To Zero |

R+ | 14 | Round Positive |

R- | 13 | Round Negative |

PM | 12 | Precision Mask |

UM | 11 | Underflow Mask |

OM | 10 | Overflow Mask |

ZM | 9 | Divide By Zero Mask |

DM | 8 | Denormal Mask |

IM | 7 | Invalid Operation Mask |

DAZ | 6 | Denormals Are Zero |

PE | 5 | Precision Flag |

UE | 4 | Underflow Flag |

OE | 3 | Overflow Flag |

ZE | 2 | Divide By Zero Flag |

DE | 1 | Denormal Flag |

IE | 0 | Invalid Operation Flag |

Setting the FZ flag causes return of zero in case of underflow. Speeds up the computation, but sometimes decreases the precision.

Flags `R+`

and `R-`

control the direction
of rounding for the lowest bit. If both are set (sometimes denoted as
RZ: *Round To Zero*), then rounding is towards zero. Usually both
are cleared (RN: *Round To Nearest*) and then rounding is towards the
closer value.

Flags`PM`

, `UM`

, `MM`

, `ZM`

,
`DM`

and `IM`

are used to mask corresponding exceptions.

Flag `DAZ`

causes automatic change of reated
*Denormals* (very small numbers, which cannot be normalized)
to zero. This flag exists only in some processors, before changing make
sure that it is handled.

Flags `PE`

, `UE`

, `ME`

, `ZE`

,
`DE`

and `IE`

are set after occurence of corresponding
exception. They are ``sticky'', i.e. once set they stay set and must be
cleared by hand. Using them permits checking the correctnes only once
at the end of instruction sequence, but we can lose the information about
multiple occurence of some exception.

`sfence`

- All memory writes initiated before this instruction have to performed before writes initited after thie instruction.- Memory write instructions bypassing cache.