SSE: SIMD-like extension introduced in processors Intel Pentium III i AMD AthlonXP, also known as Katmai New Instructions (KNI). Suported from Windows98 and Linux with kernel jÄ…drem 2.2.
Based on eight new 128-bit registers, each divided into four 32-bit
floating-point elements of single precision, named xmm0-xmm7
.
Additional control register mxcsr
.
Later more extensions were added, called SSE2, SSE3 etc.
Many new instructions operating on XMM registers (some also on MMX and universal registers). The selection presented below is really fragmentary.
Moving full 128-bit contents:
movaps
- the address must be aligned to 16-bytes boundary;
movups
- no need for alignment.
Moving 64-bit values:
movhps
- to/from the higher part of xmm
register.
movlps
- to/from the lower part of xmm
.
movhlps
- from higher part of source register to lower
part of destination register.
movlhps
- from lower part of source register to higher
part of destination register.
Moving 32-bit values:
movss
- moves 32-bit long floating-point number
between XMM register and memory or other XMM registers.
Shuffling is used to change the order of elements in a single XMM
register or to mix values from two such registers. The arguments of
shufps
instruction are two XMM registers and 8-bit mask.
The first two elements of destination register are overwritten with
any two elements of this register. The third and fourth element are
overwritten with two elements form the source register. The selection
of elements is controlled by the pairs of bits from the mask, interpreted
as numbers from range 0-3.
Both arguments could be the same register.
Examples:
shufps xmm0,xmm0,0x1b ;reverse the order of element (0x1B=00 01 10 11) shufps xmm0,xmm0,0xaa ;four copies of third element (0xAA=10 10 10 10)
shufps
and shufpd
instructions operate on
packed floating-point numbers. For integer values there are
pshufb
, pshufw
and pshufd
instructions
(from MMX set)
pshufw
- moves values form source register (or memory)
to target register shuffling according to mask.
Example: computing cross product to find a vector perpendicular to given two vectors (32-bit version for now).
;; float *cross_product (float V1[4], float V2[4], float W[4]) ;; Find the cross product of two constant vectors and return it. ;; W.x = V1.y * V2.z - V1.z * V2.y ;; W.y = V1.z * V2.x - V1.x * V2.z ;; W.z = V1.x * V2.y - V1.y * V2.x global cross_product section .text cross_product: push ebp mov ebp,esp mov eax,[ebp+8] ;Put argument addresses to registers mov ebx,[ebp+12] movups xmm0,[eax] ;If aligned then use movaps movups xmm1,[ebx] movaps xmm2,xmm0 ;Copies movaps xmm3,xmm1 shufps xmm0,xmm0,0xd8 ;Exchange 2 and 3 element (V1) shufps xmm1,xmm1,0xe1 ;Exchange 1 and 2 element (V2) mulps xmm0,xmm1 shufps xmm2,xmm2,0xe1 ;Exchange 1 and 2 element (V1) shufps xmm3,xmm3,0xd8 ;Exchange 2 and 3 element (V2) mulps xmm2,xmm3 subps xmm0,xmm2 mov eax,[ebp+16] movups [eax],xmm0 ;Result pop ebp ret
Basic arithmetic operations have two variants, recognized from suffixes:
ss
: an instruction operates only on the lowest element,
others do not change;
ps
: an instruction operates on all element in parallel.
Addition: addss
, addps
.
Subtraction: subss
, subps
.
Multiplication: mulss
, mulps
.
Division: divss
, divps
.
Reciprocal (1/x): rcpss
, rcpps
.
Binary operations, e.g.
rcpps xmm1,xmm2
Square root: sqrtss
, sqrtps
.
Binary operations, e.g.
sqrtss xmm1,xmm2
Reciprocal of square root: rsqrtss
,
rsqrtps
.
Maximum: maxss
, maxps
.
Minimum: minss
, minps
.
Example: adding 4-element vectors
;; float *vector_add (float V1[4], float V2[4], float W[4]) global vector_add section .text vector_add: push ebp mov ebp,esp mov eax,[ebp+8] ;Move argument addresses to registers mov ebx,[ebp+12] movups xmm0,[eax] ;Fetch arguments movups xmm1,[ebx] addps xmm0,xmm1 mov eax,[ebp+16] movups [eax],xmm0 ;Result pop ebp ret
The instruction
movmskps rax,xmm1puts sign bits of four 32-bit floating-point numbers in a given universla register (on the 4 lowest bits, other bits are cleared).
Arithmetic mean (average):
pavgb
- for each pair of bytes;
pavgw
- for each pair of 16-bit words.
Maxima and minima
pmaxsb
- Returns maxima for each pair of bytes considered
as 8-bit signed integers. Changing last letter to w or d we will
operate on pairs of 16 and 32-bit signed integers.
pmaxub
- Same as before, but for unsigned integers.
pminsb
- Returns minima for each pair of bytes (signed
integers), other variants as above.
pminub
- Same for unsigned integers.
Exotica:
psadbw
returns the sum of absolute
differences of all pairs of bytes (unsigned numbers). In 64-bit version
(MMX registers) the sum is placed in two lowest bytes of destination
register, other bytes are cleared. 128-bit version (XMM registers)
returns two sums for lower and upper 64-bit part.
andnps
- bit conjunction with logically negated second
argument (AND NOT)
andps
- bit conjunction (AND)
orps
- bit alternative (OR)
xorps
- exclusive alternative (XOR)
cmpps
, cmpss
- compares arguments and returns
all zeros or all ones (in first argument).
cmpwwps
- compares 4 pairs 32-bit floating-point
numbers,cmpwwss
- compares two 32-bit floating-point
numbers.eq
- equal
lt
- less
le
- less or equal
ne
- not equal
nlt
- not less
nle
- neither less nor equal
ord
- ordered, you don't want to know
unord
- unordered, you don't want to know
comiss
- Compares two 32-bit floating-point numbers
at the lowest part in classical way (clears OF, sets other flags).
ucomiss
- Same as comiss
, but does not raise
an exception for QNaNs.
Single:
cvtsi2ss
- Converts 32-bitowÄ… integer number at the lowest
position to a floating-point number. Other 3 positions do not change.
cvtss2si
- Inverse, converts 32-bit floating-point number
in the lowest position to integer number.
Parallel (packed):
cvtpi2ps
- Converts two lower 32-bit integer numbers to
floating-point numbers. The two upper positions do not change.
cvtps2pi
- Inverse, converts lower 32 floating-point numbers
to integers.
With saturation:
cvttss2si
- Converts 32-bit floating-point number
at the lowest position to integer; if the result is too big, it will
be cut to the maximum possible value.
cvttps2pi
- Same for two lowest 32-bit floating-point
numbers.
fxsave
- Saves the state of SSE and FPU units (so also MMX)
in 512-byte long region of memory.
fxrstor
- Restores the saved state on SSE and FPU units.
stmxcsr
- Saves the mxcsr
control register.
ldmxcsr
- Loads (saved) value to mxcsr
control
register.
32-bit control register mxcsr
contains flags with information
about computation results and flags controlling the flow of execution
for SSE instructions. Bits 0-15 are defined as follows
Name | Bit number | Description |
---|---|---|
FZ | 15 | Flush To Zero |
R+ | 14 | Round Positive |
R- | 13 | Round Negative |
PM | 12 | Precision Mask |
UM | 11 | Underflow Mask |
OM | 10 | Overflow Mask |
ZM | 9 | Divide By Zero Mask |
DM | 8 | Denormal Mask |
IM | 7 | Invalid Operation Mask |
DAZ | 6 | Denormals Are Zero |
PE | 5 | Precision Flag |
UE | 4 | Underflow Flag |
OE | 3 | Overflow Flag |
ZE | 2 | Divide By Zero Flag |
DE | 1 | Denormal Flag |
IE | 0 | Invalid Operation Flag |
Setting the FZ flag causes return of zero in case of underflow. Speeds up the computation, but sometimes decreases the precision.
Flags R+
and R-
control the direction
of rounding for the lowest bit. If both are set (sometimes denoted as
RZ: Round To Zero), then rounding is towards zero. Usually both
are cleared (RN: Round To Nearest) and then rounding is towards the
closer value.
FlagsPM
, UM
, MM
, ZM
,
DM
and IM
are used to mask corresponding exceptions.
Flag DAZ
causes automatic change of reated
Denormals (very small numbers, which cannot be normalized)
to zero. This flag exists only in some processors, before changing make
sure that it is handled.
Flags PE
, UE
, ME
, ZE
,
DE
and IE
are set after occurence of corresponding
exception. They are ``sticky'', i.e. once set they stay set and must be
cleared by hand. Using them permits checking the correctnes only once
at the end of instruction sequence, but we can lose the information about
multiple occurence of some exception.
sfence
- All memory writes initiated before this instruction
have to performed before writes initited after thie instruction.