simd
faer
provides a common interface for generic and composable simd, using the
pulp
crate as a backend. pulp
's high level api abstracts away the differences
between various instruction sets and provides a common api that's generic over
them (but not the scalar type). this allows users to write a generic implementation
that gets turned into several functions, one for each possible instruction set
among a predetermined subset. finally, the generic implementation can be used along
with an Arch
structure that determines the best implementation at runtime.
Here's an example of how pulp
could be used to compute the expression \(x^2 +
2y - |z|\), and store it into an output vector.
use core::iter::zip;
fn compute_expr(out: &mut[f64], x: &[f64], y: &[f64], z: &[f64]) {
struct Impl<'a> {
out: &'a mut [f64],
x: &'a [f64],
y: &'a [f64],
z: &'a [f64],
}
impl pulp::WithSimd for Impl<'_> {
type Output = ();
#[inline(always)]
fn with_simd<S: pulp::Simd>(self, simd: S) {
let Self { out, x, y, z } = self;
let (out_head, out_tail) = S::as_mut_simd_f64s(out);
let (x_head, x_tail) = S::as_simd_f64s(x);
let (y_head, y_tail) = S::as_simd_f64s(y);
let (z_head, z_tail) = S::as_simd_f64s(z);
let two = simd.splat_f64s(2.0);
for (out, (&x, (&y, &z))) in zip(
out_head,
zip(x_head, zip(y_head, z_head)),
) {
*out = simd.add_f64s(
x,
simd.sub_f64s(simd.mul_f64s(two, y), simd.abs_f64s(z)),
);
}
for (out, (&x, (&y, &z))) in zip(
out_tail,
zip(x_tail, zip(y_tail, z_tail)),
) {
*out = x - 2.0 * y - z.abs();
}
}
}
pulp::Arch::new().dispatch(Impl { out, x, y, z });
}
there's a lot of things going on at the same time in this code example. let us go over them step by step.
pulp
's generic simd implementation happens through the WithSimd
trait,
which takes self
by value to pass in the function parameters. it additionally
provides another parameter to with_simd
describing the instruction set being
used. WithSimd::with_simd
must be marked with the #[inline(always)]
attribute.
forgetting to do so could lead to a significant performance drop.
inside the body of the function, we split up each of out
, x
, y
and
z
into two parts using S::as_f64s[_mut]_simd
. the first part (head
) is a
slice of S::f64s
, representing the vectorizable part of the original slice.
The second part (tail
) contains the remainder that doesn't fit into a vector
register.
handling the head section is done using vectorized operation. currently these
need to take simd
as a parameter, in order to guarantee its availability in a
sound way. this is what allows the api to be safe. the tail section is handled
using scalar operations.
the final step is actually calling into our simd implementation. this is done
by creating an instance of pulp::Arch
that performs the runtime detection
(and caches the result, so that future invocations are as fast as possible),
then calling Arch::dispatch
which takes a type that implements WithSimd
,
and chooses the best simd implementation for it.
memory alignment
instead of splitting the input and output slices into two sections (vectorizable head + non-vectorizable tail), an alternative approach would be to split them up into three sections instead (vectorizable head + vectorizable body + vectorizable tail). this can be accomplished using masked loads and stores, which can speed things up if the slices are similarly aligned.
similarly aligned slices are slices which have the same base address modulo
the byte size of the cpu's vector registers. the simplest way to guarantee this
is to allocate the slices in aligned memory (such that the base address is a
multiple of the register size in bytes), in which case the slices are similarly
aligned, and any subslices of them (with a shared offset and size) will also be
similarly aligned. aligned allocation is done automatically for matrices in faer
,
which helps uphold these guarantees for maximum performance.
pulp
itself doesn't currently provide safe memory alignment api for now. but provides an unsafe
abstraction that can be used to build. the faer
implementation is built on top of
Simd::mask_between_xxx
, Simd::mask_load_ptr_xxx
and Simd::mask_store_ptr_xxx
, as well as the lifetime branded indexing api.
example using faer
simd:
use faer::ComplexField;
use faer::ColRef;
use faer::ContiguousFwd;
use faer::utils::simd::SimdCtx;
use faer::utils::bound::Dim;
use pulp::Simd;
#[inline(always)]
pub fn dot_product_f32<'N, S: Simd>(
simd: S,
lhs: ColRef<'_, f32, Dim<'N>, ContiguousFwd>,
rhs: ColRef<'_, f32, Dim<'N>, ContiguousFwd>,
) -> f32 {
let simd = f32::simd_ctx(simd);
let N = lhs.nrows();
let align = size_of::<S::f32s>();
let simd = SimdCtx::<'N, f32, S>::new_align(
simd,
N,
lhs.as_ptr().align_offset(align),
);
let mut acc0 = simd.zero();
let mut acc1 = simd.zero();
let mut acc2 = simd.zero();
let mut acc3 = simd.zero();
let (head, idx4, idx1, tail) = simd.batch_indices::<4>();
if let Some(i0) = head {
let l0 = simd.read(lhs, i0);
let r0 = simd.read(rhs, i0);
acc0 = simd.mul_add(l0, r0, acc0);
}
for [i0, i1, i2, i3] in idx4 {
let l0 = simd.read(lhs, i0);
let l1 = simd.read(lhs, i1);
let l2 = simd.read(lhs, i2);
let l3 = simd.read(lhs, i3);
let r0 = simd.read(rhs, i0);
let r1 = simd.read(rhs, i1);
let r2 = simd.read(rhs, i2);
let r3 = simd.read(rhs, i3);
acc0 = simd.mul_add(l0, r0, acc0);
acc1 = simd.mul_add(l1, r1, acc1);
acc2 = simd.mul_add(l2, r2, acc2);
acc3 = simd.mul_add(l3, r3, acc3);
}
for i0 in idx1 {
let l0 = simd.read(lhs, i0);
let r0 = simd.read(rhs, i0);
acc0 = simd.mul_add(l0, r0, acc0);
}
if let Some(i0) = tail {
let l0 = simd.read(lhs, i0);
let r0 = simd.read(rhs, i0);
acc0 = simd.mul_add(l0, r0, acc0);
}
acc0 = simd.add(acc0, acc1);
acc2 = simd.add(acc2, acc3);
acc0 = simd.add(acc0, acc2);
simd.reduce_sum(acc0)
}
assuming the alignment offset manages to align both lhs
and rhs
, this version can be much more efficient than the unaligned one.
however, it has a subtle bug that we will explain in the next section.
floating point determinism
when performing reductions on floating point values, it's important to keep in mind that these operations are typically only approximately associative and commutative. so the rounding error arising from the limited float precision depends on the order in which the operations are performed.
there is however a useful loophole we can exploit, assume we have a register size of 4, and we want to sum up 23 elements, with a batch size of two registers.
[
x0, x1, x2, x3,
x4, x5, x6, x7,
x8, x9, x10, x11,
x12, x13, x14, x15,
x16, x17, x18, x19,
x20, x21, x22,
]
if we use something similar to the previous algorithm, we get the following intermediate results.
in the case where the alignment offset is 0
.
// after the batch 2 loop
acc0 = f32x4(
x0 + x8,
x1 + x9,
x2 + x10,
x3 + x11,
);
acc1 = f32x4(
x4 + x12,
x5 + x13,
x6 + x14,
x7 + x15,
);
// after the batch 1 loop
acc0 = f32x4(
(x0 + x8) + x16,
(x1 + x9) + x17,
(x2 + x10) + x18,
(x3 + x11) + x19,
);
// after the tail
acc0 = f32x4(
((x0 + x8) + x16) + x20,
((x1 + x9) + x17) + x21,
((x2 + x10) + x18) + x22,
((x3 + x11) + x19) + 0.0,
);
// we add up `acc0` and `acc1`
acc = f32x4(
(((x0 + x8) + x16) + x20) + (x4 + x12),
(((x1 + x9) + x17) + x21) + (x5 + x13),
(((x2 + x10) + x18) + x22) + (x6 + x14),
(((x3 + x11) + x19) + 0.0) + (x7 + x15),
);
// reduce sum: x => (x.0 + x.2) + (x.1 + x.3)
acc = (((((x0 + x8) + x16) + x20) + (x4 + x12)) + ((((x1 + x9) + x17) + x21) + (x5 + x13)))
+ ((((x2 + x10) + x18) + x22) + (x6 + x14)) + ((((x3 + x11) + x19) + 0.0) + (x7 + x15))
now let us take the case where the alignment offset is 3
instead. in this case we pad the array by one element on the left before proceeding.
so it's as if we were working with
[
0.0, x0, x1, x2,
x3, x4, x5, x6,
x7, x8, x9, x10,
x11, x12, x13, x14,
x15, x16, x17, x18,
x19, x20, x21, x22,
]
skipping the intermediate computations, we get the final result
acc = (((((0.0 + x7) + x15) + x19) + (x3 + x11)) + ((((x0 + x8) + x16) + x20) + (x4 + x12)))
+ ((((x1 + x9) + x17) + x21) + (x5 + x13)) + ((((x2 + x10) + x18) + x19) + (x6 + x14))
note that this version of the result contains the term ((x2 + x10) + x18) + x19
, which doesn't appear
as a term in the original sum unless we assume exact associativity. (commutativity is assumed to be exact for the operations we care about.)
this results in us getting a slightly different result than before, which could lead to hard-to-reproduce bugs if the user expects consistent behavior between different runs on the same machine. and the alignment happens to be different due to a difference in the memory allocator or the os behavior.
however, if we sum up our elements like this
// after the batch 2 loop
acc0 = f32x4(
x0 + x8,
x1 + x9,
x2 + x10,
x3 + x11,
);
acc1 = f32x4(
x4 + x12,
x5 + x13,
x6 + x14,
x7 + x15,
);
// after the batch 1 loop
acc0 = f32x4(
(x0 + x8) + x16,
(x1 + x9) + x17,
(x2 + x10) + x18,
(x3 + x11) + x19,
);
// after the tail
acc1 = f32x4(
(x4 + x12) + x20,
(x5 + x13) + x21,
(x6 + x14) + x22,
(x7 + x15) + 0.0,
);
// we add up `acc0` and `acc1`
acc = f32x4(
((x0 + x8) + x16) + ((x4 + x12) + x20),
((x1 + x9) + x17) + ((x5 + x13) + x21),
((x2 + x10) + x18) + ((x6 + x14) + x22),
((x3 + x11) + x19) + ((x7 + x15) + 0.0),
);
// reduce sum: x => (x.0 + x.2) + (x.1 + x.3)
acc = (((x0 + x8) + x16) + ((x4 + x12) + x20) + ((x1 + x9) + x17) + ((x5 + x13) + x21))
+ (((x2 + x10) + x18) + ((x6 + x14) + x22) + (((x3 + x11) + x19) + ((x7 + x15) + 0.0)))
then the offset version outputs:
acc = (((0.0 + x7) + x15) + ((x3 + x11) + x19) + ((x0 + x8) + x16) + ((x4 + x12) + x20))
+ (((x1 + x9) + x17) + ((x5 + x13) + x21) + (((x2 + x10) + x18) + ((x6 + x14) + x22)))
close inspection will reveal that this is exactly equal to the previous sum,
assuming that x + 0.0 == x
(which is true modulo signed zero), and x + y == y + x
.
to generalize this strategy, it can be shown that reducing the outputs by rotating the accumulators
acc0 -> acc1 -> acc2 -> acc3 -> acc0 -> acc1 -> acc2 -> acc3 -> acc0 -> ...
and performing a tree reduction at the end
acc = (acc0 + acc2) + (acc1 + acc3);
// similarly reduce `acc`
acc = (acc.0 + acc.2) + (acc.1 + acc.3);
will always produce the same results.
faer
aims to guarantee this behavior, and not doing so is considered a bug.
universal floating point determinism
the determinism we mentioned above still only guarantees the same results between different runs on the same machine.
if the register size is changed for example, then the logic falls apart. if we want to guarantee that the results will be the same on any machine, other conditions must be met.
the first one is that the register size must be the same everywhere, this means that we must fix a register size that's available with one instruction set, and emulate it on ones where it's not available. the other condition is that the number of threads executing the code must be the same. this is due to the fact that reductions parameters depend on the thread count, if we use multithreadede reductions.
currently, faer
doesn't provide universal floating point determinism as a feature. since simd emulation
comes with overhead, this must be an opt-in feature that we want to provide in a future release. the matrix multiplication
backend used by faer
must also satisfy the same criteria in that case, which is also still a work in progress.