首页 > 解决方案 > 生锈中的高效 SIMD 点积

问题描述

我正在尝试创建高效的 SIMD 版本的点积来为 FIR 滤波器的 i16 类型实现 2D 卷积。

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

#[target_feature(enable  = "avx2")]
unsafe fn dot_product(a: &[i16], b: &[i16]) {
    let a = a.as_ptr() as *const [i16; 16];
    let b = b.as_ptr() as *const [i16; 16];
    let a = std::mem::transmute(*a);
    let b = std::mem::transmute(*b);
    let ms_256 = _mm256_mullo_epi16(a, b);
    dbg!(std::mem::transmute::<_, [i16; 16]>(ms_256));
    let hi_128 = _mm256_castsi256_si128(ms_256);
    let lo_128 = _mm256_extracti128_si256(ms_256, 1);
    dbg!(std::mem::transmute::<_, [i16; 8]>(hi_128));
    dbg!(std::mem::transmute::<_, [i16; 8]>(lo_128));
    let temp = _mm_add_epi16(hi_128, lo_128);
}

fn main() {
    let a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
    let b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
    unsafe {
        dot_product(&a, &b);
    }
}
I] ~/c/simd (master|…) $ env RUSTFLAGS="-C target-cpu=native" cargo run --release | wl-copy
warning: unused variable: `temp`
  --> src/main.rs:16:9
   |
16 |     let temp = _mm_add_epi16(hi_128, lo_128);
   |         ^^^^ help: if this is intentional, prefix it with an underscore: `_temp`
   |
   = note: `#[warn(unused_variables)]` on by default

warning: 1 warning emitted

    Finished release [optimized] target(s) in 0.00s
     Running `target/release/simd`
[src/main.rs:11] std::mem::transmute::<_, [i16; 16]>(ms_256) = [
    0,
    1,
    4,
    9,
    16,
    25,
    36,
    49,
    64,
    81,
    100,
    121,
    144,
    169,
    196,
    225,
]
[src/main.rs:14] std::mem::transmute::<_, [i16; 8]>(hi_128) = [
    0,
    1,
    4,
    9,
    16,
    25,
    36,
    49,
]
[src/main.rs:15] std::mem::transmute::<_, [i16; 8]>(lo_128) = [
    64,
    81,
    100,
    121,
    144,
    169,
    196,
    225,
]

虽然我从概念上理解 SIMD,但我不熟悉确切的指令和内在函数。

我知道我需要将两个向量相乘然后水平求和,然后将它们减半并使用指令垂直添加两个较小尺寸的减半。

我发现 madd 指令应该在乘法之后立即进行这样的求和,但不确定如何处理结果。

如果使用 mul 而不是 madd 我不确定使用哪些指令来进一步减少结果。

欢迎任何帮助!

PS我已经尝试过packed_simd,但它似乎不适用于稳定的生锈。

标签: rustsimdconvolutionavxdot-product

解决方案


推荐阅读