经典设计结构之脉动阵列OpenCL实现

脉动阵列，本身的核心概念就是让数据在运算单元的阵列中进行流动，减少访存的次数。
脉动阵列使得结构更加规整，布线更加统一，提高频率。

脉动阵列架构

上图中上半部分是传统的计算系统的模型。一个处理单元（PE）从存储器（memory）读取数据，进行处理，然后再写回到存储器。这个系统的最大问题是：数据存取的速度往往大大低于数据处理的速度。因此，整个系统的处理能力（MOPS，每秒完成的操作）很大程度受限于访存的能力。脉动阵列架构用了一个很简单的方法：让数据尽量在处理单元中多流动一会儿。正如上图的下半部分所描述的，第一个数据首先进入第一个PE，经过处理以后被传递到下一个PE，同时第二个数据进入第一个PE。以此类推，当第一个数据到达最后一个PE，它已经被处理了多次。所以，脉动架构实际上是多次重用了输入数据。因此，它可以在消耗较小的memory带宽的情况下实现较高的运算吞吐率。当然，脉动架构还有其它一些好处，比如模块化的设计容易扩展，简单和规则的数据和控制流程，使用简单并且均匀的单元（cell），避免了全局广播和扇入（fan-in），以及快速的响应时间等等。
总结起来，脉动阵列架构有几个特征：

由多个同构的PE构成，可以是一维或二维，串行、阵列或树的结构（现在我们看到的更多的是阵列形式）；
PE功能相对简单，系统通过实现大量PE并行来提高运算的效率；
PE只能向相邻的PE发送数据（在一些二维结构中，也可能有对角线方向的数据通道）。数据采用流水线的方式向“下游”流动，直到流出最后的PE。

因此，脉动阵列架构是一种很特殊的设计，结构简单，实现成本低。但它灵活性较差，只适合特定运算。特别适合于卷积运算与矩阵运算。

脉动阵列实现矩阵乘法

脉动阵列矩阵乘法示意图

源码实现


//Maximum Array Size
#define MAX_SIZE 16

__kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult(
            __global int *a,    // Read-Only Matrix A
            __global int *b,    // Read-Only Matrix B
            __global int *c,    // Output Result
                  int a_row,    // Matrix A Row Size
                  int a_col,    // Matrix A Col Size
                  int b_col     // Matrix B Col Size
        )
{

    int b_row = a_col;
    int c_row = a_row;
    int c_col = b_col;

    // Local memory to store input and output matrices

    int localA[MAX_SIZE][MAX_SIZE]  __attribute__((xcl_array_partition(complete, 1)));;
    int localB[MAX_SIZE][MAX_SIZE]  __attribute__((xcl_array_partition(complete, 2)));;
    int localC[MAX_SIZE][MAX_SIZE]  __attribute__((xcl_array_partition(complete, 0)));;

    // Burst reads on input matrices from global memory
    // Read Input A
    __attribute__((xcl_pipeline_loop))
    readA: for(int loc = 0, i = 0, j = 0; loc < a_row*a_col; loc++, j++) {
        if(j == a_col) { i++; j = 0;}
        localA[i][j] = a[loc];
    }

    // Read Input B
    __attribute__((xcl_pipeline_loop))
    readB: for(int loc = 0, i = 0, j = 0; loc < b_row*b_col; loc++, j++) {
        if(j == b_col) { i++; j = 0; }
        localB[i][j] = b[loc];
    }

    // Perform systolic matrix multiply
    // local matrices localA and localB have been partitioned in dimensions
    // 1 and 2 respectively. local matrix C has been partitioned completely

    // This partitioning enables to access MAX_SIZE elements in parallel in
    // the local matrices. Because of the mode of access of array elements,
    // we are able to perform MAX_SIZE*MAX_SIZE operations in parallel.

    // Note : i, j and k loops are interchanged.

    // The top loop systolic1 runs only for a_col iterations instead of
    // MAX_SIZE like the inner loops. The inner loops have fixed loop
    // iteration counts to enable complete unroll

    // The following diagram explains how the matrix multiply happens
    //
    //        B_0        B_1        B_2        B_3
    //         |          |          |          |
    //         v          v          v          v
    //        ___        ___        ___        ___
    //       |   |      |   |      |   |      |   |
    //  A0_->|C00| ---- |C01| ---- |C02| ---- |C03|
    //       |___|      |___|      |___|      |___|
    //         |          |          |          |
    //        ___        ___        ___        ___
    //       |   |      |   |      |   |      |   |
    //  A1_->|C10| ---- |C11| ---- |C12| ---- |C13|
    //       |___|      |___|      |___|      |___|
    //         |          |          |          |
    //        ___        ___        ___        ___
    //       |   |      |   |      |   |      |   |
    //  A2_->|C20| ---- |C21| ---- |C22| ---- |C23|
    //       |___|      |___|      |___|      |___|
    //         |          |          |          |
    //        ___        ___        ___        ___
    //       |   |      |   |      |   |      |   |
    //  A3_->|C30| ---- |C31| ---- |C32| ---- |C33|
    //       |___|      |___|      |___|      |___|

    __attribute__((xcl_pipeline_loop))
    systolic1: for(int k = 0; k < a_col; k++) {
        systolic2: for(int i = 0; i < MAX_SIZE; i++) {
            systolic3: for(int j = 0; j < MAX_SIZE; j++) {

                // Get previous sum
                int last = (k==0) ? 0 : localC[i][j];

                // Update current sum
                // Handle boundary conditions
                int a_val = (i < a_row && k < a_col)? localA[i][k] : 0;
                int b_val = (k < b_row && j < b_col)? localB[k][j] : 0;
                int result = last + a_val*b_val;

                // Write back results
                localC[i][j] = result;
            }

    }

    // Burst write from output matrices to global memory
    // Burst write from matrix C
    __attribute__((xcl_pipeline_loop))
    writeC: for(int loc = 0, i = 0, j = 0; loc < c_row*c_col; loc++, j++) {
        if(j == c_col) { i++; j = 0; }
        c[loc] = localC[i][j];
    }
}

经典设计结构之脉动阵列OpenCL实现

脉动阵列架构

脉动阵列实现矩阵乘法

TODO

脉动阵列实现卷积运算

Google TPU脉动阵列架构分析

参考