SDAccel矩阵乘法优化（二）

从一个矩阵乘法的例子一步一步进行功能设计与性能优化。
SDAccel矩阵乘法优化（一）
SDAccel矩阵乘法优化（二）
SDAccel矩阵乘法优化（三）
SDAccel矩阵乘法优化（四）

mmult实现及优化步骤

步骤	实现功能	关键概念/ Keywords
1、cpu实现	即在`host`端实现简单的矩阵乘法，便于比对数据与性能对比
2、OpenCL实现	在`device`端实现基于OpenCL的FPGA矩阵乘法硬件设计.	Key Concepts - OpenCL APIs
3、加入`Local Memory`	采用 `Local Memory` 减少数据访存次数	Key Concepts - Kernel Optimization - Local Memory
4、实现读写的突发传输	采用突发传输的方式更好的实现`DDR`与 `Local Memory`数据的读写访问	Key Concepts - Kernel Optimization - Burst Read/Write
5、数组分割	通过循环展开与数组分割的方式，实现更好的计算性能	Key Concepts - Array Partition - Loop Unroll Keywords - xcl_pipeline_loop - xcl_array_partition(complete, dim) - opencl_unroll_hint

方案分析及优化思路一（Local Memory）

首先，我们先进行访存上的优化。原始版本的矩阵乘法实现虽然简单，但是在进行计算的过程中需要频繁的与DDR进行数据交互，但是DDR与FPGA进行交互的过程中是十分耗费时间与功耗的，因此，我们需要在FPGA上开一个局部的存储空间，先将数据从DDR搬运到FPGA片上的存储空间上，然后再进行计算，计算的过程数据在片上的空间进行索引，最后将计算完的数据再统一搬运回DDR上。这样，在片上的计算过程就不会频繁的受到DDR与FPGA访存慢的限制。

代码实现


#define MAX_SIZE 64

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1,  //Read-only input matrix1
            __global int* in2,  //Read-only input matrix2
            __global int* out,  //Output matrix
            int dim             //One dimension of the matrix
          )
{
    //Local memory to store input matrices
    //Local memory is implemented as BRAM memory blocks
    //MAX_SIZE * MAX_SIZE buffer is created because the size
    //need to be known at compile time
    __local int local_in1[MAX_SIZE][MAX_SIZE];
    __local int local_in2[MAX_SIZE][MAX_SIZE];
    __local int local_out[MAX_SIZE][MAX_SIZE];

    //Read the input data from DDR memory to local memory
    read_in1: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        local_in1[i][j] = in1[iter];
        local_in2[i][j] = in2[iter];
    }

    //Reads the input_data from local memory, performs the computations
    //and writes the data to local memory
    for(int i = 0; i < dim; i++){
        for(int j = 0; j < dim; j++){
            local_out[i][j] = 0;
            write_data: for(int k = 0; k < dim; k++){
                local_out[i][j] += local_in1[i][k] * local_in2[k][ j];
            }
        }
    }

    //Write the data from local memory to DDR memory
    write_out: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        out[iter] = local_out[i][j];
    }
}

实验结果分析

vivado hls log文件分析


WARNING: [XFORM 203-542] Cannot flatten a loop nest 'Loop-2.1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:86:22) in function 'mmult' :
WARNING: [XFORM 203-542] the outer loop is not a perfect loop.
INFO: [XFORM 203-541] Flattening a loop nest 'Loop-2' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:85:18) in function 'mmult'.
INFO: [XFORM 203-811] Inferring bus burst write of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:97:9).
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:00.79 ; elapsed = 00:00:00.88 . Memory (MB): peak = 494.316 ; gain = 156.758 ; free physical = 19901 ; free virtual = 45272
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'read_in1'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)
   between bus request on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:80) and bus request on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:79).
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 138.
INFO: [SCHED 204-61] Pipelining loop 'write_data'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 8.
INFO: [SCHED 204-61] Pipelining loop 'write_out'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 4.
INFO: [SCHED 204-11] Finished scheduling.

HLS Report
综合结果分析

＊首先，硬件代码没有优化指令，不需要关注指令是否实现。
＊然后，相比于原始版本的矩阵乘法实现，Local Memory的实现方式首先将整体的代码风格进行了调整，切分成三段并列的for循环形式。从Pipleline的角度考虑：第一段for循环pipeline成功；第二段的for循环只有write_data的for循环成功，最外层的两个for循环成功完成flatten但是write_data与次外层的for循环因为含有LOOP BODY的原因，无法成功flatten，因此也无法完成整体的pipeline；第三段for循环pipeline成功。
＊从pipeline成功后的II角度考虑:第一段for循环pipeline后的II=2,原因依然是 gmem carry dependency;第二三段for循环pipeline后的II=1。

硬件仿真结果

硬件实现结果

参考

xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga
ug1253 SDx Pragma Reference Guide 2017.2
ug1207 SDAccel Environment Optmizaton Guide