SDAccel矩阵乘法优化(三)



从一个矩阵乘法的例子一步一步进行功能设计与性能优化。
SDAccel矩阵乘法优化(一)
SDAccel矩阵乘法优化(二)
SDAccel矩阵乘法优化(三)
SDAccel矩阵乘法优化(四)

mmult实现及优化步骤

步骤 实现功能 关键概念/ Keywords
1、cpu实现 即在host端实现简单的矩阵乘法,便于比对数据与性能对比
2、OpenCL实现 device端实现基于OpenCL的FPGA矩阵乘法硬件设计. Key Concepts
- OpenCL APIs
3、加入Local Memory 采用 Local Memory 减少数据访存次数 Key Concepts
- Kernel Optimization
- Local Memory
4、实现读写的突发传输 采用突发传输的方式更好的实现DDRLocal Memory数据的读写访问 Key Concepts
- Kernel Optimization
- Burst Read/Write
5、数组分割 通过循环展开与数组分割的方式,实现更好的计算性能 Key Concepts
- Array Partition
- Loop Unroll
Keywords
- xcl_pipeline_loop
- xcl_array_partition(complete, dim)
- opencl_unroll_hint

方案分析及优化思路二(Burst Read/Write)

承接第二篇Local Memory的实现方法进一步进行优化处理,主要解决gmem carry dependency的问题。在这里,不采用Max Memory Ports的方法,因为采用多个接口灰消耗大量的LUT资源,并且大大的限制时钟频率的提升。其实,前面分析过了造成gmem carry dependency的原因,在矩阵乘法的实现过程中,我们完全可以将两个输入的数据分开,不需要在一个for循环中同时进行数据的读取而导致一个for循环在pipeline的过程中需要对两个接口进行读取,进而实现了Burst突发传输。

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

#define MAX_SIZE 64

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1, //Read-only input matrix1
__global int* in2, //Read-only input matrix2
__global int* out, //Output matrix
int dim //One dimension of the matrix
)
{
//Local memory to store input matrices
//Local memory is implemented as BRAM memory blocks
__local int local_in1[MAX_SIZE][MAX_SIZE];
__local int local_in2[MAX_SIZE][MAX_SIZE];
__local int local_out[MAX_SIZE][MAX_SIZE];

//Burst reads on input matrices from DDR memory
//Burst read for matrix local_in1 and local_in2
read_in1: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
if(j == dim){ j = 0; i++; }
local_in1[i][j] = in1[iter];
}
read_in2: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
if(j == dim){ j = 0; i++; }
local_in2[i][j] = in2[iter];
}

//Reads the input_data from local memory, performs the computations
//and writes the data to local memory
for(int i = 0; i < dim; i++){
for(int j = 0; j < dim; j++){
local_out[i][j] = 0;
write_data: for(int k = 0; k < dim; k++){
local_out[i][j] += local_in1[i][k] * local_in2[k][ j];
}
}
}

//Burst write from local_out to DDR memory
write_out: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
if(j == dim){ j = 0; i++; }
out[iter] = local_out[i][j];
}
}

实验结果分析

  • vivado hls log文件分析
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

WARNING: [XFORM 203-542] Cannot flatten a loop nest 'Loop-3.1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:132:22) in function 'mmult' :
WARNING: [XFORM 203-542] the outer loop is not a perfect loop.
INFO: [XFORM 203-541] Flattening a loop nest 'Loop-3' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:131:18) in function 'mmult'.
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:122:9).
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:126:9).
INFO: [XFORM 203-811] Inferring bus burst write of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:143:9).
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:00.79 ; elapsed = 00:00:00.89 . Memory (MB): peak = 494.320 ; gain = 156.758 ; free physical = 19833 ; free virtual = 45208
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'read_in1'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'read_in2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'write_data'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 8.
INFO: [SCHED 204-61] Pipelining loop 'write_out'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 4.
INFO: [SCHED 204-11] Finished scheduling.
  • HLS Report


  • 综合结果分析
    * 首先,硬件代码没有优化指令,不需要关注指令是否实现。
    * 然后,相比于Local Memory版本的矩阵乘法实现,Burst Read/Write的实现方式主要是将两个原本在一个循环体内的输入切分到两个for循环中分开读入。从Pipleline的角度考虑:第一段for循环pipeline成功;第二段的for循环只有write_data的for循环成功,最外层的两个for循环成功完成flatten但是write_data与次外层的for循环因为含有LOOP BODY的原因,无法成功flatten,因此也无法完成整体的pipeline;第三段for循环pipeline成功。
    * 从pipeline成功后的II角度考虑:第一段for循环pipeline后的II=1,解决了之前 gmem carry dependency的问题;第二三段for循环pipeline后的II=1
  • 硬件仿真结果

  • 硬件实现结果

参考

xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga
ug1253 SDx Pragma Reference Guide 2017.2
ug1207 SDAccel Environment Optmizaton Guide

-------------本文结束 感谢您的阅读-------------
0%