SDAccel矩阵乘法优化(四)



从一个矩阵乘法的例子一步一步进行功能设计与性能优化。
SDAccel矩阵乘法优化(一)
SDAccel矩阵乘法优化(二)
SDAccel矩阵乘法优化(三)
SDAccel矩阵乘法优化(四)

mmult实现及优化步骤

步骤 实现功能 关键概念/ Keywords
1、cpu实现 即在host端实现简单的矩阵乘法,便于比对数据与性能对比
2、OpenCL实现 device端实现基于OpenCL的FPGA矩阵乘法硬件设计. Key Concepts
- OpenCL APIs
3、加入Local Memory 采用 Local Memory 减少数据访存次数 Key Concepts
- Kernel Optimization
- Local Memory
4、实现读写的突发传输 采用突发传输的方式更好的实现DDRLocal Memory数据的读写访问 Key Concepts
- Kernel Optimization
- Burst Read/Write
5、数组分割 通过循环展开与数组分割的方式,实现更好的计算性能 Key Concepts
- Array Partition
- Loop Unroll
Keywords
- xcl_pipeline_loop
- xcl_array_partition(complete, dim)
- opencl_unroll_hint

方案分析及优化思路三(指令优化)

现在经过两次优化后,代码的组织结构没有什么问题了,但是关键是对应矩阵运算的嵌套for循环仅仅实现了内层的pipeline,因为外层for循环无法对内部的for循环flatten,所以外面两层的for循环没有实现pipeline,要解决这个问题,最直接的思路就是将最内层的for循环直接进行循环展开,进一步提高计算过程的并行度。但是在进行循环展开的过程中,需要将内层用到的数组进行切割,否则将无法进行unroll。因此,我们将用到的指令有三个内层for循环要进行循环展开,并且计算用到的数组要进行数组切割(array partition),次外层的for循环要进行pipeline

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

#define MAX_SIZE 64

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1, //Read-only input matrix1
__global int* in2, //Read-only input matrix2
__global int* out, //Output matrix
int dim //One dimension of the matrix
)
{
//Local memory to store input matrices
//Local memory is implemented as BRAM memory blocks
//Complete partition done on dim 2 for in1, on dim 1 for in2 and on dim 2 for out
__local int local_in1[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));
__local int local_in2[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 1)));
__local int local_out[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));

//Burst reads on input matrices from DDR memory
//Burst read for matrix local_in1 and local_in2
read_in1: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
if(j == dim){ j = 0; i++; }
local_in1[i][j] = in1[iter];
}
read_in2: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
if(j == dim){ j = 0; i++; }
local_in2[i][j] = in2[iter];
}

//Based on the functionality the number of iterations
//to be executed for "loop_3" must be "dim" size.
//But for the pipeline to happen in the "loop_2" the
//"loop_3" must be unrolled, to unroll the size cannot be dynamic.
//It gives better throughput with usage of additional resources.
loop_1: for(int i = 0; i < dim; i++){
__attribute__((xcl_pipeline_loop))
loop_2: for(int j = 0; j < dim; j++){
local_out[i][j] = 0;
__attribute__((opencl_unroll_hint))
loop_3: for(int k = 0; k < MAX_SIZE; k++){
local_out[i][j] += local_in1[i][k] * local_in2[k][ j];
}
}
}

//Burst write from local_out to DDR memory
write_out: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
if(j == dim){ j = 0; i++; }
out[iter] = local_out[i][j];
}
}
  • xcl_pipeline_loop

循环流水(Loop pipelining)是在一个迭代周期内的多个操作可实现并行处理。下图中的A展示了默认情况下的顺序执行操作,每次读操作之间相差3个时钟周期(II=3),离最后一次写操作相差8个时钟周期。图中的B展示了加入循环流水的示意图,每次读操作之间相差1个周期(II=1),离最后一次写操作相差4个时钟周期,在使用同样的资源下,提高了流水线的启动间隔和延迟。

  • xcl_array_partition

下图所示的是数组分隔的三种方式:blockcycliccompleteblockcyclic是根据factor参数的设置来决定划分多少个独立的RAM存储(如图中的factor=2)。按块划为block是在原数组上按连续存储元素的方式划分成factor个块单元;按轮询cyclic是在原数组上按交叉存储元素的方式划分成factor个块单元;按全部展开complete是把原数组划分为一个个独立的元素(寄存器)。

对于多维数组的分割,可按照数组的维度来划分:

实验结果分析

  • vivado hls log文件分析
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

WARNING: [XFORM 203-104] Completely partitioning array 'local_in1' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in1' in dimension 2 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_in2' accessed through non-constant indices on dimension 1 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in2' in dimension 1 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_out' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_out' in dimension 2 completely.
INFO: [XFORM 203-11] Balancing expressions in function 'mmult' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:44:41)...125 expression(s) balanced.
INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:06 ; elapsed = 00:00:06 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45173
INFO: [XFORM 203-541] Flattening a loop nest 'loop_1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:182:42) in function 'mmult'.
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9).
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9).
INFO: [XFORM 203-811] Inferring bus burst write of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9).
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:07 ; elapsed = 00:00:07 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45174
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'read_in1'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'read_in2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'loop_1_loop_2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 12.
INFO: [SCHED 204-61] Pipelining loop 'write_out'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 5.
INFO: [SCHED 204-11] Finished scheduling.
  • HLS Report


  • 综合结果分析
    * 首先,硬件代码没有优化指令,log文件中首先将三个数组进行了对应维度的切割,然后也成功的对最内层的循环进行了unroll处理。
    * 然后,相比于Burst Read/Write版本的矩阵乘法实现,该版本主要是加上了两个优化指令,实现内层循环的并行化。从Pipleline的角度考虑:第一段for循环pipeline成功;第二段的for循环嵌套中write_data对应的for循环并行展开,接着对其外层的for循环进行flatten,最终,整体实现pipeline;第三段for循环pipeline成功。
    * 从pipeline成功后的II角度考虑:所有for循环pipeline后的II=1
  • 硬件仿真结果
  • 硬件实现结果

总结

到此为止,我们已经完整的分析了一个SDAceel的例子。从CPU实现,再到FPGA实现,并一步一步进行优化设计。在FPGA的优化中主要包括两种优化方向:其一是基于带宽(Bandwidth)和数据吞吐率(Throughput)的优化;其二是基于计算性能(Performance)的优化。通常在设计的过程中,新手往往只会考虑到计算性能的优化,不断地提高并行度,但是在提高并行度的过程中,往往会导致带宽严重不足,数据吞吐率低,进一步使得并行的效果受到带宽的限制,以至于无法实现并行的效果。在实现矩阵乘法的例子中,我们首先做了基本的功能实现,紧接着对于Local MemoryBurst Write/Read的优化就是在提高访存效率,进而提高数据吞吐率。最后关于指令的优化,才是对于计算性能上的优化。

参考

xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga
ug1253 SDx Pragma Reference Guide 2017.2
ug1207 SDAccel Environment Optmizaton Guide

-------------本文结束 感谢您的阅读-------------
0%