SDAccel矩阵乘法优化(一)



从一个矩阵乘法的例子一步一步进行功能设计与性能优化。
SDAccel矩阵乘法优化(一)
SDAccel矩阵乘法优化(二)
SDAccel矩阵乘法优化(三)
SDAccel矩阵乘法优化(四)

mmult实现及优化步骤

步骤 实现功能 关键概念/ Keywords
1、cpu实现 即在host端实现简单的矩阵乘法,便于比对数据与性能对比
2、OpenCL实现 device端实现基于OpenCL的FPGA矩阵乘法硬件设计. Key Concepts
- OpenCL APIs
3、加入Local Memory 采用 Local Memory 减少数据访存次数 Key Concepts
- Kernel Optimization
- Local Memory
4、实现读写的突发传输 采用突发传输的方式更好的实现DDRLocal Memory数据的读写访问 Key Concepts
- Kernel Optimization
- Burst Read/Write
5、数组分割 通过循环展开与数组分割的方式,实现更好的计算性能 Key Concepts
- Array Partition
- Loop Unroll
Keywords
- xcl_pipeline_loop
- xcl_array_partition(complete, dim)
- opencl_unroll_hint

CPU端实现mmult计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void mmult_cpu( int *in1,   // Input matrix 1
int *in2, // Input matrix 2
int *out, // Output matrix (out = A x B)
int dim // Matrix size of one dimension
)
{
//Performs matrix multiplication out = in1 x in2
for (int i = 0; i < dim; i++){
for (int j = 0; j < dim; j++){
for (int k = 0; k < dim; k++){
out[i * dim + j] += in1[i * dim + k] * in2[k * dim + j];
}
}
}
}

FPGA端实现mmult计算

OpenCL Host端初始化流程

OpenCL初始化

host 端代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
//OpenCL utility layer include
#include "xcl2.hpp"
#include <vector>

//Array Size to access
#define DATA_SIZE 64

uint64_t get_duration_ns (const cl::Event &event) {
uint64_t nstimestart, nstimeend;
event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_START,&nstimestart);
event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_END,&nstimeend);
return(nstimeend-nstimestart);
}

//CPU implementation of Matrix Multiplication
//The inputs are of the size (DATA_SIZE x DATA_SIZE)
void mmult_cpu (
int *in1, //Input Matrix 1
int *in2, //Input Matrix 1
int *out, //Input Matrix 1
int dim //One dimension of matrix
)
{
//Performs Matrix multiply Out = In1 x In2
for(int i = 0; i < dim; i++) {
for(int j = 0; j < dim; j++) {
for(int k = 0; k < dim; k++) {
out[i * dim + j] += in1[i * dim + k] * in2[k * dim + j];
}
}
}
}

//Functionality to setup OpenCL context and trigger the Kernel
uint64_t mmult_fpga (
std::vector<int,aligned_allocator<int>>& source_in1, //Input Matrix 1
std::vector<int,aligned_allocator<int>>& source_in2, //Input Matrix 2
std::vector<int,aligned_allocator<int>>& source_fpga_results, //Output Matrix
int dim //One dimension of matrix
)
{
int size = dim;
size_t matrix_size_bytes = sizeof(int) * size * size;

//The get_xil_devices will return vector of Xilinx Devices
std::vector<cl::Device> devices = xcl::get_xil_devices();
cl::Device device = devices[0];

//Creating Context and Command Queue for selected Device
cl::Context context(device);
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
std::string device_name = device.getInfo<CL_DEVICE_NAME>();

//import_binary() command will find the OpenCL binary file created using the
//xocc compiler load into OpenCL Binary and return as Binaries
//OpenCL and it can contain many functions which can be executed on the
//device.
std::string binaryFile = xcl::find_binary_file(device_name,"mmult");
cl::Program::Binaries bins = xcl::import_binary_file(binaryFile);
devices.resize(1);
cl::Program program(context, devices, bins);

//This call will extract a kernel out of the program we loaded in the
//previous line. A kernel is an OpenCL function that is executed on the
//FPGA. This function is defined in the src/mmult.cl file.
cl::Kernel kernel(program,"mmult");

//These commands will allocate memory on the FPGA. The cl::Buffer
//objects can be used to reference the memory locations on the device.
//The cl::Buffer object cannot be referenced directly and must be passed
//to other OpenCL functions.
cl::Buffer buffer_in1(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
matrix_size_bytes,source_in1.data());
cl::Buffer buffer_in2(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
matrix_size_bytes,source_in2.data());
cl::Buffer buffer_output(context,CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
matrix_size_bytes,source_fpga_results.data());

//These commands will load the source_in1 and source_in2 vectors from the host
//application into the buffer_in1 and buffer_in2 cl::Buffer objects. The data
//will be be transferred from system memory over PCIe to the FPGA on-board
//DDR memory.
q.enqueueMigrateMemObjects({buffer_in1, buffer_in2},0/* 0 means from host*/);

//Set the kernel arguments
int narg = 0;
kernel.setArg(narg++, buffer_in1);
kernel.setArg(narg++, buffer_in2);
kernel.setArg(narg++, buffer_output);
kernel.setArg(narg++, size);

cl::Event event;
uint64_t kernel_duration = 0;

//Launch the kernel
q.enqueueTask(kernel, NULL, &event);

//The result of the previous kernel execution will need to be retrieved in
//order to view the results. This call will write the data from the
//buffer_output cl_mem object to the source_fpga_results vector
q.enqueueMigrateMemObjects({buffer_output},CL_MIGRATE_MEM_OBJECT_HOST);
q.finish();

kernel_duration = get_duration_ns(event);

return kernel_duration;
}

int main(int argc, char** argv)
{
//Allocate Memory in Host Memory
int size = DATA_SIZE;
size_t matrix_size_bytes = sizeof(int) * size * size;

//When creating a buffer with user pointer, under the hood user ptr is
//used if and only if it is properly aligned (page aligned). When not
//aligned, runtime has no choice but to create its own host side buffer
//that backs user ptr. This in turn implies that all operations that move
//data to/from device incur an extra memcpy to move data to/from runtime's
//own host buffer from/to user pointer. So it is recommended to use this
//allocator if user wish to Create Buffer/Memory Object to align user buffer
//to the page boundary. It will ensure that user buffer will be used when
//user create Buffer/Mem Object.
std::vector<int,aligned_allocator<int>> source_in1(matrix_size_bytes);
std::vector<int,aligned_allocator<int>> source_in2(matrix_size_bytes);
std::vector<int,aligned_allocator<int>> source_fpga_results(matrix_size_bytes);
std::vector<int,aligned_allocator<int>> source_cpu_results(matrix_size_bytes);

//Create the test data
for(int i = 0 ; i < DATA_SIZE * DATA_SIZE ; i++){
source_in1[i] = i;
source_in2[i] = i * i;
source_cpu_results[i] = 0;
source_fpga_results[i] = 0;
}

uint64_t kernel_duration = 0;

//Compute CPU Results
mmult_cpu(source_in1.data(), source_in2.data(), source_cpu_results.data(), size);

//Compute FPGA Results
kernel_duration = mmult_fpga(source_in1, source_in2, source_fpga_results, size);

//Compare the results of FPGA to CPU
bool match = true;
for (int i = 0 ; i < size * size; i++){
if (source_fpga_results[i] != source_cpu_results[i]){
std::cout << "Error: Result mismatch" << std::endl;
std::cout << "i = " << i << " CPU result = " << source_cpu_results[i]
<< " FPGA result = " << source_fpga_results[i] << std::endl;
match = false;
break;
}
}

std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;

std::cout << "Wall Clock Time (Kernel execution): " << kernel_duration << std::endl;
std::cout << "Note: Wall Clock Time is meaningful for real hardware execution only,"
<< "not for emulation." << std::endl;

return (match ? EXIT_SUCCESS : EXIT_FAILURE);
}

device端代码实现(简单实现mmult逻辑)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1, //Read-only input matrix1
__global int* in2, //Read-only input matrix2
__global int* out, //Output matrix
int dim //One dimension of the matrix
)
{
//Reads the data from DDR, performs the computation
//and writes back the result to DDR.
LOOP1:for (int i = 0 ; i < dim ; i++){
LOOP2:for(int j = 0; j < dim; j++){
out[i * dim + j] = 0;
LOOP3:for(int k = 0; k < dim; k++){
out[i * dim + j] += in1[i * dim + k] * in2[k * dim + j];
}
}
}
}

实验结果分析

  • vivado hls log文件分析(重点关注WARNING)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
WARNING: [XFORM 203-542] Cannot flatten a loop nest 'LOOP2' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:47:44) in function 'mmult' :
WARNING: [XFORM 203-542] the outer loop is not a perfect loop.
INFO: [XFORM 203-541] Flattening a loop nest 'LOOP1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:45:43) in function 'mmult'.
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:00.77 ; elapsed = 00:00:00.88 . Memory (MB): peak = 494.320 ; gain = 156.758 ; free physical = 19872 ; free virtual = 45217
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'LOOP3'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 3, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 4, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 130, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 193, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 225, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 241, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 249, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 253, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 255, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 256, distance = 1, offset = 0)
between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
INFO: [SCHED 204-61] Unable to satisfy pipeline directive: Unable to pipeline the region.
INFO: [SCHED 204-11] Finished scheduling.
  • HLS Report


  • 综合结果分析

分析综合结果的方法:
* 首先分析对于添加的优化指令是否综合实现,若不能实现,原因是什么?
* 然后分析代码pipeline的情况。SDAccel对于嵌套的for循环来讲:pipeline内层的for循环全部unroll,pipeline外层的for循环试图进行Flattening,Flatten成功则统一到一个pipeline中。
* 对于pipeline的循环进一步分析II值是多少,理论能优化到多少?

从上述日志分析可知,该硬件的综合实现有很多问题:
* 首先,硬件代码没有优化指令,不需要关注指令是否实现。
* 然后,对于实现的三层for循环,只是实现了最内层LOOP3循环的pipeline,中间层未实现Flatten的原因是:the outer loop is not a perfect loop.。而LOOP2LOOP1继续试图进行Flattening,成功则LOOP2LOOP1统一为LOOP1_LOOP2。一般情况下对于Flattening不成功的原因有两种:一种是外层for循环中夹杂内层for循环的结构;另一种是内层for循环的循环边界是变量。具体循环的类型如下图所示。所以此例中LOOP2不能与LOOP3实现Flatten的原因是前者。也就是在LOOP2循环体中有out[i * dim + j] = 0;操作,而out数组在内层LOOP3中同样用到。反过来说,假如说编译器对LOOP2LOOP3进行Flatten,那么对于out[i * dim + j] = 0操作在同一个循环中将不知如何与内部的循环体进行融合。
loop nest class
* 最后对于试图PipelineLOOP3进行II值的分析,从log文件中可知II值过大,以至于无法进行Pipeline,原因是产生接口gmemcarried dependence。所以,所有的loop都未能实现pipeline
关于gmemcarried dependence问题可以关注我的另一篇文章 gmem carry dependency 分析

  • 硬件仿真结果

  • 硬件实现结果

参考

xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga
ug1253 SDx Pragma Reference Guide 2017.2
ug1207 SDAccel Environment Optmizaton Guide

-------------本文结束 感谢您的阅读-------------
0%