host overlap

例程描述

这个示例将演示在一个应用中用户在主机端(CPU)与FPGA交叠的运算,从而达到将数据传输隐藏在计算过程中,提高数据传输效率。其中包括异步操作和事件对象。

主要学习知识点

  • Key Concepts
    • OpenCL API
    • Host 和 FPGA 同步 Synchronize Host and FPGA
    • 处理过程异步 Asynchronous Processing
    • 事件 Events
    • 异步拷贝 Asynchronous memcpy
    • Double Buffer 乒乓
    • Burst Transfer 突发传输
  • Keywords
    • cl_event
    • CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
    • clEnqueueMigrateMemObjects
    • clEnqueueMapBuffer
  • clEnqueueReadBuffer VS clEnqueueWriteBuffer VS clEnqueueMapBuffer
    • clEnqueueReadBuffer
      • 从Cl_mem读回host mem(就算Cl_mem是直接使用host mem实现的,想读它的内容,还是要这样读回来,可以看做cl_mem是更高一层封装)
    • clEnqueueWriteBuffer
      • 使用host_mem的值写cl_mem
    • clEnqueueMapBuffer
      • 在Cl_mem和host mem之间做映射
      • 这个函数比较特殊,在创建buf时有一种方法CL_MEM_USE_HOST_PTR,是直接让device使用host上已有的一块的mem(p1)做buf,但是这个产生的CL_mem(p2)经过计算后值会改变,p2改变后通常p1不会被改变,因为虽然用的一块物理空间,但是cl_mem是高层封装,和host上的mem还是不一样的,要想使p1同步到p2的最新值,就要调用这句map

主机端代码分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
/*
Overlap Host Code
There are many applications where all of the data cannot reside in an FPGA.
For example, the data is too big to fit in an FPGA or the data is being
streamed from a sensor or the network. In these situations data must be
transferred to the host memory to the FPGA before the computation can be
performed.
Because PCIe is an full-duplex interconnect, you can transfer data to and from
the FPGA simultaneously. Xilinx FPGAs can also perform computations during
these data transfers. Performing all three of these operations at the same
time allows you to keep the FPGA busy and take full advantage of all of the
hardware on your system.
In this example, we will demonstrate how to perform this using an out of order
command queue.
+---------+---------+---------+----------+---------+---------+---------
| WriteA1 | WriteB1 | WriteA2 | Write B2 | WriteA1 | WriteB1 | Wri...
+---------+---------+---------+----------+---------+---------+---------
| Compute1 | Compute2 | Compu...
+--------------------+-------------------+--------+
| ReadC1 | | ReadC2 |
+--------+ +--------+
Many OpenCL commands are asynchronous. This means that whenever you call an
OpenCL function, the function will return before the operation has completed.
Asynchronous nature of OpenCL allows you to simultaneously perform tasks on
the host CPU as well as the FPGA.
Memory transfer operations are asynchronous when the blocking_read,
blocking_write parameters are set to CL_FALSE. These operations are behaving
on host memory so it is important to make sure that the command has completed
before that memory is used.
You can make sure an operation has completed by querying events returned by
these commands. Events are OpenCL objects that track the status of operations.
Event objects are created by kernel execution commands, read, write, map, copy
commands on memory objects or user events created using clCreateUserEvent.
Events can be used to synchronize operations between the host thread and the
device or between two operations in the same context. You can also use events
to time a particular operation if the command queue was created using the
CL_QUEUE_PROFILING_ENABLE flag.
Most enqueuing commands return events by accepting a cl_event pointer as their
last argument of the call. These events can be queried using the
clGetEventInfo function to get the status of a particular operation.
Many functions also accept event lists that can be used to enforce ordering in
an OpenCL context. These events lists are especially important in the context
of out of order command queues as they are the only way specify dependency.
Normal in-order command queues do not need this because dependency is enforced
in the order the operation was enqueued. See the concurrent execution example
for additional details on how create an use these types of command queues.
*/
#include "CL/cl.h"
#include "xcl.h"

#include <algorithm>
#include <array>
#include <chrono>
#include <cstdio>
#include <random>
#include <vector>

using std::array;
using std::chrono::duration;
using std::chrono::nanoseconds;
using std::chrono::seconds;
using std::default_random_engine;
using std::generate;
using std::uniform_int_distribution;
using std::vector;

//Allocator template to align buffer to Page boundary for better data transfer
template <typename T>
struct aligned_allocator
{
using value_type = T;
T* allocate(std::size_t num)
{
void* ptr = nullptr;
if (posix_memalign(&ptr,4096,num*sizeof(T)))
throw std::bad_alloc();
return reinterpret_cast<T*>(ptr);
}
void deallocate(T* p, std::size_t num)
{
free(p);
}
};

const int ARRAY_SIZE = 1 << 14; // ARRAY_SIZE = 2^14
static const char *error_message =
"Error: Result mismatch:\n"
"i = %d CPU result = %d Device result = %d\n";

// Wrap any OpenCL API calls that return error code(cl_int) with the below macros
// to quickly check for an error
#define OCL_CHECK(call) \
do { \
cl_int err = call; \
if (err != CL_SUCCESS) { \
printf("Error calling " #call ", error code is: %d\n", err); \
exit(EXIT_FAILURE); \
} \
} while (0);

int gen_random() {
static default_random_engine e;
static uniform_int_distribution<int> dist(0, 100);

return dist(e);
}

// An event callback function that prints the operations performed by the OpenCL
// runtime.
void event_cb(cl_event event, cl_int cmd_status, void *data) {
cl_command_type command;
clGetEventInfo(event, CL_EVENT_COMMAND_TYPE, sizeof(cl_command_type),
&command, nullptr);
cl_int status;
clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int),
&status, nullptr);
const char *command_str;
const char *status_str;
switch (command) {
case CL_COMMAND_READ_BUFFER:
command_str = "buffer read";
break;
case CL_COMMAND_WRITE_BUFFER:
command_str = "buffer write";
break;
case CL_COMMAND_NDRANGE_KERNEL:
command_str = "kernel";
break;
case CL_COMMAND_MAP_BUFFER:
command_str = "kernel";
break;
case CL_COMMAND_COPY_BUFFER:
command_str = "kernel";
break;
case CL_COMMAND_MIGRATE_MEM_OBJECTS:
command_str = "buffer migrate";
break;
default:
command_str = "unknown";
}
switch (status) {
case CL_QUEUED:
status_str = "Queued";
break;
case CL_SUBMITTED:
status_str = "Submitted";
break;
case CL_RUNNING:
status_str = "Executing";
break;
case CL_COMPLETE:
status_str = "Completed";
break;
}
printf("[%s]: %s %s\n", reinterpret_cast<char *>(data), status_str,
command_str);
fflush(stdout);
}

// Sets the callback for a particular event
void set_callback(cl_event event, const char *queue_name) {
OCL_CHECK(
clSetEventCallback(event, CL_COMPLETE, event_cb, (void *)queue_name));
}

int main(int argc, char **argv) {
cl_int err;

xcl_world world = xcl_world_single();
cl_program program = xcl_import_binary(world, "vector_addition");

// We will break down our problem into multiple iterations. Each iteration
// will perform computation on a subset of the entire data-set.
size_t elements_per_iteration = 2048;
size_t bytes_per_iteration = elements_per_iteration * sizeof(int);
size_t num_iterations = ARRAY_SIZE / elements_per_iteration; //num_iterations = 8

// This example will use an out of order command queue. The default command
// queue created by xcl_world_single is an inorder command queue. Here we will
// release the original queue and replace it with an out of order queue.
clReleaseCommandQueue(world.command_queue);
world.command_queue =
clCreateCommandQueue(world.context, world.device_id,
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);

// Allocate memory on the host and fill with random data.
// 生成A,B随机初始值数据
vector<int,aligned_allocator<int>> A(ARRAY_SIZE);
vector<int,aligned_allocator<int>> B(ARRAY_SIZE);
generate(begin(A), end(A), gen_random);
generate(begin(B), end(B), gen_random);
vector<int,aligned_allocator<int>> device_result(ARRAY_SIZE);

cl_kernel kernel = xcl_get_kernel(program, "vadd");

// This pair of events will be used to track when a kernel is finished with
// the input buffers. Once the kernel is finished processing the data, a new
// set of elements will be written into the buffer.
// 合理建立同步事件
array<cl_event, 2> kernel_events;
array<cl_event, 2> read_events;
array<cl_event, 2> map_events;
cl_mem buffer_a[2], buffer_b[2], buffer_c[2]; // Double Buffer 定义
size_t global = 1, local = 1;
for (size_t iteration_idx = 0; iteration_idx < num_iterations; iteration_idx++) {
int flag = iteration_idx % 2; // 建立Double Buffer的索引

if (iteration_idx >= 2) {
clWaitForEvents(1, &map_events[flag]);
OCL_CHECK(clReleaseMemObject(buffer_a[flag]));
OCL_CHECK(clReleaseMemObject(buffer_b[flag]));
OCL_CHECK(clReleaseMemObject(buffer_c[flag]));
OCL_CHECK(clReleaseEvent(read_events[flag]));
OCL_CHECK(clReleaseEvent(kernel_events[flag]));
}
// 一次申请2048个int值大小的只读内存,从A拷贝到buffer_a,buffer_b中。
buffer_a[flag] = clCreateBuffer(world.context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration, &A[iteration_idx * elements_per_iteration], NULL);
buffer_b[flag] = clCreateBuffer(world.context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration, &B[iteration_idx * elements_per_iteration], NULL);
//buffer_c申请2048个int值大小的只写内存,从device_result写入。
buffer_c[flag] = clCreateBuffer(world.context,
CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration, &device_result[iteration_idx * elements_per_iteration], NULL);
array<cl_event, 2> write_events;
printf("Enqueueing Migrate Mem Object (Host to Device) calls\n");
// These calls are asynchronous with respect to the main thread because we
// are passing the CL_FALSE as the third parameter. Because we are passing
// the events from the previous kernel call into the wait list, it will wait
// for the previous operations to complete before continuing
//clEnqueueMigrateMemObjects 替代 clEnqueueWriteBuffer
OCL_CHECK(clEnqueueMigrateMemObjects(
world.command_queue, 1, &buffer_a[flag],
0 /* flags, 0 means from host */,
0, NULL,
&write_events[0]));
set_callback(write_events[0], "ooo_queue_write_events[0]");
////clEnqueueMigrateMemObjects 替代 clEnqueueWriteBuffer
OCL_CHECK(clEnqueueMigrateMemObjects(
world.command_queue, 1, &buffer_b[flag],
0 /* flags, 0 means from host */,
0, NULL,
&write_events[1]));
set_callback(write_events[1], "ooo_queue_write_events[1]");

xcl_set_kernel_arg(kernel, 0, sizeof(cl_mem), &buffer_c[iteration_idx % 2]);
xcl_set_kernel_arg(kernel, 1, sizeof(cl_mem), &buffer_a[iteration_idx % 2]);
xcl_set_kernel_arg(kernel, 2, sizeof(cl_mem), &buffer_b[iteration_idx % 2]);
xcl_set_kernel_arg(kernel, 3, sizeof(int), &elements_per_iteration);

printf("Enqueueing NDRange kernel.\n");
// This event needs to wait for the write buffer operations to complete
// before executing. We are sending the write_events into its wait list to
// ensure that the order of operations is correct.
OCL_CHECK(clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
&global, &local, 2 , write_events.data(),
&kernel_events[flag]));
set_callback(kernel_events[flag], "ooo_queue_kernel_events");

printf("Enqueueing Migrate Mem Object (Device to Host) calls\n");
// This operation only needs to wait for the kernel call. This call will
// potentially overlap the next kernel call as well as the next read
// operations
// //clEnqueueMigrateMemObjects 替代 clEnqueueReadBuffer
OCL_CHECK(clEnqueueMigrateMemObjects(world.command_queue, 1, &buffer_c[flag],
CL_MIGRATE_MEM_OBJECT_HOST, 1, &kernel_events[flag], &read_events[flag]));

set_callback(read_events[flag], "ooo_queue_read_events");
clEnqueueMapBuffer(world.command_queue, buffer_c[flag], CL_FALSE, CL_MAP_READ, 0,
bytes_per_iteration, 1, &read_events[flag], &map_events[flag], 0);
set_callback(map_events[flag], "ooo_queue_map_events");

OCL_CHECK(clReleaseEvent(write_events[0]));
OCL_CHECK(clReleaseEvent(write_events[1]));
}
// Wait for all of the OpenCL operations to complete
printf("Waiting...\n");
clFlush(world.command_queue);
clFinish(world.command_queue);

//Releasing mem objects and events
for(int i = 0 ; i < 2 ; i++){
OCL_CHECK(clWaitForEvents(1, &map_events[i]));
OCL_CHECK(clReleaseMemObject(buffer_a[i]));
OCL_CHECK(clReleaseMemObject(buffer_b[i]));
OCL_CHECK(clReleaseMemObject(buffer_c[i]));
OCL_CHECK(clReleaseEvent(read_events[i]));
OCL_CHECK(clReleaseEvent(kernel_events[i]));
}

int match = 0;
// verify the results
for (int i = 0; i < ARRAY_SIZE; i++) {
int host_result = A[i] + B[i];
if (device_result[i] != host_result) {
printf(error_message, i, host_result, device_result[i]);
match = 1;
// break;
}
}

OCL_CHECK(clReleaseKernel(kernel));
OCL_CHECK(clReleaseProgram(program));
xcl_release_world(world);

printf("TEST %s\n", (match ? "FAILED" : "PASSED"));
return (match ? EXIT_FAILURE : EXIT_SUCCESS);
}

内核代码分析

内核源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#define BUFFER_SIZE 256
kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void vadd(global int* c,
global const int* a,
global const int* b,
const int elements
) {
int arrayA[BUFFER_SIZE];
int arrayB[BUFFER_SIZE];
for (int i = 0 ; i < elements ; i += BUFFER_SIZE)
{
int size = BUFFER_SIZE;
if (i + size > elements) size = elements - i;
readA: for (int j = 0 ; j < size ; j++) arrayA[j] = a[i+j];
readB: for (int j = 0 ; j < size ; j++) arrayB[j] = b[i+j];
vadd_writeC: for (int j = 0 ; j < size ; j++) c[i+j] = arrayA[j] + arrayB[j];
}
}

内核源码分析:

向量相加内核模块,采用burst 突发传输的形式,突发长度为BUFFER_SIZE。
需要学习的地方是,突发长度与数据导入过程需要进行比较,防止数据读入错误!
内核一次运算elements个数据。

综合报表

Performence图

关键理解概念描述

实验结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
INFO: Importing xclbin/vector_addition.hw.xilinx_adm-pcie-7v3_1ddr.xclbin
INFO: Loaded file
INFO: Created Binary
INFO: Built Program
Enqueueing Migrate Mem Object (Host to Device) calls
Enqueueing NDRange kernel.
Enqueueing Migrate Mem Object (Device to Host) calls
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_write_events[0]]: Completed buffer migrate ---> Wa1
[ooo_queue_write_events[1]]: Completed buffer migrate ---> Wb1
[ooo_queue_write_events[0]]: Completed buffer migrate ---> Wa2
[ooo_queue_write_events[1]]: Completed buffer migrate ---> Wb2
[ooo_queue_kernel_events]: Completed kernel ---> CU1
[ooo_queue_read_events]: Completed buffer migrate ---> Rc1(1)
Enqueueing NDRange kernel.
[ooo_queue_map_events]: Completed kernel ---> Rc1(2)
Enqueueing Migrate Mem Object (Device to Host) calls
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_kernel_events]: Completed kernel ---> CU2
Enqueueing NDRange kernel.
[ooo_queue_write_events[0]]: Completed buffer migrate ---> Wa3
[ooo_queue_read_events]: Completed buffer migrate ---> Rc2(1)
Enqueueing Migrate Mem Object (Device to Host) calls
[ooo_queue_write_events[1]]: Completed buffer migrate ---> Wb3
[ooo_queue_map_events]: Completed kernel ---> Rc2(2)
[ooo_queue_kernel_events]: Completed kernel ---> CU3
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_read_events]: Completed buffer migrate ---> Rc3(1)
[ooo_queue_map_events]: Completed kernel ---> Rc3(2)
[ooo_queue_write_events[0]]: Completed buffer migrate ---> Wa4
Enqueueing NDRange kernel.
[ooo_queue_write_events[1]]: Completed buffer migrate ---> Wb4
Enqueueing Migrate Mem Object (Device to Host) calls
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_kernel_events]: Completed kernel ---> CU4
Enqueueing NDRange kernel.
[ooo_queue_write_events[0]]: Completed buffer migrate ---> Wa5
[ooo_queue_write_events[1]]: Completed buffer migrate ---> Wb5
[ooo_queue_read_events]: Completed buffer migrate ---> Rc4(1)
[ooo_queue_map_events]: Completed kernel ---> Rc4(2)
Enqueueing Migrate Mem Object (Device to Host) calls
[ooo_queue_kernel_events]: Completed kernel ---> CU5
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_read_events]: Completed buffer migrate ---> Rc5(1)
[ooo_queue_write_events[0]]: Completed buffer migrate
[ooo_queue_map_events]: Completed kernel ---> Rc5(2)
[ooo_queue_write_events[1]]: Completed buffer migrate
Enqueueing NDRange kernel.
Enqueueing Migrate Mem Object (Device to Host) calls
[ooo_queue_kernel_events]: Completed kernel
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_read_events]: Completed buffer migrate
Enqueueing NDRange kernel.
[ooo_queue_map_events]: Completed kernel
[ooo_queue_write_events[0]]: Completed buffer migrate
[ooo_queue_write_events[1]]: Completed buffer migrate
Enqueueing Migrate Mem Object (Device to Host) calls
Enqueueing Migrate Mem Object (Host to Device) calls
[ooo_queue_kernel_events]: Completed kernel
Enqueueing NDRange kernel.
Enqueueing Migrate Mem Object (Device to Host) calls
[ooo_queue_write_events[0]]: Completed buffer migrate
[ooo_queue_read_events]: Completed buffer migrate
[ooo_queue_write_events[1]]: Completed buffer migrate
[ooo_queue_map_events]: Completed kernel
Waiting...
[ooo_queue_kernel_events]: Completed kernel
[ooo_queue_read_events]: Completed buffer migrate
[ooo_queue_map_events]: Completed kernel
TEST PASSED
-------------本文结束 感谢您的阅读-------------
0%