公司建站網(wǎng)站口碑營銷怎么做
RK3588上CPU和GPU算力以及opencv resize的性能對比測試
- 一.背景
- 二.小結
- 三.相關鏈接
- 四.操作步驟
- 1.環(huán)境搭建
- A.安裝依賴
- B.設置GPU為高性能模式
- C.獲取GPU信息
- D.獲取CPU信息
- 2.調用OpenCL SDK獲取GPU信息
- 3.使用OpenCL API計算矩陣乘
- 4.使用clpeak測試GPU的性能
- 5.使用OpenBLAS測試CPU的算力
- 6.分別用CPU與OpenCL測試opencv resize的性能
- A.編譯OpenCV支持OpenCL
- B.運行OpenCV測試程序
一.背景
- 希望對比RK3588上CPU和Mali-GPU的性能差異
- Mali-GPU算力測試采用clpeak
- CPU-FP32的性能測試采用Openblas(開啟了NEON優(yōu)化)
- 分別用CPU和opencl測試opencv resize在不同算法下的性能:從32x32放大到8192x8192再縮放回32x32,循環(huán)100次
二.小結
- GPU型號: Mali-LODX r0p0 Mali-G610 4 cores r0p0 0xA867
- GPU FP32(clpeak): 441.95 GFLOPS
- CPU FP32(openblas+neon): 53.68 GFLOPS
- 插值方法:INTER_NEAREST CPU耗時(秒):3.01526 GPU耗時(秒):0.0672681
- 插值方法:INTER_LINEAR CPU耗時(秒):5.3227 GPU耗時(秒):0.0189366
- 插值方法:INTER_CUBIC CPU耗時(秒):8.22734 GPU耗時(秒):11.6337
- 插值方法:INTER_AREA CPU耗時(秒):20.4999 GPU耗時(秒):27.3197
- 插值方法:INTER_LANCZOS4 CPU耗時(秒):29.3602 GPU耗時(秒):43.9484
三.相關鏈接
- opencv編譯
四.操作步驟
1.環(huán)境搭建
A.安裝依賴
mv /lib/aarch64-linux-gnu/libOpenCL.so.1 /lib/aarch64-linux-gnu/libOpenCL.so.1.bk
ln -s /usr/lib/aarch64-linux-gnu/libmali.so /lib/aarch64-linux-gnu/libOpenCL.so.1sudo apt install opencl-headers
sudo apt install ocl-icd-libopencl1
sudo apt install ocl-icd-opencl-dev
sudo apt install clinfo
B.設置GPU為高性能模式
echo performance> /sys/class/devfreq/fb000000.gpu/governor
echo performance> /sys/class/devfreq/fdab0000.npu/governor
C.獲取GPU信息
cat /sys/class/misc/mali0/device/gpuinfo
clinfo
輸出
Mali-G610 4 cores r0p0 0xA867Number of platforms 1Platform Name ARM PlatformPlatform Vendor ARMPlatform Version OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Platform Profile FULL_PROFILEPlatform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclPlatform Host timer resolution 1nsPlatform Extensions function suffix ARMPlatform Name ARM Platform
Number of devices 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device Name Mali-LODX r0p0Device Vendor ARMDevice Vendor ID 0xa8670000Device Version OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Driver Version 2.1Device OpenCL C Version OpenCL C 2.0 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Device Type GPUDevice Profile FULL_PROFILEDevice Available YesCompiler Available YesLinker Available YesMax compute units 4Max clock frequency 1000MHzDevice Partition (core)Max number of sub-devices 0Supported partition types NoneSupported affinity domains (n/a)Max work item dimensions 3Max work item sizes 1024x1024x1024Max work group size 1024Preferred work group size multiple 16Max sub-groups per work group 64Preferred / native vector sizeschar 16 / 4short 8 / 2int 4 / 1long 2 / 1half 8 / 2 (cl_khr_fp16)float 4 / 1double 0 / 0 (n/a)Half-precision Floating-point support (cl_khr_fp16)Denormals YesInfinity and NANs YesRound to nearest YesRound to zero YesRound to infinity YesIEEE754-2008 fused multiply-add YesSupport is emulated in software NoSingle-precision Floating-point support (core)Denormals YesInfinity and NANs YesRound to nearest YesRound to zero YesRound to infinity YesIEEE754-2008 fused multiply-add YesSupport is emulated in software NoCorrectly-rounded divide and sqrt operations NoDouble-precision Floating-point support (n/a)Address bits 64, Little-EndianGlobal memory size 16643870720 (15.5GiB)Error Correction support NoMax memory allocation 16643870720 (15.5GiB)Unified memory for Host and Device YesShared Virtual Memory (SVM) capabilities (core)Coarse-grained buffer sharing YesFine-grained buffer sharing NoFine-grained system sharing NoAtomics NoMinimum alignment for any data type 128 bytesAlignment of base address 1024 bits (128 bytes)Preferred alignment for atomicsSVM 0 bytesGlobal 0 bytesLocal 0 bytesMax size for global variable 65536 (64KiB)Preferred total size of global vars 0Global Memory cache type Read/WriteGlobal Memory cache size 1048576 (1024KiB)Global Memory cache line size 64 bytesImage support YesMax number of samplers per kernel 16Max size for 1D images from buffer 65536 pixelsMax 1D or 2D image array size 2048 imagesBase address alignment for 2D image buffers 32 bytesPitch alignment for 2D image buffers 64 pixelsMax 2D image size 65536x65536 pixelsMax 3D image size 65536x65536x65536 pixelsMax number of read image args 128Max number of write image args 64Max number of read/write image args 64Max number of pipe args 16Max active pipe reservations 1Max pipe packet size 1024Local memory type GlobalLocal memory size 32768 (32KiB)Max number of constant args 128Max constant buffer size 16643870720 (15.5GiB)Max size of kernel argument 1024Queue properties (on host)Out-of-order execution YesProfiling YesQueue properties (on device)Out-of-order execution YesProfiling YesPreferred size 2097152 (2MiB)Max size 16777216 (16MiB)Max queues on device 1Max events on device 1024Prefer user sync for interop NoProfiling timer resolution 1000nsExecution capabilitiesRun OpenCL kernels YesRun native kernels NoSub-group independent forward progress YesIL version SPIR-V_1.0SPIR versions <printDeviceInfo:161: get CL_DEVICE_SPIR_VERSIONS size : error -30>printf() buffer size 1048576 (1024KiB)Built-in kernels (n/a)Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclNULL platform behaviorclGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM PlatformclGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM]clCreateContext(NULL, ...) [default] Success [ARM]clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)Platform Name ARM PlatformDevice Name Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)Platform Name ARM PlatformDevice Name Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)Platform Name ARM PlatformDevice Name Mali-LODX r0p0
D.獲取CPU信息
lscpu
輸出
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 3
Vendor ID: ARM
Model: 0
Model name: Cortex-A55
Stepping: r2p0
CPU max MHz: 2208.0000
CPU min MHz: 408.0000
BogoMIPS: 48.00
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 1 MiB
L3 cache: 3 MiB
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
2.調用OpenCL SDK獲取GPU信息
cat > cl_query.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>int main() {cl_platform_id *platforms = NULL;cl_uint num_platforms = 0;// 獲取可用的平臺數(shù)量cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * num_platforms);// 獲取所有平臺IDclStatus = clGetPlatformIDs(num_platforms, platforms, NULL);printf("OpenCL平臺數(shù)量: %d\n", num_platforms);// 遍歷每個平臺for (cl_uint i = 0; i < num_platforms; ++i) {char buffer[10240];printf("\n平臺 %d:\n", i+1);// 獲取平臺名稱clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(buffer), buffer, NULL);printf(" 名稱: %s\n", buffer);// 獲取平臺供應商clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(buffer), buffer, NULL);printf(" 供應商: %s\n", buffer);// 獲取平臺版本clGetPlatformInfo(platforms[i], CL_PLATFORM_VERSION, sizeof(buffer), buffer, NULL);printf(" 版本: %s\n", buffer);// 獲取設備數(shù)量cl_uint num_devices = 0;clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);cl_device_id *devices = (cl_device_id*) malloc(sizeof(cl_device_id) * num_devices);clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);// 遍歷每個設備for (cl_uint j = 0; j < num_devices; ++j) {printf(" 設備 %d:\n", j+1);// 獲取設備名稱clGetDeviceInfo(devices[j], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);printf(" 名稱: %s\n", buffer);// 獲取設備類型cl_device_type device_type;clGetDeviceInfo(devices[j], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);if (device_type & CL_DEVICE_TYPE_CPU)printf(" 類型: CPU\n");if (device_type & CL_DEVICE_TYPE_GPU)printf(" 類型: GPU\n");if (device_type & CL_DEVICE_TYPE_ACCELERATOR)printf(" 類型: 加速器\n");// 獲取計算單元數(shù)量cl_uint compute_units;clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);printf(" 計算單元數(shù): %d\n", compute_units);// 獲取全局內存大小cl_ulong global_mem;clGetDeviceInfo(devices[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);printf(" 全局內存大小: %llu MB\n", (unsigned long long)(global_mem / (1024 * 1024)));}free(devices);}free(platforms);return 0;
}
EOFgcc -o cl_query cl_query.c -lOpenCL
./cl_query
輸出
OpenCL平臺數(shù)量: 1平臺 1:名稱: ARM Platform供應商: ARM版本: OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03設備 1:
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.名稱: Mali-LODX r0p0類型: GPU計算單元數(shù): 4全局內存大小: 15872 MB
3.使用OpenCL API計算矩陣乘
cat > matmul.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#include <time.h>
#include <sys/time.h>#define MATRIX_SIZE 8192
#define TILE_SIZE 32// 獲取當前時間(秒),用于計算耗時
double get_current_time() {struct timeval tp;gettimeofday(&tp, NULL);return (double)(tp.tv_sec) + (double)(tp.tv_usec) / 1e6;
}#define xstr(s) str(s)
#define str(s) #sconst char *kernelSource = " \n" \
"__kernel void mat_mul_optimized(const int N, \n" \
" __global float* A, \n" \
" __global float* B, \n" \
" __global float* C) { \n" \
" const int TILE_SIZE = " xstr(TILE_SIZE) "; \n" \
" __local float Asub[TILE_SIZE][TILE_SIZE]; \n" \
" __local float Bsub[TILE_SIZE][TILE_SIZE]; \n" \
" int global_row = get_global_id(1); \n" \
" int global_col = get_global_id(0); \n" \
" int local_row = get_local_id(1); \n" \
" int local_col = get_local_id(0); \n" \
" float sum = 0.0f; \n" \
" int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE; \n" \
" for (int t = 0; t < numTiles; ++t) { \n" \
" int tiled_row = global_row; \n" \
" int tiled_col = t * TILE_SIZE + local_col; \n" \
" if (tiled_row < N && tiled_col < N) \n" \
" Asub[local_row][local_col] = A[tiled_row * N + tiled_col];\n" \
" else \n" \
" Asub[local_row][local_col] = 0.0f; \n" \
" tiled_row = t * TILE_SIZE + local_row; \n" \
" tiled_col = global_col; \n" \
" if (tiled_row < N && tiled_col < N) \n" \
" Bsub[local_row][local_col] = B[tiled_row * N + tiled_col];\n" \
" else \n" \
" Bsub[local_row][local_col] = 0.0f; \n" \
" barrier(CLK_LOCAL_MEM_FENCE); \n" \
" for (int k = 0; k < TILE_SIZE; ++k) { \n" \
" sum += Asub[local_row][k] * Bsub[k][local_col]; \n" \
" } \n" \
" barrier(CLK_LOCAL_MEM_FENCE); \n" \
" } \n" \
" if (global_row < N && global_col < N) \n" \
" C[global_row * N + global_col] = sum; \n" \
"} \n";int main() {int N = MATRIX_SIZE;size_t bytes = N * N * sizeof(float);// 分配主機內存float *h_A = (float*)malloc(bytes);float *h_B = (float*)malloc(bytes);float *h_C = (float*)malloc(bytes);// 初始化矩陣for(int i = 0; i < N*N; i++) {h_A[i] = 1.0f;h_B[i] = 1.0f;}// 獲取平臺和設備信息cl_platform_id platformId = NULL;cl_device_id deviceID = NULL;cl_uint retNumDevices;cl_uint retNumPlatforms;cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);// 創(chuàng)建 OpenCL 上下文cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);// 創(chuàng)建命令隊列cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);// 創(chuàng)建內存緩沖區(qū)cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, &ret);// 將數(shù)據(jù)寫入緩沖區(qū)ret = clEnqueueWriteBuffer(commandQueue, d_A, CL_TRUE, 0, bytes, h_A, 0, NULL, NULL);ret = clEnqueueWriteBuffer(commandQueue, d_B, CL_TRUE, 0, bytes, h_B, 0, NULL, NULL);// 記錄編譯開始時間double compile_start = get_current_time();// 創(chuàng)建程序對象cl_program program = clCreateProgramWithSource(context, 1, (const char**)&kernelSource, NULL, &ret);// 編譯內核程序ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);// 檢查編譯錯誤if (ret != CL_SUCCESS) {size_t log_size;clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);char *log = (char *)malloc(log_size);clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);printf("CL Compilation failed:\n%s\n", log);free(log);return 1;}// 記錄編譯結束時間double compile_end = get_current_time();double compile_time = compile_end - compile_start;// 創(chuàng)建 OpenCL 內核cl_kernel kernel = clCreateKernel(program, "mat_mul_optimized", &ret);// 設置內核參數(shù)ret = clSetKernelArg(kernel, 0, sizeof(int), (void*)&N);ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_A);ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_B);ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&d_C);// 定義全局和本地工作區(qū)大小size_t local[2] = {TILE_SIZE, TILE_SIZE};size_t global[2] = {(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE,(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE};// 記錄第一次內核執(zhí)行開始時間double launch_start = get_current_time();// 執(zhí)行內核ret = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global, local, 0, NULL, NULL);printf("clEnqueueNDRangeKernel:%d\n",ret);// 等待命令隊列執(zhí)行完成clFinish(commandQueue);// 記錄第一次內核執(zhí)行結束時間double launch_end = get_current_time();double launch_time = launch_end - launch_start;// 讀取結果ret = clEnqueueReadBuffer(commandQueue, d_C, CL_TRUE, 0, bytes, h_C, 0, NULL, NULL);// 計算 GFLOPSdouble total_ops = 2.0 * N * N * N;double gflops = (total_ops / 1e9) / launch_time;// 輸出結果printf("編譯時間: %f 秒\n", compile_time);printf("第一次內核執(zhí)行時間: %f 秒\n", launch_time);printf("計算性能: %f GFLOPS\n", gflops);// 釋放資源ret = clFlush(commandQueue);ret = clFinish(commandQueue);ret = clReleaseKernel(kernel);ret = clReleaseProgram(program);ret = clReleaseMemObject(d_A);ret = clReleaseMemObject(d_B);ret = clReleaseMemObject(d_C);ret = clReleaseCommandQueue(commandQueue);ret = clReleaseContext(context);free(h_A);free(h_B);free(h_C);return 0;
}EOF
gcc -o matmul matmul.c -lOpenCL
./matmul
輸出
編譯時間: 0.031085 秒
第一次內核執(zhí)行時間: 62.258528 秒
計算性能: 17.660418 GFLOPS
4.使用clpeak測試GPU的性能
git clone https://gitcode.com/gh_mirrors/cl/clpeak.git
git submodule update --init --recursive --remote
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./clpeak
輸出
Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device: Mali-LODX r0p0Driver version : 2.1 (Linux ARM64)Compute units : 4Clock frequency : 1000 MHzGlobal memory bandwidth (GBPS)float : 25.71float2 : 24.45float4 : 23.70float8 : 12.05float16 : 12.01Single-precision compute (GFLOPS)float : 441.77float2 : 470.27float4 : 466.52float8 : 435.65float16 : 411.38Half-precision compute (GFLOPS)half : 441.96half2 : 878.25half4 : 911.51half8 : 886.19half16 : 846.44No double precision support! SkippedInteger compute (GIOPS)int : 124.96int2 : 125.71int4 : 125.16int8 : 123.82int16 : 124.24Integer compute Fast 24bit (GIOPS)int : 125.16int2 : 125.63int4 : 125.20int8 : 123.73int16 : 124.33Integer char (8bit) compute (GIOPS)char : 126.47char2 : 251.55char4 : 498.03char8 : 497.37char16 : 491.94Integer short (16bit) compute (GIOPS)short : 126.31short2 : 250.90short4 : 249.47short8 : 248.51short16 : 245.30Transfer bandwidth (GBPS)enqueueWriteBuffer : 8.54enqueueReadBuffer : 9.97enqueueWriteBuffer non-blocking : 8.55enqueueReadBuffer non-blocking : 9.99enqueueMapBuffer(for read) : 61.66memcpy from mapped ptr : 11.95enqueueUnmap(after write) : 62.02memcpy to mapped ptr : 11.89Kernel launch latency : 26.81 us
5.使用OpenBLAS測試CPU的算力
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make TARGET=ARMV8
make install
cd benchmark
make TARGET=ARMV8 sgemm
cc sgemm.o -o sgemm /opt/OpenBLAS/lib/libopenblas.so -Wl,-rpath=/opt/OpenBLAS/lib/
export OPENBLAS_NUM_THREADS=8
export OPENBLAS_LOOPS=10
export OPENBLAS_PARAM_M=8192
export OPENBLAS_PARAM_N=8192
export OPENBLAS_PARAM_K=8192
./sgemm
輸出
From : 1 To : 200 Step=1 : Transa=N : Transb=NSIZE Flops TimeM=8192, N=8192, K=8192 : 53485.68 MFlops 205.571220 sec
6.分別用CPU與OpenCL測試opencv resize的性能
A.編譯OpenCV支持OpenCL
- Opencv修改點[鏈接libmali.so]
diff --git a/cmake/OpenCVDetectOpenCL.cmake b/cmake/OpenCVDetectOpenCL.cmake
index 6ab2cae070..c3cf235e45 100644
--- a/cmake/OpenCVDetectOpenCL.cmake
+++ b/cmake/OpenCVDetectOpenCL.cmake
@@ -3,9 +3,8 @@ if(APPLE)set(OPENCL_LIBRARY "-framework OpenCL" CACHE STRING "OpenCL library")set(OPENCL_INCLUDE_DIR "" CACHE PATH "OpenCL include directory")else()
- set(OPENCL_LIBRARY "" CACHE STRING "OpenCL library")
- set(OPENCL_INCLUDE_DIR "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/1.2" CACHE PATH "OpenCL include directory")
- ocv_install_3rdparty_licenses(opencl-headers "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/LICENSE.txt")
+ set(OPENCL_LIBRARY "/usr/lib/aarch64-linux-gnu/libmali.so")
+ set(OPENCL_INCLUDE_DIR "/usr/include")endif()mark_as_advanced(OPENCL_INCLUDE_DIR OPENCL_LIBRARY)
- 編譯Opencv
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout bdb6a968ce69a2bf7c34724f9052c20e941ab47b
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \-DCMAKE_INSTALL_PREFIX=`pwd`/_install \-DWITH_OPENCL=ON -DWITH_NEON=ON \-DBUILD_SHARED_LIBS=ON \-D BUILD_opencv_world=ON -DBUILD_TESTS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_opencv_apps=OFF \-DBUILD_opencv_dnn=OFF -DBUILD_opencv_calib3d=OFF \-DBUILD_opencv_imgproc=ON -DBUILD_opencv_imgcodecs=ON ..
make -j4
make install
B.運行OpenCV測試程序
cat > opencv_resize.cpp <<-'EOF'
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
#include <map>void run(int resize_mode)
{// 創(chuàng)建一個32x32的隨機圖像cv::Mat src = cv::Mat::zeros(32, 32, CV_8UC3);cv::randu(src, cv::Scalar::all(0), cv::Scalar::all(255));// ------------------------------------// 在CPU上執(zhí)行// ------------------------------------cv::ocl::setUseOpenCL(false);cv::Mat enlarged_cpu, resized_back_cpu;// 記錄放大操作的開始時間int64 start_time_cpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src, enlarged_cpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 縮小回32x32cv::resize(enlarged_cpu, resized_back_cpu, cv::Size(32, 32), 0, 0, resize_mode);}// 記錄縮小操作的結束時間int64 end_time_cpu = cv::getTickCount();// 計算縮小操作的耗時double time_resize_cpu = (end_time_cpu - start_time_cpu) / cv::getTickFrequency();// ------------------------------------// 在GPU(OpenCL)上執(zhí)行// ------------------------------------cv::ocl::setUseOpenCL(true);cv::UMat src_umat;src.copyTo(src_umat);cv::UMat enlarged_gpu, resized_back_gpu;// 記錄放大操作的開始時間int64 start_time_gpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src_umat, enlarged_gpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 縮小回32x32cv::resize(enlarged_gpu, resized_back_gpu, cv::Size(32, 32), 0, 0, resize_mode);}// 記錄縮小操作的結束時間int64 end_time_gpu = cv::getTickCount();// 計算縮小操作的耗時double time_resize_gpu = (end_time_gpu - start_time_gpu) / cv::getTickFrequency();std::cout <<"CPU耗時(秒):" << time_resize_cpu << " " << "GPU耗時(秒):" << time_resize_gpu << std::endl;
}int main() {// 檢查系統(tǒng)是否支持OpenCLif (!cv::ocl::haveOpenCL()) {std::cout << "系統(tǒng)不支持OpenCL。" << std::endl;return -1;}// 輸出OpenCL設備信息cv::ocl::Context context;if (!context.create(cv::ocl::Device::TYPE_GPU)) {std::cout << "未找到可用的GPU設備,使用CPU執(zhí)行。" << std::endl;} else {cv::ocl::Device device = cv::ocl::Device::getDefault();std::cout << "使用的OpenCL設備:" << device.name() << std::endl;}// 定義要測試的插值方法std::vector<int> interpolation_methods = {cv::INTER_NEAREST,cv::INTER_LINEAR,cv::INTER_CUBIC,cv::INTER_AREA,cv::INTER_LANCZOS4};// 插值方法的名稱,用于輸出結果std::vector<std::string> interpolation_names = {"INTER_NEAREST","INTER_LINEAR","INTER_CUBIC","INTER_AREA","INTER_LANCZOS4"};for (size_t i = 0; i < interpolation_methods.size(); ++i) {int interpolation = interpolation_methods[i];std::string method_name = interpolation_names[i];std::cout << "插值方法:" << method_name << " ";run(interpolation);} return 0;
}
EOF
g++ -o opencv_resize opencv_resize.cpp -I _install/include/opencv4 \_install/lib/libopencv_world.so -Wl,-rpath=_install/lib
export OPENBLAS_NUM_THREADS=8
./opencv_resize
輸出
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
使用的OpenCL設備:Mali-LODX r0p0
插值方法:INTER_NEAREST CPU耗時(秒):3.01526 GPU耗時(秒):0.0672681
插值方法:INTER_LINEAR CPU耗時(秒):5.3227 GPU耗時(秒):0.0189366
插值方法:INTER_CUBIC CPU耗時(秒):8.22734 GPU耗時(秒):11.6337
插值方法:INTER_AREA CPU耗時(秒):20.4999 GPU耗時(秒):27.3197
插值方法:INTER_LANCZOS4 CPU耗時(秒):29.3602 GPU耗時(秒):43.9484