DPC++ for Intel Processor Graphics Architecture
How to Offload Compute-Intensive Code to Intel® GPUs
Rama Malladi, Graphics Efficiency Modeling Engineer, Intel Corporation
Intel® Processor Graphics Architecture is an Intel® expertise that affords graphics, compute, media, and show camouflage capabilities for more than just a few of Intel’s machine-on-a-chip (SoC) merchandise. The Intel Processor Graphics Architecture is informally is named “Gen,” shorthand for expertise. Every open of the architecture has a corresponding version indicated after the be aware “Gen.” For instance, basically the most well liked open of Intel Graphics Architecture is Gen11. Over the years, they’ve evolved to produce unbelievable graphics (3D rendering and media performance) and overall-arrangement compute capabilities with as a lot as 1 TFLOPS (trillion floating-level operations per 2nd) of performance.
Listed right here, we’ll explore the general-arrangement compute capabilities of the Intel Processor Graphics Gen9 and Gen11 architectures and the suited approach to program them the utilization of Knowledge Parallel C++ (DPC++) in the Intel® oneAPI Disagreeable Toolkit. Particularly, we’ll peep at a case explore that reveals programming and performance aspects of the two Gen architectures the utilization of DPC++.
Intel Processor Graphics: Architecture Overview by Gen
Intel Processor Graphics is a energy-efficient, high-performance graphics and media accelerator constructed-in on-die with the Intel® CPU. The constructed-in GPU shares the closing-stage cache (LLC) with the CPU, which permits animated-grained, coherent knowledge sharing at low latency and high bandwidth. Make a choice 1 reveals the SoC with Gen11 graphics. The on-die integration enables grand lower energy consumption than a discrete graphics card.
Make a choice 1: Intel Processor Graphics Gen11 SoC (phase of the elevated CPU SoC)
Make a choice 2 reveals the architecture block plan of a Gen9 GPU. The GPU contains many execution items (EUs), every excellent of doing single instruction, various knowledge (SIMD) computations. A series of eight EUs would possibly perchance well be a subslice. Every subslice has:
- An instruction cache
- L1/L2 sampler caches
- Memory load/store unit ports
These subslices are aggregated to produce a slit, which features a shared L3 cache (coherent with the CPU) and a banked shared native reminiscence (SLM). An Intel constructed-in GPU would possibly perchance glean one or extra such slices. In such a configuration, the L3 is linked to various slices by technique of an interconnect fabric.
Make a choice 2: Intel® Gen9 GPU Architecture
Make a choice 3 reveals some particulars of the EU in the Gen9 architecture. The EU supports multithreading with as a lot as seven threads per EU, with every thread having 128 SIMD-8 32-bit registers. An EU can spot as a lot as four directions per cycle. (Study extra about architecture particulars and benchmarking of Intel GPUs right here.) For instance, the height theoretical GFLOPS for the hardware would possibly perchance furthermore be calculated as (EUs)*(SIMD items/EU)*(FLOPS per cycle/SIMD unit)*(Freq GHz).
Make a choice 3: Subslice and EU architecture particulars
As soon as you’re programming a tool a lot like a GPU, getting basically the most productive performance requires language constructs that plot successfully to the hardware parts available in the market. Lots of APIs will likely be found, but let’s carry a deep dive into oneAPI.
oneAPI and DPC++
oneAPI is an starting up, free, and requirements-essentially based fully programming model that affords portability and performance during accelerators and generations of hardware. oneAPI involves DPC++, the core programming language for code reuse during diversified hardware targets. It is doubtless you’ll salvage extra particulars in my old article, Heterogeneous Programming The utilization of oneAPI (The Parallel Universe, spot 39). DPC++ involves:
- A Unified Shared Memory characteristic for straightforward host-tool reminiscence management
- OpenCL-trend NDRange subgroups to wait on vectorization
- Make stronger for generic/arrangement pointers
- And a ramification of varied parts
This article gifts a case explore that converts a CUDA code to DPC++.
Case Glimpse: Compute Kernel Execution on Intel Processor Graphics
Let’s peep on the Hogbom Dapper imaging algorithm, extensively dilapidated in processing radio astronomy photography. This imaging algorithm has two hotspots:
- Salvage Peak
For brevity, we’ll focal level on the performance aspects of Salvage Peak. The well-liked implementation was in C++, OpenMP, CUDA, and OpenCL. The host CPU offloads the CUDA and OpenCL kernels onto the GPU when available in the market. (CUDA is a proprietary manner to offload computations to utterly NVidia GPUs.) Figures 4 and 5 show camouflage snippets of the host and tool code, respectively.
Make a choice 4: Salvage Peak host code: C++, CUDA
Make a choice 5: Salvage Peak tool code: CUDA
We’re going to have the skill to manually replace the CUDA code with DPC++, or we can reveal the DPC++ Compatibility Tool (DPCT). DPCT assists in migrating CUDA programs to DPC++ (Figures 6 and 7). It factual requires the Intel oneAPI Disagreeable Toolkit and the NVIDIA CUDA header. Invoking the DPCT instrument emigrate an instance.cu file is as straightforward as:
For migrating applications with many CUDA facts, we can reveal the DPCT choices –in-root to spot the spot of program sources and –out-root for writing the DPCT migrated code. If the utility uses produce or cmake, it’s suggested that migration be performed the utilization of intercept-impress. This creates a compilation database file (.json file) with the compiler invocations (the enter file names for both the host C++ code and the tool CUDA code and the connected compiler choices).
Particularly, for migrating Hogbom Dapper CUDA code to DPC++, we can either invoke the DPCT instrument on the HogbomCuda.cu file which has the CUDA kernels or reveal intercept-impress. By default, the migrated code gets the file name extension dp.cpp.
Let’s overview the migrated DPC++ code (Figures 6-9) and review with the distinctive CUDA code
(Figures 4 and 5).
Make a choice 6: Salvage Peak DPC++ host code migrated the utilization of DPCT
Make a choice 7: Comparability of CUDA host code versus migrated DPC++ host code
Make a choice 8: Salvage Peak DPC++ tool DPCT migrated code
Make a choice 9: Comparability of Salvage Peak CUDA kernel versus migrated DPC++ tool kernel
Some key aspects of a DPC++ code include the invocation of tool code the utilization of SYCL queues, a lambda arrangement handler for executing the tool code, and, optionally, a parallel_for kind for multithreaded execution. The migrated DPC++ code right here uses the unified shared reminiscence (USM) programming model and allocates reminiscence on the tool for knowledge being read/written by the tool kernels. Since right here’s a tool allocation, explicit knowledge duplicate ought to be performed from host to tool and vice versa. We’re going to have the skill to also furthermore allocate the reminiscence as shared and it could also furthermore be accessed and up up to now by both the host and the tool. No longer confirmed right here is non-USM code, wherein knowledge transfers are performed the utilization of SYCL buffers and accessors.
The DPCT-migrated code determines the fresh tool and creates a queue for that tool (calls to get_current_device() and get_default_queue()). To dump DPC++ code to the GPU, we prefer to kind a queue with the parameter sycl::gpu_selector. The guidelines to be processed ought to quiet be made available in the market on the tool and to the kernel that executes on the GPU. The size and size of the tips being copied into and out of the GPU are specified by sycl::vary, sycl::nd_range. When the utilization of DPCT, every source line in the CUDA code is migrated to the same DPC++ code. For the Salvage Peak tool kernel code (d_findPeak), the DPC++ code generated (from CUDA code) is with regards to a one-to-one the same migration. Which potential that truth, DPCT is a extraordinarily noteworthy instrument for like a flash porting and prototyping. The migrated DPC++ code comparability versus CUDA code is confirmed Figures 7 and 9.
Having migrated the code to DPC++ the utilization of DPCT, our next job is to take a look at correctness and efficiency. In some instances, the DPCT instrument would possibly perchance simply replace preprocessor directive variables with their values. We’re going to have the skill to also simply desire a handbook repair to undo this substitute. We’re going to have the skill to also simply furthermore procure compilation errors with the migrated code that indicates a repair (e.g., changing CUDA threadId.x with an the same nd_range accessor). The Hogbom Dapper utility code has a correctness checker that helped us validate the effects produced by the migrated DPC++ code. The correctness take a look at was performed by comparing outcomes from the DPC++ code execution on the GPU and a baseline C++ implementation on the host CPU.
Now we can resolve the efficiency of the migrated DPC++ code on a GPU by analyzing its utilization (EU occupancy, reveal of caches, SP or DP FLOPS) and facts transfer between host and tool. Just among the parameters which glean an impact on GPU utilization are the workgroup sizes/vary dimensions. Within the Hogbom Dapper utility, for Salvage Peak, these are nBlocks and findPeakWidth.
For instance the performance impact and tuning opportunity, Make a choice 10 reveals a performance profile tranquil the utilization of nBlocks values spot to 24 and 4. The findPeakWidth was spot to 256. The profile was tranquil with Intel® VTune™ Profiler, which supports GPU profiling. Tuning is extra explicitly required when the utilization of DPCT because the parameters that are efficient for an NVidia GPU the utilization of CUDA would possibly perchance simply no longer be basically the most productive for an Intel GPU executing DPC++ code. Table 1 reveals the stats tranquil on Gen9 (48 EUs).
Make a choice 10: Hogbom Dapper profile on Gen9 for 2 values of nBlocks = (a) 24 and (b) 4
Table 1. Efficiency metrics on Gen9 GPU for the Salvage Peak hotspot
To boot to to GPU utilization and efficiency optimizations, the tips transfer between host and tool ought to quiet even be tuned. The Hogbom Dapper utility has various calls to Salvage Peak, and SubtractPSF kernels and the tips dilapidated by these kernels would possibly perchance furthermore be resident on the tool. Thus, they don’t require reallocation and/or duplicate from host to tool, or vice versa. (We’ll discuss about these kind of optimizations connected to knowledge transfers and USM in future articles.)
Writing Better Algorithms
Belief the Intel Processor Graphics Architecture and DPC++ parts can enable you write better algorithms and conveyable implementations. Listed right here, we reviewed some particulars of the architecture and explored a case explore the utilization of DPC++ constructs and DPCT. It’s major to tune the kernel parameters to procure simplest performance on Intel GPUs, especially when the utilization of DPCT. We recommend attempting the Intel® DevCloud to create, take a look at, and sprint applications on basically the most well liked Intel hardware and tool.