Comparison of the number of cuda processors. CUDA We're Rolling: NVIDIA CUDA Technology. CUDA development history

And it is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for building the final program or library in any programming environment, for example, in NetBeans.

The CUDA architecture uses the grid memory model, cluster threading and SIMD instructions. Applicable not only for high-performance graphics computing, but also for various scientific computing using nVidia graphics cards. Scientists and researchers use CUDA extensively in a variety of fields, including astrophysics, computational biology and chemistry, fluid dynamics simulations, electromagnetic interactions, computed tomography, seismic analysis, and more. CUDA has the ability to connect to applications using OpenGL and Direct3D. CUDA is cross-platform software for operating systems such as Linux, Mac OS X and Windows.

On March 22, 2010 nVidia released the CUDA Toolkit 3.0 which included OpenCL support.

Equipment

CUDA Platform First appeared on the market with the release of the eighth generation NVIDIA G80 chip and became present in all subsequent series of graphics chips used in the GeForce, Quadro and NVidia Tesla families of accelerators.

The first series of hardware supporting the CUDA SDK, the G8x, had a 32-bit single-precision vector processor using the CUDA SDK as an API (CUDA supports the C double type, but now its precision has been downgraded to 32-bit floating point). Later GT200 processors have support for 64-bit precision (only for SFU), but performance is significantly worse than for 32-bit precision (due to the fact that there are only two SFUs per stream multiprocessor, and there are eight scalar processors). The GPU organizes hardware multithreading, which allows all the GPU resources to be used. Thus, the prospect opens up of shifting the functions of a physical accelerator to a graphics accelerator (an example of implementation is nVidia PhysX). Also, there are wide opportunities for using the graphic equipment of a computer to perform complex non-graphic calculations: for example, in computational biology and in other branches of science.

Advantages

Compared to the traditional approach to organizing general-purpose computing through the capabilities of graphics APIs, the CUDA architecture has the following advantages in this area:

Restrictions

All functions executed on the device do not support recursion (in CUDA Toolkit 3.1 it supports pointers and recursion) and have some other limitations

Supported GPUs and Graphics Accelerators

The list of devices from the equipment manufacturer Nvidia with the declared full support for CUDA technology is given on the official Nvidia website: CUDA-Enabled GPU Products.

In fact, in the PC hardware market today, the following peripherals support CUDA technology:

Specification version	GPU	Video cards
1.0	G80, G92, G92b, G94, G94b	GeForce 8800GTX / Ultra, 9400GT, 9600GT, 9800GT, Tesla C / D / S870, FX4 / 5600, 360M, GT 420
1.1	G86, G84, G98, G96, G96b, G94, G94b, G92, G92b	GeForce 8400GS / GT, 8600GT / GTS, 8800GT / GTS, 9600 GSO, 9800GTX / GX2, GTS 250, GT 120/30/40, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32 / 370M, 3/5 / 770M, 16/17/27/28/36/37 / 3800M, NVS420 / 50
1.2	GT218, GT216, GT215	GeForce 210, GT 220/40, FX380 LP, 1800M, 370 / 380M, NVS 2 / 3100M
1.3	GT200, GT200b	GeForce GTX 260, GTX 275, GTX 280, GTX 285, GTX 295, Tesla C / M1060, S1070, Quadro CX, FX 3/4/5800
2.0	GF100, GF110	GeForce (GF100) GTX 465, GTX 470, GTX 480, Tesla C2050, C2070, S / M2050 / 70, Quadro Plex 7000, Quadro 4000, 5000, 6000, GeForce (GF110) GTX 560 TI 448, GTX570, GTX580, GTX590
2.1	GF104, GF114, GF116, GF108, GF106	GeForce 610M, GT 430, GT 440, GTS 450, GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti, 500M, Quadro 600, 2000
3.0	GK104, GK106, GK107	GeForce GTX 690, GTX 680, GTX 670, GTX 660 Ti, GTX 660, GTX 650 Ti, GTX 650, GT 640, GeForce GTX 680MX, GeForce GTX 680M, GeForce GTX 675MX, GeForce GTX 670MX, GTX 660M, GeForce GT 650M, GeForce GT 645M, GeForce GT 640M
3.5	GK110

Nvidia GeForce Desktop
GeForce GTX 590
GeForce GTX 580
GeForce GTX 570
GeForce GTX 560 Ti
GeForce GTX 560
GeForce GTX 550 Ti
GeForce GTX 520
GeForce GTX 480
GeForce GTX 470
GeForce GTX 465
GeForce GTX 460
GeForce GTS 450
GeForce GTX 295
GeForce GTX 285
GeForce GTX 280
GeForce GTX 275
GeForce GTX 260
GeForce GTS 250
GeForce GT 240
GeForce GT 220
GeForce 210
GeForce GTS 150
GeForce GT 130
GeForce GT 120
GeForce G100
GeForce 9800 GX2
GeForce 9800 GTX +
GeForce 9800 GTX
GeForce 9800 GT
GeForce 9600 GSO
GeForce 9600 GT
GeForce 9500 GT
GeForce 9400 GT
GeForce 9400 mGPU
GeForce 9300 mGPU
GeForce 8800 GTS 512
GeForce 8800 GT
GeForce 8600 GTS
GeForce 8600 GT
GeForce 8500 GT
GeForce 8400 GS

Nvidia GeForce Mobile
GeForce GTX 580M
GeForce GTX 570M
GeForce GTX 560M
GeForce GT 555M
GeForce GT 540M
GeForce GT 525M
GeForce GT 520M
GeForce GTX 485M
GeForce GTX 480M
GeForce GTX 470M
GeForce GTX 460M
GeForce GT 445M
GeForce GT 435M
GeForce GT 425M
GeForce GT 420M
GeForce GT 415M
GeForce GTX 285M
GeForce GTX 280M
GeForce GTX 260M
GeForce GTS 360M
GeForce GTS 350M
GeForce GTS 160M
GeForce GTS 150M
GeForce GT 335M
GeForce GT 330M
GeForce GT 325M
GeForce GT 240M
GeForce GT 130M
GeForce G210M
GeForce G110M
GeForce G105M
GeForce 310M
GeForce 305M
GeForce 9800M GTX
GeForce 9800M GT
GeForce 9800M GTS
GeForce 9700M GTS
GeForce 9700M GT
GeForce 9650M GS
GeForce 9600M GT
GeForce 9600M GS
GeForce 9500M GS
GeForce 9500M G
GeForce 9300M GS
GeForce 9300M G
GeForce 9200M GS
GeForce 9100M G
GeForce 8800M GTS
GeForce 8700M GT
GeForce 8600M GT
GeForce 8600M GS
GeForce 8400M GT
GeForce 8400M GS

Nvidia Tesla *
Tesla C2050 / C2070
Tesla M2050 / M2070 / M2090
Tesla S2050
Tesla S1070
Tesla M1060
Tesla C1060
Tesla C870
Tesla D870
Tesla S870

Nvidia Quadro Desktop
Quadro 6000
Quadro 5000
Quadro 4000
Quadro 2000
Quadro 600
Quadro FX 5800
Quadro FX 5600
Quadro FX 4800
Quadro FX 4700 X2
Quadro FX 4600
Quadro FX 3700
Quadro FX 1700
Quadro FX 570
Quadro FX 470
Quadro FX 380 Low Profile
Quadro FX 370
Quadro FX 370 Low Profile
Quadro CX
Quadro NVS 450
Quadro NVS 420
Quadro NVS 290
Quadro Plex 2100 D4
Quadro Plex 2200 D2
Quadro Plex 2100 S4
Quadro Plex 1000 Model IV

Nvidia Quadro for mobile computers
Quadro 5010M
Quadro 5000M
Quadro 4000M
Quadro 3000M
Quadro 2000M
Quadro 1000M
Quadro FX 3800M
Quadro FX 3700M
Quadro FX 3600M
Quadro FX 2800M
Quadro FX 2700M
Quadro FX 1800M
Quadro FX 1700M
Quadro FX 1600M
Quadro FX 880M
Quadro FX 770M
Quadro FX 570M
Quadro FX 380M
Quadro FX 370M
Quadro FX 360M
Quadro NVS 5100M
Quadro NVS 4200M
Quadro NVS 3100M
Quadro NVS 2100M
Quadro NVS 320M
Quadro NVS 160M
Quadro NVS 150M
Quadro NVS 140M
Quadro NVS 135M
Quadro NVS 130M

Tesla C1060, Tesla S1070, Tesla C2050 / C2070, Tesla M2050 / M2070, Tesla S2050 models allow you to perform calculations on a GPU with double precision.

Features and specifications of different versions

Feature support (unlisted features are supported for all compute capabilities)	Compute capability (version)
	1.0	1.1	1.2	1.3	2.x
32-bit words in global memory	Not	Yes
floating point values in global memory	Not	Yes
Integer atomic functions operating on 32-bit words in shared memory	Not		Yes
atomicExch () operating on 32-bit floating point values in shared memory
Integer atomic functions operating on 64-bit words in global memory
Warp vote functions
Double-precision floating-point operations	Not			Yes
Atomic functions operating on 64-bit integer values in shared memory	Not				Yes
Floating-point atomic addition operating on 32-bit words in global and shared memory
_ballot ()
_threadfence_system ()
_syncthreads_count (), _syncthreads_and (), _syncthreads_or ()
Surface functions
3D grid of thread block

Technical specifications	Compute capability (version)
Technical specifications	1.0	1.1	1.2	1.3	2.x
Maximum dimensionality of grid of thread blocks	2				3
Maximum x-, y-, or z-dimension of a grid of thread blocks	65535
Maximum dimensionality of thread block	3
Maximum x- or y-dimension of a block	512				1024
Maximum z-dimension of a block	64
Maximum number of threads per block	512				1024
Warp size	32
Maximum number of resident blocks per multiprocessor	8
Maximum number of resident warps per multiprocessor	24		32		48
Maximum number of resident threads per multiprocessor	768		1024		1536
Number of 32-bit registers per multiprocessor	8 K		16 K		32 K
Maximum amount of shared memory per multiprocessor	16 KB				48 KB
Number of shared memory banks	16				32
Amount of local memory per thread	16 KB				512 KB
Constant memory size	64 KB
Cache working set per multiprocessor for constant memory	8 KB
Cache working set per multiprocessor for texture memory	Device dependent, between 6 KB and 8 KB
Maximum width for 1D texture	8192				32768
Maximum width for 1D texture reference bound to linear memory	2 27
Maximum width and number of layers for a 1D layered texture reference	8192 x 512				16384 x 2048
Maximum width and height for 2D texture reference bound to linear memory or a CUDA array	65536 x 32768				65536 x 65535
Maximum width, height, and number of layers for a 2D layered texture reference	8192 x 8192 x 512				16384 x 16384 x 2048
Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array	2048 x 2048 x 2048
Maximum number of textures that can be bound to a kernel	128
Maximum width for a 1D surface reference bound to a CUDA array	Not supported				8192
Maximum width and height for a 2D surface reference bound to a CUDA array					8192 x 8192
Maximum number of surfaces that can be bound to a kernel					8
Maximum number of instructions per kernel	2 million

Example

CudaArray * cu_array; texture< float , 2 >tex; // Allocate array cudaMalloc (& cu_array, cudaCreateChannelDesc< float>(), width, height); // Copy image data to array cudaMemcpy (cu_array, image, width * height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture (tex, cu_array); // Run kernel dim3 blockDim (16, 16, 1); dim3 gridDim (width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>> (d_odata, width, height); cudaUnbindTexture (tex); __global__ void kernel (float * odata, int height, int width) (unsigned int x = blockIdx.x * blockDim.x + threadIdx.x; unsigned int y = blockIdx.y * blockDim.y + threadIdx.y; float c = texfetch (tex, x, y); odata [y * width + x] = c;)

Import pycuda.driver as drv import numpy drv.init () dev = drv.Device (0) ctx = dev.make_context () mod = drv.SourceModule ( "" "__global__ void multiply_them (float * dest, float * a, float * b) (const int i = threadIdx.x; dest [i] = a [i] * b [i];)" "") multiply_them = mod.get_function ("multiply_them") a = numpy.random .randn (400) .astype (numpy.float32) b = numpy.random .randn (400) .astype (numpy.float32) dest = numpy.zeros_like (a) multiply_them (drv.Out (dest), drv.In (a), drv.In (b), block = (400, 1, 1)) print dest-a * b

CUDA as a subject in universities

As of December 2009, the CUDA programming model is being taught at 269 universities around the world. In Russia, CUDA training courses are taught at the St. Petersburg Polytechnic University, Yaroslavl State University named after V.I. P. G. Demidov, Moscow, Nizhny Novgorod, St. Petersburg, Tver, Kazan, Novosibirsk, Novosibirsk State Technical University Omsk and Perm State Universities, International University of Nature of Society and Man "Dubna", Ivanovo State Power Engineering University, Belgorod State University, MSTU them. Bauman, RKhTU them. Mendeleev, Interregional Supercomputer Center RAS,. In addition, in December 2009, it was announced that the first Russian scientific and educational center "Parallel Computing", located in the city of Dubna, was launched, whose tasks include training and consulting on solving complex computing problems on the GPU.

In Ukraine, CUDA courses are taught at the Kiev Institute for Systems Analysis.

Notes (edit)

Devices for turning personal computers into small supercomputers have been around for a long time. Back in the 80s of the last century, so-called transputers were offered on the market, which were inserted into the then widespread ISA expansion slots. At first, their performance in the corresponding tasks was impressive, but then the growth in the speed of universal processors accelerated, they strengthened their positions in parallel computing, and there was no point in transputers. Although such devices still exist today, these are various specialized accelerators. But often the scope of their application is narrow and such accelerators have not received much distribution.

But recently, the torch of parallel computing has shifted to the mass market, one way or another associated with three-dimensional games. General purpose multi-core parallel vector computing devices used in 3D graphics achieve peak performance that general purpose processors cannot. Of course, the maximum speed is achieved only in a number of convenient tasks and has some limitations, but such devices have already begun to be widely used in areas for which they were not originally intended. An excellent example of such a parallel processor is the Cell processor, developed by the Sony-Toshiba-IBM alliance and used in the Sony PlayStation 3, as well as all modern video cards from market leaders - Nvidia and AMD.

We will not touch Cell today, although it appeared earlier and is a universal processor with additional vector capabilities, we are not talking about it today. For 3D video accelerators, a few years ago, the first technologies of non-graphic general-purpose computations, GPGPU (General-Purpose computation on GPUs), appeared. After all, modern video chips contain hundreds of mathematical execution units, and this power can be used to significantly accelerate many computationally intensive applications. And current generations of GPUs have a flexible enough architecture that, together with high-level programming languages and software and hardware architectures like the one discussed in this article, uncovers these possibilities and makes them much more accessible.

The development of the GPCPU was prompted by the emergence of sufficiently fast and flexible shader programs that are capable of executing modern video chips. The developers decided to make the GPU calculate not only the image in 3D applications, but also be used in other parallel calculations. In GPGPU, graphics APIs were used for this: OpenGL and Direct3D, when data was transferred to the video chip in the form of textures, and calculation programs were loaded in the form of shaders. The disadvantages of this method are the relatively high complexity of programming, low speed of data exchange between the CPU and GPU, and other limitations, which we will discuss below.

GPU computing has evolved and is evolving very quickly. And in the future, the two main manufacturers of video chips, Nvidia and AMD, developed and announced the corresponding platforms called CUDA (Compute Unified Device Architecture) and CTM (Close To Metal or AMD Stream Computing), respectively. Unlike previous GPU programming models, these were performed with direct access to the hardware capabilities of video cards. The platforms are not compatible with each other, CUDA is an extension of the C programming language, and CTM is a virtual machine that executes assembly code. Instead, both platforms removed some of the important limitations of previous GPGPU models using the traditional graphics pipeline and the corresponding Direct3D or OpenGL interfaces.

Of course, open standards using OpenGL seem to be the most portable and universal, they allow using the same code for video chips from different manufacturers. But such methods have a lot of disadvantages, they are much less flexible and not so convenient to use. In addition, they do not allow using the specific capabilities of certain video cards, such as the fast shared (shared) memory found in modern computing processors.

That is why Nvidia released the CUDA platform - a C-like programming language with its own compiler and libraries for GPU computing. Of course, writing the optimal code for video chips is not at all so simple and this task requires a long manual work, but CUDA just reveals all the possibilities and gives the programmer more control over the hardware capabilities of the GPU. It is important that the G8x, G9x and GT2xx chips used in Geforce 8, 9 and 200 series graphics cards have Nvidia CUDA support, which are very widespread. Currently, the final version of CUDA 2.0 has been released, which introduces some new features, for example, support for double precision calculations. CUDA is available on 32-bit and 64-bit Linux, Windows and MacOS X operating systems.

Difference between CPU and GPU in parallel calculations

The increase in the frequency of universal processors has run up against physical limitations and high power consumption, and their performance is increasingly increasing due to the placement of several cores in a single chip. The processors sold now contain only up to four cores (further growth will not be rapid) and they are designed for ordinary applications, use MIMD - multiple stream of instructions and data. Each core works separately from the others, executing different instructions for different processes.

Specialized vector capabilities (SSE2 and SSE3) for four-component (single-precision floating point) and two-component (double precision) vectors have appeared in general-purpose processors due to the increased requirements of graphics applications, in the first place. That is why for certain tasks the use of GPUs is more profitable, because they were originally made for them.

For example, in Nvidia video chips, the main unit is a multiprocessor with eight to ten cores and hundreds of ALUs in total, several thousand registers and a small amount of shared memory. In addition, the video card contains fast global memory with access to it by all multiprocessors, local memory in each multiprocessor, and special memory for constants.

Most importantly, these multiple multiprocessor cores in the GPU are SIMD (single instruction stream, multiple data streams) cores. And these cores execute the same instructions at the same time, this programming style is common for graphical algorithms and many scientific problems, but requires specific programming. But this approach allows you to increase the number of execution units due to their simplification.

So, let's list the main differences between CPU and GPU architectures. CPU cores are designed to execute a single stream of sequential instructions at maximum performance, while GPUs are designed to quickly execute large numbers of concurrent instruction streams. General Purpose Processors are optimized for high performance from a single instruction stream handling both integers and floating point numbers. In this case, access to memory is random.

CPU designers try to get as many instructions to run in parallel as possible to improve performance. For this purpose, starting with the Intel Pentium processors, superscalar execution has appeared, which ensures the execution of two instructions per clock, and the Pentium Pro has distinguished itself by out-of-order execution of instructions. But the parallel execution of a sequential stream of instructions has certain basic limitations and by increasing the number of execution units, a multiple increase in speed cannot be achieved.

For video chips, the work is simple and initially parallelized. The video chip takes a group of polygons as input, performs all the necessary operations, and outputs pixels at the output. The processing of polygons and pixels is independent, they can be processed in parallel, separately from each other. Therefore, due to the initially parallel organization of work in the GPU, a large number of execution units are used, which are easy to load, in contrast to the sequential flow of instructions for the CPU. In addition, modern GPUs can also execute more than one instruction per clock cycle (dual issue). Thus, the Tesla architecture, under some conditions, launches MAD + MUL or MAD + SFU operations simultaneously.

The GPU differs from the CPU also in terms of memory access. In the GPU, it is connected and easily predictable - if the texel of the texture is read from memory, then after a while the time will come for neighboring texels as well. Yes, and the same thing during recording - a pixel is written to the framebuffer, and after a few clock cycles, the one located next to it will be recorded. Therefore, the memory organization is different from that used in the CPU. And a video chip, unlike universal processors, simply does not need a large cache memory, and textures require only a few (up to 128-256 in current GPUs) kilobytes.

And in itself, the work with memory for the GPU and CPU is somewhat different. So, not all central processors have built-in memory controllers, and all GPUs usually have several controllers, up to eight 64-bit channels in the Nvidia GT200 chip. In addition, faster memory is used on video cards, and as a result, video chips have many times more memory bandwidth available, which is also very important for parallel calculations operating with huge data streams.

In general-purpose processors, large amounts of transistors and chip area go to instruction buffers, hardware branch prediction, and huge amounts of on-chip cache memory. All of these hardware blocks are needed to speed up the execution of a small number of instruction streams. Video chips spend transistors on arrays of execution units, flow control units, small shared memory and memory controllers for several channels. The above does not speed up the execution of individual threads; it allows the chip to process several thousand threads simultaneously executing on the chip and requiring high memory bandwidth.

About the differences in caching. General Purpose CPUs use cache memory to increase performance by reducing memory access latency, while GPUs use cache or shared memory to increase bandwidth. CPUs reduce memory access latency by using large caches and branch prediction. These hardware parts take up most of the chip area and consume a lot of power. Video chips bypass the problem of memory access delays by concurrently executing thousands of threads - while one of the threads is waiting for data from memory, the video chip can perform calculations on another thread without waiting or delays.

There are many differences in multithreading support as well. The CPU performs 1-2 computation threads per processor core, and video chips can support up to 1024 threads per multiprocessor, of which there are several in the chip. And if switching from one thread to another for the CPU costs hundreds of cycles, then the GPU switches several threads in one cycle.

In addition, CPUs use SIMD (one instruction is executed on multiple data) blocks for vector computation, and video chips use SIMT (one instruction and multiple threads) for scalar processing of streams. SIMT does not require the developer to convert data to vectors, and allows arbitrary branching in streams.

In short, we can say that, unlike modern general-purpose CPUs, video chips are designed for parallel computations with a large number of arithmetic operations. And a much larger number of GPU transistors work for their intended purpose - processing data arrays, and do not control the execution (flow control) of a few sequential computational threads. This is a diagram of how much space in the CPU and GPU various logic takes:

As a result, the basis for the effective use of the power of the GPU in scientific and other non-graphical calculations is the parallelization of algorithms into hundreds of execution units available in video chips. For example, many molecular modeling applications are perfectly suited for computing on video chips, they require a lot of computing power and are therefore convenient for parallel computing. And the use of multiple GPUs gives even more computing power to solve such problems.

Performing calculations on a GPU shows excellent results in algorithms that use parallel data processing. That is, when the same sequence of mathematical operations is applied to a large amount of data. In this case, the best results are achieved if the ratio of the number of arithmetic instructions to the number of memory accesses is large enough. This places less demand on flow control, and the high density of mathematics and a large amount of data obviates the need for large caches like on the CPU.

As a result of all the differences described above, the theoretical performance of video chips significantly exceeds that of a CPU. Nvidia provides this graph of CPU and GPU performance growth over the past few years:

Naturally, these data are not without a dose of guile. Indeed, on a CPU it is much easier to achieve theoretical figures in practice, and the figures are given for single precision in the case of a GPU, and for double precision in the case of a CPU. In any case, for some parallel tasks, single precision is enough, and the difference in speed between general-purpose and graphics processors is very large, and therefore the game is worth the effort.

The first attempts at using GPU computing

They tried to use video chips in parallel mathematical calculations for a long time. The earliest attempts at such an application were extremely primitive and were limited to the use of some hardware features such as rasterization and Z-buffering. But in this century, with the advent of shaders, they began to speed up matrix calculations. In 2003, at SIGGRAPH, a separate section was allocated for GPU computing, and it was named GPGPU (General-Purpose computation on GPU).

The most famous BrookGPU is the Brook streaming programming language compiler designed to perform non-graphical computations on the GPU. Before its appearance, developers using the capabilities of video chips for computing chose one of two common APIs: Direct3D or OpenGL. This seriously limited the use of GPUs, because shaders and textures are used in 3D graphics, which parallel programming specialists are not required to know about, they use threads and cores. Brook was able to help make their task easier. These streaming extensions to the C language, developed at Stanford University, hid the three-dimensional API from programmers and presented the video chip as a parallel coprocessor. The compiler parsed the .br file with C ++ code and extensions, producing code linked to a DirectX, OpenGL, or x86 library.

Naturally, Brook had many flaws, which we stopped at, and which we will talk about in more detail below. But even its mere appearance caused a significant influx of attention of the same Nvidia and ATI to the initiative of computing on GPUs, since the development of these capabilities seriously changed the market in the future, opening a whole new sector of it - parallel computers based on video chips.

Subsequently, some researchers from the Brook project joined the Nvidia development team to introduce a hardware-software parallel computing strategy, opening up a new market share. And the main advantage of this Nvidia initiative is that developers are well aware of all the capabilities of their GPUs to the smallest detail, and there is no need to use the graphics API, and you can work with the hardware directly using the driver. The result of this team's efforts is the Nvidia CUDA (Compute Unified Device Architecture), a new hardware and software architecture for parallel computing on the Nvidia GPU that this article is about.

Applications of parallel GPU computing

To understand the benefits of transferring calculations to video chips, here are the average figures obtained by researchers around the world. On average, when computing computing is transferred to a GPU, in many tasks the acceleration is 5-30 times, compared to fast general-purpose processors. The largest numbers (of the order of 100-fold acceleration and even more!) Are achieved in code that is not very well suited for calculations using SSE blocks, but is quite convenient for the GPU.

These are just some examples of accelerating synthetic code on a GPU versus SSE-vectorized code on a CPU (according to Nvidia):

Fluorescence microscopy: 12x;
Molecular dynamics (non-bonded force calc): 8-16x;
Electrostatics (direct and multilevel Coulomb stacking): 40-120x and 7x.

And this is a plate that Nvidia loves very much, showing it at all presentations, on which we will dwell in more detail in the second part of the article, devoted to specific examples of practical applications of CUDA computing:

As you can see, the numbers are very attractive, the 100-150x gains are especially impressive. In the next article on CUDA, we'll break down some of these numbers in detail. Now let's list the main applications in which GPU computing is currently used: image and signal analysis and processing, physics simulation, computational mathematics, computational biology, financial calculations, databases, gas and liquid dynamics, cryptography, adaptive radiation therapy, astronomy, processing sound, bioinformatics, biological simulations, computer vision, data mining, digital cinema and television, electromagnetic simulations, geographic information systems, military applications, mining planning, molecular dynamics, magnetic resonance imaging (MRI), neural networks, oceanographic research, particle physics, protein folding simulation, quantum chemistry, ray tracing, visualization, radar, reservoir simulation, artificial intelligence, satellite data analysis, seismic exploration, surgery, ultrasound, video conferencing.

Details on many of the applications can be found on the Nvidia website in the software section. As you can see, the list is quite long, but that's not all! It can be continued, and we can certainly assume that in the future other areas of application of parallel calculations on video chips will be found, which we have not yet guessed about.

Nvidia CUDA Features

CUDA technology is Nvidia's software and hardware computing architecture based on an extension of the C language, which makes it possible to organize access to a set of instructions of a graphics accelerator and manage its memory when organizing parallel computing. CUDA helps to implement algorithms that are executable on GPUs of the eighth generation of Geforce video accelerators (Geforce 8, Geforce 9, Geforce 200 series), as well as Quadro and Tesla.

Although the complexity of GPU programming with CUDA is quite high, it is lower than with earlier GPGPU solutions. Such programs require splitting the application between several multiprocessors similar to MPI programming, but without dividing the data that is stored in the shared video memory. And since CUDA programming for each multiprocessor is similar to OpenMP programming, it requires a good understanding of memory organization. But, of course, the complexity of development and porting to CUDA is highly application dependent.

The developer kit contains many code examples and is well documented. The learning process will take about two to four weeks for those already familiar with OpenMP and MPI. The API is based on the extended C language, and to translate code from this language, the CUDA SDK includes the nvcc command line compiler based on the Open64 open source compiler.

Let's list the main characteristics of CUDA:

unified hardware and software solution for parallel computing on Nvidia video chips;
a wide range of supported solutions, from mobile to multichip
standard C programming language;
Standard Numerical Analysis Libraries FFT (Fast Fourier Transform) and BLAS (Linear Algebra);
optimized data exchange between CPU and GPU;
interaction with graphics APIs OpenGL and DirectX;
support for 32- and 64-bit operating systems: Windows XP, Windows Vista, Linux and MacOS X;
the ability to develop at a low level.

Regarding operating system support, it should be added that all major Linux distributions are officially supported (Red Hat Enterprise Linux 3.x / 4.x / 5.x, SUSE Linux 10.x), but judging by the enthusiasts' data, CUDA works great on other assemblies: Fedora Core, Ubuntu, Gentoo, etc.

The CUDA Development Environment (CUDA Toolkit) includes:

nvcc compiler;
FFT and BLAS libraries;
profiler;
gdb debugger for GPU;
CUDA runtime driver included with standard Nvidia drivers
programming guide;
CUDA Developer SDK (source, utilities and documentation).

In the source code examples: parallel bitonic sort, matrix transposition, parallel prefix summation of large arrays, image convolution, discrete wavelet transform, an example of interaction with OpenGL and Direct3D, using the CUBLAS and CUFFT libraries, calculating the option price (Black's formula Scholes, binomial model, Monte Carlo method), parallel random number generator Mersenne Twister, calculation of a large array histogram, noise reduction, Sobel filter (finding boundaries).

Benefits and Limitations of CUDA

From a programmer's point of view, a graphics pipeline is a collection of processing stages. The geometry block generates triangles, and the raster block generates the pixels displayed on the monitor. The traditional GPGPU programming model looks like this:

To transfer computations to GPUs within the framework of such a model, a special approach is needed. Even element-wise addition of two vectors will require drawing the shape to the screen or to an offscreen buffer. The shape is rasterized, the color of each pixel is calculated according to the specified program (pixel shader). The program reads input data from textures for each pixel, adds them and writes to the output buffer. And all these numerous operations are needed for what in an ordinary programming language is written in one operator!

Therefore, the use of GPGPU for general-purpose computing has a limitation in the form of too much difficulty in training developers. And there are enough other restrictions, because a pixel shader is just a formula for the dependence of the final pixel color on its coordinate, and the pixel shader language is a language for writing these formulas with a C-like syntax. Early GPGPU techniques are a tricky trick to exploit the power of the GPU, but without any convenience. The data there is represented by images (textures), and the algorithm is represented by the rasterization process. A very specific memory and execution model should also be noted.

Nvidia's hardware and software architecture for GPU computing differs from previous GPGPU models in that it allows you to write GPU programs in real C with a standard syntax, pointers, and the need for a minimum of extensions to access the computing resources of video chips. CUDA is independent of graphics APIs, and has some features specifically designed for general computing.

Advantages of CUDA over the traditional approach to GPGPU computing:

the CUDA application programming interface is based on the standard C programming language with extensions, which simplifies the process of learning and implementing the CUDA architecture;
CUDA provides access to 16KB shared memory per multiprocessor, which can be used to create a cache with a wider bandwidth than texture samples;
more efficient data transfer between system and video memory
no need for graphical APIs with redundancy and overhead;
linear memory addressing, and gather and scatter, the ability to write to arbitrary addresses;
hardware support for integer and bit operations.

The main limitations of CUDA:

lack of recursion support for functions performed;
minimum block width of 32 threads;
Closed architecture CUDA, owned by Nvidia.

The weaknesses of programming using the previous GPGPU methods are that these methods do not use vertex shader execution units in previous non-unified architectures, data is stored in textures, but output to an off-screen buffer, and multi-pass algorithms use pixel shader units. GPGPU limitations can include: insufficient use of hardware capabilities, memory bandwidth limitations, no scatter operation (gather only), mandatory use of the graphics API.

The main advantages of CUDA over the previous GPGPU methods stem from the fact that this architecture is designed for efficient use of non-graphical computing on the GPU and uses the C programming language, without requiring the transfer of algorithms into a form convenient for the concept of a graphics pipeline. CUDA offers a new way of GPU computing that does not use graphics APIs, offering random access to memory (scatter or gather). This architecture is devoid of the disadvantages of GPGPU and uses all execution units, and also expands capabilities due to integer math and bit shift operations.

In addition, CUDA opens up some hardware capabilities not available from graphics APIs, such as shared memory. This is a small amount of memory (16 kilobytes per multiprocessor), which can be accessed by thread blocks. It allows you to cache the most frequently accessed data and can provide a higher speed than using texture fetches for this task. This, in turn, reduces the bandwidth sensitivity of parallel algorithms in many applications. For example, it is useful for linear algebra, FFT, and image processing filters.

More convenient in CUDA and memory access. The program code in the graphics API outputs data in the form of 32 single-precision floating-point values (RGBA values simultaneously in eight render targets) to predefined areas, and CUDA supports scatter recording - an unlimited number of records at any address. These advantages make it possible to execute on the GPU some algorithms that cannot be efficiently implemented using GPGPU methods based on graphics APIs.

Also, graphics APIs necessarily store data in textures, which requires preliminary packing of large arrays into textures, which complicates the algorithm and forces the use of special addressing. And CUDA allows you to read data at any address. Another advantage of CUDA is the optimized data exchange between the CPU and GPU. And for developers looking to get low-level access (for example, when writing another programming language), CUDA offers the possibility of low-level assembly language programming.

CUDA development history

CUDA development was announced with the G80 chip in November 2006, and the public beta version of the CUDA SDK was released in February 2007. Version 1.0 was released in June 2007 for the launch of Tesla solutions based on the G80 chip and intended for the high-performance computing market. Then, at the end of the year, the beta version of CUDA 1.1 was released, which, despite the slight increase in the version number, introduced quite a lot of new things.

What's new in CUDA 1.1 is the inclusion of CUDA functionality in regular Nvidia video drivers. This meant that in the requirements for any CUDA program it was enough to specify a Geforce 8 or higher video card, as well as the minimum driver version 169.xx. This is very important for developers, if these conditions are met, CUDA programs will work for any user. It also added asynchronous execution along with copying data (only for G84, G86, G92 and higher chips), asynchronous data transfer to video memory, atomic memory access operations, support for 64-bit Windows versions and the ability to multichip CUDA operation in SLI mode.

At the moment, the actual version for solutions based on GT200 - CUDA 2.0, released together with the Geforce GTX 200 line. The beta version was released back in the spring of 2008. The second version introduces: support for double precision calculations (hardware support only for GT200), finally Windows Vista (32 and 64-bit versions) and Mac OS X are supported, debugging and profiling tools are added, 3D textures are supported, optimized data transfer.

With regard to calculations with double precision, their speed on the current hardware generation is several times lower than single precision. The reasons are discussed in ours. The implementation in GT200 of this support lies in the fact that FP32 blocks are not used to obtain the result at a four times lower rate; to support FP64 computations, Nvidia decided to make dedicated computational blocks. And in the GT200 there are ten times less of them than FP32 blocks (one double precision block for each multiprocessor).

In reality, the performance may be even less, since the architecture is optimized for 32-bit reading from memory and registers, in addition, double precision is not needed in graphics applications, and in GT200 it is made more likely to be simple. And modern quad-core processors show not much less real performance. But being even 10 times slower than single precision, this support is useful for mixed precision designs. One of the common techniques is to get the initial approximation results in single precision, and then refine them in double precision. Now this can be done directly on the video card, without sending intermediate data to the CPU.

Another useful feature of CUDA 2.0 is not related to the GPU, oddly enough. It is just now possible to compile CUDA code into highly efficient multithreaded SSE code for fast execution on the CPU. That is, now this feature is suitable not only for debugging, but also for real use on systems without an Nvidia video card. After all, the use of CUDA in regular code is constrained by the fact that Nvidia video cards, although the most popular among the dedicated video solutions, are not available in all systems. And before version 2.0, in such cases, you would have to make two different codes: for CUDA and separately for the CPU. And now you can execute any CUDA program on a CPU with high efficiency, albeit at a lower speed than on video chips.

Solutions with Nvidia CUDA support

All CUDA-enabled graphics cards can help accelerate most demanding tasks, from audio and video processing to medicine and scientific research. The only real limitation is that many CUDA programs require at least 256 megabytes of video memory, which is one of the most important technical specifications for CUDA applications.

For an up-to-date list of CUDA-enabled products, visit. At the time of this writing, CUDA calculations supported all products of the Geforce 200, Geforce 9 and Geforce 8 series, including mobile products starting with the Geforce 8400M, as well as the Geforce 8100, 8200 and 8300 chipsets. Also, modern Quadro and all Tesla: S1070, C1060, C870, D870 and S870.

We especially note that along with the new Geforce GTX 260 and 280 graphics cards, the corresponding solutions for high-performance computing were announced: Tesla C1060 and S1070 (shown in the photo above), which will be available for purchase this fall. The GPU is the same in them - GT200, in the C1060 it is one, in the S1070 - four. But, unlike gaming solutions, they use four gigabytes of memory for each chip. The only drawbacks are the lower memory frequency and memory bandwidth than gaming cards, which provide 102 GB / s per chip.

Nvidia CUDA Roster

CUDA includes two APIs: high level (CUDA Runtime API) and low level (CUDA Driver API), although it is not possible to use both at the same time in one program, you need to use either one or the other. The high-level works "on top" of the low-level, all runtime calls are translated into simple instructions processed by the low-level Driver API. But even the "high-level" API assumes knowledge about the structure and operation of Nvidia video chips, there is no too high level of abstraction there.

There is one more level, even higher - two libraries:

CUBLAS- CUDA version of BLAS (Basic Linear Algebra Subprograms), designed for calculating linear algebra problems and using direct access to GPU resources;

CUFFT- CUDA version of the Fast Fourier Transform library for calculating the Fast Fourier Transform, which is widely used in signal processing. The following transformation types are supported: complex-complex (C2C), real-complex (R2C), and complex-real (C2R).

Let's take a closer look at these libraries. CUBLAS are standard linear algebra algorithms translated into the CUDA language, currently only a certain set of basic CUBLAS functions is supported. The library is very easy to use: you need to create a matrix and vector objects in the memory of the video card, fill them with data, call the required CUBLAS functions, and load the results from the video memory back into the system memory. CUBLAS contains special functions for creating and destroying objects in GPU memory, as well as for reading and writing data to this memory. Supported BLAS functions: levels 1, 2 and 3 for real numbers, level 1 CGEMM for complex numbers. Level 1 is vector-vector operations, level 2 is vector-matrix operations, level 3 is matrix-matrix operations.

CUFFT - CUDA variant of the Fast Fourier Transform function - widely used and very important in signal analysis, filtering, etc. CUFFT provides a simple interface to efficiently compute FFT on Nvidia GPUs without having to develop your own GPU FFT. CUDA version of FFT supports 1D, 2D, and 3D transformations of complex and real data, batch execution for several 1D transformations in parallel, 2D and 3D transformations can be sized, up to 8 million elements are supported for 1D.

CUDA Programming Basics

To understand the further text, you should understand the basic architectural features of Nvidia video chips. A GPU consists of several Texture Processing Clusters. Each cluster consists of an enlarged block of texture samples and two or three streaming multiprocessors, each of which consists of eight computing devices and two superfunctional blocks. All instructions are executed according to the SIMD principle, when one instruction is applied to all threads in a warp (a term from the textile industry, in CUDA this is a group of 32 threads - the minimum amount of data processed by multiprocessors). This method of execution was called SIMT (single instruction multiple threads).

Each of the multiprocessors has certain resources. So, there is a special shared memory of 16 kilobytes per multiprocessor. But this is not a cache, since the programmer can use it for any need, like the Local Store in the SPU of Cell processors. This shared memory allows information to be exchanged between threads of the same block. It is important that all threads of one block are always executed by the same multiprocessor. And streams from different blocks cannot exchange data, and you need to remember this limitation. Shared memory is often useful, except when multiple threads access the same memory bank. Multiprocessors can also access video memory, but with higher latencies and worse bandwidth. To speed up access and reduce the frequency of accessing video memory, multiprocessors have 8 kilobytes of cache for constants and texture data.

The multiprocessor uses 8192-16384 (for G8x / G9x and GT2xx, respectively) registers common to all streams of all blocks executed on it. The maximum number of blocks per multiprocessor for the G8x / G9x is eight and the number of warp is 24 (768 threads per multiprocessor). In total, the top-end video cards of the Geforce 8 and 9 series can process up to 12288 streams at a time. Geforce GTX 280 based on GT200 offers up to 1024 threads per multiprocessor, it has 10 clusters of three multiprocessors, processing up to 30720 threads. Knowing these constraints allows you to optimize the algorithms for the available resources.

The first step in porting an existing application to CUDA is to profile it and identify the bottlenecks in the code that hinder performance. If such sections are suitable for fast parallel execution, these functions are ported to C and CUDA extensions for execution on the GPU. The program is compiled using a compiler supplied by Nvidia, which generates code for both the CPU and GPU. When executing a program, the central processor executes its portions of the code, and the GPU executes the CUDA code with the heaviest parallel computations. This part dedicated to the GPU is called the kernel. The kernel defines the operations that will be performed on the data.

The video chip receives a core and creates copies for each data item. These copies are called threads. The stream contains the counter, registers and state. For large amounts of data such as image processing, millions of threads are run. Threads are executed in groups of 32 called warp "s. Warp" and are assigned to execute on specific stream multiprocessors. Each multiprocessor consists of eight cores - stream processors that execute one MAD instruction per clock cycle. To execute one 32-thread warp, four clock cycles of the multiprocessor are required (we are talking about the shader domain frequency, which is 1.5 GHz and higher).

A multiprocessor is not a traditional multi-core processor, it is well suited for multithreading, supporting up to 32 warps at a time. Each clock cycle the hardware chooses which warp to execute and switches from one to the other without clock losses. In analogy with the central processor, it is like running 32 programs simultaneously and switching between them every clock cycle without loss of context switching. In reality, the CPU cores support the simultaneous execution of one program and switch to others with a delay of hundreds of clock cycles.

CUDA programming model

Again, CUDA uses a parallel computation model, when each of the SIMD processors executes the same instruction on different data items in parallel. A GPU is a computing device, a coprocessor (device) for a central processor (host), which has its own memory and processes a large number of threads in parallel. The kernel (kernel) is a function for the GPU, executed by threads (analogy from 3D graphics - a shader).

We said above that a video chip differs from a CPU in that it can process tens of thousands of threads simultaneously, which is usually for graphics that are well parallelized. Each stream is scalar and does not require packing data into 4-component vectors, which is more convenient for most tasks. The number of logical threads and blocks of threads exceeds the number of physical execution units, which provides good scalability for the entire range of the company's solutions.

The CUDA programming model assumes thread grouping. Threads are combined into thread blocks - one-dimensional or two-dimensional grids of threads that interact with each other using shared memory and synchronization points. The program (kernel) runs on a grid of thread blocks, see the figure below. One mesh is executed at a time. Each block can be one-, two-, or three-dimensional in shape, and can have 512 threads on current hardware.

Thread blocks are executed in small groups called warp, which are 32 threads in size. This is the minimum amount of data that can be processed in multiprocessors. And since this is not always convenient, CUDA allows you to work with blocks containing from 64 to 512 threads.

Grouping blocks into grids allows you to get away from restrictions and apply the kernel to more threads in a single call. It helps with scaling too. If the GPU does not have enough resources, it will execute blocks sequentially. Otherwise, the blocks can be executed in parallel, which is important for optimal distribution of work on video chips of different levels, starting from mobile and integrated ones.

CUDA memory model

The memory model in CUDA is distinguished by the possibility of byte addressing, support for both gather and scatter. A fairly large number of registers are available for each stream processor, up to 1024 pieces. Access to them is very fast, you can store 32-bit integers or floating point numbers in them.

Each thread has access to the following types of memory:

Global memory- the largest amount of memory available for all multiprocessors on a video chip, ranging in size from 256 megabytes to 1.5 gigabytes on current solutions (and up to 4 GB on Tesla). It has high bandwidth, over 100 gigabytes / s for top Nvidia solutions, but very high latencies of several hundred clock cycles. Not cached, supports generic load and store instructions, and regular pointers to memory.

Local memory Is a small amount of memory that only one stream processor has access to. It is relatively slow - the same as the global one.

Shared memory Is a 16-kilobyte (in the video chips of the current architecture) block of memory that is shared by all stream processors in a multiprocessor. This memory is quite fast, just like registers. It provides interoperability of threads, is directly controlled by the developer, and has low latency. Advantages of shared memory: use in the form of a programmer-controlled cache of the first level, reduction of delays in access of execution units (ALU) to data, reduction of the number of calls to global memory.

Memory of constants- a 64 kilobyte memory area (the same for current GPUs), read-only by all multiprocessors. It is cached at 8 kilobytes per multiprocessor. Quite slow - a delay of several hundred clock cycles in the absence of the required data in the cache.

Texture memory- a block of memory available for reading by all multiprocessors. Data sampling is carried out using the texture units of the video chip, therefore, the possibilities of linear data interpolation are provided without additional costs. 8 kilobytes are cached for each multiprocessor. Slow as global - hundreds of clock cycles of latency in the absence of data in the cache.

Naturally, global, local, texture and constant memory are physically the same memory, known as local video memory of a video card. Their differences are in various caching algorithms and access models. The central processor can update and request only external memory: global, constant and texture.

From the above, it is clear that CUDA assumes a special approach to development, not quite the same as in programs for the CPU. It is necessary to remember about different types of memory, that local and global memory is not cached and the latency when accessing it is much higher than that of register memory, since it is physically located in separate microcircuits.

A typical, but not required, problem-solving pattern:

the task is broken down into subtasks;
input data is divided into blocks that fit into shared memory;
each block is processed by a block of threads;
the subblock is loaded into shared memory from the global one;
the corresponding calculations are performed on the data in the shared memory;
the results are copied from shared memory back to global memory.

Programming environment

CUDA includes runtime libraries:

a common part providing built-in vector types and subsets of RTL calls supported on CPU and GPU;
CPU component for controlling one or more GPUs;
GPU component that provides GPU-specific functionality.

The main process of the CUDA application runs on a universal processor (host), it starts several copies of the kernel processes on the video card. The code for the CPU does the following: initializes the GPU, allocates memory on the video card and the system, copies the constants to the video card memory, launches several copies of the kernel processes on the video card, copies the result from the video memory, frees the memory and exits.

As an example for understanding, let's give the CPU code for adding vectors, presented in CUDA:

The functions executed by the video chip have the following restrictions: there is no recursion, there are no static variables inside functions and a variable number of arguments. Two types of memory management are supported: linear memory with access by 32-bit pointers, and CUDA arrays with access only through texture fetching functions.

CUDA programs can interact with graphics APIs: for rendering data generated in the program, for reading rendering results and processing them using CUDA tools (for example, when implementing post-processing filters). For this, the resources of the graphics API can be mapped (with obtaining the resource address) into the CUDA global memory space. The following graphics API resource types are supported: Buffer Objects (PBO / VBO) in OpenGL, vertex buffers and textures (2D, 3D, and cube maps) Direct3D9.

Stages of compiling a CUDA application:

CUDA C source code files are compiled using the NVCC program, which is a wrapper over other tools and calls them: cudacc, g ++, cl, etc. NVCC generates: code for the central processor that compiles along with the rest of the application, written in pure C, and the PTX object code for the video chip. Executable files with CUDA code necessarily require the CUDA runtime library (cudart) and CUDA core library (cuda).

Optimization of CUDA programs

Naturally, within the framework of the review article, it is impossible to consider serious issues of optimization in CUDA programming. So let's just briefly cover the basics. To effectively use the capabilities of CUDA, you need to forget about the usual methods of writing programs for the CPU, and use those algorithms that are well parallelized to thousands of threads. It is also important to find the optimal place for storing data (registers, shared memory, etc.), minimize data transfer between the CPU and GPU, and use buffering.

In general terms, when optimizing a CUDA program, you should try to achieve an optimal balance between the size and the number of blocks. More threads in a block will reduce the impact of memory latency, but also reduce the number of registers available. In addition, a block of 512 threads is inefficient, Nvidia itself recommends using blocks of 128 or 256 threads as a trade-off for achieving optimal latency and number of registers.

Among the main points of optimization of CUDA programs: the most active use of shared memory, since it is much faster than the global video memory of a video card; read and write operations from global memory should be coalesced whenever possible. To do this, you need to use special data types for reading and writing at once 32/64/128 data bits in one operation. If the reads are difficult to combine, you can try using texture fetches.

conclusions

The hardware and software architecture presented by Nvidia for computing on CUDA video chips is well suited for solving a wide range of tasks with high parallelism. CUDA runs on a wide variety of Nvidia GPUs, and improves the GPU programming model by making it much simpler and adding more features such as shared memory, thread synchronization, double precision, and integer operations.

CUDA is a technology available to every software developer and can be used by any programmer who knows the C language. You just have to get used to a different programming paradigm inherent in parallel computing. But if the algorithm is, in principle, well parallelized, then the study and time spent on programming in CUDA will return many times over.

It is likely that due to the widespread use of video cards in the world, the development of parallel computing on the GPU will greatly affect the high-performance computing industry. These possibilities have already aroused great interest in scientific circles, and not only in them. After all, the potential for accelerating algorithms that lend themselves well to parallelization (on available hardware, which is no less important) are not so often dozens of times at once.

Universal processors are developing rather slowly, they do not have such performance leaps. Basically, while it sounds too loud, anyone in need of fast computers can now have an inexpensive personal supercomputer on their desk, sometimes without even investing additional funds, as Nvidia graphics cards are so widespread. Not to mention the efficiency gains in terms of GFLOPS / $ and GFLOPS / Watt, which GPU makers love so much.

The future of many computations is clearly in parallel algorithms; almost all new solutions and initiatives are directed in this direction. So far, however, the development of new paradigms is at an early stage, you have to manually create threads and schedule access to memory, which complicates the tasks compared to conventional programming. But CUDA technology has taken a step in the right direction and a successful solution is clearly visible in it, especially if Nvidia manages to convince developers as much as possible of its benefits and prospects.

But, of course, GPUs won't replace CPUs. In their current form, they are not intended for this. Now that video chips are gradually moving towards the CPU, becoming more and more universal (single and double precision floating point calculations, integer calculations), so CPUs are becoming more and more "parallel", acquiring a large number of cores, multithreading technologies, not to mention the appearance of blocks SIMD and Heterogeneous Processor Projects. Most likely, GPU and CPU will simply merge in the future. It is known that many companies, including Intel and AMD, are working on similar projects. It doesn't matter if the GPU is swallowed up by the CPU, or vice versa.

In the article, we mainly talked about the benefits of CUDA. But there is also a fly in the ointment. One of the few disadvantages of CUDA is poor portability. This architecture works only on video chips from this company, and even not on all, but starting with the Geforce 8 and 9 series and the corresponding Quadro and Tesla. Yes, there are a lot of such solutions in the world, Nvidia gives a figure of 90 million CUDA-compatible video chips. This is just fine, but competitors offer their solutions other than CUDA. So, AMD has Stream Computing, Intel will have Ct in the future.

Which of the technologies will win, become widespread and live longer than others - only time will tell. But CUDA has good chances, as compared to Stream Computing, for example, it represents a more developed and easy-to-use programming environment in the usual C language. Perhaps a third party can help in the definition by issuing some general solution. For example, in the next DirectX update under version 11, Microsoft promised computational shaders, which can become a kind of average solution that suits everyone, or almost everyone.

Based on preliminary data, this new type of shader borrows a lot from the CUDA model. And by programming in this environment now, you can get the benefits immediately and the necessary skills for the future. From a HPC perspective, DirectX also has a clear disadvantage in the form of poor portability, as this API is limited to the Windows platform. However, another standard is being developed - the open multi-platform OpenCL initiative, which is supported by most companies, including Nvidia, AMD, Intel, IBM and many others.

Keep in mind that in the next CUDA article you will explore specific practical applications of scientific and other non-graphical computing performed by developers around the world using Nvidia CUDA.

For decades, Moore's Law has been in effect, which states that the number of transistors on a chip will double every two years. However, that was back in 1965, and over the past 5 years the idea of physical multicore in consumer-class processors has begun to develop rapidly: in 2005, Intel introduced the Pentium D, and AMD - the Athlon X2. Then applications using 2 cores could be counted on the fingers of one hand. However, the next generation of Intel processors, which made a revolution, had exactly 2 physical cores. Moreover, in January 2007, the Quad series appeared, at the same time Moore himself admitted that his law would soon cease to operate.

What now? Dual-core processors even in budget office systems, and 4 physical cores became the norm in just 2-3 years. The frequency of processors is not increasing, but the architecture is improving, the number of physical and virtual cores is increasing. However, the idea of using video adapters equipped with dozens or even hundreds of computing "blocks" has been around for a long time.

And although the prospects for computing with GPUs are huge, the most popular solution - Nvidia CUDA is free, has a lot of documentation and, in general, is very simple to implement, there are not many applications using this technology. Basically, these are all kinds of specialized calculations, to which the ordinary user in most cases does not care. But there are also programs designed for the mass user, and we will talk about them in this article.

To begin with, a little about the technology itself and what it is eaten with. Because When writing an article I am guided by a wide range of readers, then I will try to explain it in an accessible language without complex terms and somewhat briefly.

CUDA(English Compute Unified Device Architecture) is a software and hardware architecture that allows computing using NVIDIA graphics processors that support GPGPU technology (arbitrary computing on video cards). The CUDA architecture first appeared on the market with the release of the eighth generation NVIDIA G80 chip and is present in all subsequent series of graphics chips used in the GeForce, Quadro and Tesla accelerator families. (c) Wikipedia.org

Incoming streams are processed independently of each other, i.e. parallel.

Moreover, there is a division into 3 levels:

Grid- core. Contains a one / two / three dimensional array of blocks.

Block- contains many threads (thread). Streams of different blocks cannot interact with each other. Why did you need to introduce blocks? Each block is essentially responsible for its own subtask. For example, a large image (which is a matrix) can be split into several smaller parts (matrices) and work in parallel with each part of the image.

Thread- stream. Threads within one block can interact either through shared memory, which, by the way, is much faster than global memory, or through thread synchronization tools.

Warp- this is the union of interacting streams, for all modern GPUs Warp's size is 32. Next comes half-warp, which is a half of a warp'a, since memory access usually goes separately for the first and second half of a warp.

As you can see, this architecture is great for parallelizing tasks. And although programming is carried out in the C language with some restrictions, in reality it is not so simple, since not everything can be parallelized. There are also no standard functions for generating random numbers (or initialization); all this has to be implemented separately. And although there are enough ready-made options, all this does not bring joy. The ability to use recursion is relatively new.

For clarity, a small console (to minimize the code) program was written that performs operations with two float arrays, i.e. with non-integer values. For the above reasons, initialization (filling the array with various arbitrary values) was performed by the CPU. Then, 25 various operations were performed with the corresponding elements from each array, intermediate results were written into the third array. The array size changed, the results are as follows:

In total, 4 tests were carried out:

1024 elements in each array:

It is clearly seen that with such a small number of elements, there is little sense from parallel computations, since the calculations themselves are much faster than their preparation.

4096 elements in each array:

And now you can see that the video card performs operations on arrays 3 times faster than the processor. Moreover, the execution time of this test on a video card has not increased (a slight decrease in time can be attributed to an error).

Now there are 12288 elements in each array:

The gap between the video card has doubled. Again, note that the execution time on the video card has increased.
insignificantly, but on the processor more than 3 times, i.e. proportional to the complexity of the task.

And the last test - 36864 elements in each array:

In this case, the acceleration reaches impressive values - almost 22 times faster on a video card. And again, the execution time on the video card increased insignificantly, and on the processor - the prescribed 3 times, which again is proportional to the complication of the task.

If you continue to complicate the calculations, then the video card wins more and more. Although the example is somewhat exaggerated, in general it clearly shows the situation. But as mentioned above, not everything can be parallelized. For example, calculating pi. There are only examples written using the Monte Carlo method, but the accuracy of the calculations is 7 decimal places, i.e. regular float. In order to increase the accuracy of calculations, long arithmetic is needed, but here problems arise, because it is very, very difficult to implement effectively. On the Internet, I could not find examples using CUDA and calculating Pi to 1 million decimal places. Attempts have been made to write such an application, but the simplest and most efficient method for calculating pi is the Brent-Salamin algorithm or the Gauss formula. In the well-known SuperPI, most likely (judging by the speed of work and the number of iterations), the Gauss formula is used. And judging by
the fact that SuperPI is single-threaded, the lack of examples for CUDA and the failure of my attempts, it is impossible to effectively parallelize the calculation of Pi.

By the way, you can see how in the process of performing calculations the load on the GPU increases, as well as memory allocation.

Now let's move on to the more practical benefits of CUDA, namely the currently existing programs using this technology. For the most part, these are all kinds of audio / video converters and editors.

We used 3 different video files for testing:

* The history of the movie Avatar - 1920x1080, MPEG4, h.264.

* "Lie to me" series - 1280x720, MPEG4, h.264.

* Series "It's Always Sunny in Philadelphia" - 624x464, xvid.

The container and size of the first two files were mkv and 1.55 gb, and the last one was .avi and 272 mb.

Let's start with a very sensational and popular product - Badaboom... The version used was - 1.2.1.74 ... The cost of the program is $29.90 .

The program interface is simple and intuitive - on the left we select the source file or disk, and on the right - the required device for which we will encode. There is also a user mode, in which parameters are manually set, and it was used.

To begin with, let's consider how quickly and efficiently the video is encoded "into itself", i.e. at the same resolution and approximately the same size. The speed will be measured in fps, and not in the elapsed time - it is more convenient to both compare and calculate how much video of arbitrary length will be compressed. Because today we are considering the technology of "green", then the graphs will be appropriate -)

The encoding speed directly depends on the quality, this is obvious. It should be noted that light resolution (let's call it traditionally - SD) is not a problem for Badaboom - the encoding speed is 5.5 times higher than the original (24 fps) video frame rate. And even a heavy 1080p video is converted by the program in real time. It should be noted that the quality of the final video is very close to the original video, i.e. encodes Badaboom very, very high quality.

But usually video is overtaken to a lower resolution, let's see how things are in this mode. As the resolution was lowered, the video bitrate also dropped. It was 9500 kbps for 1080p output file, 4100 kbps for 720p and 2400 kbps for 720x404. The choice was made on the basis of a reasonable ratio of size / quality.

Comments are superfluous. If you rip from 720p to normal SD quality, then it will take about 30 minutes to transcode a 2-hour movie. And at the same time, the processor load will be insignificant, you can go about your business without feeling discomfort.

But what if you convert the video to a format for a mobile device? To do this, select the iPhone profile (bitrate 1 Mbps, 480x320) and look at the encoding speed:

Do I need to say something? A two-hour movie in normal iPhone quality is transcoded in less than 15 minutes. HD quality is more difficult, but still quite fast. The main thing is that the quality of the output video material remains at a fairly high level when viewed on the phone display.

In general, the impressions from Badaboom are positive, the speed of work pleases, the interface is simple and straightforward. All sorts of bugs in earlier versions (I still used beta in 2008) have been cured. Except for one thing - the path to the source file, as well as to the folder into which the finished video is saved, should not contain Russian letters. But against the background of the merits of the program, this drawback is insignificant.

Next in line we will have Super LoiLoScope... For the usual version they ask 3 280 rubles, and for the touch version that supports touch control in Windows 7, they ask for as much 4 440 rubles... Let's try to figure out why the developer wants that kind of money and why the video editor needs multitouch support. The latest version was used - 1.8.3.3 .

It is rather difficult to describe the program interface in words, so I decided to shoot a short video. I must say right away that, like all video converters for CUDA, GPU acceleration is supported only for video output in MPEG4 with the h.264 codec.

During encoding, the processor load is 100%, but this does not cause discomfort. The browser and other non-heavy applications do not slow down.

Now let's move on to performance. To begin with, everything is the same as with Badaboom - video transcoding into the same quality.

The results are much better than Badaboom. The quality is also on top, the difference with the original can be seen only by comparing the frames in pairs under a magnifying glass.

Wow, here LoiloScope bypasses Badaboom 2.5 times. At the same time, you can easily cut and encode another video in parallel, read news and even watch a movie, and even FullHD can be played without problems, although the processor load is maximum.

Now let's try to make a video for a mobile device, we will name the profile the same as it was called in Badaboom - iPhone (480x320, 1 Mbps):

There is no mistake. Everything was checked several times, each time the result was the same. Most likely, this happens for the simple reason that the SD file is written with a different codec and in a different container. When transcoding, the video is first decoded, divided into matrices of a certain size, and compressed. ASP decoder used in case of xvid is slower than AVC (for h.264) in parallel decoding. However, 192 fps is 8 times faster than the original video speed, a 23-minute burst is compressed in less than 4 minutes. The situation was repeated with other files compressed in xvid / DivX.

LoiloScope left only pleasant impressions about myself - the interface, despite its unusualness, is convenient and functional, and the speed of work is beyond praise. The relatively poor functionality is somewhat frustrating, but often with simple editing, you just need to slightly adjust the colors, make smooth transitions, overlay text, and LoiloScope does an excellent job with this. The price is also somewhat frightening - more than $ 100 for the regular version is normal for foreign countries, but such figures still seem somewhat wild to us. Although, I confess that if, for example, I often filmed and edited home videos, I might have thought about buying. At the same time, by the way, I checked the ability to edit HD (or rather AVCHD) content directly from the camcorder without first converting it to another format, LoiloScope did not reveal any problems with mts files.

Let us turn to history - back to 2003, when Intel and AMD participated in the joint race for the most powerful processor. In just a few years, this race has resulted in a significant increase in clock speeds, especially since the release of the Intel Pentium 4.

But the race was rapidly approaching its limit. After a wave of huge increases in clock speeds (between 2001 and 2003, the clock speed of the Pentium 4 doubled from 1.5 to 3 GHz), users had to be content with tenths of a gigahertz, which manufacturers were able to squeeze out (from 2003 to 2005, clock speeds increased from only 3 to 3 , 8 GHz).

Even architectures optimized for high clock speeds, like Prescott, began to experience difficulties, and this time not only production ones. Chip makers simply ran into the laws of physics. Some analysts even predicted that Moore's Law would cease to work. But that did not happen. The original meaning of the law is often distorted, but it concerns the number of transistors on the surface of a silicon core. For a long time, the increase in the number of transistors in the CPU was accompanied by a corresponding increase in performance - which led to a distortion of the meaning. But then the situation became more complicated. The designers of the CPU architecture approached the law of gain reduction: the number of transistors that had to be added for the desired increase in performance became more and more, leading to a dead end.

While CPU makers were pulling the last hairs of their heads trying to find solutions to their problems, GPU makers continued to benefit remarkably from the benefits of Moore's Law.

Why haven't they hit the same dead end as the CPU architects? The reason is very simple: CPUs are designed for maximum performance on a stream of instructions that process different data (both integers and floating point numbers), perform random access to memory, etc. Until now, developers are trying to provide more parallelism of instructions - that is, to execute as many instructions in parallel as possible. So, for example, with the Pentium superscalar execution appeared, when under certain conditions it was possible to execute two instructions per cycle. Pentium Pro received an out-of-order execution of instructions, which made it possible to optimize the work of computational units. The problem is that parallel execution of a sequential stream of instructions has obvious limitations, so a blind increase in the number of computational units does not give a benefit, since they will still be idle most of the time.

In contrast, the GPU's job is relatively straightforward. It consists of accepting a group of polygons on one side and generating a group of pixels on the other. Polygons and pixels are independent of each other, so they can be processed in parallel. Thus, in the GPU, a large part of the crystal can be separated into computational units, which, unlike the CPU, will actually be used.

Click on the picture to enlarge.

The GPU differs from the CPU not only in this. Access to memory in the GPU is very tied - if a texel is read, then after a few clock cycles the adjacent texel will be read; when a pixel is recorded, then after a few clock cycles the adjacent one will be recorded. By judiciously organizing memory, you can get performance close to the theoretical bandwidth. This means that the GPU, unlike the CPU, does not require a huge cache, as its role is to speed up texturing operations. All that is needed is a few kilobytes containing several texels used in bilinear and trilinear filters.

Click on the picture to enlarge.

Long live GeForce FX!

The two worlds remained separated for a long time. We used a CPU (or even multiple CPUs) for office tasks and internet applications, and GPUs were only good for speeding up rendering. But one thing changed everything: namely, the advent of programmable GPUs. At first, central processing units had nothing to fear. The first so-called programmable GPUs (NV20 and R200) were hardly a threat. The number of instructions in the program remained limited to about 10, they worked on very exotic data types such as 9- or 12-bit fixed point numbers.

Click on the picture to enlarge.

But Moore's law again showed its best side. The increase in the number of transistors has not only increased the number of computing units, but also improved their flexibility. The appearance of the NV30 can be considered a significant step forward for several reasons. Of course, gamers didn't really like the NV30 cards, but the new GPUs began to rely on two features that were designed to change the perception of GPUs not only as graphics accelerators.

Support for single precision floating point calculations (even if it did not comply with the IEEE754 standard);
support for more than a thousand instructions.

So we got all the conditions that can attract pioneer researchers who are always looking for additional computing power.

The idea of using graphics accelerators for mathematical calculations is not new. The first attempts were made back in the 90s of the last century. Of course, they were very primitive - limited, for the most part, to the use of some hardware-based features, such as rasterization and Z-buffers to speed up tasks such as finding a route or rendering. Voronoi diagrams .

Click on the picture to enlarge.

In 2003, with the advent of evolved shaders, a new benchmark was reached - this time performing matrix calculations. This was the year that a whole section of SIGGRAPH ("Computations on GPUs") was dedicated to the new area of IT. This early initiative was called GPGPU (General-Purpose computation on GPU). And the early turning point was the appearance.

To understand the role of BrookGPU, you need to understand how everything happened before it appeared. The only way to get GPU resources in 2003 was to use one of the two graphics APIs, Direct3D or OpenGL. Consequently, developers who wanted to get the GPU power for their computing had to rely on the two APIs mentioned. The problem is that they weren't always experts in programming video cards, which made access to technology seriously difficult. While 3D programmers operate with shaders, textures and fragments, specialists in the field of parallel programming rely on threads, cores, scatter, etc. Therefore, first it was necessary to draw analogies between the two worlds.

Stream is a stream of elements of the same type, in the GPU it can be represented by a texture. Basically, in classical programming there is such an analogue as an array.
Kernel- a function that will be applied independently to each element of the stream; is the equivalent of a pixel shader. In classical programming, the analogy of a loop can be made - it is applied to a large number of elements.
To read the results of applying a kernel to a stream, a texture must be created. There is no equivalent on the CPU, since there is full access to memory.
The location in memory where the recording will be made (in the scatter / scatter operations) is controlled through the vertex shader, since the pixel shader cannot change the coordinates of the processed pixel.

As you can see, even taking into account the analogies given, the task does not look simple. And Brook came to the rescue. This name refers to extensions to the C language ("C with streams", "C with streams"), as the developers at Stanford called them. At its core, Brook's task was to hide from the programmer all the components of the 3D API, which made it possible to present the GPU as a coprocessor for parallel computations. To do this, the Brook compiler processed a .br file with C ++ code and extensions, and then generated C ++ code that was linked to a library with support for different outputs (DirectX, OpenGL ARB, OpenGL NV3x, x86).

Click on the picture to enlarge.

Brook has several merit, the first of which is to bring GPGPU out of the shadows so that the general public can become familiar with the technology. However, after the announcement of the project, a number of IT sites were too optimistic that the release of Brook casts doubt on the existence of CPUs, which will soon be replaced by more powerful GPUs. But, as we can see, this did not happen even after five years. To be honest, we don't think this will ever happen. On the other hand, looking at the successful evolution of CPUs, which are increasingly oriented towards parallelism (more cores, SMT multithreading technology, expansion of SIMD blocks), as well as GPUs, which, on the contrary, are becoming more and more universal (support for floating point calculations single precision, integer computation, double precision support), it looks like GPU and CPU will soon merge. What will happen then? Will GPUs be swallowed up by the CPU, as happened with math coprocessors? Quite possible. Intel and AMD are currently working on similar projects. But there is still a lot that can change.

But back to our topic. Brook's advantage was to popularize the GPGPU concept, it significantly simplified access to GPU resources, which allowed more and more users to master the new programming model. On the other hand, despite all the qualities of Brook, there was still a long way to go before GPU resources could be used for computation.

One of the problems is associated with different levels of abstraction, and also, in particular, with the excessive additional load created by the 3D API, which can be quite noticeable. But more serious is the compatibility issue that the Brook developers couldn't do anything about. There is fierce competition among GPU manufacturers, so they often optimize their drivers. If these optimizations are good for gamers for the most part, they could end Brook compatibility in a moment. Therefore, it is difficult to imagine using this API in production code that will work somewhere. And for a long time Brook remained the lot of hobbyist researchers and programmers.

However, Brook's success was enough to attract the attention of ATI and nVidia, and they developed an interest in such an initiative, as it could expand the market, opening up a significant new sector for companies.

Researchers initially involved with the Brook project quickly joined the development teams in Santa Clara to present a global strategy for developing a new market. The idea was to create a combination of hardware and software suitable for GPGPU tasks. Since the nVidia developers know all the secrets of their GPUs, the graphics API could not be relied upon, but rather communicate with the GPU through a driver. Although, of course, this has its own problems. So, the CUDA (Compute Unified Device Architecture) development team has created a set of software layers for working with the GPU.

Click on the picture to enlarge.

As you can see in the diagram, CUDA provides two APIs.

High-level API: CUDA Runtime API;
low-level API: CUDA Driver API.

Because the high-level API is implemented on top of the low-level API, each runtime function call is broken down into simpler instructions that the Driver API processes. Note that the two APIs are mutually exclusive: a programmer can use one or the other API, but mixing calls to functions of the two APIs will not work. In general, the term "high-level API" is relative. Even the Runtime API is such that many would consider it low-level; however, it does provide functions that are quite handy for initialization or context management. But don't expect a particularly high level of abstraction - you still need to have a good knowledge of nVidia GPUs and how they work.

The Driver API is even harder to work with; it takes more effort to run GPU processing. On the other hand, the low-level API is more flexible, giving the programmer additional control if needed. Two APIs are capable of working with OpenGL or Direct3D resources (only the ninth version for today). The benefits of this feature are obvious - CUDA can be used to create resources (geometry, procedural textures, etc.) that can be passed to the graphics API, or, conversely, you can make the 3D API send the rendering results to the CUDA program, which, in turn, will perform post-processing. There are many examples of these interactions, and the advantage is that resources continue to be stored in the GPU memory without having to be passed over the PCI Express bus, which is still a bottleneck.

However, it should be noted that sharing of resources in video memory is not always perfect and can lead to some "headaches". For example, when changing resolution or color depth, graphic data takes priority. Therefore, if you need to increase the resources in the framebuffer, then the driver will do it without any problems at the expense of the resources of the CUDA applications, which will simply "take off" with an error. Not very elegant, of course, but this shouldn't happen very often. And while we started talking about the disadvantages: if you want to use multiple GPUs for CUDA applications, then you need to disable SLI mode first, otherwise CUDA applications will only be able to "see" one GPU.

Finally, the third layer of software is given to libraries - two, to be precise.

CUBLAS, where there are the necessary blocks for calculating linear algebra on the GPU;
CUFFT, which supports the computation of Fourier transforms, an algorithm widely used in signal processing.

Before we dive into CUDA, let me define a number of terms scattered throughout the nVidia documentation. The company has chosen a very specific terminology that is difficult to get used to. First of all, note that thread in CUDA is far from the same meaning as CPU thread, nor is it the equivalent of thread in our articles on GPUs. The GPU thread in this case is the underlying set of data that needs to be processed. Unlike CPU threads, CUDA threads are very "light", that is, context switching between two threads is by no means a resource-intensive operation.

The second term often found in the CUDA documentation is warp... There is no confusion here, since there is no analogue in Russian (unless you are a fan of Start Trek or the Warhammer game). In fact, the term is taken from the textile industry, where weft yarn is pulled through the warp yarn, which is stretched on the machine. Warp in CUDA is a group of 32 threads and is the minimum amount of data processed by the SIMD method in CUDA multiprocessors.

But this "graininess" is not always convenient for the programmer. Therefore, in CUDA, instead of working with warps directly, you can work with blocks / block containing from 64 to 512 threads.

Finally, these blocks come together in grids / grid... The advantage of this grouping is that the number of blocks simultaneously processed by the GPU is closely related to the hardware resources, as we will see below. Grouping blocks into grids allows you to completely abstract from this limitation and apply the kernel / kernel to more threads in one call, without thinking about fixed resources. The CUDA libraries are responsible for all this. Moreover, such a model scales well. If the GPU has few resources, then it will execute blocks sequentially. If the number of computational processors is large, then the blocks can be executed in parallel. That is, the same code can run on both entry-level GPUs and top-end and even future models.

There are a couple of other terms in the CUDA API that refer to CPU ( host / host) and GPU ( device / device). If this little introduction didn’t scare you, then it’s time to take a closer look at CUDA.

If you read Tom "s Hardware Guide regularly, then the architecture of the latest GPUs from nVidia is familiar to you. If not, we recommend that you check out the article" nVidia GeForce GTX 260 and 280: the next generation of graphics cards"In terms of CUDA, nVidia presents the architecture in a slightly different way, showing some details that were previously hidden.

As you can see from the illustration above, the nVidia shader core is made up of multiple texture processor clusters. (Texture Processor Cluster, TPC)... The 8800 GTX, for example, used eight clusters, the 8800 GTS six, and so on. Each cluster, in fact, consists of a texture unit and two streaming multiprocessor... The latter include the beginning of the pipeline (front end), which reads and decodes instructions, as well as sending them for execution, and the end of the pipeline (back end), which consists of eight computing devices and two super-functional devices. SFU (Super Function Unit) where instructions are executed according to the SIMD principle, that is, one instruction applies to all threads in the warp. nVidia calls this way of doing SIMT(single instruction multiple threads, one instruction, many threads). It is important to note that the end of the pipeline operates at twice the frequency of its beginning. In practice, this means that this part looks twice as "wide" than it actually is (that is, as a 16-channel SIMD block instead of an eight-channel block). Streaming multiprocessors work as follows: each clock cycle, the beginning of the pipeline picks a warp ready for execution and starts executing the instruction. It would take four clock cycles for the instruction to apply to all 32 threads in the warp, but since it runs at twice the frequency of the start, it only takes two clock cycles (in terms of the start of the pipeline). Therefore, so that the beginning of the pipeline does not idle a cycle, and the hardware is loaded as much as possible, in the ideal case, you can alternate instructions each cycle - a classic instruction in one cycle and an instruction for SFU - in another.

Each multiprocessor has a specific set of resources that are worth understanding. There is a small area of memory called "Shared Memory", 16 KB per multiprocessor. This is by no means a cache memory: the programmer can use it at his own discretion. That is, we have before us something close to the Local Store of the SPU on Cell processors. This detail is curious because it emphasizes that CUDA is a combination of software and hardware technologies. This memory area is not used for pixel shaders, which nVidia wittily points out "we don't like it when pixels talk to each other."

This memory area opens up the possibility of exchanging information between threads. in one block... It is important to emphasize this limitation: all threads in a block are guaranteed to be executed by one multiprocessor. On the contrary, the binding of blocks to different multiprocessors is not stipulated at all, and two threads from different blocks cannot exchange information with each other at runtime. That is, using shared memory is not easy. However, shared memory is still justified, except in cases where several threads try to access the same memory bank, causing a conflict. In other situations, access to shared memory is as fast as it is to registers.

Shared memory is not the only memory that multiprocessors can access. They can use video memory, but with lower bandwidth and higher latency. Therefore, in order to reduce the frequency of access to this memory, nVidia equipped the multiprocessors with a cache (approximately 8KB per multiprocessor) storing constants and textures.

The multiprocessor has 8192 registers, which are common for all streams of all blocks active on the multiprocessor. The number of active blocks per multiprocessor cannot exceed eight, and the number of active warps is limited to 24 (768 threads). Therefore, the 8800 GTX can handle up to 12,288 threads at a time. All of these limitations were worth mentioning as they allow the algorithm to be optimized based on the available resources.

Optimizing a CUDA program, therefore, consists in obtaining an optimal balance between the number of blocks and their size. More threads per block will be useful to reduce memory latency, but the number of registers available per thread is also reduced. Moreover, a block of 512 threads will be ineffective, since only one block can be active on a multiprocessor, which will lead to the loss of 256 threads. Therefore, nVidia recommends using blocks of 128 or 256 threads, which provides the best compromise between lower latency and the number of registers for most kernels / kernel.

From a programmatic point of view, CUDA consists of a set of C extensions, reminiscent of BrookGPU, as well as several specific API calls. Extensions include type specifiers related to functions and variables. It is important to remember the key word __global__, which, being given before the function, shows that the latter refers to the kernel / kernel - this function will be called by the CPU, and it will be executed on the GPU. Prefix __device__ indicates that the function will be executed on the GPU (which, by the way, CUDA calls "device / device") but it can only be called from the GPU (in other words, from another __device__ function or from the __global__ function). Finally, the prefix __host__ optional, it denotes a function that is called by the CPU and executed by the CPU - in other words, a regular function.

There are a number of limitations with the __device__ and __global__ functions: they cannot be recursive (that is, call themselves), and they cannot have a variable number of arguments. Finally, since __device__ functions are located in GPU memory space, it makes sense that you won't be able to get their address. Variables also have a number of qualifiers that indicate the area of memory where they will be stored. Variable prefixed __shared__ means that it will be stored in the shared memory of the streaming multiprocessor. The __global__ function call is slightly different. The point is, when calling, you need to set the execution configuration - more specifically, the size of the grid / grid to which the kernel / kernel will be applied, as well as the size of each block. Take, for example, the kernel with the following signature.

__global__ void Func (float * parameter);

It will be called as

Func<<< Dg, Db >>> (parameter);

where Dg is the grid size and Db is the block size. These two variables are of a new type of vector introduced with CUDA.

The CUDA API contains functions for working with memory in VRAM: cudaMalloc to allocate memory, cudaFree to free, and cudaMemcpy to copy memory between RAM and VRAM and vice versa.

We will end this overview in a very interesting way in which a CUDA program is compiled: the compilation is done in several steps. First, the CPU code is retrieved and passed to the standard compiler. The GPU-specific code is first converted to the PTX intermediate language. It is similar to assembly language and allows you to study your code looking for potential inefficiencies. Finally, the last phase is to translate the intermediate language into specific GPU instructions and create a binary file.

After looking through the nVidia docs, I just want to try CUDA this week. Indeed, what could be better than evaluating an API by creating your own program? That's when most problems should surface, even if everything looks perfect on paper. In addition, practice will best show how well you have understood all the principles outlined in the CUDA documentation.

It's pretty easy to dive into a project like this. Today, a large number of free, but high-quality tools are available for download. For our test, we used Visual C ++ Express 2005, which has everything you need. The hardest part was finding a program that didn't take weeks to port to the GPU, yet fun enough to keep our efforts in good shape. Finally, we chose a chunk of code that takes a heightmap and calculates the corresponding normalmap. We will not delve into this function in detail, since it is hardly interesting in this article. In short, the program deals with the curvature of areas: for each pixel of the initial image, we impose a matrix that determines the color of the resulting pixel in the generated image by the adjacent pixels, using a more or less complex formula. The advantage of this function is that it is very easy to parallelize, so this test perfectly demonstrates the capabilities of CUDA.

Another advantage is that we already have an implementation on the CPU, so we can compare its result with the CUDA version - and not reinvent the wheel.

We repeat once again that the purpose of the test was to familiarize yourself with the CUDA SDK utilities, and not to comparative testing of versions for CPU and GPU. Since this was our first attempt at creating a CUDA program, we had little hope of getting high performance. Since this part of the code is not critical, the version for the CPU was not optimized, so a direct comparison of the results is hardly interesting.

Performance

However, we measured the execution time to see if there is an advantage to using CUDA even with the roughest implementation, or if it will take a long and tedious practice to get some kind of GPU gain. The test machine was taken from our development lab - a laptop with a Core 2 Duo T5450 processor and a GeForce 8600M GT graphics card running Vista. This is far from a supercomputer, but the results are quite interesting, since the test is not "sharpened" for the GPU. It's always nice to see nVidia make huge gains on systems with monstrous GPUs and sizable bandwidth, but in practice, many of the 70 million CUDA-enabled GPUs in today's PC market are nowhere near as powerful, which is why our test has a right to life.

For a 2,048 x 2,048 pixel image, we got the following results.

CPU 1 thread: 1 419 ms;
CPU 2 threads: 749 ms;
CPU 4 threads: 593 ms
GPU (8600M GT) blocks of 256 threads: 109 ms;
GPU (8600M GT) blocks of 128 threads: 94 ms;
GPU (8800 GTX) blocks of 128 threads / 256 threads: 31 ms.

Several conclusions can be drawn from the results. Let's start with the fact that, despite the talk about the obvious laziness of programmers, we modified the initial version of the CPU for multiple threads. As we mentioned, the code is ideal for this situation - all that is required is to split the initial image into as many zones as there are streams. Please note that the acceleration from the transition from one thread to two on our dual-core CPU turned out to be almost linear, which also indicates the parallel nature of the test program. Quite unexpectedly, the version with four threads also turned out to be faster, although this is very strange on our processor - on the contrary, one could expect a drop in efficiency due to the overhead of managing additional threads. How can this result be explained? It's hard to tell, but the Windows thread scheduler may be to blame; in any case, the result is repeatable. With smaller textures (512x512), the gain from splitting into threads was not so pronounced (about 35% versus 100%), and the behavior of the version with four threads was more logical, without an increase compared to the version for two threads. The GPU was still faster, but not as pronounced (8600M GT was three times faster than the dual-thread version).

Click on the picture to enlarge.

The second significant observation is that even the slowest GPU implementation turned out to be almost six times faster than the highest performing CPU version. For the first program and the unoptimized version of the algorithm, the result is very encouraging. Note that we got a noticeably better result on small blocks, although intuition might suggest otherwise. The explanation is simple - our program uses 14 registers per thread, and with 256-thread blocks, 3,584 registers per block are required, and 768 threads are required for the full processor load, as we showed. In our case, this is three blocks or 10,572 registers. But the multiprocessor has only 8192 registers, so it can only keep two blocks active. In contrast, with blocks of 128 threads, we need 1,792 registers per block; if 8 192 is divided by 1 792 and rounded to the nearest integer, then we get four blocks. In practice, the number of threads will be the same (512 per multiprocessor, although theoretically 768 are needed for the full load), but an increase in the number of blocks gives the GPU the advantage of flexibility in memory access - when an operation with high latencies is in progress, you can start the execution of instructions from another block, waiting receipt of results. Four blocks clearly reduce latency, especially since our program uses multiple memory accesses.

Analysis

Finally, despite what we said above, we could not resist the temptation and ran the program on the 8800 GTX, which turned out to be three times faster than the 8600, regardless of the block size. You might think that in practice, on the respective architectures, the result would be four or more times higher: 128 ALU / shader processors versus 32 and higher clock speeds (1.35 GHz versus 950 MHz), but this did not work out. The most likely limiting factor was memory access. To be more precise, the initial image is accessed as a multidimensional CUDA array - a rather complicated term for what is nothing more than a texture. But there are several benefits.

accesses benefit from texture cache;
we are using wrapping mode, which does not need to handle image borders, unlike the CPU version.

In addition, we can take advantage of "free" filtering with normalized addressing between and instead of, but in our case this is hardly useful. As you know, the 8600 has 16 texture units compared to 32 for the 8800 GTX. Therefore, the ratio between the two architectures is only two to one. Add to that the difference in frequencies and we get the ratio (32 x 0.575) / (16 x 0.475) = 2.4 - close to the "three to one" we actually get. This theory also explains why the block size does not change much on the G80, since the ALU still rests against texture blocks.

Click on the picture to enlarge.

Aside from promising results, our first exposure to CUDA went very well, given the not-so-favorable conditions chosen. Developing on a Vista laptop means using the CUDA SDK 2.0, still in beta, with driver 174.55, which is also beta. Despite this we cannot report any unpleasant surprises - only initial errors during the first debugging, when our program, still quite "buggy", tried to address memory outside the allocated space.

The monitor flickered wildly, then the screen went black ... until Vista started the driver repair service and everything was fine. But it is still somewhat surprising to observe if you are used to seeing the typical Segmentation Fault error on standard programs like ours. Finally, a little criticism towards nVidia: in all the documentation available for CUDA, there is no small guide that would walk you step by step on how to set up a development environment for Visual Studio. Actually, the problem is not big, as the SDK has a full set of examples that you can study to understand the framework for CUDA applications, but a beginner's guide would be helpful.

Click on the picture to enlarge.

Nvidia introduced CUDA with the GeForce 8800. While the promises were tempting at the time, we held on to our enthusiasm until the actual test. Indeed, at the time, it seemed more of a territorial markup to stay on the GPGPU wave. Without an available SDK, it's hard to say that this is not just another marketing dummy that will fail. This is not the first time a good initiative was announced too early and did not come to light at the time due to a lack of support - especially in such a competitive sector. Now, a year and a half after the announcement, we can confidently say that nVidia has kept its word.

The SDK appeared pretty quickly in beta in early 2007 and has been rapidly updated since then, which proves the value of this project to nVidia. Today CUDA is developing quite nicely: the SDK is already available in beta 2.0 for the main operating systems (Windows XP and Vista, Linux, and 1.1 for Mac OS X), and nVidia has dedicated a whole section of the site for developers.

On a more professional level, the impression of the first steps with CUDA was very positive. Even if you are familiar with GPU architecture, you can easily figure it out. When the API looks straightforward at first glance, you immediately begin to believe that you will get convincing results. But won't the computing time be wasted from numerous transfers from CPU to GPU? And how to use these thousands of threads with little or no synchronization primitive? We began our experiments with all these fears in mind. But they quickly dissipated when the first version of our algorithm, albeit a very trivial one, turned out to be significantly faster than on the CPU.

So CUDA is not a lifesaver for researchers who want to convince university officials to buy them a GeForce. CUDA is an already fully available technology that any C programmer can use if they are willing to spend the time and effort getting used to the new programming paradigm. This effort won't be wasted if your algorithms parallelize well. We would also like to thank nVidia for providing complete and high-quality documentation for aspiring CUDA programmers to find answers.

What does CUDA require to become a recognizable API? In one word: portability. We know that the future of IT lies in parallel computing - today everyone is preparing for such changes, and all initiatives, both software and hardware, are directed in this direction. However, at the moment, if we look at the development of paradigms, we are still at the initial stage: we create flows manually and try to plan access to shared resources; you can somehow cope with all this, if the number of nuclei can be counted on the fingers of one hand. But in a few years, when the number of processors will be in the hundreds, this possibility will no longer exist. With the release of CUDA, nVidia took the first step towards solving this problem - but, of course, this solution is only suitable for GPUs from this company, and even then not for everyone. Only GF8 and 9 (and their Quadro / Tesla derivatives) can work with CUDA programs today. And the new 260/280 range, of course.

Click on the picture to enlarge.

Nvidia can boast that it has sold 70 million CUDA-compatible GPUs worldwide, but it's still not enough to become the de facto standard. Taking into account the fact that competitors do not sit idly by. AMD offers its own SDK (Stream Computing), and Intel has announced a solution (Ct), although it is not yet available. A standards war is coming, and there will clearly not be room for three competitors in the market until another player, such as Microsoft, comes up with a shared API, which will of course make life easier for developers.

Therefore, nVidia has a lot of difficulties in the way of CUDA approval. Although technologically we have, without a doubt, a successful solution, it still remains to convince the developers of its prospects - and this will not be easy. However, judging by the many recent API announcements and news, the future looks far from bleak.

- a set of low-level software interfaces ( API) to create games and other high-performance multimedia applications. Includes support for high performance 2D- and 3D-graphics, sound and input devices.

Direct3D (D3D) - interface for outputting three-dimensional primitives(geometric bodies). Included in .

Opengl(from the English. Open Graphics Library, literally - an open graphics library) is a specification that defines a cross-platform programming interface independent of a programming language for writing applications using two-dimensional and three-dimensional computer graphics. Includes over 250 functions for drawing complex 3D scenes from simple primitives. It is used in the creation of video games, virtual reality, visualization in scientific research. On the platform Windows competes with .

OpenCL(from the English. Open Computing Language, literally - an open computing language) - framework(framework of a software system) for writing computer programs related to parallel computations on various graphic ( GPU) and ( ). Into the framework OpenCL includes a programming language and an application programming interface ( API). OpenCL provides instruction-level and data-level parallelism and is an implementation of the technique GPGPU.

GPGPU(abbreviated from English. General-P urpose G raphics P rocessing U nits, literally - GPU general purpose) is a technique for using a graphics processor-video card for general computations that are usually performed.

Shader(eng. shader) - a program for shading on synthesized images, used in three-dimensional graphics to determine the final parameters of an object or image. Typically includes arbitrary complexity describing light absorption and scattering, texture mapping, reflection and refraction, shading, surface displacement, and post-processing effects. Complex surfaces can be rendered using simple geometric shapes.

Rendering(eng. rendering) - visualization, in computer graphics, the process of obtaining an image from a model using software.

SDK(abbreviated from English. Software Development Kit) - a set of software development tools.

Cpu(abbreviated from English. Central Processing Unit, literally - central / main / main computing device) - central (micro); a device that executes machine instructions; a piece of hardware that is responsible for performing computational operations (specified by the operating system and application software) and coordinating the work of all devices.

GPU(abbreviated from English. Graphic Processing Unit, literally - a graphics computing device) - a graphics processor; A separate device or game console that performs graphics rendering (rendering). Modern GPUs handle and render computer graphics very efficiently. The graphics processor in modern video adapters is used as an accelerator for three-dimensional graphics, but it can also be used in some cases for calculations ( GPGPU).

Problems Cpu

For a long time, an increase in the performance of traditional ones was mainly due to a sequential increase in the clock frequency (about 80% of the performance was determined by the clock frequency) with a simultaneous increase in the number of transistors on one crystal. However, a further increase in the clock frequency (at a clock frequency of more than 3.8 GHz, the chips simply overheat!) Runs up against a number of fundamental physical barriers (since the technological process is very close to the size of an atom: and the size of a silicon atom is about 0.543 nm):

First, as the crystal size decreases and the clock frequency increases, the leakage current of the transistors increases. This leads to an increase in power consumption and an increase in heat emission;

Second, the benefits of higher clock speeds are partially offset by memory latency because memory access times do not match increasing clock speeds;

Third, for some applications, traditional sequential architectures become inefficient with increasing clock rates due to the so-called "von Neumann bottleneck" - a performance limitation resulting from a sequential flow of computations. At the same time, the resistive-capacitive signal transmission delays increase, which is an additional bottleneck associated with an increase in the clock frequency.

Development GPU

In parallel with the development of GPU:

…

November 2008 - Intel introduced a line of 4-core Intel Core i7 based on a new generation microarchitecture Nehalem... The processors operate at a clock speed of 2.6-3.2 GHz. Made according to the 45nm process technology.

December 2008 - shipments of a 4-core AMD Phenom II 940(codename - Deneb). Operates at a frequency of 3 GHz, manufactured using the 45-nm process technology.

…

May 2009 - company AMD introduced the GPU version ATI Radeon HD 4890 with a core clock frequency increased from 850 MHz to 1 GHz. This is the first graphic 1 GHz processor. The computing power of the chip, due to the increase in frequency, increased from 1.36 to 1.6 teraflops. The processor contains 800 (!) Computational cores, supports video memory GDDR5, DirectX 10.1, ATI CrossFireX and all other technologies inherent in modern video card models. The chip is made on the basis of 55nm technology.

The main differences GPU

Distinctive features GPU(compared with ) are:

- architecture, maximally aimed at increasing the speed of calculating textures and complex graphic objects;

- peak power typical GPU much higher than ;

- thanks to the specialized conveyor architecture, GPU much more efficient in processing graphic information than.

"Crisis of the genre"

"Crisis of the genre" for matured by 2005 - it was then that they appeared. But, despite the development of technology, the increase in the productivity of conventional decreased markedly. At the same time performance GPU continues to grow. So, by 2003, this revolutionary idea crystallized - use the computing power of the graphic... GPUs have become actively used for "non-graphical" computing (physics simulation, signal processing, computational mathematics / geometry, database operations, computational biology, computational economics, computer vision, etc.).

The main problem was that there was no standard interface for programming GPU... Developers used Opengl or Direct3D but it was very convenient. Corporation NVIDIA(one of the largest manufacturers of graphics, media and communication processors, as well as wireless media processors; founded in 1993) took up the development of a unified and convenient standard - and presented the technology CUDA.

How it started

2006 - NVIDIA demonstrates CUDA ™; the beginning of a revolution in computing GPU.

2007 - NVIDIA releases architecture CUDA(original version CUDA SDK was introduced on February 15, 2007); nomination "Best novelty" from the magazine Popular Science and "Readers' Choice" from the publication HPCWire.

2008 - technology NVIDIA CUDA won in the category "Technical Excellence" from PC Magazine.

What's happened CUDA

CUDA(abbreviated from English. Compute Unified Device Architecture, literally - unified computing architecture of devices) - architecture (a set of software and hardware) that allows GPU general-purpose computations, while GPU actually acts as a powerful coprocessor.

Technology NVIDIA CUDA ™ Is the only development environment in the programming language C, which allows developers to create software to solve complex computational problems in less time, thanks to the processing power of GPUs. Millions already work in the world GPU with the support CUDA, and thousands of programmers are already using (free!) tools CUDA to accelerate applications and solve the most complex resource-intensive tasks - from video and audio encoding to oil and gas exploration, product modeling, medical imaging and scientific research.

CUDA gives the developer the opportunity, at his discretion, to organize access to the set of instructions of the graphics accelerator and manage its memory, organize complex parallel computations on it. Graphics accelerator with support CUDA becomes a powerful programmable open architecture like today. All this provides the developer with low-level, distributed and high-speed access to equipment, making CUDA a necessary foundation when building serious high-level tools such as compilers, debuggers, math libraries, software platforms.

Uralsky, leading technology specialist NVIDIA comparing GPU and , says so: " Is an SUV. He drives anytime, anywhere, but not very fast. A GPU Is a sports car. On a bad road, he simply will not go anywhere, but give a good coverage - and he will show all his speed, which the SUV never dreamed of! .. ".

Technology capabilities CUDA