Non -central processor. General purpose calculations of the graphic processor

Today, perhaps, only the lazy did not hear about graphic processors involved in general calculations. Thundered around the world NVIDIA CUDA , I collected the audience AMD Firestream , They caught up with the fog Opencl And DirectX 11 , made his court Intel. In general, no one was standing aside – one way or another, everyone put their hand to the technology, even Apple with its support for general-purpose GPU accounts at the operating system level. It's time to figure out what is generally a general purpose, why video cards are needed not only for games and that the era of multifunctional graphic processors brings us good.

Swedes, proteins and parallel calculations

It all started in 2003. Group of students from the Gotland University (g. Visby, Sweden) faced an unpleasant problem: they needed to make rather large -scale parallel matrix computing. So large -scale that the central processors of the computers to which students had access would be engaged in these calculations for about six months. And the students had a thesis on the nose of the students.

At first, the Swedes decided to resort to the assistance of the world community (the program for voluntary participation in distributed calculations was gained popularity), but buried in the documents necessary for registration of the draft. And then one of them came up with the thought: why not use graphic instead of a central processor? By 2003, the computing power of the GPU has already exceeded the capabilities of the CPU, they were focused on parallel calculations and simply perfectly suited to solve the Casper Spins Casino problem. True, there were no hardware for the propagation of the “non -graphics” on the GPU at that moment, but this could not stop the enthusiasts, and after a few weeks the calculations began. They took two and a half months.

So the concept of GPGPU, General-Purpose Graphics Processing Units, general-purpose calculations performed by the graphic processor. Like any enthusiasm from the world of technology, in just a couple of years the idea enslaved the minds of many programmers around the world. The communities of adherents of the new concept were created, international forums and round tables were arranged. The most famous, perhaps, became the project Bionicfx , on the basis of which a social network was built Avex (Audio Video Exchange). Using a video card GeForce 6800 As a processor, Bionicfx quickly and without special problems distilled the audio files from the format to the format, led them to the right size and even knew how to process stream data, as the file is loaded to the server.

Another sensational project is a distributed calculation program Folding@Home Stanford University. By the way, it began with the use of central processors, but with the advent of GPGPU it switched to graphic processors. The performance from this increased three times.

Of course, all this could not go unnoticed by the main whales of the graphic business – Nvidia and ATI. The “Reds” were the first to start: at the beginning of 2006, ATI announced the imminent issue of the hardware platform Radeon R580 , capable of replacing the "black sketches of GPGPU enthusiasts". True, the platform came out only a year later, and under a completely different name-the finished product bore the proud name of AMD Firestream and was designed to work with video cards of a new generation built on 55-nm technology process. NVIDIA with its announcement GPGPU was somewhat late: for the first time about the platform for general -purpose calculations, the company spoke only in the fall of 2006. But by the beginning of 2007, the first version of the CUDA hardware and software system (Compute Unified Device Architecture, general calculation architecture) appeared on its official website, and in just six months I gained such popularity that AMD-Ti pair cannot gain with their firestream still.

It's all about magic bubbles

Before understanding the intricacies of different computing platforms, let's talk about what, in fact, the graphic processor so exceeds the central? CPU was created to work with stream instructions, and GPU for parallel calculations. That is, if you exaggerate, the central processor is designed to process instructions in which it is first considered a, then B and, finally, C (because the result C is somehow related to the results of A and B), and the graphic-for instructions in which All three components are considered simultaneously and the results of miscalculations at this stage do not affect each other.

However, to say that the central processors do not know how to work with parallel calculations in general is to error against the truth. They know how. Just not very good. For example, processors Pentium can follow two instructions for the beat, and Pentium Pro – dynamically change the priority of instructions and, accordingly, execute them almost in parallel.

With the advent of processor multi -core and quick memory, the situation has changed for the better, and today CPU can even be entrusted with a miscalculation of graphs, but … why? GPU cope with such tasks much better – they have more processor nuclei, and memory is used wiser. If the central processor addresses its cache randomly and writes data in an arbitrary place, then the GPU has everything strictly in a good way: the blocks are read and recorded strictly in turn, one after another. The result is a productivity that is as close as possible to theoretical throughput. GPU can do much less than that of CPU, cache and get higher results.

True, in the opposite direction the rule of incompatibility is also valid. As soon as the GPU faces the need to make consecutive calculations, its performance drops sharply. In theory, you can teach a graphic processor to hybrid calculations (that is, consistent and parallel, and in any combinations), but so far they don’t want to do this – they say, why make one device for everything in the world, if you can use specially designed for each Tasks components? The position is controversial, but so far a leader in the market. That is why, contrary to the predictions of many IT analytics, in the near future we will not see omnipotent graphic processors that can replace the central processors.

NVIDIA: She is already spinning!

The largest and really supported platform, created for general purpose calculations, is undoubtedly NVIDIA CUDA. It has been developed for two years, survived three versions, many patches and is used, according to NVIDIA, more than six million computers daily. Despite the fact that Cuda appeared not the first and her competition is not so weak, the result is impressive.

The secret of success is simple – balance. Closed architecture is balanced by the availability and free of the developer package, a strict binding to a certain gland – a wide selection of this iron … The system has already supported several dozens of different video cards, from weak mobile solutions and budget Ion to the most powerful top cards (the full list can be viewed here: www.NVIDIA.ru/object/cuda_learn_products_ru.HTML ). Moreover, for any of them, the same program is written, performed in the same way-and even though on a weak map, it will become a few times longer, it will still work without any portains.

Another plus CUDA as a platform is accessibility for programmers. NVIDIA decided not to invent a bicycle and use programming languages, which people have been actively using for many years. C# c small extensions, Fortran, Python, Delphy, Opengl. The only thing that the novice Kuder will have to learn is to write programs with parallel calculations. The fact is that classical programming is focused on working with CPU and there they are teaching everyone from the school bench from parallel calculations to leave … However, there are already there are already world universities to be retraining courses in many world universities. We can really learn to “pour” at the ICSU Navy, at the Fiztekh of St. Petersburg State University, at the Yekaterinburg State University and at Kazan State University.

In addition, NVIDIA periodically arranges training courses in various Russian cities – so we are advised to everyone interested in monitoring ads on the walls of native technical institutions.

The set of the CUDA developer consists of three components: a set of funds (loaded libraries), a compiler (supporting several programming languages) and documentation. All this can be downloaded completely free on the site www.NVIDIA.ru/object/cuda_get_ru.HTML or from our disk from the "Soft" section.

AMD: Peace, Friendship, Development

AMD approached the problem from the other side. Instead of making a potential (remember about Jack) from video cards to their processors, the developers decided to join forces and make a new hardware platform from the central and graphic processors, which will be equally ease to cope with both heterogeneous and parallel calculations. It is completely not ready yet, but the foundation can already be evaluated: according to the rate of the DRIVEL, the AMD GPGPU system is ahead of competitors. True, testing show that the quality of execution is not so impressive yet.

The principle of operation Firestream is slightly different from Cuda: the graphic processor still takes on parallel tasks, collects its stream processors into clusters and considers tasks at the same time. But if the NVIDIA chip can cope with heterogeneous tasks suddenly appearing in the code, the ATI chip transfers these tasks to the central processor. At first glance, this approach should slow down the overall operation of the system, but in practice it pours out in a high speed of work-after all, CPU, even when taking the time to transfer data back and forth, copes with streaming computing faster than GPU.

The advantages of Firestream are ending on this. Unlike CUDA, this system uses not generally accepted, but its own programming language, so learning to work with it is much more difficult. In addition, the language itself is still unfinished, the compiler is also not perfect, due to which the program code often turns out to be unoptime and bulky.

Nobody has failed to truly compare the performance of Cuda and Firestream: the only program that officially works with both systems, Cyberlink PowerDirector , itself does not differ in an optimized code, so there is no need to talk about the accuracy of the results. If you look at pure numbers, AMD technology is ahead of the brainchild of NVIDIA when transcoding video. But at the same time, the quality of the picture that issues at the output CUDA is higher than the quality of the same picture from AMD. Perhaps all the fault of unfinished technologies, possibly – the difference in approaches to the task. Be that as it may, comparing the finished product with the unprepared is pointless, so we will better wait for Firestream to be completed to the end. Most likely you will have to wait long. After all, AMD, unlike NVIDIA, does not put the development of the GPGPU system among the highest priorities (like its own graphic engine, by the way) and therefore does not call any deadlines.

Everyone who wants to check the effectiveness of Firestream can download the developer package ( http: // Developer.AMD.COM/GPU/Atistreamsdk/Pages/Default.ASPX ).

Not competitors

Disputes do not subside on whether all these hardware accelerations are needed from NVIDIA and AMD if the long-awaited OpenCl framework is about to be born, which is destined to become the cornerstone of the entire industry of graphic calculations? Most of the debaters believe that neither Cuda nor Firestream in this situation do not need anyone. But there are nuances.

The fact is that OpenCl (just like before it Opengl And Openal ) – this is just a programming language. On which, yes, it is very convenient to steal various kinds of graphics, from games and cartoons to the most difficult models of proteins and meteorological processes. But OpenCl itself does not consider anything and cannot count: it still, one way or another, is based on the iron with which it works. And support of general -purpose computing by the GPU forces, which is discussed in the documentation for the new framework, is nothing more than support for technologies with which these same calculations are made.

Another thing is that thanks to OpenCl, parallel calculations can now be used not only for the drawing of graphics, but also for various other things. For example, to work with physical engines or a miscalculation AI.

The same is the same with the future Direct3d 11. So, for example, in Windows 7, this framework allows you to use hardware for transcoding video files: if there are appropriate iron and drivers, this will be done using NVIDIA CUDA, in the absence of the central processor forces.

Thus, neither the OpenCl nor Direct3D 11 competitors NVIDIA and AMD products are. On the contrary, they are created in order to effectively interact with them. After all, it is already possible to turn on the same CUDA to support the existing games today, but so far it is not very effective: the switch between the standard operating mode of the video card and the CUDA mode take too much time, and the precious FPS are lost. This problem should solve computational shaders, which are divided into flows even at the “upper” level of the program, and the video card is due to this receives data from the same type – the performance from this increases, and the load on the video card falls.

* * *

The use of GPU for general purpose calculations is the same inevitable future as multi -core processors and multi -section hard drives. The technology has long passed from the category of amateur to the necessary category: on its basis, whole computing clusters are built, operating systems and software are programmed, games are written.

Thanks to CUDA technology, the Ion concept, where the central processor plays the role of an auxiliary tool and therefore is not obliged to differ either by a particular multi -core or insane performance. Thanks to the GPGPU concept, computers are becoming cheaper and more accessible, they learn to solve more and more complex problems. However, it is still very early to write off the CPU from the steamer of history – there is still where to increase its power.

[[Break]] We count together

The idea of distributed calculations is extremely simple: if we do not have sufficient capacities to solve the problem and there is no money to purchase these capacities, why not ask someone for help? This way or something like this was thought by scientists from Stanford University, when their laboratory computer wrote on the screen, how many hundreds of years it would calculate the structures of proteins according to a given program. The figure was so impressive that it was possible not to speak about any purchase of a new computer with the dean’s office: to solve such problems, a whole park of supercomputers had to be purchased, and the university’s leadership was not going to go to this.

Scientists began to look for scientists on the Internet. Created their own site, wrote software and offered everyone to spend a little computer time on science. In order to help researchers, it was enough to install the appropriate software on your computer and go to the network. Further, the owner of the computer could calmly deal with his business, while his car downloaded the necessary data from the Internet and calculated them for three to four hours. Despite the clear lack of benefits for the program participants, its popularity was huge: every day several hundred new participants were recorded on the site.

You can help science today. True, for this it is better for you to have a video card with CUDA support, because most of the computing programs are now built on this technology. Community World Community Grid ( www.WorldCommunitygrid.Org ) looking for a cure for cancer, project Seventeen or Bust ( www.Seventeenorbust.Com ) is engaged in solving the problem of the Serpinsky, Seti@home ( http: // setiathome.Berkeley.Edu ) deciphers signals from radio telescopes, trying to find brothers in the mind in the space … Choose a project to your liking and join yourself. Of course, this will not give you anything, but science thanks to your efforts will advance at least a little further.

240 parallel

A multi-core central processor consists of several nuclei and cache memory. Each core is capable of performing only one operation per unit of time, therefore, to accelerate the permit process, any task is crushed into many small pieces, which can be considered almost simultaneously, with a difference of half a step.

Having finished the calculation, the cash of the processor writes its results in the cache, where the result immediately picks up another core. The system works almost continuously. The only negative of such architecture is that the memory of the processor is not used very efficiently. In order for the data not to be recorded on the same section of memory, each core works with a certain segment of the cache: one writes, relatively speaking, into clusters from 0 to 25, the second – from 25 to 50 and so on. The speed of reading from such a disagreement, of course, falls somewhat – after all, it turns out that the processor at the same time (and constantly) turns to different physical areas of memory.

Graphic processors are arranged differently. In them, the quality is replaced by the quantity: instead of two or four nuclei, each of which would work with a frequency of more than 3 MHz, graphic chips use entire clusters of 240 streaming processors of 1-1.5 MHz each. Such a system is good in that, in addition to physical clusters, you can use logical clusters, that is, for each specific task, the data is divided into exactly the number of groups that is necessary for the miscalculation. Due to this, performance in parallel calculations (when all parts of the program are considered at the same time) jumps to heaven.

GPU memory is also built differently: since all the nuclei always consider at the same time and the results of their calculations are in no way related, random access to memory is not needed. Instead, a special group of processors (called the beginning of the flow) is periodically passed through all clusters and exchanges information with them in order to sequentially write it in memory or read it from memory. When calculating graphics, such a system gives crazy performance numbers. But it is worth uploading a GPU with some traditional computer task like launching an operating system or processing information that you type in Word, as the processor will immediately begin to slip: the organization of the GPU memory is extremely ineffective to solve successive problems.

New generation supercomputer

At the beginning of the year, NVIDIA decided to once again remind everyone that the technology of general -purpose calculations of the graphic processor forces is not just toys for lovers. Supercomputer NVIDIA Tesla , Created using this technology, is able to replace a whole park of supercomputers based on the idea of calculations of CPU forces. It consists of many clusters, each of which has its own graphic board with 240 streaming graphic processors on board. Even the “home” solution of Tesla, a simple system unit with one such cluster on board, provides an increase in performance by 250 times compared to systems that produce similar calculations of top models Core i7. What can we say about cluster solutions for data centers?!

Account practice

First of all, the new GPGPU technologies won the hearts of scientists. The Max Planck Science Society for the Assistance (independent non -profit research organization, the main division of which is located in Germany) was the first to meet progress and paid for the installation of the Tesla supercomputer based on the CUDA technology in the Laboratory of Professor Holger Stark and his group at the Gettingen Institute (Germany).

The professor was engaged in research in the field of 3D electronic cryomycoscopy-he studied the structure and spatial movement of the smallest nanomolecular structures. Professor Starka team used an electron microscope to obtain a detailed 3D molecules of molecules. In principle, the resolution of modern microscopes is so great that even the distances between atoms can be considered in them, but only the biological structures with which scientists worked quickly destroyed from the effects of the intensive stream of electrons. Therefore, these very structures are well frozen, and then use a very small portion of electrons so as not to damage them. The image eventually turns out to be very noisy, but special processing allows you to solve this problem. And here scientists rested on the imperfection of modern computer systems. For 7 days, the Starka team processed 15,000 images using a 48-core CPU-cluster. With such a speed, processing a million files (and this is the minimum necessary for research) would take a little less than a year and a half – and so on each molecule. Installation of a computer operating on the basis of CUDA technology literally saved scientists: processing a million images on it takes 14 hours (which is more than 800 times faster than the work of the CPU-Claster), and this, in general, in a minimum configuration. Adding another GPU-cluster reduced images processing time to 9 hours.