Hyper-Threading: Intel's Two-in-One, or The Hidden Capabilities of Xeon. Multi-core processors: how they work

17.06.2019 OS

s in the machine and I found a few posts but I'm confused as some have mentioned that you get logical cores and physical cores etc.
So what is the difference between logical and physical cores and is there a way to get physical cores? Or does it make sense to include logical cores in our account?

4 Solutions collect form web for “So what are logical processor cores (as opposed to physical processor cores)?”

Physical cores are simply physical cores in a processor. Logic cores are the ability of one core to do two or more things at the same time. It grew out of early Pentium 4 processors capable of doing what was called Hyper Threading (HTT).

It was a game played when auxiliary kernel components were not used for certain types of instructions while other lengthy work could be done. Thus, the central processing unit could work on two things at the same time.

The new cores are more fully functional processors, so they work on multiple things at the same time, but they are not true processors as physical cores. You can read more about the limitations of the hyperthreading feature and the physical capabilities of the kernel here on tomshardware in this article titled: Intel Core i5 and Core i7: Intel Mainstream Magnum Opus.

You can see the breakdown of your window using the lscpu command:

$ lscpu Architecture: x86_64 CPU op-mode (s): 32-bit, 64-bit CPU (s): 4 Thread (s) per core: 2 Core (s) per socket: 2 CPU socket (s): 1 NUMA node (s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 37 Stepping: 5 CPU MHz: 2667.000 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU (s ): 0-3

My Intel i5 laptop above has 4 "processors" in total

Processor (s): 4

of which there are 2 physical cores

Core (s) per socket: 2

of which each can run up to 2 threads

Topic (s) per core: 2

in the same time. These threads are the logical capabilities of the kernel.

Physical cores are the number of physical cores, real hardware components.

Logical cores are the number of physical cores multiplied by the number of threads that can run on each core using hyperthreading.

for example my 4-core processor runs on 2 threads per core, so I have 8 logical processors.

$ sudo dmidecode | egrep "Socket Designation: Proc | ((Thread | Core) Count)" Socket Designation: Proc 1 Core Count: 14 Thread Count: 28 Socket Designation: Proc 2 Core Count: 14 Thread Count: 28

Two nests. Each connector has 14 physical cores. Each core has two threads (28/14). The total number of logical blocks "cpus" or logic processing is 56 (that's what "top" and some other commands show you as the number "cpus").

Hyperthreading technology allows one physical processor core to behave like two logical processors.

Thus, one processor core can simultaneously execute two independent threads.

Intel refers to a physical processor as a socket.

Hyperthreading allows a physical processor to behave as if it has two physical processors, called logical processors. What for?

While hyperthreading does not double system performance, it can improve performance by leveraging idle resources, resulting in more throughput for certain important types of workloads. An application running on a single logical processor on a busy core can expect slightly more than half the bandwidth it gets when running in single mode on a non-hyperthreading processor.

Summary

The physical processor is something that we can see and fall.

A logical processor is similar to: Phsical Core, acting as two physical cores

An Intel Pentium 4 processor with a clock frequency of 3.06 GHz, which uses Hyper-Threading (HT) technology, has appeared on the St. Petersburg market.
Previously used only in server systems, the technology launched a new class of high-performance desktop personal computers, manufacturers say.
With HT technology, one physical processor is perceived by the PC operating system and applications as two logical processors. According to the head of Intel's representative office in Russia and the CIS countries Alexei Navolokin, preliminary data show that the new processor with NT technology provides an average performance gain of 25%.

Out of turn
HT technology allows users to improve PC performance in two ways: when working with software that uses multi-threaded data processing, and when working in multitasking environments. Applications written taking into account the ability of the new processor to work simultaneously with several fragments of code (the so-called "threads" - thread) will "see" one physical Intel Pentium 4 processor with a clock speed of 3.06 GHz with HT technology as two logical processors. HT technology allows the processor to process two independent data streams not in turn, but at the same time.

For business
With the help of HT technology, you can, for example, start playing a music album and at the same time exchange messages in a chat without compromising the sound quality. By downloading an MP3 file from the Internet to your music archive, you can run an anti-virus program in parallel, which will protect your PC from the penetration of unwanted programs from the outside.
HT provides ample opportunities in the world of business - the head of the enterprise can simultaneously view stock reports and indices, track the indicators of the automated enterprise management system, and be in touch with contractors. Engineers and scientists using a PC based on the Intel Pentium 4 processor with Hyper-Threading Technology will be able to work with information sources most efficiently, while downloading it from the Internet and receiving it from colleagues in the form of files of various formats - from PDF to XLS.
Petersburg integrator firms ("Svega +", "Computer Service 320-80-80", "Computer-Center KEY" and "Computer World") plan to sell at least 15-20 computers based on Intel Pentium 4 3.06 processor monthly GHz with HT technology.

We wrote that the use of single-processor Xeon systems makes no sense, since at a higher price their performance will be the same as that of a Pentium 4 of the same frequency. Now, after more careful study, this statement will probably have to be made a small amendment. The Hyper-Threading technology implemented in Intel Xeon with the Prestonia core really works and gives quite a noticeable effect. Although many questions arise when using it ...

Give performance

"Faster, even faster ...". The race for performance has been going on for years, and sometimes it's even difficult to say which component of your computer is accelerating faster. For this, more and more new ways are being invented, and the further, the more skilled labor and high-quality brains are invested in this avalanche-like process.

A constant increase in performance is certainly needed. At least, this is a profitable business, and there will always be a beautiful way to encourage users to upgrade yesterday's "super-performance CPU" to tomorrow's "even more super ..." For example, simultaneous speech recognition and simultaneous translation into another language is not everyone's dream? Or unusually realistic games of almost "cinematic" quality (completely absorbing attention and sometimes leading to serious changes in the psyche) - is this not the aspiration of many gamers, young and old?

But let's take the marketing aspects out of the box in this case, focusing on the technical ones. Moreover, not everything is so gloomy: there are urgent tasks (server applications, scientific calculations, modeling, etc.), where ever higher performance, in particular of central processors, is really necessary.

So, what are the ways to increase their performance?

Overclocking... It is possible to further "thin" the technological process and increase the frequency. But, as you know, this is not easy and is fraught with all sorts of side effects like heat dissipation problems.

Increasing processor resources- for example, increasing the cache size, adding new blocks (Execution Units). All this entails an increase in the number of transistors, an increase in processor complexity, an increase in the die area, and, consequently, in cost.

Besides, the previous two methods give, as a rule, not a linear increase in productivity. This is well known in the Pentium 4 example: errors in branch prediction and interrupts cause a long pipeline to be dropped, which greatly affects the overall performance.

Multiprocessing... Installing multiple CPUs and distributing work among them is often quite efficient. But this approach is not very cheap - each additional processor increases the cost of the system, and a dual motherboard is much more expensive than a regular one (not to mention motherboards with support for four or more CPUs). In addition, not all applications benefit from multiprocessor performance enough to justify the cost.

In addition to "pure" multiprocessing, there are several "intermediate" options to speed up the execution of applications:

Chip Multiprocessing (CMP)- two processor cores are physically located on one die using a shared or separate cache. Naturally, the size of the crystal turns out to be quite large, and this cannot but affect the cost. Note that several of these "dual" CPUs can also run on a multiprocessor system.

Time-Slice Multithreading... The processor switches between program threads at fixed intervals. The overhead can be quite impressive at times, especially if a process is pending.

Switch-on-Event Multithreading... Switching tasks when long pauses occur, such as "cache misses", a large number of which are typical for server applications. In this case, a process waiting to load data from the relatively slow memory into the cache is suspended, freeing up CPU resources for other processes. However, Switch-on-Event Multithreading, like Time-Slice Multithreading, does not always allow achieving optimal use of processor resources, in particular, due to errors in branch prediction, instruction dependencies, etc.

Simultaneous Multithreading... In this case, program threads are executed on one processor "simultaneously", that is, without switching between them. CPU resources are allocated dynamically, according to the principle "if you don't use it, give it to someone else." It is this approach that underlies Intel Hyper-Threading technology, which we now turn to.

How Hyper-Threading Works

As you know, the current "computing paradigm" assumes multithreaded computing. This applies not only to servers, where such a concept exists initially, but also to workstations and desktop systems. Threads can refer to either one or different applications, but almost always there are more than one active threads (to be sure of this, it is enough to open Task Manager in Windows 2000 / XP and turn on the display of the number of threads). At the same time, a conventional processor can execute only one of the threads at a time and is forced to constantly switch between them.

For the first time, the Hyper-Threading technology was implemented in the Intel Xeon MP (Foster MP) processor, on which it was tested. Recall that the Xeon MP, officially presented at IDF Spring 2002, uses the Pentium 4 Willamette core, contains 256 KB L2 cache and 512 KB / 1 MB L3 cache, and supports 4-processor configurations. Also, support for Hyper-Threading is present in the processor for workstations - Intel Xeon (Prestonia core, 512 KB L2 cache), which came to the market a little earlier than the Xeon MP. Our readers are already familiar with dual-processor configurations on Intel Xeon, so we will consider the capabilities of Hyper-Threading using these CPUs as an example - both theoretically and practically. Anyway, a "simple" Xeon is a more mundane and digestible thing than the Xeon MP in 4-processor systems ...

The principle of Hyper-Threading is based on the fact that at any given time, only a portion of the processor's resources are used while executing program code. Unused resources can also be loaded with work - for example, they can be used for parallel execution of another application (or another thread of the same application). In one physical Intel Xeon processor, two logical processors (LP - Logical Processor) are formed, which share the computing resources of the CPU. The operating system and applications "see" exactly two CPUs and can distribute work between them, as in the case of a full-fledged dual-processor system.

One of the goals of implementing Hyper-Threading is to allow it to run at the same speed as on a regular CPU if there is only one active thread. For this, the processor has two main operating modes: Single-Task (ST) and Multi-Task (MT). In ST mode, only one logical processor is active, which takes full advantage of the available resources (ST0 and ST1 modes); the other LP was stopped by the HALT command. When the second program thread appears, the idle logical processor is activated (via an interrupt) and the physical CPU is placed in MT mode. Stopping unused LPs with the HALT command is the responsibility of the operating system, which is ultimately responsible for the same fast execution of one thread as in the case without Hyper-Threading.

For each of the two LPs, the so-called Architecture State (AS) is stored, which includes the state of various types of registers - general purpose, control, APIC and service. Each LP has its own APIC (interrupt controller) and a set of registers, for correct operation with which the concept of Register Alias Table (RAT) is introduced, which monitors the correspondence between eight general-purpose IA-32 registers and 128 physical CPU registers (one RAT for each LP ).

When operating with two streams, two corresponding sets of Next Instruction Pointers are supported. Most of the instructions are taken from the Trace Cache (TC), where they are stored in decoded form, and the two active LPs get access to the TC alternately, through a clock cycle. At the same time, when only one LP is active, it gains exclusive access to the TC without interleaving by clock. The Microcode ROM is accessed in the same way. ITLB (Instruction Translation Look-aside Buffer) blocks, which are used in the absence of the necessary instructions in the instruction cache, are duplicated and each deliver instructions for its own stream. The IA-32 Instruction Decode instruction decoder unit is shared and, when it is required to decode instructions for both streams, it serves them one by one (again, every clock cycle). The Uop Queue and Allocator blocks are split in two, allocating half of the elements for each LP. Schedulers, 5 in number, process the queues of decoded commands (Uops), despite belonging to LP0 / LP1, and send commands to execute the necessary Execution Units, depending on the readiness to execute the former and the availability of the latter. Caches of all levels (L1 / L2 for Xeon, as well as L3 for Xeon MP) are completely shared between the two LPs, however, to ensure data integrity, records in DTLB (Data Translation Look-aside Buffer) are provided with descriptors in the form of logical processor IDs.

Thus, instructions of both logical CPUs can be executed simultaneously on the resources of one physical processor, which are divided into four classes:

duplicated (Duplicated);
fully shared (Fully Shared);
with element descriptors (Entry Tagged);
dynamically partitioned depending on the ST0 / ST1 or MT operating mode.

At the same time, most applications accelerated in multiprocessor systems can also be accelerated on a CPU with Hyper-Threading enabled without any modifications. But there are also problems: for example, if one process is in a waiting loop, it can take up all the resources of the physical CPU, preventing the second LP from working. Thus, performance while using Hyper-Threading can sometimes drop (up to 20%). To prevent this, Intel recommends using the PAUSE instruction (introduced in IA-32 since Pentium 4) instead of empty wait loops. Quite serious work is also underway on automatic and semi-automatic code optimization during compilation - for example, compilers of the Intel OpenMP C ++ / Fortran Compilers () series have made significant progress in this regard.

Another goal of the first implementation of Hyper-Threading, according to Intel, was to minimize the increase in the number of transistors, die area and power consumption with a noticeable increase in performance. The first part of this commitment has already been fulfilled: the addition of Hyper-Threading support to the Xeon / Xeon MP has increased die area and power consumption by less than 5%. What happened with the second part (performance), we still have to check.

Practical part

For obvious reasons, we did not test 4-processor server systems on a Xeon MP with Hyper-Threading enabled. First, it is quite time consuming. And secondly, if we decide on such a feat - anyway, now, less than a month after the official announcement, it is absolutely unrealistic to get this expensive equipment. Therefore, it was decided to restrict ourselves to the same system with two Intel Xeon 2.2 GHz, on which the first testing of these processors was carried out (see the link at the beginning of the article). The system was based on a Supermicro P4DC6 + motherboard (Intel i860 chipset), contained 512 MB of RDRAM, a video card based on a GeForce3 chip (64 MB DDR, Detonator 21.85 drivers), a Western Digital WD300BB hard drive and 6X DVD-ROM; Windows 2000 Professional SP2 was used as an OS.

First, a few general impressions. When installing one Xeon with the Prestonia kernel, at the start of the system, the BIOS displays a message about the presence of two CPUs; if two processors are installed, the user sees a message about four CPUs. The operating system will normally recognize "both processors", but only if two conditions are met.

Firstly, in the CMOS Setup of the latest BIOS versions of Supermicro P4DCxx boards, the Enable Hyper-Threading item has appeared, without which the OS recognizes only the physical processor (s). Second, ACPI capabilities are used to inform the OS about the presence of additional logical processors. Therefore, to enable Hyper-Threading, the ACPI option must be enabled in CMOS Setup, and the HAL (Hardware Abstraction Layer) with ACPI support must also be installed for the OS itself. Fortunately, in Windows 2000, changing HAL from Standard PC (or MPS Uni- / Multiprocessor PC) to ACPI Uni- / Multiprocessor PC is easy - by replacing the "computer driver" in the device manager. At the same time, for Windows XP, the only legal way to migrate to ACPI HAL is to reinstall the system over the existing installation.

But now all the preparations have been made, and our Windows 2000 Pro already firmly believes that it works on a dual-processor system (although in fact there is only one processor installed). Now, traditionally, it's time to decide on the goals of testing. So we want:

Evaluate the impact of Hyper-Threading on the performance of applications of various classes.
Compare this effect with the effect of installing a second processor.
Check how "fair" resources are given to the active logical processor when the second LP is idle.

To evaluate the performance, we took a set of applications already familiar to our readers and used in testing workstation systems. Let's start from the end and check the "fairness" of the logical CPUs. Everything is extremely simple: first we run tests on one processor with Hyper-Threading disabled, and then we repeat the process, enabling Hyper-Threading and using only one of the two logical CPUs (using Task Manager). Since in this case we are only interested in relative values, the results of all tests are reduced to "bigger is better" and normalized (the indicators of a uniprocessor system without Hyper-Threading are taken as a unit).

Well, as you can see, Intel's promises have been fulfilled here: with only one active thread, the performance of each of the two LPs is exactly equal to the speed of a physical CPU without Hyper-Threading. An idle LP (both LP0 and LP1) is actually suspended, and the shared resources, as far as we can judge from the results obtained, are completely transferred to the use of the active LP.

Therefore, we draw the first conclusion: two logical processors are actually equal, and enabling Hyper-Threading does not "interfere" with the work of one thread (which is not bad in itself). Now let's see if this inclusion "helps", and if so, where and how?

Rendering... The results of four tests in 3D-modeling packages 3D Studio MAX 4.26, Lightwave 7b and A | W Maya 4.0.1 are combined into one diagram due to their similarity.

In all four cases (for Lightwave - two different scenes), the CPU load with one processor with Hyper-Threading disabled is almost always kept at 100%. Nevertheless, when Hyper-Threading is enabled, scene calculation is accelerated (as a result of which we even had a joke about CPU load over 100%). In three tests, we can see a 14-18% increase in performance from Hyper-Threading - on the one hand, not much compared to the second CPU, but on the other hand, it is quite good, given the "free" effect of this effect. In one of the two tests with Lightwave, the performance gain is practically zero (apparently, this is due to the specificity of this application, which is full of weirdness). But there is no negative result anywhere, and a noticeable increase in the other three cases is encouraging. And this despite the fact that parallel rendering processes do a similar job and most likely may not be the best way to simultaneously use the resources of the physical CPU.

Photoshop and MP3 encoding... The GOGO-no-coda 2.39c codec is one of the few that supports SMP, and it shows a 34% performance gain from dual-processor technology. At the same time, the effect of Hyper-Threading in this case is zero (we don't consider a 3% difference to be significant). But in the test with Photoshop 6.0.1 (a script consisting of a large set of commands and filters) you can see a slowdown when Hyper-Threading is enabled, although the second physical CPU adds 12% performance in this case. This is, in fact, the first case when Hyper-Threading causes a drop in performance ...

Professional OpenGL... It has long been known that SPEC ViewPerf and many other OpenGL applications often slow down on SMP systems.

OpenGL and Dual Processor: Why They Are Not Friends

Many times in our articles, we have drawn the readers' attention to the fact that dual-processor platforms, when performing professional OpenGL tests, very rarely show any significant advantage over single-processor ones. Moreover, there are often cases when installing a second processor, on the contrary, degrades the system performance when rendering dynamic three-dimensional scenes.
Naturally, not only we noticed this oddity. Some testers simply silently bypassed this fact - for example, citing the SPEC ViewPerf benchmark results only for dual-processor configurations, thus avoiding the explanation "why is a dual-processor system slower?" Others made all possible fantastic assumptions about the coherence of caches, the need to maintain it, the resulting overhead, etc. And for some reason no one was surprised that, for example, the processors were impatient to monitor the coherence of the processors precisely during windowed OpenGL rendering (in its "computational" essence, it is not much different from any other computational problem).
In fact, the explanation, in our opinion, is much simpler. As you know, an application can run on two processors faster than on one if:
there are more than two or more threads running at the same time;
these threads do not interfere with the execution of one another — for example, they do not compete for a shared resource such as an external storage device or a network interface.

Now let's take a simplified look at what OpenGL rendering looks like when performed by two threads. If an application, "seeing" two processors, creates two threads of OpenGL rendering, then for each of them, according to the rules of OpenGL, its own gl-context is created. Accordingly, each thread renders to its own gl-context. But the problem is that for the window into which the image is displayed, only one gl-context can be current at a time. Accordingly, the threads in this case simply "in turn" output the generated image to the window, alternately making their context current. Needless to say, this "alternation of contexts" can be very expensive in terms of overhead?
Also, for example, we will give graphs of the use of two CPUs in several applications displaying OpenGL scenes. All measurements were taken on a platform with the following configuration:
one or two Intel Xeon 2.2 GHz (Hyper-Threading disabled);
512 MB RDRAM memory;
Supermicro P4DC6 + motherboard;
ASUS V8200 Deluxe video card (NVidia GeForce3, 64 MB DDR SDRAM, Detonator 21.85 drivers);
Windows 2000 Professional SP2
video mode 1280x1024x32 bpp, 85 Hz, Vsync disabled.

Blue and red show the graphs of the utilization of CPU 0 and CPU 1, respectively. The line in the middle is the final CPU Usage graph. The three graphs correspond to two scenes from 3D Studio MAX 4.26 and part of the SPEC ViewPerf benchmark (AWadvs-04).

CPU Usage: Animation 3D Studio MAX 4.26 - Anibal (with manipulators) .max

CPU Usage: Animation 3D Studio MAX 4.26 - Rabbit.max

CPU Usage: SPEC ViewPerf 6.1.2 - AWadvs-04
The same pattern is repeated in many other applications that use OpenGL. Two processors do not bother with work at all, and the total CPU Usage turns out to be at the level of 50-60%. At the same time, for a uniprocessor system, in all these cases, CPU Usage is confidently kept at 100%.
Therefore, it is not surprising that so many OpenGL applications do not speed up too much on dual systems. Well, the fact that they sometimes even slow down has, in our opinion, a completely logical explanation.

We can state that with two logical CPUs the performance drop is even more significant, which is quite understandable: two logical processors interfere with each other in the same way as two physical ones. But their overall performance, naturally, turns out to be lower, so when Hyper-Threading is enabled, it decreases even more than when two physical CPUs are running. The result is predictable and the conclusion is simple: Hyper-Threading, like "real" SMP, is sometimes contraindicated for OpenGL.

CAD applications... The previous conclusion is confirmed by the results of two CAD tests - SPECapc for SolidEdge V10 and SPECapc for SolidWorks. The graphics performance of these tests for Hyper-Threading is similar (although in the case of the SMP system for the SolidEdge V10, the result is slightly higher). But the results of CPU_Score tests loading the processor make you think: 5-10% gain from SMP and 14-19% slowdown from Hyper-Threading.

But in the end, Intel honestly admits in some cases the possibility of performance degradation during Hyper-Threading - for example, when using empty wait loops. We can only assume that this is the reason (a detailed study of the SolidEdge and SolidWorks code is beyond the scope of this article). After all, everyone knows the conservatism of CAD developers who prefer proven reliability and are not particularly in a hurry to rewrite the code taking into account new trends in programming.

Summing up, or "Attention, the right question"

Hyper-Threading works, there is no doubt about that. Of course, the technology is not universal: there are applications that are "worse" from Hyper-Threading, and if this technology spreads, it would be desirable to modify them. But didn't the same thing happen in due time with MMX and SSE and continues to happen with SSE2? ..

However, this raises the question of the applicability of this technology to our realities. We will discard the version of a single-processor system based on Xeon with Hyper-Threading right away (or let it be only temporary, pending the purchase of a second processor): even a 30% increase in performance does not justify the price in any way - then it is better to buy a regular Pentium 4. The number of CPUs remains from two or more.

Now let's imagine we are buying a dual-processor Xeon system (say, with Windows 2000 / XP Professional). Two CPUs are installed, Hyper-Threading is on, the BIOS finds as many as four logical processors, now, how can we take off ... Stop. But how many processors will our operating system see? That's right, two. Only two, since it is simply not designed for a larger number. These will be two physical processors, that is, everything will work exactly the same as with disabled Hyper-Threading - not slower (two "additional" logical CPUs will simply stop), but not faster (verified by additional tests, the results are not are given because of their complete evidence). Hmmm, little pleasant ...

What is left? Well, do not put Advanced Server or .NET Server on our workstation really? No, the system will install itself, recognize all four logical processors and will function. But the server OS looks a little strange on a workstation, to put it mildly (not to mention the financial aspects). The only reasonable case is when our dual-processor Xeon-system will act as a server (at least some collectors have already launched production of servers on workstation-processors Xeon without hesitation). But for dual workstations with corresponding operating systems, the applicability of Hyper-Threading remains questionable. Intel is now actively advocating OS licensing based on the number of not logical, but physical CPUs. Discussions are still going on, and, in general, a lot depends on whether we will see an OS for workstations with support for four processors.

Well, with servers, everything comes out quite simply. For example, Windows 2000 Advanced Server, installed on a dual-processor Xeon system with Hyper-Threading enabled, will "see" four logical processors and run smoothly on it. To assess the benefits of Hyper-Threading in server systems, we present results from Intel Microprocessor Software Labs for dual-processor Xeon MP systems and several Microsoft server applications.

A 20-30% increase in performance for a two-processor server "for free" is more than tempting (especially compared to buying a "real" 4-processor system).

So it turns out that at the moment the practical applicability of Hyper-Threading is possible only in servers. The issue with workstations depends on the OS licensing solution. However, one more application of Hyper-Threading is quite realistic - if desktop processors also receive support for this technology. For example (let's fantasize) why is a system with a Pentium 4 with Hyper-Threading support and Windows 2000 / XP Professional with SMP support installed? - from servers to desktop and mobile systems.

Tutorial

In this article I will try to describe the terminology used to describe systems capable of executing multiple programs in parallel, that is, multicore, multiprocessor, multithreaded. The different kinds of parallelism in IA-32 CPUs have appeared at different times and in a somewhat inconsistent manner. It's pretty easy to get confused in all this, especially considering that operating systems carefully hide details from not-too-sophisticated applications.

The purpose of the article is to show that with all the variety of possible configurations of multiprocessor, multicore and multithreaded systems for programs running on them, opportunities are created both for abstraction (ignoring differences) and for taking into account the specifics (the ability to programmatically find out the configuration).

Warning about marks ®, ™, in the article

My comment explains why company employees should use copyright marks in public communications. In this article, I had to use them quite often.

CPU

Of course, the oldest, most often used and controversial term is "processor".

In the modern world, a processor is what we buy in a beautiful Retail box or a not very beautiful OEM package. An indivisible entity that plugs into a socket on the motherboard. Even if there is no connector and cannot be removed, that is, if it is firmly soldered, it is one chip.

Mobile systems (phones, tablets, laptops) and most desktops have a single processor. Workstations and servers sometimes boast two or more processors on a single motherboard.

Supporting multiple CPUs in one system requires numerous design changes. At a minimum, it is necessary to ensure their physical connection (provide several sockets on the motherboard), resolve issues of processor identification (see later in this article, as well as my previous note), negotiation of memory accesses and delivery of interrupts (the interrupt controller must be able to route interrupts for multiple processors) and, of course, support from the operating system. Unfortunately, I could not find a documentary mention of the creation of the first multiprocessor system based on Intel processors, however, Wikipedia claims that Sequent Computer Systems supplied them already in 1987 using Intel 80386 processors. Widespread support for multiple chips in one system is becoming available starting with Intel® Pentium.

If there are several processors, then each of them has its own connector on the board. At the same time, each of them has complete independent copies of all resources, such as registers, executors, caches. They share a common memory - RAM. Memory can be connected to them in various and rather non-trivial ways, but that is a separate story beyond the scope of this article. It is important that in any scenario for the executable programs the illusion of a uniform shared memory available from all processors included in the system should be created.

Ready for takeoff! Intel® Desktop Board D5400XS

Core

Historically, multi-core in Intel IA-32 appeared later than Intel® HyperThreading, but in the logical hierarchy it comes next.

It would seem that if the system has more processors, then its performance is higher (on tasks that can use all the resources). However, if the cost of communication between them is too high, then all the gain from parallelism is killed by long delays in the transfer of shared data. This is exactly what is observed in multiprocessor systems - both physically and logically, they are very far from each other. To communicate effectively in such an environment, specialized buses such as Intel® QuickPath Interconnect have to be invented. Energy consumption, size and price of the final solution, of course, do not decrease from all this. High integration of components should come to the rescue - circuits executing parts of a parallel program should be dragged closer to each other, preferably on one crystal. In other words, one processor should organize several cores, in everything identical to each other, but working independently.

Intel's first multi-core IA-32 processors were introduced in 2005. Since then, the average number of cores in server, desktop, and now mobile platforms has been steadily growing.

Unlike two single-core processors on the same system, sharing only memory, two cores can also share caches and other resources that are responsible for interacting with memory. Most often, the first level caches remain private (each core has its own), while the second and third level can be either shared or separate. This organization of the system allows to reduce delays in data delivery between neighboring cores, especially if they are working on a common task.

A micrograph of an Intel quad-core processor, codenamed Nehalem. Separate cores, a shared L3 cache, as well as QPI links to other processors and a common memory controller are allocated.

Hyperthreading

Until about 2002, the only way to get an IA-32 system capable of executing two or more programs in parallel was to use multiprocessor systems. The Intel® Pentium® 4, as well as the Xeon line, codenamed Foster (Netburst), introduced a new technology - hyperthreading or hyperthreading - Intel® HyperThreading (hereinafter HT).

There is nothing new under the sun. HT is a special case of what the literature calls simultaneous multithreading (SMT). Unlike "real" cores, which are full and independent copies, in the case of HT, only a part of the internal nodes are duplicated in one processor, primarily responsible for storing the architectural state - registers. The executive nodes responsible for organizing and processing data remain in the singular, and at any given time are used by at most one of the threads. Like kernels, hyperthreads share caches among themselves, but from what level it depends on the specific system.

I will not try to explain all the pros and cons of designs with SMT in general and with HT in particular. The interested reader can find a fairly detailed discussion of the technology in many sources, and of course on Wikipedia. However, I will note the following important point, which explains the current limits on the number of hyperthreads in real products.

Stream limits

When is the presence of "dishonest" multicore in the form of HT justified? If one application thread is unable to load all the executing nodes inside the kernel, then they can be "borrowed" to another thread. This is typical for applications that have a "bottleneck" not in computations, but in data access, that is, they often generate cache misses and have to wait for data to be delivered from memory. At this time, the kernel without HT will be forced to idle. The presence of HT allows you to quickly switch free executing nodes to a different architectural state (since it is just duplicated) and execute its instructions. This is a special case of a technique called latency hiding, when one long operation, during which useful resources are idle, is masked by the parallel execution of other tasks. If the application already has a high utilization of kernel resources, the presence of hyperthreads will not allow it to get accelerated - “honest” kernels are needed here.

Typical desktop and server application scenarios for general-purpose machine architectures have the potential for concurrency enabled by HT. However, this potential is quickly “used up”. Perhaps for this reason, on almost all IA-32 processors, the number of hardware hyperthreads does not exceed two. In typical scenarios, the gain from using three or more hyperthreads would be small, but the loss in crystal size, power consumption and cost is significant.

A different situation is observed in typical tasks performed on video accelerators. Therefore, these architectures are characterized by the use of SMT techniques with a large number of threads. Since the Intel® Xeon Phi coprocessors (introduced in 2010) are ideologically and genealogically quite close to video cards, they can be four hyperthreading on each core - a configuration unique to IA-32.

Logical processor

Of the three described "levels" of parallelism (processors, cores, hyperthreads), some or all of them may be missing in a particular system. This is affected by BIOS settings (multi-core and multithreading are disabled independently), microarchitectural features (for example, HT was missing in Intel® Core ™ Duo, but was returned with the release of Nehalem) and system events (multiprocessor servers can shutdown failed processors in the event of a malfunction and continue to "fly" on the rest). How is this multi-tier zoo of concurrency visible to the operating system and, ultimately, to the application?

Further, for convenience, we denote the number of processors, cores, and threads in some system by the triple ( x, y, z), where x is the number of processors y is the number of cores in each processor, and z- the number of hyperthreads in each core. Hereinafter, I will call this triple topology- a well-established term that has little to do with the section of mathematics. Work p = xyz defines the number of entities named logical processors systems. It defines the total number of independent concurrent application process contexts in a shared memory system that the operating system is forced to consider. I say “forced” because it cannot control the order of execution of two processes on different logical processors. This also applies to hyper-threads: although they work "sequentially" on the same core, the specific order is dictated by the hardware and is not available for monitoring or controlling programs.

Most often, the operating system hides the features of the physical topology of the system on which it is running from the end applications. For example, the following three topologies: (2, 1, 1), (1, 2, 1) and (1, 1, 2) - the OS will represent in the form of two logical processors, although the first of them has two processors, the second - two cores, and the third has just two threads.

Windows Task Manager shows 8 logical processors; but how much is it in processors, cores and hyperthreads?

Linux top shows 4 logical processors.

This is quite convenient for application developers - they do not have to deal with the hardware features that are often irrelevant to them.

Topology definition programmatically

Of course, abstraction of topology into a single number of logical processors in some cases creates enough grounds for confusion and misunderstandings (in heated Internet disputes). Computing applications that want to squeeze the maximum performance out of hardware require detailed control over where their threads will be placed: closer to each other on neighboring hyperthreads, or, conversely, farther away on different processors. The speed of communication between logical processors in a single core or processor is much higher than the speed of data transfer between processors. The possibility of heterogeneity in the organization of RAM also complicates the picture.

Information about the topology of the system as a whole, as well as the position of each logical processor in IA-32, is available using the CPUID instruction. Since the appearance of the first multiprocessor systems, the logical processor identification scheme has been expanded several times. To date, parts of it are contained in sheets 1, 4 and 11 of the CPUID. Which sheet to watch can be determined from the following flowchart taken from the article:

I will not bore you here with all the details of the individual parts of this algorithm. If interest arises, then the next part of this article can be devoted to this. I will refer the interested reader to, in which this issue is dealt with in as much detail as possible. Here I will first briefly describe what APIC is and how it relates to topology. Then consider working with sheet 0xB (eleven in decimal), which is currently the last word in "apicostroenie".

APIC ID

Local APIC (advanced programmable interrupt controller) is a device (now part of the processor) responsible for working with interrupts coming to a specific logical processor. Each logical processor has its own APIC. And each of them in the system must have a unique APIC ID value. This number is used by interrupt controllers for addressing when delivering messages, and by everyone else (for example, the operating system) to identify logical processors. The specification for this interrupt controller has evolved from the Intel 8259 PIC through Dual PIC, APIC and xAPIC to x2APIC.

Currently, the width of the number stored in the APIC ID has reached the full 32 bits, although in the past it was limited to 16, and even earlier - only 8 bits. Nowadays, the remnants of the old days are scattered all over the CPUID, but all 32 bits of the APIC ID are returned in CPUID.0xB.EDX. Each logical processor, independently executing the CPUID instruction, will return its own value.

Clarification of family ties

The APIC ID value by itself does not say anything about the topology. To find out which two logical processors are inside one physical processor (ie, they are “brothers” of hyperthreads), which two are inside the same processor, and which ones are in completely different processors, you need to compare their APIC ID values. Depending on the degree of relationship, some of their bits will be the same. This information is contained in the CPUID.0xB sublists, which are encoded using the ECX operand. Each of them describes the position of the bit field of one of the topology levels in EAX (more precisely, the number of bits that need to be shifted in the APIC ID to the right to remove the lower topology levels), as well as the type of this level - hyperthread, core, or processor - in ECX.

Logical processors located inside the same core will have the same APIC ID bits, except for those belonging to the SMT field. For logical processors in the same processor, all bits except for the Core and SMT fields. Since the number of sublists for CPUID.0xB can grow, this scheme will allow supporting the description of topologies with more levels, if the need arises in the future. Moreover, it will be possible to enter intermediate levels between the existing ones.

An important consequence of the organization of this scheme is that there may be "holes" in the set of all APIC IDs of all logical processors in the system; they won't go sequentially. For example, in a multicore processor with turned off HT, all APIC IDs may turn out to be even, since the least significant bit responsible for encoding the hyperstream number will always be zero.

Note that CPUID.0xB is not the only source of information about logical processors available to the operating system. A list of all processors available to it, along with their APIC ID values, is encoded in the MADT ACPI table.

Operating systems and topology

Operating systems provide logical processor topology information to applications through their own interfaces.

On Linux, topology information is contained in the / proc / cpuinfo pseudo file and the dmidecode command output. In the example below, I am filtering the cpuinfo content on some quad-core system without HT, leaving only the topology-related entries:

Hidden text

[email protected]: ~ $ cat / proc / cpuinfo | grep "processor \ | physical \ id \ | siblings \ | core \ | cores \ | apicid" processor: 0 physical id: 0 siblings: 4 core id: 0 cpu cores: 2 apicid: 0 initial apicid: 0 processor: 1 physical id: 0 siblings: 4 core id: 0 cpu cores: 2 apicid: 1 initial apicid: 1 processor: 2 physical id: 0 siblings: 4 core id: 1 cpu cores: 2 apicid: 2 initial apicid: 2 processor: 3 physical id: 0 siblings: 4 core id: 1 cpu cores: 2 apicid: 3 initial apicid: 3

In FreeBSD, the topology is reported via the sysctl mechanism in the kern.sched.topology_spec variable as XML: