Intel Sandy Bridge processors are all secrets. Five generations of Core i7: from Sandy Bridge to Skylake

19.06.2019 Windows 10

"published about a year ago, we talked about the Nehalem microarchitecture, which replaced Core at the end of 2008. This review will focus on the Sandy Bridge architecture, which should completely replace Nehalem in the very near future.

Today chips based on Sandy Bridge are presented in all lines of Intel processors, including the server Xeon, the desktop and mobile Core i3 / 35 / i7, Pentium and Celeron, and the "extreme" Core i7 Extreme. Shortly before the publication of this article, on May 22, 2011, seven more new processors based on Sandy Bridge were introduced.

What are the fundamental differences between Sandy Bridge and Nehalem, and what are the features and advantages of the new Intel microarchitecture? In short, these differences are as follows: the updated graphics core as part of the "system agent" is located on the same die with the computational one, a new L0 microinstruction buffer, shared L3 cache, upgraded Turbo Boost technology, an extended SIMD AVX instruction set and a redesigned dual-channel DDR3 1333 MHz memory controller are provided. ... Along with the new architecture, a new LGA 1155 processor socket has appeared.

One of the main design differences between Sandy Bridge and Nehalem is the placement of computational cores and the north bridge (system agent) on one die. Recall that in Nehalem, the CPU itself and the north bridge were located under a common cover, but in fact they were located on independent chips, which, moreover, were made according to different technological standards: the CPU - at 32 nm, and the north bridge - at 45 nm ... In Sandy Bridge, this is a single crystal, made using a 32nm process technology, which houses the computing cores, the graphics core, RAM, PCI Express, Power Control Unit (PCU) controllers, and a video output unit.

The new set of SIMD instructions in Sandy Bridge chips is called AVX - Advanced Vector Extensions, that is, "extended vector instructions". In fact, this is the next generation of SIMD instructions (Single Instruction, Multiple Data - SSE5 "single stream of instructions, multiple data stream" instructions with support for four-operand commands Supports hardware encryption technology Advanced Encryption Standard (AES) and virtualization system Virtual Machine Extensions (VMX).

Despite the similar design, Sandy Bridge chips have more execution units than Nehalem: 15 versus 12 (see block diagram). Each execution unit is connected to the instruction scheduler via a 128-bit channel. Two execution units are used simultaneously to execute new AVX instructions containing 256-bit data.

Sandy Bridge chips are capable of processing up to four instructions per clock cycle thanks to four decoders built into the instruction fetch blocks. These decoders convert x86 instructions into simple RISC-like microinstructions.

The most important innovation in Sandy Bridge processors is the so-called "level zero" cache L0, which, in principle, was absent in the previous generation processors. This cache is capable of storing up to 1536 decoded microinstructions: its meaning is that when the executable program enters a circular loop, that is, it repeatedly executes the same instructions, it is not necessary to decode the same instructions again. This scheme can significantly improve performance: according to Intel experts, L0 is used in 80% of the computer time, that is, in the overwhelming majority of cases. In addition, when using L0, decoders and L1 cache are disabled, and the chip uses less power and generates less heat.

In connection with the appearance of "level zero cache" in Sandy Bridge chips, the trace cache of "veterans of the gigahertz race" Pentium 4 processors based on the NetBurst architecture is often remembered. However, these buffers work in different ways: in the trace cache, instructions are written exactly in the order in which they were executed, so the same instructions can be repeated several times in it. L0 stores single instructions, which, of course, is more rational.

The branch prediction block has undergone noticeable changes, which received a branch target buffer of doubled size. In addition, a special data compression algorithm is now used in the buffer, due to which the block is able to prepare large volumes of instructions, thereby increasing the computational performance.

The memory subsystem in Sandy Brigde has also been optimized to handle 256-bit AVX instructions. Recall that Nehalem used dedicated ports for loading, storing addresses and storing data, tied to separate dispatch ports, which means that it can load 128 bits of data from the L1 cache per clock cycle. In Sandy Brigde, the load and storage ports can be reassigned as needed and simultaneously act as a pair of load or storage ports, allowing for 256 bits of data per clock.

The Sandy Bridge uses a ring interconnect to communicate between the chip's components, i.e., compute cores, L3 cache, graphics core and system agent (memory, PCI Express, power and display controllers). It was based on the high-speed QPI bus (Quick Path Interconnect, bandwidth up to 6.4 GB / s at 3.2 GHz), first implemented in Nehalem Lynnfield chips (Core i7 9xxx for Socket LGA1366), addressed to enthusiasts.

In essence, the Sandy Bridge ring bus is made up of four 32-byte rings: data buses, request buses, acknowledgment buses, and monitor buses. Requests are processed at the frequency of the processing cores, while at a clock frequency of 3 GHz the bus bandwidth reaches 96 GB per second. At the same time, the system automatically determines the shortest data transmission path, ensuring the minimum latency.

The use of the ring bus allowed another way to implement the third level cache L3, which in Sandy Bridge was called LLC (Last Level Cache, that is, "last level cache"). Unlike Nehalem, here LLC is not common to all cores, but at the same time, it can be distributed among all cores, as well as graphics and the system agent, if necessary. It is important to note that although each computational core has its own LLC segment, this segment is not rigidly tied to "its" core and its volume can be distributed among other components via a ring bus.

During the transition to Sandy Bridge, Intel assigned all the components of the central processor that do not belong to the computational cores themselves, the general name System Agent, that is, the "system agent". In fact, these are all components of the so-called "north bridge" of the system logic set, but this name is still more suitable for a separate microcircuit. When applied to Nehalem, the strange and obviously unfortunate name "Uncore", that is, "non-kernel", was used, so the "system agent" sounds much more appropriate.

The main elements of the "system agent" include an upgraded dual-channel DDR3 memory controller up to 1333 MHz, a PCI Express 2.0 controller with support for one x16 bus, two x8 buses, or one x8 and two x4 buses. The chip has a special power control unit, on the basis of which the new generation Turbo Boost automatic overclocking technology is implemented. Thanks to this technology, which takes into account the state of both computing and graphics cores, the chip, if necessary, can significantly exceed its thermal packet for up to 25 seconds without damaging the processor and affecting performance.

Sandy Bridge uses the next generation Intel HD Graphics 2000 and HD Graphics 3000 GPUs, which can be composed of six or twelve execution units (EU), depending on the processor model. The nominal graphics clock speed is 650 or 850 MHz, while it can be increased to 1100, 1250 or 1350 MHz in Turbo Boost mode, which now applies to the video accelerator. Graphics support the Direct X 10.1 API - the developers considered support for Direct X 11 unnecessary, rightly considering that fans of computer games, where this API is really in demand, will in any case prefer much more powerful discrete graphics.

The labeling of Sandy Bridge processors is quite simple and logical. As before, it consists of numeric indices, which in some cases are followed by letters. Sandy Bridge can be distinguished from Nehalem by the name: the index of new chips is four-digit and starts with a two ("second generation"), and the old ones are three-digit. For example, we have an Intel Core i5-2500K processor. Here, "Intel Core" is the brand, "i5" is the series, "2" is the generation, "500" is the model index, and "K" is the letter index.

As for the letter indices, one of them is known from the chips with the Nehalem microarchitecture - "S" (i5-750S and i7-860S processors). It is assigned to chips targeted at home multimedia machines. Processors with the same numeric index differ in that the models with the letter index "S" operate at a slightly lower nominal clock frequency, but the "turbo frequency" achieved with automatic Turbo Boost is the same for them. In other words, in normal operation they are more economical and their cooling system is quieter than that of the "standard" models. All new desktop Cores of the second generation without indexes consume 95 watts, and with the "S" index - 65 watts.

Modifications with the "T" index operate at an even lower clock frequency than the "base" ones, while their "turbo frequency" is also lower. The thermal package of such processors is only 35 or 45 W, which is quite comparable to the TDP of modern mobile chips.

And finally, the "K" index stands for an unlocked multiplier, which allows you to overclock the processor without hindrance, increasing its clock speed.

We got acquainted with the general technical solutions implemented in "desktop" processors with Sandy Bridge architecture. Next, we will talk about the features of different series, study the current model range and give recommendations on which specific models can be considered the best purchases in their class.

We are opening a series of articles about the new Intel Sandy Bridge processor microarchitecture. In the first article, we will touch on the theory - we will talk about changes and innovations. In the near future, the results of tests of the new platform and a lot of interesting things will appear on the blog pages.

The Tick-Tock concept, invented in the bowels of Intel, continues to work - every year the manufacturer introduces a modified processor microarchitecture. The "Tick" phase implies the improvement of previous developments (reduction of the technical process, the introduction of not too revolutionary new technologies, and so on). About a year after "Tick" happens "Tock" - the release of processors based on a completely new microarchitecture.

In early 2010, Intel introduced a line of chips codenamed Westmere / Clarkdale - a technological advancement of the very first Core i3 / i5 / i7 (Nehalem) models. It was Tock's turn. Meet the revolutionary Sandy Bridge microarchitecture, on the basis of which processors under the general name Core 2011 are built - completely new models of Core i3, Core i5, Core i7, as well as budget Pentium and Celeron models.

This time, the manufacturer decided not to waste time on trifles and immediately announced many models for mobile and desktop computers in all price ranges. True, only a few, far from the most affordable versions went on sale, but more on that later.

The press calls Sandy Bridge one of the most significant Intel microarchitectures in recent years - the manufacturer has done everything possible to bring its processors to a new level of performance, brought to mind the technologies presented earlier, and offered incredible integration of computing units and controllers. Compared to Sandy Bridge, the models presented earlier seem childish. Let's take a closer look at the changes in Core 2011.

Features of the new microarchitecture

A block diagram depicting the Sandy Bridge microarchitecture is unlikely to tell much about the technologies implemented and the overall changes. However, it is worth knowing that all the components of the new processors differ significantly from those of the same Westmere / Clarkdale. The main thing to understand before exploring the features of Sandy Bridge is that architectural improvements allow the new processors to run 10-50% faster compared to the Core 2010 generation.

Intel engineers redesigned the branch prediction block, changed the preprocessor, implemented an advanced decoded cache, high-speed ring bus, a block of advanced vector extensions AVX, reworked the integrated RAM controller and links with the PCI Express bus, changed the integrated graphics chip beyond recognition, introduced a fixed block for hardware acceleration video transcoding, brought to mind the Turbo Boost auto-overclocking technology and so on. Now you probably believe that there are really a lot of changes? We will try to briefly go over each of them in order to get a certain picture before full testing appears on our blogs.

For starters, the 4-core Sandy Bridge models are made up of 995 million transistors, manufactured in a highly tuned 32nm process. About 114 million are allocated for the needs of the graphics chip, each core takes 55 million transistors, the rest goes to additional controllers. In comparison, a full 4-core AMD Phenom II X4 processor contains 758 million transistors, while a 4-core Nehalem uses 731 million transistors. With all this, a full-fledged Sandy Bridge processor crystal occupies an area of 216 square millimeters - a crystal of one of the first 4-core Intel processors (Core 2 Quad) occupied the same area with a much smaller number of transistors and, accordingly, offered incomparably lower performance.

Now, let me tell you about the key innovations in microarchitecture in order.

Decoded instruction cache (micro-op cache) - Introduced in Sandy Bridge, the micro-op cache mechanism stores instructions as they are decoded. When performing calculations, the processor determines whether the next instruction has entered the cache. If so, the preprocessor and compute pipeline are de-energized, which saves energy. At the same time, 1.5 KB of decoded cache memory is fully integrated with the cache of the first level (L1).

The redesigned branch predictor boasts increased performance accuracy. All of this is made possible by several significant design innovations.

Ring Bus - Sandy Bridge processors use an advanced and very fast ring bus to combine multiple architecture blocks. The interface owes its appearance to the integrated graphics core and video transcoder - the need to communicate with the third-level cache made the previous connection scheme (about 1000 pins for each core) ineffective. All important processor components are connected to the redesigned bus - graphics, x86-compatible cores, transcoder, System Agent, L3 cache.

Under the name "System Agent" there is a block, previously known as un-core - here controllers that were previously moved to the north bridge on the motherboard are united. The agent includes 16 links for connecting to the PCI Express 2.0 bus, a dual-channel DDR3 memory controller, an interface for connecting to a common system bus DMI, a power control unit and a graphics unit responsible for displaying an image.

One of the most important innovations of Sandy Bridge is considered to be a redesigned graphics chip from scratch. Let's start with the fact that now the graphics are integrated with other blocks in a single crystal (previously, two scattered chips were hidden under the metal cover of Clarkdale processors). Intel engineers are bragging about twice the bandwidth of the graphics chip over the previous generation of Intel HD Graphics, thanks to redesigned unified shader processors, L3 cache access, and other enhancements. At the same time, in the new processors it will be possible to find at once two significantly different models of the graphics core - HD Graphics 2000 and HD Graphics 3000. The first offers six unified shader processors, the second - twelve. Intel and the industry press say the new graphics make the cheapest discrete graphics cards redundant, but we'll have to be convinced of that in a separate review. We almost forgot to say that the new HD Graphics models support DirectX 10, the transition to more modern graphics technologies will take place in the next generations of processors.

In addition, the new graphics chip provides a separate Media Engine block, consisting of two parts for video transcoding and decoding. Intel engineers decided not to tempt fate - before, decoding and encoding video was done by unified shader processors and, in part, low-power fixed units. According to eyewitnesses, the Fixed Media Engine does its job faster and better than even monstrous high-end graphics cards.

The revised Turbo Boost auto-overclocking algorithms now allow the processor to slightly exceed the prescribed power consumption norms for a short time - in practice, this means the processor will be able to make high-speed races over short distances. Of course, automation will not allow you to cross the line of reliability. Recall that Turbo Boost automatically increases the frequency of one, two, three or four cores as needed. So, the most powerful model Intel Core i7 2600 can increase the frequency of one core up to 3.8 GHz when working with applications that are not optimized for multi-core architecture.
Blocked overclocking

Since the days of the Pentium II, Intel has started selling processors with locked multipliers to prevent users from playing at clock speeds, and the company has always been able to sell the same models at different price ranges. But overclockers have always had the ability to adjust the FSB frequency. Unfortunately, with the arrival of Sandy Bridge, everything changes again - the multiplier is tightly locked in most models, and the bus frequency generator is integrated into the only bridge of the 6-series chipsets and is locked at 100 MHz.

Modifications with unlocked multipliers remain the only overclocking outlet - there are not many such models in the new line, but they exist and cost quite adequate money.

Ruler

It's time to talk about the processors that were presented in the first place - to understand the new names and understand which processor should be chosen for your purposes.
During the release of Sandy Bridge, Intel introduced 29 (twenty nine!) New Core iX models - fourteen for desktop and fifteen for mobile computers.

The manufacturer switched to a new, even more clouded processor designation scheme, in which it is necessary to delve into.
So, the name of each new processor in the desktop line consists of a brand designation (Intel Core), the name of a specific line (i3, i5, i7), an index (2600) and a suffix (K). There are only three suffixes for the desktop line - K (unlocked multiplier), S (power consumption 65W) and T (power consumption 34-45W). Now, the strangest thing is that the powerful HD Graphics 3000 graphics chip is included only in models with an unlocked multiplier (K), the rest of the processors are content with the noticeably weaker HD Graphics 2000.

The original Core 2011 desktop line breaks down nicely by line name. So, Core i7 processors are quad-core chips with Hyper Threading support (4 cores, 8 threads), Core i3 are simple dual-core chips without Turbo Boost support, but with Hyper Threading support (2 cores, 4 threads), Core i5 is the first queue quad-core models with Turbo Boost support, but no Hyper Threading. Unfortunately, in the future, dual-core models will also appear within the Core i5 line, but they will be available primarily for assemblers of ready-made systems.

Another reason for further differentiation of the line is auto-overclocking of the integrated graphics core. Initially, both graphics models run at 850 MHz, but the Core i5 and Core i3 processors can overclock it up to 1100 MHz. Older Core i7 - up to 1350 MHz. Think for yourself how this will affect the final performance.

With the mobile modifications of Sandy Bridge, things are a little more complicated. For starters, absolutely all mobile processors in the new line use the powerful HD Graphics 3000 graphics chip (even the most economical models). For some unknown reason, Intel decided to break the unspoken law of marketing and roam the indices - we have not yet decided how we will understand the models with indices 2657, 2537, 2410 and 2720. In terms of indexes, there are XM, QM, M designations, indicating laptops for different tasks. Accordingly, XM are extreme models for gaming systems, M are dual-core processors for economical laptops, QM are quad-core processors for mainstream laptops.

Of course, these are not all models for the next year - Intel will continue to experiment and will occasionally delight fans with new modifications. The main thing is not to violate the logic of the rulers invented by their own forces.

Platform

Together with Sandy Bridge, chipsets of the 6th series with the necessary processor socket LGA1155 were presented - the first swallows were Intel P67 and Intel H67. Understanding the two modifications is as easy as shelling pears. Intel P67 is suitable for configurations where a discrete graphics card will be used, while the platform supports overclocking. Plus, P67-based boards offer 2x8 PCI Express 2.0 lanes for multi-GPU configurations in AMD CrossFire or NVIDIA SLI mode. Intel H67, on the other hand, is of little use for overclocking, it supports only one PCI Express x16 port, but it can output a video signal.

All those who dream of getting all the features on one board will have to wait a bit - sometime in the second quarter of 2011, the developers will present the Intel Z68 chipset. Motherboards based on this chipset will support the graphics core built into the processor, as well as all the features of the Intel P67.

A few words about the new processor socket - Intel has remade the schematics and structure of the socket, so the old Core 2010 models for LGA 1156 will no longer work. Fortunately, the socket size has remained the same, you can install numerous coolers for LGA 1156 here and not have to worry about looking for the newest models.

Chipsets still lack native support for USB 3.0, although the market seems to be quite ready for such “innovations”. Fans of all the best will have to focus on advanced motherboards, where manufacturers embed third-party USB 3.0 controllers.

Fortunately, Intel has not forgotten about the new version of the SATA interface - new platforms support SATA3 with a bandwidth of up to 6 Gb / s. It is clear that for classic spindle hard drives all these speed boosts are not needed, but flash drives will appreciate the speed window at their true worth. For example, one of the flash drives presented at CES will reveal its high-speed capabilities only when paired with SATA3 - within the framework of SATA2 it is cramped (we are talking about Crucial RealSSD C300). Importantly, the SATA3 ports on the new motherboards coexist with SATA2, although the new interface offers full backward compatibility with the previous generation - be careful when plugging in your super-expensive SSD.

In the new chipsets, manufacturers are finally starting to get rid of the main archaism - the BIOS interface. UEFI comes to replace the clumsy blue screen from the past - the new shell supports mouse (or touchpad) control, offers a noticeably more modern and user-friendly interface. Other features of UEFI include innate support for hard drives over 2.2 TB.

What do we end up with?

It is widely believed among experts that Sandy Bridge is just an evolution of previous microarchitectures and that the company has not presented anything fundamentally new. We agree with the other part of the analysts. Despite the fact that the new line does not offer truly revolutionary features, the work done by Intel is worthy of all praise. The manufacturer brought all his undertakings to perfection - he carried out full integration of all components, improved the graphics chip to an acceptable level, completed the ring bus, redesigned the preprocessor functions, revised the Turbo Boost auto-overclocking capabilities, introduced a fixed block for video processing, and so on. As a result, we have before us completely new processors, which are head and shoulders above the previous generations in terms of technical characteristics.

In the near future, DNS blogs will feature testing of a new processor in games and popular programs, an overview of overclocking options using air cooling, a test of a graphics chip against budget discrete video cards. Do not miss.

Is the superiority of the first Core i (Nehalem and, in 2009, Westmere) over the rival's CPU final? The situation is a bit like the first year after the Pentium II was released: resting on our laurels and making record profits, it would be good to continue the successful architecture without changing its name much, adding new ones, the use of which will significantly improve performance, not forgetting about other innovations that speed up today's versions programs. True, in contrast to the situation 10 years ago, one should also pay attention to the now fashionable topic of energy efficiency, played with the ambiguous adjective Cool - "cool" and "cold" - and no less fashionable desire to build into the processor everything that still exists as separate. This is the sauce under which the novelty is served.

"The day before yesterday", "yesterday" and "today" of Intel processors.

Front of the conveyor. The colors represent different types of information and the blocks that process or store it.

Prediction

Let's start with Intel's announcement of a completely reworked (BPU). As in Nehalem, every clock cycle (and ahead of real execution) predicts the address of the next 32-byte piece of code, depending on the expected behavior of the jump instructions in the just predicted portion - and, apparently, regardless of the number and type of transitions. More precisely, if there is a supposedly triggered transition in the current portion, its own and target addresses are issued, otherwise - a transition to the next portion in a row. The predictions themselves have become even more accurate due to doubling (BTB), lengthening (GBHR) and optimization of the access hash function (BHT). True, actual tests have shown that in some cases the prediction efficiency is still slightly worse than in Nehalem. Maybe the increase in performance with the decrease in consumption is not compatible with good branch prediction? Let's try to figure it out.

In Nehalem (as well as in other modern architectures) BTB is present in the form of a two-level hierarchy - small - "fast" L1 and large - "slow" L2. This happens for the same reason why there are several levels: a single-level solution will turn out to be too compromise in all parameters (size, response speed, consumption, etc.). But in SB, the architects decided to put one level, and the size is twice as large as L2 BTB from Nehalem, that is, probably at least 4096 cells - that is how many of them are in Atom. (It should be noted that the size of the most frequently executed code is slowly growing and less and less fits into the cache, the size of which is the same for all Intel CPUs from the first Pentium M.) In theory, this will increase the area occupied by BTB, since the total area will change not recommended (this is one of the initial tenets of architecture) - you have to take something from some other structure. But there is also speed. Given that the SB should be designed for a slightly higher speed with the same process technology, you can expect that this large structure will be the bottleneck of the entire conveyor - unless you pipe it (two is already enough). True, the total number of transistors operating per cycle in the BTB will double, which does not contribute to energy savings at all. Dead end again? Intel replies to this that the new BTB stores addresses in a certain compressed state, which allows you to have twice as many cells with the same area and consumption. But it is not yet possible to verify this.

We look from the other side. SB received not new prediction algorithms, but optimized old ones: general, for indirect branches, loops and backtracks. Nehalem has 18-bit GBHR and BHT of unknown size. However, you can guarantee that the number of cells in the table is less than 2 18, otherwise it would take up most of the kernel. Therefore, there is a special hash function that folds 18 bits of the history of all transitions and the bits of the instruction address into a shorter index. Moreover, most likely, there are at least two hashes - for all GBHR bits and for those that reflect the triggering of the most difficult transitions. And now the efficiency of the chaotic distribution of the indices of various patterns of behavior according to the numbers of BHT cells determines the success of the general predictor. Although not explicitly stated, Intel certainly improved the hashes, which made it possible to use longer GBHRs with no less padding efficiency. But you can still guess about the size of BHT - as well as about how the predictor's energy consumption actually changed as a whole ... As for (RSB), it is still 16-address, but a new limitation has been introduced on the calls themselves - no more four by 16 bytes of code.

Until we went further, let's say about a slight discrepancy between the declared theory and the observed practice - and it showed that the cycle predictor in SB is removed, as a result of which the prediction of the final transition to the beginning of the cycle is made by the general algorithm, i.e., worse. An Intel spokesman assured us that nothing should be "worse", however ...

Decoding and IDQ

The addresses of the executable commands predicted in advance (alternately for each thread - with the technology enabled) are issued to check their presence in the instruction caches (L1I) and (L0m), but we will keep silent about the latter - we will describe the rest of the edge for now. Oddly enough, Intel kept the size of the instruction portion read from L1I at 16 bytes (here the word "portion" is understood according to ours). Until now, this has been an obstacle for the code, the average instruction size of which has exceeded 4 bytes, and therefore 4 instructions that are desirable for execution per cycle will no longer fit into 16 bytes. AMD solved this problem in the K10 architecture by expanding the instruction chunk to 32 bytes - although its CPUs have no more than 3 pipelines so far. In SB, the inequality of sizes leads to a side effect: the predictor issues the next address of a 32-byte block, and if a (presumably) triggered transition is found in its first half, then it is not necessary to read and decode the second - however, it will be done.

From L1I, the portion goes to the pre-decoder, and from there - to the length itself (), which processes up to 7 or 6 commands / cycle (with and without; Nehalem could have a maximum of 6), depending on their total length and complexity. Immediately after the transition, processing begins with the command at the target address, otherwise - from the byte before which the pre-decoder stopped one clock earlier. Likewise with the final point: either this is a (probably) triggered transition, the address of the last byte of which came from BTB, or the last byte of the chunk itself - unless the limit of 7 instructions / clock is reached, or an "inconvenient" command is encountered. Most likely, the length meter buffer has only 2-4 portions, but the length meter can receive any 16 consecutive byte. For example, if at the beginning of the portion 7 double-byte instructions are recognized, then in the next cycle you can process another 16 bytes, starting from the 15th.

The length meter, among other things, deals with the detection of pairs of macro-merged commands. We'll talk about the pairs themselves a little later, but for now, note that, as in Nehalem, no more than one such pair can be detected each clock, although a maximum of them could be marked with 3 (and one more single command). However, the measurement of instruction lengths is a partially sequential process, so it would not be possible to determine several macro-merged pairs during a cycle.

Labeled commands fall into one of two commands (IQ: instruction queue) - one per thread, 20 commands each (which is 2 more than Nehalem's). alternately reads commands from queues and transfers them to mops. It has 3 simple ones (translate 1 command into 1 mop, and with macro-merge - 2 commands into 1 mop), a complex translator (1 command into 1–4 mop or 2 commands in 1 mop) and a micro-sequencer for the most complex commands requiring 5 and more mops out. Moreover, it stores only the "tails" of each sequence, starting from the 5th mop, because the first 4 are produced by a complex translator. Moreover, if the number of mops in the microprogram is not divisible by 4, then their last four will be incomplete, but it will not work to insert another 1-3 mops from the translators in the same cycle. The decoding result comes in and out two (one per stream). The latter (officially called IDQ - instruction decode queue) still have 28 uops and the ability to block the loop if its executable part fits there.

All this (except for the mop cache) was already in Nehalem. What are the differences? First of all, obviously, the decoder has been taught to handle new subset commands. Support for all-digit SSE sets is no longer surprising, and command acceleration (including PCLMULQDQ) has been added to Westmere (32nm version of Nehalem). We have a pitfall: this function does not work for commands that have both a constant and RIP-relative addressing (RIP-relative, the address relative to the command pointer is the usual way of accessing data in 64-bit code). Such commands require 2 mops (separate loading and operation), which means that the decoder will process them no more than one per cycle, using only a complex translator. Intel claims that these sacrifices are made to save energy, but it is not clear on what: double placement, execution and mops will obviously take up more resources, which means they will consume energy than one.

Macro-merge has been optimized - previously, only an arithmetic or logical comparison (CMP or TEST) could be used as the first command to merge, now simple arithmetic instructions for addition and subtraction (ADD, SUB, INC, DEC) and logical "AND" (AND) are allowed, as well changing for the transition (second team of the pair). This makes it possible to reduce the last 2 commands to 1 mop in almost any cycle. Of course, the restrictions on the merged commands remain, but they are not critical, since the listed situations for a pair of commands are almost always executed:

the first of the first command must be a register;
if the second operand of the first command is in memory, RIP-relative addressing is invalid;
the second command cannot be at the beginning or cross line boundaries.

The rules for the transition itself are as follows:

only TEST and AND are compatible with any condition;
comparisons with (not) equals and any signed ones are compatible with any permitted first command;
comparisons for (non) hyphenation and any unsigned comparisons are not compatible with INC and DEC;
other comparisons (sign, overflow, parity and their negatives) are only valid for TEST and AND.

The main change in the mop queues is the merged mops of the type, in which memory access requires reading the index register, (and a few more rare types) are divided into pairs when writing to IDQ. Even if there are 4 such mops, then all 8 total will be recorded in IDQ. This is done because in the mop queues (IDQ), dispatcher (ROB) and reservation, the abbreviated mop format is now used without a 6-bit index field (of course, to save money when moving mops). It is assumed that such cases will be rare, and therefore will not greatly affect the speed.

We will tell you the history of the loop blocking mode of this buffer below, but here we will only point out one trifle: the transition to the beginning of the loop previously took 1 additional cycle, forming a "bubble" between the readings of the end and the beginning of the loop, but now it does not exist. Nevertheless, the four of the mops read per cycle cannot contain the last from the current iteration and the first from the next, therefore, ideally, the number of mops in the loop should be completely divisible by 4. But the criteria for blocking it have hardly changed:

loop mops must be generated by no more than 8 32-byte chunks of source code;
these portions must be cached in L0m (in Nehalem, of course, in L1I);
up to 8 unconditional jumps are allowed, predicted to be triggered (including the final one);
calls and returns are not allowed;
unpaired calls to the stack are unacceptable (most often with an unequal number of PUSH and POP commands) - more on that below.

Stack engine

There is one more mechanism, the work of which we did not consider in previous articles - the stack pointer tracker located in front of the IDQ. It appeared in the Pentium M and still hasn't changed. Its essence is that the modification of the stack pointer (ESP / RSP register for 32/64-bit mode) with commands for working with it (PUSH, POP, CALL and RET) is done in a separate adder, the result is stored in a special register and returned to the MOP as constant - instead of modifying the pointer in after each command, as required and as it was in the Intel CPU before the Pentium M.

This happens until some command accesses the pointer directly (and in some other rare cases) - the stack engine compares the shadow pointer to zero and, if the value is nonzero, inserts a synchronization uop into the stream of mops before the command calling the pointer, writing to the pointer the actual value from the special register (and the register itself is reset). Since this is rarely required, most of the calls to the stack that only implicitly modify the pointer use its shadow copy, which is modified simultaneously with other operations. That is, from the point of view of pipeline blocks, such instructions are encoded by a single merged mop and are no different from ordinary memory accesses, without requiring processing in the ALU.

An attentive Reader (good afternoon!) Will notice a connection: when looping a mop queue, unpaired calls to the stack are unacceptable precisely because the stack engine is in the pipeline. before IDQ - if after the next iteration the value of the shadow pointer turns out to be non-zero, you will need to insert a synchromop in the new one, but this is impossible in the cyclic mode (mops are only read from IDQ). Moreover, the stack engine is turned off altogether to save energy, like all other parts of the front.

The secret life of the nop

Another change affected the length meter, but this case stands out somewhat. First, let's remember what they are and why they are needed. Once upon a time in the x86 architecture, nop was only 1-byte. When it was required to shift the code by more than 1 byte or replace commands longer than 1 byte, but they just inserted it several times. But despite the fact that this command does nothing, time is still spent on its decoding, and in proportion to the number of nops. To prevent the performance of the "patched" program from sagging, the button can be lengthened. However, in the CPU of the 90s, the decoding rate of instructions with a number of prefixes above a certain value (which is much less than the maximum allowable length of an x86 instruction of 15 bytes) dropped sharply. In addition, specifically for nopa, the prefix is used, as a rule, of one type, but repeated many times, which is allowed only as an undesirable exception, complicating the length gauge.

To resolve these problems, starting with Pentium Pro and Athlon processors understand the "long nop" with the modR / M byte for "official" command lengthening using registers and address offset. Naturally, no operations with memory and registers occur, but when determining the length, the same length meter blocks are used as for ordinary multibyte instructions. Now, using long knots is officially recommended by both Intel and AMD low-level software optimization tutorials. By the way, the SB pre-decoder halved (from 6 to 3 clocks) the penalty for prefixes 66 and 67, which change the length of the constant and the address offset - but, as in Nehalem, the penalty is not imposed on commands where these prefixes do not actually change the length (for example, if prefix 66 is applied to a command without an immediate operand) or are part of a list of switches (which is often used in vector code).

The maximum length of a well-formed long nop does not exceed 9 bytes for Intel and 11 for AMD. Therefore, for alignment by 16 or 32 bytes, there can still be several nops. However, since this command is simple, its decoding and "execution" will take resources no more than processing the simplest operating commands. Therefore, for many years testing with long knots has been the standard method for determining the parameters of the pipeline front, in particular, the length meter and decoder. And here Sandy Bridge presented a very strange surprise: testing the performance of ordinary programs did not reveal any delays or slowdowns, but a synthetic check of the decoder parameters on duty unexpectedly showed that its performance is equal to one command per clock! At the same time, Intel did not give any official announcements about such radical changes in the decoder.

The metering procedure worked great on Nehalem and showed correct 4. You can blame the new and "overly" active Turbo Boost 2.0, which spoils the measured clock rates, but for tests it was disabled. Overheating with slowing down the frequency of throttling is also excluded. And when, finally, the reason was discovered, it became even stranger: it turns out that long nops on SB are processed only by the first simple translator, although 1-byte nops with any number of prefixes and similar commands “by inaction” (for example, copying a register to itself) are easily accepted all four. Why this was done is not clear, but at least one drawback of such a technical solution has already clearly shown itself: our research team took ten days to find out the reasons for the mysterious slowness of the decoder ... In retaliation, we ask the furious fans of the Opposite Camp to come up with some kind of conspiracy theory about insidious plans of a certain company I. to obfuscate the naive valiant researchers of processors. :)

By the way, as it turned out, the translator # 1 was already "more equal" among others. In Nehalem, cyclic scrolling commands (ROL and ROR) with an explicit constant operand were also decoded only in the first translator, and in the same cycle the fourth one was turned off, so that the IPC value dropped to 3. It would seem - why give such a rare example here? But precisely because of this trick, in order to achieve peak speed on hashing algorithms like SHA-1, very precise command layouts were needed, which compilers could not cope with. In SB, however, such commands simply became 2-Mops, so that, occupying a complex translator (which is already one), they behave almost indistinguishable for the CPU, but more predictable for a person and a compiler. With the knots, the opposite happened. Cash mops

Goals and predecessors

We deliberately separated this chapter from the rest of the description of the front - the addition of the mop cache clearly demonstrates which path Intel has chosen for all of its processors, starting with Core 2. In the latter, for the first time (for Intel), a block was added that simultaneously reached two, it would seem conflicting goals: increasing speed and saving energy. We are talking about the command queue (IQ) between the pre-decoder and the decoder, which then stored up to 18 commands of up to 64 bytes in total. If it only smoothed out the difference in the rates of preparation and decoding of commands (like a normal buffer), there would be little benefit. But Intel figured out to attach a small LSD block to the IQ (it is unlikely that the guys "accepted" something, they just have such humor) - Loop Stream Detector, a "cyclic flow detector". When a loop that fits into 18 instructions is detected, LSD disables all previous stages (predictor, L1I cache and pre-decoder) and queues up the loop instructions to the decoder until it completes or until a transition is made outside of its limits (calls and returns are not acceptable). Thus, energy is saved by disabling temporarily idle blocks and performance is increased due to the guaranteed flow of 4 instructions / clock for the decoder, even if they were "supplied" with the most inconvenient prefixes.

Intel clearly liked this idea, so for Nehalem the circuit was optimized: IQ was duplicated (for two streams), and two IDQ queues were put between the decoder and the dispatcher (that is, exactly on the border of the front and rear), each with 28 mps, and the LSD block was moved to them. Now, when the loop is blocked, the decoder is also turned off, and the performance has increased, including due to the guaranteed influx of not 4 commands, but 4 mops per cycle, even if they were generated with a minimum (for Core 2 / i) tempo of 2 mops / cycle. Furious fans of the Opposite Camp, for a second looking away from their favorite activity, will immediately insert a hairpin: if LSD is such a good thing, why was it not built into the Atom? And the hairpin is fair - having a 32-megapixel queue after the decoder, Atom does not know how to block a loop in it, which would be very useful for saving precious milliwatts. Nevertheless, Intel was not going to give up the idea and prepared an update for the new CPUs, and what an update!

The official internal name for the decode stream buffer is DSB (decode stream buffer), although it is not as accurate as the recommended term decoded instruction cache (DIC). Oddly enough, it does not replace, but complements the IDQ queues, which are now connected to the decoder or with a cache of mops. At the next branch prediction, the target address is simultaneously checked in the instruction and mop caches. If the latter is triggered, then further reading goes from it, and the rest of the edge is turned off. This is why the mop cache is a level zero mop cache, i.e. L0m.

Interestingly, this idea can be continued by calling IDQ level “minus 1” caches. :) But isn't such a complex hierarchy within the framework of even the entire core, but only one front, excessive? Even if Intel, as an exception, did not spare the area, will the IDQ pair bring significant additional savings, given that during their operation only the cache of mops is now turned off, since the rest of the front (except for the predictor) is already asleep? And you will not get a special increase in speed either, since the cache of mops is also configured to generate 4 mops / cycle. Apparently, Intel engineers decided that a 3-level game was worth milliwatt candles.

In addition to savings, the cache of mops accelerates performance, including by reducing the penalty for false prediction of the transition: in Nehalem, when the correct code was found in L1I, the penalty was 17 clock cycles, in SB - 19, but if the code was found in L0m, then only 14. Moreover These are the maximum numbers: in case of a falsely predicted transition, the scheduler still needs to start and finish the previous mops in programmatic order, and during this time L0m can have time to pump up the correct mops so that the scheduler has time to start them immediately after the commands are resigned before the transition. In Nehalem, this technique worked with IDQ and a front, but in the first case, the probability that the correct target address will also be inside a 28-Mop cycle is very small, and in the second case, the slowness of the front in most cases did not allow reducing the latency to zero. SB has a better chance.

Device

Topologically, L0m consists of 32 by 8 lines (8-). Each line contains 6 mops (in the entire cache - 1536, that is, “one and a half kilo-megapixel”), and the cache can write and read one line per cycle. The predictor gives addresses of 32-byte blocks, and it is this size that is working for L0m, therefore, below the term "chunk" we mean aligned and predicted as an executing 32-byte block of code (and not 16-byte, as for a decoder). When decoding, the L0m controller waits for the portion to be processed until the end or until the first transition that is triggered in it (of course, presumably - hereinafter we assume that the predictions are always correct), accumulating mops simultaneously with their sending to the rear. Then it fixes the entry and exit points of the chunk, according to the behavior of the transitions. Usually, the entry point is the target address of the transition that was triggered in the previous portion (more precisely, the lower 5 bits of the address), and the exit point is the own address of the triggered transition in this portion. In an extreme case, if neither the previous nor the current portions triggered a single transition (that is, portions are not only executed, but also stored in a row), then both will be executed in full - the entry into them will be on the zero mop and the first byte of the first completely fit in this portion of the command, and the output is on the last mop of the last fully fit command and its initial byte.

If there are more than 18 mops in a portion, it is not cached. This sets the minimum average (within a chunk) command size to 1.8 bytes, which will not be a serious limitation in most programs. You can recall the second point of IDQ restrictions - if the cycle fits in a portion, but takes from 19 to 28 mops, neither the L0m cache nor the IDQ queue will fix it, although it would fit everywhere in size. However, in this case, the average length of commands should be 1.1–1.7 bytes, which is extremely unlikely for two dozen teams in a row.

Most likely, portions of mops are simultaneously written to the cache, occupying 1–3 lines of one set, so for L0m one of the main principles of the set-associative cache is violated: when usually one set line is triggered. Right there, tags of up to three lines can receive the address of the same portion, differing only in ordinal numbers. When the predicted address enters L0m, the reading proceeds in the same way - 1, 2 or 3 paths of the required set are triggered. True, such a scheme is fraught with a drawback.

If the executable program in all chunks is decoded into 13-18 uops, which will take 3 lines L0m for all chunks, the following will be found: if the current set is already occupied by two 3-line chunks, and the third one is trying to write to it (which is not enough one line) , you will have to supplant one of the old ones, and taking into account its connectivity - all 3 old ones. Thus, more than two portions of the "small-command" code in the set should not fit. When testing this assumption in practice, it turned out the following: portions with large teams requiring less than 7 mops were packed in L0m number of 255 (for some reason it did not work out to take another one), fitting almost 8 KB of code. Medium portions (7-12 mops) occupied all 128 possible positions (2 lines each), caching exactly 4 KB. Well, small commands fit into 66 portions, which is two more than the expected value (2112 bytes versus 2048), which is apparently due to the boundary effects of our test code. The shortage is obvious - if 256 6-megapixel lines could be filled completely, they would be enough for 85 full triplets with a total code size of 2720 bytes.

Perhaps Intel does not expect that some code will contain so many short and simple commands that more than 2/3 of it will fall on 3-line chunks, which will push each other out of L0m earlier than necessary. And even if such a code is found - given the simplicity of its decoding, the rest of the front blocks will easily cope with the task of supplying 4 mops / cycle necessary for the rear (albeit without the promised saving of watts and penalty cycles in case of false prediction). It is curious that if we had L0m 6 paths, the problem would not have arisen. Intel decided that having the cache size one third larger precisely due to associativity is more important ...

Dimensions (edit)

Recall that the idea of caching a large number of uops instead of x86 commands is not new. It first appeared in Pentium 4 as a cache of mop traces - sequences of mops after unrolling the loops. Moreover, the trace cache did not supplement, but replaced the missing L1I - the commands for the decoder were read immediately from. Despite the oblivion of the NetBurst architecture, it is reasonable to assume that Intel engineers used past experience, albeit without unrolling loops and a dedicated predictor for the cache. Let's compare the old and new solutions (the new CPUs are named here Core i 2, because the numbers of almost all models with the SB architecture start with a two):

* - presumably

An explanation is needed here. First, the throughput for L0m is based on a total conveyor width limitation of 4 mopa. Above, we assumed that L0m can read and write 18 mops per clock. However, when reading, all 18 (if there are exactly that many when decoding the original portion) cannot be sent per clock cycle, and sending occurs in several clock cycles.

Further, the size of the mop in bits generally refers to very delicate information that manufacturers either do not give out at all, or only when pinned to the wall (they say, you already figured everything out, so be it - we will confirm). For Intel CPUs, the last known figure is 118 bits for the Pentium Pro. It is clear that the size has increased since then, but this is where the guesswork begins. 118 bits for a 32-bit x86-CPU can be obtained if the mop has fields for the address of the instruction that generated it (32 bits), the immediate operand (32 bits), address offset (32 bits), register operands (3 x 3 bits + 2 bits per scale for the index register) and opcode (11 bits, in which a specific version of the x86 command is encoded, taking into account prefixes). After adding, and SSE2, the opcode field probably increased by 1 bit, from where the number 119 is obtained.

After switching to (Prescott and further), in theory, all 32-bit fields should increase to 64-bit. But there are subtleties here: 64-bit constants in x86-64 are allowed only one at a time (that is, both constants in the command will definitely not take more than 8 bytes), and both then and now it costs 48 bits. So to increase the size of the mop is required only by 16 bits of the address and 3 additional bits of register numbers (of which there are 16) - we get (approximately) 138 bits. Well, in SB, the mos has apparently grown by another 1 bit due to the addition of several hundred more commands since the last P4, and by 8 more - due to an increase in the maximum number of explicitly specified registers in a command to 5 (when using AVX). The latter, however, is doubtful: since the days, imagine, even i386 has not been added to the x86 architecture new a command that requires at least 4 bytes of a constant (with the only recent and extremely subtle exception in SSE4.a from AMD that even most programmers don't know about). Since Intel AVX and AMD have updated the encoding of only vector instructions, the bits of the additional register numbers will fit in the upper half of the partially unused (for these instructions) 32-bit field of the immediate operand. Moreover, in the x86 command itself, the 4th or 5th register is encoded with just four constant bits.

Obviously, it is very expensive to store and send such "monsters" in any large quantity. Therefore, even for P4, Intel came up with an abbreviated version of the mop, in which there is only one field for both constants, and if they do not fit there, then the missing bits are placed in the same field of the neighboring mop. However, if it already stores its constants there, then as a neighbor it is necessary to insert a np as a donor carrier of additional bits. The continuity of this scheme is also observed in SB: extra nops are not inserted, but commands with 8-byte constants (or with the sum of the sizes of a constant and an address offset of 5-8 bytes) have a double size in L0m. However, given the length of such commands, more than 4 of them will not fit in a portion, so the limit on occupied mops is clearly uncritical. Nevertheless, we state: SB, unlike previous CPUs, has as many as 3 mop formats - decoded (the most complete), stored in the mop cache (with reduced constants) and the main one (without the index register field), which is used further in the pipeline. Most mops, though, go untouched from decoding to retirement.

Restrictions

"Rules for using the cache" on the special format of mops does not end there. Obviously, such a convenient block as L0m could not be completely without restrictions of one degree or another, which we were not told about in the promotional materials. :) Let's start with the fact that all the mops of the translated command must fit in one line, otherwise they are carried over to the next. This is explained by the fact that the addresses of the line mops are stored separately (to save 48 bits in each mop), and all the mops generated by the command must correspond to the address of its first byte stored in the tag of only one line. To restore the original addresses, the lengths of the commands that generated the mops are stored in the tags. The "intolerance" of mops somewhat spoils the efficiency of using L0m, since occasional commands that generate several mops have a significant chance of not being able to fit into the next line.

Moreover, the mops of the most complex commands are still stored in the ROM with the microcode, and only the first 4 mops of the sequence, plus a link to the continuation, get into L0m, so that everything together occupies a whole line. From this it follows that no more than three microcode instructions can be found in a portion, and given the average team size, two are more likely limit. In reality, however, they come across much less often.

Another important point is that L0m does not have its own. It seems that this should speed up the verification of addresses (which are only virtual here) and reduce power consumption. But everything is much more interesting - it's not for nothing that all modern caches have. The virtual address spaces of programs executed in the OS can overlap, therefore, when switching the task context, so that old data or code is not read at the same addresses, the virtual addressable cache must be flushed (this is exactly what happened with the trace cache of P4). Of course, its effectiveness will be low. Some architectures use the so-called. ASID (address space identifier) are unique numbers assigned by the OS to each stream. However, x86 does not support ASIDs as unnecessary - given the presence of physical tags for all caches. But then L0m came and broke the picture. Moreover, remember that the mop cache, like most kernel resources, is shared between two threads, so that it contains mops of different programs. And if you add switching between virtual operating systems in the appropriate mode, then the mops of the two programs can coincide in addresses. What to do?

The problem with streams is easy to solve - L0m is simply halved by sets, so the stream number gives the most significant bit of the set number. In addition, L1I has a retention policy relative to L0m. Therefore, when the code is preempted from L1I, its mops are removed from L0m, which requires checking two adjacent portions (the line size of all caches of modern CPUs, excluding L0m itself, is 64 bytes). Thus, the virtual address from the cached uops can always be checked in the L1I tags using its TLB. It turns out that although L0m has virtual addressing, it borrows physical tags for the code from L1I. Nevertheless, there is a situation in which L0m is completely reset - both the replacement in L1I TLB, as well as its complete reset (including when switching the CPU operating modes). In addition, L0m is disabled entirely unless the base address of the code selector (CS) is zero (which is highly unlikely in modern operating systems).

Work

The main secret of the mop cache is the algorithm that substitutes readings from L0m for the front's work on processing commands into mops. It starts with the fact that at the next hop, to select the L0m set, it uses bits 5-9 of the address of the hop target (or bits 5-8 plus the stream number in case of 2-threading). The set tags indicate the entry point into the portion, the mops of which are written in the line corresponding to the tag, and the ordinal number of this line within the portion. 1-3 lines can match, which (most likely) are simultaneously read into an 18-megapixel buffer. From there, the mops are sent in fours to the IDQ until the exit point is reached - and everything is repeated from the beginning. Moreover, when 1–3 last mops remain unsent in a portion, they are sent with the first 3–1 mops of a new portion, making up the usual four in total. That is, from the point of view of the IDQ queue receiving the mop, all transitions are smoothed into a uniform stream of code - as in P4, but without the trace cache.

And now an interesting point - no more than two transitions are allowed in a line, and if one of them is unconditional, then it will be the last one for the line. Our Attentive Reader will realize that it is permissible for the entire chunk to have up to 6 conditional jumps (each of which can be triggered without being an exit point), or 5 conditional and 1 unconditional, which will be the last command of the chunk. The branch predictor in Intel CPU is designed in such a way that it does not notice a conditional branch until it fires at least once, and only after that its behavior will be predicted. But even "everlasting" transitions are also subject to the limitation. In fact, this means that it is permissible to complete the execution of portion mops and before the point of its exit.

But a similar trick with multiple input will not work - if there is a transition to an already cached portion, but at a different offset in it (for example, when there is more than one unconditional transition), then L0m fixes a miss, turns on the front and writes the received mops to a new portion. That is, in the cache, copies are allowed for portions with different inputs and the same, exactly known output (in addition to several more possible ones). And when the code is displaced from L1I to L0m, all lines are deleted, the entry points of which fall into any of the 64 bytes of two portions. By the way, copies were also possible in the P4 trace cache, and they significantly reduced the efficiency of storing the code ...

Such restrictions reduce the availability of the space L0m. Let's try to calculate how much of it remains for actual use. The average size of an x86-64 command is 4 bytes. The average number of mops per team is 1.1. That is, 8-10 mops are likely to be consumed per serving, which is 2 lines. As previously calculated, L0m will be able to store 128 of these pairs, which is enough for 4 KB of code. However, taking into account the imperfect use of strings, the actual number will probably be 3–3.5 KB. I wonder how this fits into the overall balance of the cache subsystem volumes?

1 (actually part of L3, on average per core) - 2 MB;
L2 - 256 KB, 8 times less;
both L1 - 32 KB each, 8 times less;
the cached volume in L0m is about 10 times less.

It is curious that if you find another structure in the kernel that stores many commands or mops, then it will turn out to be a dispatcher's ROB queue, which can hold 168 mops, generated by about 650-700 bytes of code, which is 5 times less than the effective equivalent volume L0m (3– 3.5 KB) and 9 times less than the full (6 KB). In this way, the mop cache complements a neat hierarchy of different code stores with different but well-balanced parameters. Intel claims that, on average, 80% of hits are in L0m. This is significantly lower than the figure of 98–99% for the 32 KB L1I cache, but still - in four cases out of five cache mops it justifies its presence.

Its detailed review on our website (however, support for C6 deep sleep and LV-DDR3 low-voltage memory appeared only in Westmere). What's new in SB?

Firstly, the second type of temperature sensors. A familiar thermal diode, the readings of which are "seen" by the BIOS and utilities, measures the temperature to adjust the fan speed and protect against overheating (frequency throttling and, if it does not help, emergency shutdown of the CPU). However, its area is very large, because there are only one of them in each core (including GPU) and in the system agent. To them in each large block are added several compact analog -circuits with thermotransistors. They have a shorter operating range of measurements (80–100 ° C), but they are needed to refine the thermal diode data and build an accurate crystal heating map, without which the new TB 2.0 functions cannot be realized. What's more, the power controller can even use an external sensor if the motherboard manufacturer places and connects it - although it's not clear how it will help.

Added a function for renumbering C-states, for which the history of transitions between them is tracked for each core. The transition takes longer, the larger the "sleep number" into which the nucleus enters or exits. The controller determines whether it makes sense to put the kernel to sleep, taking into account the likelihood of its "awakening". If such is expected soon, then instead of the requested OS, the kernel will be transferred to C3 or C1, respectively, that is, to a more active state that goes into working faster. Oddly enough, despite the higher power consumption in such a dream, the overall savings may not be affected, since both transition periods during which the processor does not sleep at all are reduced.

For mobile models, transferring all cores to C6 causes the L3 cache to be reset and disabled using power keys common to banks. This will further reduce the consumption when idle, but is fraught with an additional delay on waking up, since the cores will have to miss several hundred or thousand times in L3 while the necessary data and code are pumped there. Obviously, in conjunction with the previous function, this will happen only if the controller is absolutely sure that the CPU falls asleep for a long time (by the standards of processor time).

Core i3 / i5 of the previous generation were a kind of record holders in terms of the complexity of the CPU power system on the motherboard, requiring as many as 6 voltages - more precisely, all 6 were available before, but not all of them led to the processor. In SB, they changed not by number, but by using:

x86 cores and L3 - 0.65-1.05 V (separated in Nehalem L3);
GPU - similarly (in Nehalem, almost the entire north bridge, which, we recall, was the second CPU crystal there, is powered by a common bus);
a system agent for which the frequency is fixed and the voltage is constant 0.8, 0.9 or 0.925 V (the first two options are for mobile models), or dynamically adjustable 0.879–0.971 V;
- constant 1.8 V or adjustable 1.71-1.89 V;
memory bus driver - 1.5 V or 1.425-1.575 V;
PCIe driver - 1.05V.

The regulated versions of the power bus are used in the unlocked SB views with the letter K. In desktop models, the idle frequency of x86 cores has been increased from 1.3 GHz to 1.6 GHz, apparently without sacrificing savings. At the same time, a 4-core CPU at full idle consumes 3.5-4 watts. Mobile versions are idle at 800 MHz and ask for even less. Models and Chipsets

Performance

What does this chapter do in a theoretical overview of microarchitecture? And the fact that there is one generally recognized test that has been used for 20 years (in different versions) to assess not the theoretical, but the programmatically achievable speed of computers - SPEC CPU. He can comprehensively evaluate the performance of the processor, and in the best case for him - when the source code of the tests is compiled and optimized for the system under test (i.e., the compiler with libraries is also checked in passing). In this way, useful programs will turn out to be faster only with handwritten insertions in assembler, which today are rare daredevil programmers with a large margin of time. SPEC can be classified as semi-synthetic tests, since it does not calculate anything useful, and does not give any specific numbers (IPC, flops, timings, etc.) - "parrots" of one CPU are needed only for comparison with others.

Intel typically provides results for their CPUs almost simultaneously with their release. But SB has experienced an incomprehensible 3-month delay, and the figures obtained in March are still preliminary. What exactly is holding them back is unclear, but this is still better than the situation with AMD, which did not release official results for its latest CPUs at all. The following figures for the Opteron are given by server manufacturers using an Intel compiler, so these results may be under-optimized: what Intel's software toolkit can do with code executing on a "foreign" CPU,. ;)

Comparison of systems in SPEC CPU2006 tests. Table compiled by David Kanter from March 2011.

Compared to previous CPUs, SB shows excellent (in the literal sense) results in the absolute and completely record-breaking for each core and gigahertz. Turning on HT and adding 2 MB to L3 gives + 3% to real speed and + 15% to integer. However, the 2-core model has the highest specific speed, and this is an instructive observation: obviously, Intel used AVX, but since an integer gain is still impossible to obtain, a sharp acceleration of only real indicators can be expected. But even for them there is no jump, which is shown by a comparison of 4-core models - and the results for i3-2120 reveal the reason: having the same 2 IKP channels, each core receives twice the bandwidth, which is reflected by a 34% increase in specific real speed. Apparently, the 6-8 MB L3 cache is too small, and scaling its own bandwidth at the expense of the ring bus does not help. It is now clear why Intel plans to equip server Xeons with 3 and even 4-channel ICPs. Only now the 8 cores there are already not enough to deploy to the fullest ...

Update: The final SB results have appeared - the numbers (as expected) have grown slightly, but the qualitative conclusions are the same. Prospects and results

The 22nm successor to Sandy Bridge, the Ivy Bridge, which will be released in the spring of 2012, is already well known. General purpose kernels will support a slightly updated subset of AES-NI; it is quite possible and "free" copying of registers at the stage of renaming. Improvements in Turbo Boost are not expected, but the GPU (which, by the way, will work on all versions of the chipset) will increase the maximum number of FUs to 16, will support the connection of not two, but three screens, and will finally acquire normal support for OpenCL 1.1 (along with DirectX 11 and OpenGL 3.1) and will improve hardware video processing capabilities. Most likely, even in desktop and mobile models, the IKP will support 1600 MHz, and the PCIe controller will support the 3.0 bus version. The main technological innovation is that the L3 cache will use (for the first time in mass microelectronic production!) Transistors with a vertically arranged multi-sided gate-fin (FinFET), having radically improved electrical characteristics (details - in one of the upcoming articles). Rumor has it that the GPU versions will again become multi-GPU, only this time one or more fast video memory crystals will be added to the processor.

Ivy Bridge will connect to new chipsets (i.e. south bridges) of the 70 series: Z77, Z75 and H77 for home (replaces Z68 / P67 / H67) and Q77, Q75 and B75 for office (instead of Q67 / Q65 / B65 ). She(that is, the physical chip under different names) will still have no more than two SATA 3.0 ports, and support for USB 3.0 will finally appear, but a year later than the competitor. Native PCI support will disappear (after 19 years for the bus, it's time to rest), but the disk subsystem controller in the Z77 and Q77 will receive Smart Response technology to increase performance by caching drives using SSDs. However, the most exciting news is that despite good old Traditionally, desktop versions of Ivy Bridge will not only be housed in the same LGA1155 socket as the SB, but will also be backward compatible with them - that is, modern boards will fit the new CPU as well.

Well, for enthusiasts, a much more powerful X79 chipset will be ready already in the 4th quarter of this year (for 4–8-core SB-E for the "server-extreme" LGA2011 socket). It will not yet have USB 3.0, but there will be 10 out of 14 SATA 3.0 ports (plus support for 4 types of RAID), and 4 out of 8 PCIe lanes can be connected to the CPU in parallel with DMI, doubling the "CPU-chipset" communication bandwidth. Unfortunately, the X79 will not match the 8-core Ivy Bridge.

As an exception (and maybe a new rule), we will not provide a list of what we would like to improve and fix in Sandy Bridge. It is already obvious that any change is a complex compromise - strictly according to the law of conservation of matter (in the formulation of Lomonosov): if something arrives somewhere, then somewhere the same amount will decrease. If Intel rushed to correct the mistakes of the old in each new architecture, then the number of broken wood and flying chips could exceed the benefits obtained. Therefore, instead of extremes and an unattainable ideal, it is more economically profitable to seek a balance between constantly changing and sometimes opposite requirements.

Despite some blemishes, the new architecture should not only shine brightly (which, judging by the tests, it does), but also outshine all the previous ones - both their own and the rival. The announced goals for performance and economy have been achieved, with the exception of optimization for the AVX suite, which is about to appear in new versions of popular programs. And then Gordon Moore will once again be surprised at his sagacity. Apparently, Intel is fully armed to the Epic Battle between architectures, which we will see this year.

Acknowledgments are expressed:

Maxim Loktyukhin, the very "Intel representative", employee of the department of software and hardware optimization - for answering numerous clarifying questions.
Mark Buxton, Lead Software Engineer and Head of Optimization, for his answers, and for the very opportunity to get some kind of official response.
Agner Fogh, programmer and processor researcher - for independent low-level testing of SB, which has revealed a lot of new and mysterious.
To the Attentive Reader - for attentiveness, perseverance and loud snoring.
Furious fans of the Opposite Camp - to the heap.

Finally, Intel has officially announced new processors running on a new microarchitecture Sandy bridge... For most people, the "Sandy Bridge announcement" is just words, but by and large, Intel Core II generations are, if not a new era, then at least an update of almost the entire processor market.

Initially it was reported about the launch of only seven processors, but on the most useful page ark.intel.com information about all the new products has already appeared. There were a few more processors, or rather their modifications (in parentheses, I indicated the approximate price - how much each processor in a batch of 1000 will cost):

Mobile:

Intel Core i5-2510E (~ $ 266)
Intel Core i5-2520M
Intel Core i5-2537M
Intel Core i5-2540M

A side-by-side, detailed comparison of the second generation Intel Core i5 mobile processors.

Intel Core i7-2617M
Intel Core i7-2620M
Intel Core i7-2629M
Intel Core i7-2649M
Intel Core i7-2657M
Intel Core i7-2710QE (~ $ 378)
Intel Core i7-2720QM
Intel Core i7-2820QM
Intel Core i7-2920XM Extreme Edition

A side-by-side, detailed comparison of the second generation Intel Core i7 mobile processors.

Desktop:

Intel Core i3-2100 (~ $ 117)
Intel Core i3-2100T
Intel Core i3-2120 ($ 138)

A side-by-side detailed comparison of second generation Intel Core i3 desktop processors.

Intel Core i5-2300 (~ $ 177)
Intel Core i5-2390T
Intel Core i5-2400S
Intel Core i5-2400 (~ $ 184)
Intel Core i5-2500K (~ $ 216)
Intel Core i5-2500T
Intel Core i5-2500S
Intel Core i5-2500 (~ $ 205)

A side-by-side, detailed comparison of second generation Intel Core i5 desktop processors.

Intel Core i7-2600K (~ $ 317)
Intel Core i7-2600S
Intel Core i7-2600 (~ $ 294)

A side-by-side detailed comparison of second generation Intel Core i7 desktop processors.

As you can see, the model names now have four digits in the name - this is done to avoid confusion with the previous generation processors. The lineup turned out to be quite complete and logical - the most interesting i7 series are clearly separated from the i5 by the presence of technology Hyper threading and increased cache size. And processors of the i3 family differ from i5 not only in fewer cores, but also in the lack of technology Turbo Boost.

You probably also noticed the letters in the names of the processors, without which the lineup has greatly thinned. So, the letters S and T talk about lower power consumption, and TO Is a free multiplier.

Visual structure of new processors:

As you can see, in addition to the graphics and computational cores, cache memory and memory controller, there is a so-called System Agent- a lot of things are dumped there, for example, DDR3 memory and PCI-Express 2.0 controllers, a power management model and blocks that are responsible at the hardware level for the operation of the integrated GPU and for displaying an image if it is used.

All "core" components (including the graphics processor) are interconnected by a high-speed ring bus with full access to the L3 cache, which increases the overall data exchange rate in the processor itself; interestingly, this approach allows you to increase performance in the future, simply by increasing the number of cores added to the bus. Although even now everything promises to be at its best - compared to the previous generation processors, the performance of the new ones is more adaptive and, according to the manufacturer, in many tasks it is able to demonstrate a 30-50% increase in the speed of task execution!

If there is a desire to learn more about the new architecture, then in Russian I can advise these three articles -,,.

The new processors are wholly and completely manufactured according to the 32nm process technology and for the first time have a "visually smart" microarchitecture that combines best-in-class computing power and 3D graphics processing technology on a single chip. There are indeed many innovations in the Sandy Bridge graphics, aimed mainly at increasing productivity when working with 3D. One can argue for a long time about the "imposition" of an integrated video system, but there is still no other solution as such. But there is such a slide from the official presentation, which claims to be plausible, including in mobile products (laptops):

I have already talked about the new technologies of the second generation of Intel Core processors, so I will not repeat myself. I will dwell only on the development Intel Insider, the appearance of which many were surprised. As I understand it, this will be a kind of store that will give computer owners access to high-definition films directly from the creators of these films - something that previously appeared only some time after the announcement and appearance of DVD or Blu-ray discs. To demonstrate this feature, Intel VP Mouli Eden(Mooly Eden) invited to the stage Kevin Tsujiharu(Kevin Tsujihara), President of Warner Home Entertainment Group. I quote:

« Warner Bros. finds personal systems the most versatile and widespread platform for delivering high-quality entertainment content, and Intel is now making the platform even more reliable and secure. From now on, with the help of the WBShop store, as well as our partners such as CinemaNow, we will be able to provide PC users with new releases and films from our catalog in true HD quality."- Muli Eden demonstrated the work of this technology on the example of the film" Inception ". In collaboration with the industry's leading studios and media giants (such as Best Buy CinemaNow, Hungama Digital Media Entertainment, Image Entertainment, Sonic Solutions, Warner Bros. Digital Distribution, and others), Intel is building a secure and piracy-free (hardware-based) ecosystem for distribution, storage and playback of high quality video.

The work of the above technology will be compatible with two equally interesting developments, which are also present in all models of new generation processors. I'm talking about (Intel WiDi 2.0) and Intel InTru 3-D... The first is designed for wireless transmission of HD-video (with support for resolutions up to 1080p), the second is intended for displaying stereo content on monitors or high-definition TVs through a connection HDMI 1.4.

Two more functions for which I did not find a better place in the article - Intel Advanced Vector Extensions(AVX). Processors support these commands to speed up data-intensive applications such as audio editors and professional photo editing software.

… and Intel Quick Sync Video- Through collaboration with software companies such as CyberLink, Corel and ArcSoft, the processor giant has managed to increase the performance in this task (transcoding between H.264 and MPEG-2 formats) 17 times over the performance of previous generation integrated graphics.

Let's say there are processors - how to use them? That's right - along with them, new chipsets (logic sets) were also announced, which are representatives of the "sixtieth" series. Apparently, there are only two sets for the thirsty Consumers, this is Intel H67 and Intel P67 on which most of the new motherboards will be built. The H67 is able to work with the video core integrated into the processor, while the P67 is endowed with the Performance Tuning function for overclocking the processor. All processors will work in the new socket, 1155 .

I am glad that it seems that the new processors have included compatibility with sockets of Intel processors with the next generation architecture. This plus is useful for both ordinary users and manufacturers who do not have to redesign and create new devices.

In total, Intel has unveiled more than 20 chips, chipsets and wireless adapters, including the new Intel Core i7, i5 and i3 processors, Intel 6 Series chipsets, and Intel Centrino Wi-Fi and WiMAX adapters. In addition to those mentioned above, the following "badges" may appear on the market:

More than 500 models of desktop computers and laptops of the world's leading brands are expected to be released on new processors this year.

And finally, once again an awesome video, suddenly someone did not see: