Physical and logical organization of computer systems memory. Principles of memory management of a computing system

08.04.2019 Windows 8

Memory organization of the MPS. memory segmentation. Address calculation. Internal cache memory.

The memory of the microprocessor system performs the function of temporary or permanent storage of data and commands. The amount of memory determines the allowable complexity of the algorithms executed by the system, as well as, to some extent, the speed of the system as a whole. Memory modules are executed on memory chips (operational or permanent). Increasingly, flash memory is used as part of microprocessor systems (English - flash memory), which is a non-volatile memory with the ability to repeatedly overwrite the contents.

To connect the memory module to the system bus, interface blocks are used, which include an address decoder (selector), a bus control signal processing circuit, and data buffers (Figure 7.4.1).

Figure 7.4.1. Memory module connection diagram.

In the memory space of a microprocessor system, there are usually several special areas that perform special functions. These include:

– boot program memory executed on ROM or flash memory;

- memory for the stack or stack (Stack) - this is part of random access memory, designed for temporary data storage;

– a table of interrupt vectors containing the addresses of the start of interrupt handling programs;

– memory of devices connected to the system bus.

All other parts of the memory space, as a rule, have universal purpose. They can contain both data and programs (of course, in the case of a single-bus architecture).

Often the memory space is divided into segments with a programmatically changeable address of the beginning of the segment and with fixed size segment. For example, in the Intel 8086 processor, memory segmentation is organized as follows.

The entire memory of the system is represented not as a continuous space, but as several pieces - segments given size(64 KB each), the position of which in the memory space can be changed programmatically.

To store memory address codes, not separate registers are used, but pairs of registers:

The segment register determines the address of the beginning of the segment (that is, the position of the segment in memory);

The pointer register (offset register) determines the position of the work address within the segment.

In this case, the physical 20-bit memory address assigned to the external address bus is formed as shown in Figure 7.4.2, that is, by adding the offset and the segment address with a shift of 4 bits.

Figure 7.4.2. Formation of the physical memory address from the segment address and offset.

The position of this address in memory is shown in Figure 7.4.3.

Figure 7.4.3. Location of physical address in memory

A segment can only start at a 16-byte memory boundary (since the address of the start of a segment actually has four low-order zeros, as seen in Figure 7.4.2), that is, at an address that is a multiple of 16. These valid segment boundaries are called paragraph boundaries. .

Note that the introduction of segmentation is primarily due to the fact that the internal registers of the processor are 16-bit, and the physical memory address is 20-bit (a 16-bit address allows using only 64 KB of memory, which is clearly not enough).

The cache is located between the main memory (RAM) and central processing unit to reduce the time spent on CPU access to the OP.

The idea of cache memory is based on predicting the most likely CPU accesses to the RAM. The most "probable" data and instructions are copied to the fast, CPU-speedy cache before they are actually used by the CPU, so that data and instructions currently in use can be accessed quickly, without recourse to the RAM. This approach is based on the principle of program locality or, as they say, the nested nature of calls, meaning that the addresses of successive calls to the OP form, as a rule, a compact group. When accessing the OP, not individual data are copied to the cache memory, but blocks of information, including those data that are most likely to be used in the CPU at subsequent steps of work. In this regard, subsequent instructions are selected by the CPU no longer from the RAM, but from the fast cache memory. When the CPU needs to read or write some data to the RAM, it first checks for its presence in the cache. The efficiency of the cache system depends on the block size and program algorithm.

main memory

main memory is a storage device directly connected to the processor and designed to store executable programs and data directly involved in operations. It has sufficient speed, but limited volume. The main memory is divided into various types, the main of which are random access memory (RAM) and read only memory (ROM) (Fig. 1).

RAM is designed to store information (programs and data) directly involved in computing process at the current stage of operation.

RAM is used for receiving, storing and issuing information. It is in it that the processor "takes" programs and initial data for processing, it writes the results to it. The name “RAM” is given to this memory because it works very quickly, so that the processor does not have to wait much when reading data from memory and writing to memory. However, the data it contains is only saved while the computer is turned on. When you turn off your computer, the contents of RAM are erased. So RAM is volatile memory.

Rice. 1. Main types of main memory

Often, the designation RAM (random access memory, i.e. memory with random access). Random access is understood as the possibility of direct access to any (arbitrary) given memory cell, and the access time for any cell is the same.

RAM is based on large integrated circuits containing matrices of semiconductor storage elements (flip-flops). The memory elements are located at the intersection of the vertical and horizontal tires of the matrix; the recording and reading of information is carried out by applying electrical impulses through those channels of the matrix that are connected to the elements belonging to the selected memory cell.

Not only the ability to work with resource-intensive programs, but also its performance depends on the amount of RAM installed in the computer, since when there is a shortage of memory, it is used as its logical extension. HDD, the access time to which is incomparably higher. In addition to the amount of RAM, computer performance is also affected by its speed and the method of data exchange between the microprocessor and memory.

OP is implemented on DRAM chips (dynamic OP), which is characterized in comparison with other types of memory by low cost and high specific capacity, but by high power consumption and slower performance. Each information byte (0 and 1) in DRAM is stored as a capacitor charge. Due to the presence of leakage currents, the charge of the capacitor must be renewed at certain intervals. Due to the continuous need for updating, such memory is called dynamic. Regeneration of the contents of the memory requires additional time, and the recording of information during regeneration in the memory is not allowed.

Cost of RAM in Lately fell sharply (from the summer of 1995 to the summer of 1996 - more than 4 times), so the large requests of many programs and operating systems to RAM with financial point vision became less burdensome.

To speed up access to RAM on high-speed computers, a stable ultra-high-speed cache memory is used, which is located, as it were, “between” the microprocessor and RAM and stores copies of the most frequently used sections of RAM. When the microprocessor accesses the memory, it first searches for the necessary data from the cache memory. Since the access time to the cache memory is several times less than to conventional memory, and in most cases the data needed by the microprocessor is already stored in the cache memory, the average memory access time decreases. The cache memory is implemented on a SRAM (static RAM) chip.

For computers based on the Intel-386DX or 80386SX, a cache size of 64 KB is sufficient, while 128 KB is sufficient. Computers based on the Intel-80486DX, DX2, DX4, and Pentium are typically equipped with 256 KB cache memory.

The 486 and Pentium series microprocessors contain a small amount of internal cache memory, so for the sake of clarity of terminology, sometimes in the technical literature the cache memory located on the system board is referred to as the second-level cache memory.

In the Pentium Pro microprocessor, the second-level cache memory is contained in a single package with the processor itself (it can be said that it is built into the microprocessor).

It is not necessary to have all the memory in which the information must change. Part of the most important information it is better to keep it permanently in the computer's memory. This memory is called permanent. The data is entered into the permanent memory during its manufacture. As a rule, this data cannot be changed; programs running on the computer can only read it. This type of memory is usually called ROM (read only memory, or read-only memory), or ROM (Read Only Memory).

An IBM PC compatible computer stores programs in permanent memory to check the hardware, computer, initiate the loading of the operating system (OS) and execute basic functions maintenance of computer devices. Since most of these programs deal with I/O services, the contents of permanent memory are often referred to as the BIOS (Basic Input - output System, or basic input-output system).

Many computers install BIOS based on FLASH memory. Such memory can be modified by programs, which allows you to update the BIOS using special programs, without replacing motherboard or BIOS chips.

All but very old computers also have a computer configuration program (SETUP) in the BIOS. It allows you to set some characteristics of computer devices (types of video controller, hard disks and floppy disk drives, some also modes of working with RAM, prompting for a password at boot, etc.). Typically, the configuration program is invoked when the user presses a certain key or key combination (most commonly the Del key) during bootup.

FLASH memory capacity from 32 KB to 2 MB, read access time 0.06 µs, write time per byte approximately 10 µs; FLASH memory is a non-volatile memory.

In addition to regular RAM and permanent memory, the computer has a small area of memory for storing computer configuration settings. It is often referred to as CMOS memory because this memory is usually implemented using CMOS (complementary metal-oxide semiconductor) technology, which has low power consumption. The contents of the CMOS memory do not change when the computer is powered off because it uses a special battery to power it.

Thus, the capacity of the main memory consists of millions of individual memory cells with a capacity of 1 byte each. The total capacity of the main memory of modern PCs usually ranges from 1 to 4 GB. The capacity of RAM is one to two orders of magnitude greater than the capacity of ROM: ROM occupies up to 2 MB on new motherboards), the rest is RAM.

Topic 3.1 Organization of calculations in computing systems Oh

Purpose and characteristics of the aircraft. Organization of calculations in computing systems. computer parallel action, the concepts of command flow and data flow. association systems. Matrix systems. Computing pipeline. Command pipeline, data pipeline. Superscalarization.

The student must

know:

The concept of command flow;

The concept of data flow;

Types of computing systems;

Architectural features of computing systems

Computing systems

Computing system(Sun)- a set of interconnected and interacting processors or computers, peripheral equipment And software, designed to collect, store, process and distribute information.

The creation of the Armed Forces pursues the following main goals:

Improving system performance by speeding up data processing processes;

Improving the reliability and reliability of calculations;

Providing the user with additional services etc.

Topic 3.2

Aircraft classification depending on the number of command and data streams: OKOD (SISD), OKMD (SIMD), MISD (MISD), MKMD (MIMD).

Classification of multiprocessor computer systems with different ways of implementing shared memory: UMA, NUMA, COMA. Comparative characteristics, hardware and software features.

Classification of multi-machine aircraft: MPP, NDW and COW. Purpose, characteristics, features.

Sun examples various types. Advantages and disadvantages of various types of computing systems.

Classification of computing systems

A distinctive feature of the CS in relation to classical computers is the presence in it of several calculators that implement parallel processing .

The parallelism of the execution of operations significantly increases the speed of the system; it can significantly increase reliability (if one component of the system fails, another can take over its function), as well as the reliability of the system, if operations are duplicated and results are compared.

Computing systems can be divided into two groups:

· multi-machine ;

· multiprocessor .

Multi-machine computing system consists of several individual computers. Each computer in a multimachine system has a classical architecture, and such a system is used quite widely. However, the effect of using such a computing system can only be obtained by solving a problem that has a special structure: it must be divided into as many loosely connected subtasks as there are computers in the system.

Multiprocessor architecture implies the presence of several processors in the computer, so many data streams and many instruction streams can be organized in parallel. Thus, several fragments of one task can be executed simultaneously. The speed advantage of multiprocessor computing systems over single-processor ones is obvious.

The disadvantage is the possibility of conflict situations when multiple processors access the same memory area.

A feature of multiprocessor computing systems is the presence of a common RAM as shared resource(Figure 11).

Figure 11 - Architecture of a multiprocessor computing system

Flynn classification

Among all considered aircraft classification systems, the classification proposed in 1966 by M. Flynn is the most widely used. It is based on flow concept , which is understood as a sequence of command or data elements processed by the processor. Depending on the number of command streams and data streams, Flynn distinguishes 4 classes of architectures:

· OKOD – single command stream - single data stream. These include classical von Neumann VMs. Pipelining does not matter, therefore both 6600 VMs with scalar functional units and 7600 with pipelined ones fall into the OKOD class.

· MKOD – multiple command stream - single data stream. In this architecture, multiple processors process the same data stream. An example would be an aircraft, on the processors of which a distorted signal is supplied, and each of the processors processes this signal using its own filtering algorithm. However, neither Flynn nor other computer architects have yet been able to imagine a real-life aircraft built on this principle. A number of researchers refer conveyor systems to this class, but this has not found final recognition. Having an empty class should not be considered a disadvantage of Flynn's classification. Such classes can become useful in the development of new concepts in the theory and practice of aircraft construction.

· OKMD - one command stream - many data streams - commands are issued by one control processor, and are executed simultaneously on all processing processors on the local data of these processors. SIMD (single instruction - multiple data)

· MKMD - many command streams - many data streams - a set of computers working according to their programs with their original data. MIMD (multiple instruction - multiple data)

The Flynn classification scheme is the most common in the initial assessment of the aircraft, since it immediately allows you to assess the basic principle of the system. However, Flynn's classification also has obvious drawbacks: for example, the inability to unambiguously assign some architectures to a particular class. The second disadvantage is the excessive saturation of the MIMD class.

Existing computing systems of the MIMD class form three subclasses: symmetrical multiprocessors (SMP) , clusters And massively parallel systems (MPP). This classification is based on the structural-functional approach.

Symmetric multiprocessors consist of a collection of processors that have the same memory and external device access capabilities and operate under the same operating system (OS). special case SMP stands for single processor computers. All SMP processors have shared shared memory with a single address space.

Using SMP provides the following possibilities:

• application scaling at low initial cost, by applying without transformation of applications on new, more productive hardware;

Creation of applications in familiar software environments;

the same access time to all memory;

the ability to send messages with a large bandwidth;

Support for the coherence of a set of caches and blocks of main memory, indivisible synchronization and blocking operations.

cluster system formed from modules connected by a communication system or shared devices external memory such as disk arrays.

The cluster size varies from a few modules to several tens of modules.

Within both shared and distributed memory, several models of memory system architectures are implemented. Figure 12 shows the classification of such models used in computing systems of the MIMD class (it is also true for the SIMD class).

Figure 12 - Classification of memory architecture models of computing systems

In systems from shared memory all processors have equal access to the same address space. A single memory can be built as a single-block or modular, but the second option is usually practiced.

Computing systems with shared memory, where access of any processor to memory is performed uniformly and takes the same time, are called systems with uniform memory access and are abbreviated as UMA (Uniform Memory Access). This is the most common memory architecture of parallel VS with shared memory.

Technically, UMA systems assume the presence of a node connecting each of P processors with each T memory modules. The simplest way to build such an aircraft is to combine several processors (P i .) with a single memory (M P) by means of common bus- shown in Figure 12a . In this case, however, only one of the processors can communicate on the bus at any time, that is, the processors must compete for access to the bus. When the processor R i , fetches an instruction from memory, other processors R j(i ≠ j) must wait until the tire is free. If only two processors are in the system, they are able to operate at close to maximum performance because their bus access can be interleaved: while one processor is decoding and executing an instruction, the other is free to use the bus to fetch the next instruction from memory. However, when a third processor is added, performance starts to drop. When there are ten processors on the bus, the bus speed curve (Figure 12b) becomes horizontal, so adding an 11th processor does not improve performance. The bottom curve in this figure illustrates the fact that memory and bus have a fixed bandwidth determined by the combination of memory cycle time and bus protocol, and in a shared bus multiprocessor system this bandwidth is shared across multiple processors. If the processor cycle is longer than the memory cycle, many processors can be connected to the bus. However, in fact, the processor is usually much faster than the memory, so this scheme is not widely used.

Alternative way building a multiprocessor VS with shared memory based on UMA is shown in Figure 13c . Here the bus has been replaced by a switch that routes processor requests to one of several memory modules. Even though there are multiple memory modules, they are all part of a single virtual address space. The advantage of this approach is that the switch is able to serve multiple requests in parallel. Each processor can be connected to its own memory module and access it at the maximum allowed speed. Contention between processors can occur when trying to access the same memory module at the same time. In this case, only one processor gets access, and the others are blocked.

Unfortunately, the UMA architecture does not scale well. The most common systems contain 4-8 processors, much less often 32-64 processors. In addition, such systems cannot be classified as fault-tolerant, since the failure of one processor or memory module entails the failure of the entire aircraft.

Figure 13 - Shared memory:

a) combining processors using a bus and a system with local caches;

b) system performance as a function of the number of processors on the bus;

c) a multiprocessor CS with a shared memory consisting of individual modules

Another approach to building a shared memory aircraft is non-uniform memory access , referred to as NUMA (Non-Uniform Memory Access). Here, as before, a single address space appears, but each processor has local memory. The processor accesses its own local memory directly, which is much faster than accessing remote memory through a switch or network. Such a system can be supplemented with global memory, in which case local storage devices play the role of a fast cache memory for global memory. Such a scheme may improve the performance of the aircraft, but it cannot indefinitely delay the equalization of direct performance. If each processor has a local cache (Figure 13a), there is a high probability (p> 0.9) that the required command or data is already in local memory. A reasonable probability of hitting local memory significantly reduces the number of processor accesses to global memory and thus leads to efficiency gains. The place of the break in the performance curve (the upper curve in Figure 13b ), corresponding to the point where adding processors is still effective now moves to the 20 processor region, and the point where the curve becomes horizontal to the 30 processor region.

As part of the concept NUMA several different approaches are implemented, denoted by abbreviations COMA, CC-NUMA And NCC-NUMA.

IN cache-only architecture (COMA, Cache Only Memory Architecture) the local memory of each processor is built as a large cache memory for quick access from "its" processor. The caches of all processors are collectively treated as the system's global memory. There is no actual global memory. The principal feature of the SOMA concept is expressed in dynamics. Here, the data is not statically bound to a specific memory module and does not have a unique address that remains unchanged during the entire lifetime of the variable. In the COMA architecture, data is transferred to the cache memory of the processor that last requested them, while the variable is not fixed by a unique address and can be placed in any physical cell at any time. Transferring data from one local cache to another does not require the participation of the operating system in this process, but involves complex and expensive memory management hardware. To organize such a regime, so-called cache directories . Note also that the last copy of an item is never removed from the cache.

Since data is moved to the local cache memory of the owner processor in COMA architecture, such ANs have a significant performance advantage over other NUMA architectures. On the other hand, if a single variable or two different variables stored in the same line of the same cache are required by two processors, that cache line must be moved back and forth between processors on each data access. Such effects may depend on the details of memory allocation and lead to unpredictable situations.

Model cache-coherent access to heterogeneous memory (CC-NUMA, Cache Coherent Non-Uniform Memory Architecture) is fundamentally different from the COMA model. The CC-NUMA system does not use cache memory, but ordinary physically distributed memory. There is no copying of pages or data between memory locations. There is no software-implemented message passing. There is just one memory card, with parts physically connected copper cable, and "smart" hardware. Hardware cache coherence means that no software is required to store multiple copies of updated data or transfer them. All this is handled by the hardware level. Access to local memory modules in different nodes of the system can be performed simultaneously and is faster than to remote memory modules.

The difference between the model and cache-incoherent access to heterogeneous memory (NCC-NUMA, Non-Cache Coherent Non-Uniform Memory Architecture) by CC-NUMA is obvious from the name. The memory architecture assumes a single address space, but does not provide global data consistency at the hardware level. The use of such data is entirely the responsibility of the software (applications or compilers). Despite this circumstance, which seems to be a drawback of the architecture, it turns out to be very useful in improving the performance of computing systems with a memory architecture of the DSM type, which is considered in the "Distributed memory architecture models" section.

In general, NUMA shared memory aircraft are called architectures with virtual shared memory (virtual shared memory architectures). This type of architecture, in particular CC-NUMA, has recently been considered as an independent and rather promising type of M1MD class computing systems.

Models of distributed memory architectures. In a distributed memory system, each processor has its own memory and can only address it. Some authors call this type of system multi-machine aircraft or multicomputers , emphasizing the fact "that the blocks from which the system is built are themselves small computing systems with a processor and memory. Models of distributed memory architectures are commonly referred to as architecture without direct access to remote memory (NORMA, No Remote Memory Access). This name comes from the fact that each processor has access only to its own local memory. Access to remote memory (local memory of another processor) is possible only by exchanging messages with the processor that owns the addressable memory.

Such an organization has a number of advantages. First, when accessing data, there is no competition for the bus or switches: each processor can fully use the bandwidth of the communication path with its own local memory. Secondly, the absence of a common bus means that there are no associated restrictions on the number of processors: the size of the system is limited only by the network that combines the processors. Thirdly, the problem of cache coherence is removed. Each processor has the right to independently change its data, without worrying about coordinating copies of data in its own local cache with the caches of other processors.

The student must

know:

aircraft classification;

Examples of aircraft of various types.

be able to:

- choose the type of computing system in accordance with the problem being solved.

©2015-2019 site
All rights belong to their authors. This site does not claim authorship, but provides free use.
Page creation date: 2016-07-22

Chapter 11

Organization computing memory systems

In computing systems that combine many parallel processors or machines, the problem of proper memory organization is one of the most important. The difference between CPU and memory speed has always been a stumbling block in single processor VMs. The multiprocessor nature of the CS leads to another problem - the problem of simultaneous access to memory by several processors.

Depending on how the memory of multiprocessor (multi-machine) systems is organized, there are computing systems with shared memory (shared memory) and computer systems with distributed memory (distributed memory). IN shared memory systems(often referred to as shared or shared memory) The VM is treated as a shared resource, and each of the processors has full access to the entire address space. Systems with shared memory are called strongly connected(closely coupled systems). A similar construction of computing systems takes place both in the SIMD class and in the MIMD class. Sometimes, to emphasize this circumstance, special subclasses are introduced, using the abbreviations SM-SIMD (Shared Memory SIMD) and SM-MIMD (Shared Memory MIMD) to designate them.

In the variant distributed memory each of the processors is given its own memory. Processors are merging in network and can, if necessary, exchange data stored in their memory, transmitting to each other the so-called messages. This type of aircraft is called loosely coupled(loosely coupled systems). Weak related systems also occur both in the SIMD class and in the MIMD class, and sometimes to emphasize this feature, introduce subclasses DM-SIMD (Distributed Memory SIMD) and DM-MIMD (Distributed Memory MIMD).

In some cases, shared memory computing systems are called multiprocessors, and systems with distributed memory - multicomputers.

The difference between shared and distributed memory is the difference in the structure of virtual memory, that is, how the memory looks from the processor side. Physically, almost every memory system is divided into standalone components which can be accessed independently. What separates shared memory from distributed memory is how the memory subsystem interprets the cell address received from the processor. For example, suppose the processor executes a load RO, i instruction meaning "Load register R0 with the contents of cell i". In the case of shared memory, i is a global address, and points to the same location for any processor. IN distributed system memory i is local address If two processors execute the instruction load RO, i, then each of them accesses i-th cell in your local memory, that is, to different cells, and different values can be loaded into registers R0.

The difference between the two memory systems must be taken into account by the programmer, as it determines the way the parts of a parallelized program interact. In the variant with shared memory, it is enough to create a data structure in memory and pass references to this structure to the parallel used subroutines. In a distributed memory system, it is necessary to have a copy of the shared data in each local memory. These copies are created by embedding shared data in messages sent to other processors.

Memory from alternation addresses

Physically, the memory of a computing system consists of several modules (banks), while the essential question is how the address space (the set of all addresses that the processor can form) is distributed in this case. One way to allocate virtual addresses to memory modules is to divide the address space into consecutive blocks. If the memory is P banks, then the cell with the address i during block splitting will be in the bank with the number i/n. In system interleaved memory(interleaved memory) consecutive addresses are located in different banks: the cell with address i is in the bank with number i mod P. Let, for example, the memory consists of four banks, 256 bytes each. In a block addressing scheme, the first bank will be allocated virtual addresses 0-255, the second - 256-511, etc. In the interleaved scheme, consecutive cells in the first bank will have virtual addresses 0, 4, 8, .... in the second bank - 1, 5, 9, etc. (Fig. 11.1, a).

The distribution of the address space by modules makes it possible to simultaneously process requests for memory access, if the corresponding addresses belong to different banks, the processor can request access to a cell in one of the cycles i and in the next loop - to cell j. If i andj are in different banks, the information will be transmitted in successive cycles. Here, a cycle refers to a processor cycle, while a full memory cycle takes several processor cycles. Thus, in this case, the processor does not have to wait until the full cycle of accessing the cell is completed. i. The considered technique allows to increase throughput: if the memory system consists of

https://pandia.ru/text/78/264/images/image002_61.jpg" width="62" height="15"> The interval between elements is called step by index or "Stride"(stride). One of interesting applications this property can serve as Access to matrices. If the index step is one greater than the number of rows in the matrix, a single memory access request will return all the diagonal elements of the matrix (Figure 11.1b). It is the programmer's responsibility to ensure that all extracted matrix elements are located in different banks.

Memory architecture models of computing systems

Within both shared and distributed memory, several models of memory system architectures are implemented.

DIV_ADBLOCK84">

Rice. 11.3. Shared memory: a - combining processors using a bus; b - system with local caches; in- system performance as a function of the number of processors on the bus; d - multiprocessor VS with shared memory consisting of individual modules

An alternative way to build a multiprocessor CS with shared memory based on NML is shown in Fig. 11.3, G. Here the spike has been replaced by a switch that routes processor requests to one of several memory modules. Although there are multiple memory modules, they are all part of a single virtual address space. The advantage of this approach is that the switch is able to serve multiple requests in parallel. Each processor can be connected to its own memory module and access it at the maximum allowed speed. Contention between processors can occur when trying to access the same memory module at the same time. In this case, only one processor gets access, and the others are blocked.

Unfortunately, the UMA architecture does not scale well. The most common systems contain 4-8 processors, much less often 32-64 processors. In addition, such systems cannot be classified as fault-tolerant, since the failure of one processor or memory module entails the failure of the entire aircraft.

Another approach to building a shared memory aircraft is heterogeneous memory access, referred to as NUM A (Non-Uniform Memory Access). This still features a single address space, but each processor has local memory. The processor accesses its own local memory directly, which is much faster than accessing remote memory through a switch or network. Such a system can be supplemented with global memory, then local storage devices play the role of a fast cache memory for global memory. Such a scheme can improve the performance of the aircraft, but is not able to defer direct performance equalization indefinitely. If each processor has a local cache (Fig. 11.3.6), there is a high probability (p > 0.9) that the required command or data is already in local memory. A reasonable probability of hitting local memory significantly reduces the number of processor accesses to global memory and thus leads to efficiency gains. The location of the break in the performance curve (upper curve in Fig. 11.3, in), the point at which adding processors is still effective now moves to the 20 processor region, and the thin line where the curve becomes horizontal to the 30 processor region.

As part of the concept NUMA several different approaches are implemented, denoted by abbreviations SOMA,CC-NUMA And NCC-NUMA.

IN cache-only architecture(COMA, Cache Only Memory Architecture) the local memory of each processor is built as a large cache for quick access from "its" processor. The caches of all processors are collectively treated as the system's global memory. There is no actual global memory. The principal feature of the SOMA concept is expressed in dynamics. Here, the data is not statically bound to a specific memory module and does not have a unique address that remains unchanged during the entire lifetime of the variable. In the COMA architecture, data is transferred to the cache memory of the processor that last requested them, while the variable is not fixed by a unique address and can be placed in any physical cell at any time. Transferring data from one local cache to another does not require the participation of the operating system in this process, but involves complex and expensive memory management hardware. To organize such a regime, so-called cache directories. Note also that the last copy of an item is never removed from the cache.

Because the COMA architecture moves data to the local cache memory of the owner processor, such ECs have a significant performance advantage over other NUM A architectures. On the other hand, if a single variable or two different variables store in the same line of the same cache , are required by two processors, this cache line must be moved back and forth between processors on each data access. Such effects may depend on the details of memory allocation leading to unpredictable situations.

Model cache-coherent access to heterogeneous memory(CC-NUMA, Cache Coherent Non-Uniform Memory Architecture) is fundamentally different from the COMA model. The CC-NUMA system does not use cache memory, but ordinary physically distributed memory. There is no copying of pages or data between memory locations. There is no software-implemented message passing. There is just one memory card, with parts physically connected by copper cable, and "smart" hardware. Hardware cache coherence means that no software is required to store multiple copies of updated data or transfer them. All this is handled by the hardware level. Access to local memory modules in different nodes of the system can be performed simultaneously and is faster than to remote memory modules.

The difference between the model and cache-incoherent access to heterogeneous memory(NCC-NUMA, Non-Cache Coherent Non-Uniform Memory Architecture) by CC-NUMA is obvious from the name. The memory architecture assumes a single address space, but does not provide global data consistency at the hardware level. The use of such data is entirely the responsibility of the software (applications or compilers). Despite this circumstance, which seems to be a drawback of the architecture, it turns out to be very useful in improving the performance of computing systems with a memory architecture of the DSM type, which is considered in the "Distributed memory architecture models" section.

In general, NUMA shared memory aircraft are called architectures with virtual shared memory(virtual shared memory architectures). This type of architecture, in particular CC-NUMA, has recently been considered as an independent and rather promising type of MIMD-class computing systems, so such CSs will be discussed in more detail below.

Models of distributed memory architectures

In a distributed memory system, each processor has its own memory and can only address it. Some authors call this type of system multi-machine aircraft or multicomputers, emphasizing the fact that the building blocks of the system are themselves small computing systems with a processor and memory. Models of architectures with distributed memory are commonly referred to as architectures without direct access to remote memory(NORMA, No Remote Memory Access). This name comes from the fact that each processor has access only to its own local memory. Access to remote memory (local memory of another processor) is possible only by exchanging messages with the processor that owns the addressable memory.

Such an organization has a number of advantages. First, when accessing data, there is no competition for the bus or switches - each processor can fully use the bandwidth of the communication path with its own local memory. Secondly, the absence of a common bus means that there are no associated restrictions on the number of processors: the size of the system is limited only by the network that combines the processors. Thirdly, the problem of cache coherence is removed. Each processor has the right to independently change its Data, without worrying about matching data copies in its own local cache with the caches of other processors.

The main disadvantage of distributed memory CS is the complexity of information exchange between processors. If one of the processors needs data from the memory of another processor, it must exchange messages with this processor. This results in two types of costs:

· It takes time to form and forward a message from one! processor to another;

· To respond to messages from other processors, the receiving processor must receive an interrupt request and execute the interrupt handling routine.

The structure of a system with distributed memory is shown in Fig. 1. 11.4. In the left! parts (Fig. 11.4, but) one processing element (PE) is shown. It includes) the processor itself (P), local memory (M) and two input / output controllers (Ko and CD On the right side (Fig. 11.4, b) a four-processor system is shown, illustrating how messages are sent from one processor to another. In relation to each PE, all other processor elements can be considered simply as input / output devices. To send a message to another PE, the processor forms a data block in its local memory and notifies its local controller about the need to transfer information to external device. The interconnection network forwards this message to the receiving I/O controller of the receiving PE. The latter finds a place for the message in its own local memory and notifies the source processor that the message has been received.

DIV_ADBLOCK89">

An interesting variant of a distributed memory system is; model distributed shared memory(DSM, Distribute Shared Memory), also known under a different name architecture with heterogeneousmemory access and coherence software(SC-NUMA, Software-Coherent Non-Uniform Memory Architecture). The idea of this model is that the VS, physically being a system with distributed memory, is presented to the user as a system with shared memory thanks to the operating system. This means that the operating system offers the user a single address space, despite the fact that the actual access to the memory of the “foreign” computer VS is still provided by message exchange.

Multiprocessorcoherence cache- memory

A shared-memory multiprocessor system consists of two or more independent processors, each of which performs either part of the big program or an independent program. All processors access instructions and data stored in a common main memory. Since memory is a shared resource, contention occurs between processors when it is accessed, resulting in an increase in the average memory access latency. To reduce this latency, each processor is given a local cache that, by serving local memory accesses, in many cases prevents the need to access shared main memory. In turn, equipping each processor with a local cache memory leads to the so-called coherence problem or providing according tocache memory. According to , the system is coherent if each read operation to any address performed by any of the processors returns the value entered during the last write operation to this address, regardless of which of the processors was the last to write.

In its simplest form, the problem of cache coherence can be explained as follows (Figure 11.5). Let two processors Rg and Rg are connected to the shared memory via a bus. First, both processors read the variable X. Copies of blocks containing this variable are transferred from the main memory to the local caches of both processors (Fig. 11.5, but). Next, the processor Pt performs the operation of increasing the value of the variable X per unit. Since a copy of the variable is already in the processor's cache, a cache hit will occur and the value will only be changed in cache 1. If processor P2 now performs a read operation again X, then a cache hit will also occur and P2 will get the “old” value stored in its cache X(Fig. 11.5, b).

Maintaining consistency requires that when a data element is changed by one of the processors, the corresponding changes are made in the cache memory of the other processors, where there is a copy of the changed data element, as well as in shared memory. A similar problem occurs, by the way, in single-processor systems, where there are several levels of cache memory. Here it is required to coordinate contents of caches of different levels.

There are two approaches to solving the problem of coherence: software and hardware. Some systems use strategies that combine both approaches.

Software ways solutionsProblems coherence

Software techniques for solving the coherence problem make it possible to do without additional equipment or minimize it.

ProtocolBerkeley. The Berkeley protocol was applied to the Berkeley multiprocessor system based on RISC processors.

The overhead resulting from cache misses is reduced by the idea of cache line ownership implemented in this protocol. Usually, the main memory is considered the owner of the rights to all blocks of data. Before modifying the contents of a line in its cache, the processor must acquire ownership of given line. These rights are acquired through special read and write operations. If, when accessing a block whose owner is this moment is not main memory, a cache miss occurs, the processor that owns the row prevents reading from main memory and supplies the requesting processor with data from its local cache.

Another improvement is the introduction of the shared state. When a processor writes to one of its local cache lines, it typically generates a signal to invalidate copies of the block being modified in other caches. In the Berkeley protocol, a revocation signal is generated only if other caches have such copies. This can significantly reduce overhead traffic on the bus. The following scenarios are possible.

First of all, every time a processor writes to its cache, mutable string is transferred to the "modified, private" (PD, Private Dirty) state. Further, if the row is shared, an invalidation signal is sent to the bus, and in all local caches where there is a copy of this block of data, these copies are transferred to the "invalid" (I, Invalid) state. If a write miss occurs, the processor obtains a copy of the block from the cache of the current owner of the requested block. Only after these actions the processor writes to its cache.

On a read cache miss, the processor sends a request to the block owner to get the most recent version of the block, and translates its a new copy to the read-only state (RO, Read Only). If the row was owned by another processor, it marks its copy of the block as Shared Dirty (SD).

The state diagram of the Berkeley protocol is shown in Figure 1. 11.10.

Comparing write-once protocols and Berkeley, the following can be noted. Both protocols use a write-back strategy that keeps changed blocks in cache for as long as possible. Main memory is updated only when a line is removed from the cache. The upper bound on the total number of write transactions on the bus is determined by the part of the write-once protocol where write-through is implemented, since the latter strategy generates a write operation on the bus with each change initiated by the processor. Since the first write operation in the write once protocol is end-to-end, it is performed even if the data is not shared. It entails additional traffic bus, which increases with increasing cache capacity. The write once protocol has been proven to result in more bus traffic than the Berkeley protocol. .

Hit while reading

Rice. 11.10. Berkeley Protocol

A constantly read and updated line in the write-once protocol requires the line to be read into the cache, modified locally in the cache, and written back to memory. The whole procedure requires two operations on the bus: reading from the main memory (OP) and writing back to the OP. On the other hand, the Berkeley protocol comes from taking ownership of a row. Next, the block is modified in the cache. If the row was not accessed before it was removed from the cache, the number of bus cycles will be the same as in the write-once protocol. However, it is more likely that the row will be requested again, then from the point of view of a single cache memory, updating a cache row needs only one read operation on the bus. Thus, the Berkeley protocol transfers lines directly between caches, while the write-once protocol transfers a block from the source cache to main memory, and then from the RAM to the requesting caches, which has an overall memory system latency. .

ProtocolIllinois. The Illinois protocol, proposed by Mark Papamarkos, also aims to reduce bus traffic and thus the processor's wait time for bus access. Here, as in the Berkeley protocol, the idea of block ownership dominates, but slightly modified. In the Illinois protocol, any cache that has a valid copy of a block of data has ownership. In this case, the same block can have multiple owners. When this happens, each processor is assigned a specific priority, and the owner with the higher priority becomes the source of information.

As in the previous case, the invalidation signal is generated only when there are copies of this block in other caches. Possible scenarios for the Illinois protocol are shown in Fig. 11.11.

Rice. 11.11. Illinois Protocol

Each time a processor writes to its cache, the mutable row is put into the PD (Private Dirty) state. If the data block is shared, a kill signal is sent to the bus and in all local caches, where there is a copy of this block, these copies are transferred to the state "invalid" (I, Invalid). If a write miss occurs, the processor obtains a copy from the cache of the current owner of the requested block. Only after these actions, the processor writes to its cache. As you can see, in this part there is a complete coincidence with the Berkeley protocol.

On a read cache miss, the processor sends a request to the owner of the block to get the latest version of the block, and puts its new copy in the "exclusive" (E, Exclusive) state, provided that it is the sole owner of the row. Otherwise, the status changes to "shared" (S, Shared).

It is essential that the protocol is extensible and is closely tied both to the cache miss rate and to the amount of data that is the common property of a multiprocessor system.

Protocolfirefly. The protocol was proposed by Tucker et al. and implemented in the Firefly Multiprocessor Workstation developed at the Digital Equipment Corporation Research Center.

The Firefly protocol uses an update record. The possible states of the cache line are the same as those of the Illinois protocol (Figure 11-12). The difference is that the write-back policy applies only to rows in the PD or E state, while rows in the S state are write-through. Observer caches use write-through to update their copies. In addition, watch caches that find a copy of a line in their possession drive a special "shared" bus line so that the writing controller can decide in which state to put the line that was written to. The “shared” line during a read cache miss informs the local cache memory controller about the place where the copy of the line came from: from the main memory or another cache. Thus, state S only applies to data that is actually shared.

https://pandia.ru/text/78/264/images/image018_2.jpg" width="491 height=316" height="316">

Rice. 11.13. Dragon Protocol

ProtocolMESI. Certainly among known protocols observation itself is the MESI (Modified/Exclusive/Shared/Invalid) protocol. The MESI protocol is widely used in commercial microprocessor systems, such as those based on Pentium and PowerPC microprocessors. So, it can be found in the internal cache and controller external cache i82490 of the Pentium microprocessor, in the i860 processor and the Motorola MC88200 cache controller.

The protocol was designed for write-back caches. One of the main objectives of the MESI protocol is to delay for as long as possible the operation of writing back the cached data to the main memory of the BC. This improves system performance by minimizing the necessary transfers of information between caches and main memory. The MESI protocol assigns one of four states to each cache line, which are controlled by two MESI status bits in the line's tag. The cache line status can be changed both by the processor for which this cache memory is local, and by other processors of the multiprocessor scheme. Management of the state of the cache lines can also be assigned to external logical devices. One version of the protocol provides for the use of the previously considered write-once scheme.

■shared(S, Shared) - a line in the cache matches a similar line in main memory (data is valid) and may be present in one or more of the other caches.

■ Invalid(I, Invalid) - A cache line marked as invalid does not contain valid data and becomes logically inaccessible.

Rice. 11.15. The sequence of changing states in the MESI protocol: a - processor 1 reads x;

b- processor 2 reads x; c - processor 1 makes the first entry in x;

G- processor 1 writes to x

The order in which a cache line transitions from one state to another depends on: the current state of the line, the operation performed (read or write), the result of the cache access (hit or miss), and finally, whether the line is shared or not. On fig. 11.14 shows a diagram of the main transitions without taking into account the write-once mode.

Suppose one of the processors makes a request to read from a line that is not currently in its local cache (read miss). The request will be broadcast on the bus. If none of the caches found a copy of the desired line, then there will be no response from the monitoring controllers of other processors, the line will be read into the cache of the requesting processor from the main memory, and the copy will be assigned the status E. If any of the local caches contains the copy being searched for, a response will be received from the corresponding watchdog indicating access to the shared row. All copies of the line in question in all caches will be put into state S, regardless of what state they were in before (I, E, or S).

When a processor makes a write request to a line that is not in its local cache (write miss), the line must be read from main memory (MA) and modified before being loaded into the cache. Before the processor can load a row, it must make sure that there is indeed a valid version of the data in main memory, that is, that there is no modified copy of the given row in other caches. The sequence of operations formed in this case is called reads with modifier intenttions(RWITM, Read With Intent To Modify). If a copy of the desired line is found in one of the caches, moreover, in state M, then the processor that has this copy interrupts the RWITM sequence and rewrites the line in the OP, after which it changes the state of the line in its cache to I. Then the RWITM sequence is resumed and done re-accessing main memory to read the updated row. The final state of the row will be M, in which neither the OP nor other caches have another valid copy of it. If a copy of the line existed in another cache and did not have the state M, then such a copy is canceled and access to the main memory is performed immediately.

A cache hit on a read does not change the status of the line being read. If the processor performs a write access to an existing row that is in state S, it broadcasts a request to the bus to inform other caches, updates the row in its cache, and sets it to status M. All other copies of the row are put in state I. If the processor performs a write access to a row in state E, the only thing it has to do is write to the row and change its state to M, since there are no other copies of the row in the system.

On fig. Figure 11-15 shows a typical sequence of events in a two-processor system requesting access to location x. An access to any cell in a cache line is treated as an access to the entire line.

Let's illustrate the steps when processor 2 tries to read the contents of cell x" (Fig. 11.16). First, a read cache miss occurs and the processor tries to access main memory. Processor 1 monitors the bus, detects a call to a cell, a copy of which is in its cache -memory and is in

Rice. 11.16. Transition from state E to state S in the MESI protocol: a - processor 2

reads x; b - processor 1 produces writeback x" to main memory;

i - processor 2 reads x" from main memory

state M, so it blocks the read operation from processor 2. Processor 1 then rewrites the line containing x" into the OP and releases processor 2 so that it can repeat the access to main memory. Processor 2 now receives the line containing x", and loads it to your cache. Both copies are marked as S.

So far, we have considered the non-single-write version of the MESI protocol. Considering the single entry, the state diagram depicted in fig. 11.14, slightly modified. All read cache misses cause a transition to the S state. The first write hit is followed by a transition to the E state (the so-called write-once transition). The next write hit causes the row status to change to M.

Handbook based protocols

Directory based coherence protocols are typical for complex multiprocessor systems with shared memory, where processors are united by a multistage hierarchical network of interconnections. The complexity of the topology makes the use of surveillance protocols with their broadcast mechanism costly and inefficient.

Directory-based protocols collect and track information about the contents of all local caches. Such protocols are usually implemented using a centralized controller that is physically part of the main memory controller. The directory itself is stored in main memory. When the local cache controller makes a request, the directory controller detects such a request and generates the commands necessary to transfer data from main memory or from another local cache containing latest version requested data. The central controller is responsible for updating information about the state of the local caches, so it must be notified of any local action that could affect the state of the data block.

The directory contains a set of entries that describe each cached RT location that can be shared between processors in the system. The directory is accessed whenever one of the processors modifies a copy of such a cell in its local memory. In this case, information from the directory is needed in order to invalidate or update copies of the changed cell (or the entire row containing this cell) in other local caches where such copies are available.

For each shared row that can be cached, a directory entry is allocated to store pointers to copies of that row. In addition, one modification bit (D) is allocated in each entry, indicating whether the copy is “dirty” (D = 1 - dirty) or “clean” (D = 0 - clean), that is, whether the contents of the line in the cache have changed memory after it was loaded there. This bit indicates whether the processor is allowed to write to the given row.

There are currently three ways to implement directory-based cache coherence protocols: full directory, limited directories, and chained directories.

In the protocol complete reference a single centralized directory maintains information about all caches. The directory is stored in main memory.

Rice. 11.17. Full Reference Cache Coherence Protocol

In a system of N processors, each directory entry will contain N single bit pointers. If a copy of the data is present in the corresponding local cache, the bit pointer is set to 1, otherwise it is set to 0. 11.17. This assumes that a copy of the string exists in every cache. Each row is given two status indicators: a valid bit (V, Valid) and a possession bit (P, Private). If the information in the line is correct, its V-bit is set to 1. A P-bit value of one indicates that given processor granted the right to write to the corresponding line of its local cache.

Suppose processor 2 is writing to location x. At the initial moment, the processor has not yet received permission for such a record. It forms a request to the directory controller and waits for permission to continue the operation. In response to a request to all caches where there are copies of the row containing cell x, a signal is issued to invalidate existing copies. Each cache that receives this signal resets the invalidated row validity bit (V-bit) to 0 and returns an acknowledgment signal to the directory controller. Upon receipt of all acknowledgment signals, the directory controller sets the modification bit (D-bit) of the corresponding directory entry to one and sends a signal to processor 2 to enable writing to location x. From this point on, processor 2 can continue to write to its own copy of location x, as well as to main memory if write-through is implemented in the cache.

The main problems of the complete reference protocol are related to big amount records. For each cell in the directory of a system of N processors, it is required N+ 1 bits, that is, with an increase in the number of processors, the complexity coefficient increases linearly. The full directory protocol allows each local cache to have copies of all shared cells. In practice, this possibility does not always remain in demand - at any given moment, only one or several copies are usually relevant. IN protocolwith limited references copies of a single line can only be in a limited number of caches - at the same time there can be no more than P copies of the string, while the number of pointers in the directory entries is reduced to n(n< N ). To uniquely identify the cache memory holding the copy, the pointer must consist of log2 N bits instead of one bit, and the total length of the pointers in each directory entry should instead N bits will be equal to Plog2N bit. At a constant value P the growth rate of the complexity factor of a limited directory as the size of the system increases is lower than in the case of a linear relationship.

When more than P copies, the controller decides which of the copies to keep and which to cancel, after which the corresponding changes are made in the directory entry pointers.

Method linked directories also aims to compress the volume of the directory. It uses a linked list to store records, which can be implemented as one-linked (unidirectional) and doubly-linked (bidirectional).

Rice. 11.18. Linked Directory Cache Coherence Protocol

In a singly linked list (Figure 11.18), each directory entry contains a pointer to a copy of the string in one of the local caches. Copies of the lines of the same name in different caches systems form a unidirectional chain. To do this, their tags have a special field where a pointer to the cache memory containing the next copy of the chain is entered. A special delimiter character is placed in the tag of the last copy of the chain. The chained directory admits chains of length N, that is, it supports N cell copies. When creating another copy, the chain must be destroyed, and a new one must be formed instead. Suppose, for example, processor 5 does not have a copy of cell x, and it refers to the main memory for it. The pointer in the directory is changed to point to cache number 5, and the pointer in cache 5 is changed to point to cache 2. To do this, the main memory controller must pass the cache pointer to cache 5 along with the requested data. -memory number 2. Only after the entire structure of the chain is formed, processor 5 will receive permission to access cell x. If the processor writes to a cell, then a kill signal is sent down the path defined by the corresponding chain of pointers. The chain must also be updated when the copy is removed from any cache.

A doubly linked list supports both forward and backward pointers. This allows you to more efficiently insert new pointers into the chain or remove unnecessary ones from it, but requires storing a larger number of pointers.

Directory-based schemes suffer from "congestion" in the centralized controller, as well as from communication overhead in the paths between the local cache controllers and the central controller. Nevertheless, they are very effective in multiprocessor systems with a complex topology of interconnections between processors, where it is impossible to implement monitoring protocols.

Below is a brief description of the currently relevant protocols for ensuring cache coherence based on the directory. For a detailed acquaintance with these protocols, references to the relevant literary sources are provided.

ProtocolTang. There is a centralized global directory containing full copy all information from the directories of each of the local caches. This leads to a problem bottlenecks, and also requires finding the corresponding inputs.

Protocolcensier. The Censier directory schema uses a pointer bit vector to indicate which processors hold a local copy of a given block of memory. Such a vector exists for each memory block. The disadvantages of the method are its inefficiency with a large number of processors, and, in addition, access to the main memory is required to update the cache lines.

ProtocolArchibald. The Archibald Handbook Schema is a pair of convoluted schemas for hierarchically organized networks of processors. FROM detailed description this protocol can be found in .

ProtocolStenstrom. The Stenstrom guide provides six valid states for each data block. This protocol is relatively simple and is suitable for any processor interconnect topology. The directory is stored in main memory. In the event of a read cache miss, the main memory is accessed, which sends a message to the block owner cache, if any. Upon receiving this message, the cache owner sends the requested data and also sends a message to all other processors sharing the data to update their bitvectors. The scheme is not very efficient with a large number of processors, however, it is currently the most developed and widely used handbook protocol.

Control questions

1. Analyze the impact of shared memory and distributed memory VS features on software development. Why are these VS called strongly coupled and weakly coupled, respectively?

2. Explain the idea of alternating memory addresses. What are the considerations for choosing an address allocation mechanism? How is it related to the aircraft architecture class?

3. Give comparative characteristic homogeneous and heterogeneous accesses
to memory.

4. What are the advantages of the SOMA architecture?

5. Carry out a comparative analysis of models with cache-coherent and cache-incoherent access to heterogeneous memory.

6. Formulate the advantages and disadvantages of an architecture without direct access to remote memory.

7. Explain the meaning of distributed and shared memory.

8. Develop your own example to illustrate the cache coherence problem.

9. Describe the features software ways solutions to the problem of coherence, highlight their advantages and disadvantages.

10. Compare the methods of writing to memory with cancellation and writing to memory with translation, emphasizing their advantages and disadvantages.

11. Give a comparative description of methods for maintaining coherence in multiprocessor systems.

12. Perform a comparative analysis of observation protocols known to you.

13. What is the most popular surveillance protocol? Justify the reasons for the increased interest in him.

14. Give a detailed description of coherence protocols based on the handbook and how to implement them. What is the difference between these protocols and observation protocols?

Table 9.1. Hierarchy of the PC memory subsystem

№	Memory type	1985			2000
№	Memory type	Sample time	Typical Volume	Price / byte	Sample time	Typical Volume	Price / byte
1	Super-fast memory (registers)	0.2 5 ns	16/32 bit	$ 3 - 100	0.01 1 ns	32/64/128 bit	$ 0,1 10
2	Fast buffer memory (cache)	20 100 ns	8Kb - 64Kb	~ $ 10	0.5 - 2 ns	32Kb 1Mb	$ 0,1 - 0,5
3	Operational (main) memory	~ 0.5 ms	1Mb - 256Mb	$ 0,02 1	2 ns 20 ns	128MB - 4GB	$ 0,01 0,1
4	External storage (mass storage)	10 - 100 ms	1Mb - 1Gb	$ 0,002 - 0,04	5 - 20 ms	1GB - 0.5TB	$ 0,001 - 0,01

Processor registers make up its context and store data used by currently executing processor commands. Processor registers are referred to, as a rule, by their mnemonic designations in processor commands.

Cache used to match the speed of the CPU and main memory. In computing systems, a multi-level cache is used: level I cache (L1), level II cache (L2), etc. Desktop systems typically use a two-level cache, while servers use a three-level cache. The cache stores instructions or data that are likely to be processed by the processor in the near future. The operation of the cache memory is transparent to the software, so cache memory usually not programmatically available.

RAM stores, as a rule, functionally completed software modules(operating system kernel, executing programs and their libraries, drivers of devices used, etc.) and their data directly involved in the operation of programs, and is also used to save the results of calculations or other data processing before transferring them to an external memory, to a device data output or communication interfaces.

Each cell random access memory assigned a unique address. Organizational memory allocation methods provide programmers with the ability to efficiently use the entire computer system. These methods include a solid ("flat") memory model and a segmented memory model. Using solid pattern( flat model ) of memory, the program operates with a single continuous address space, a linear address space in which memory cells are numbered sequentially and continuously from 0 to 2n-1, where n is the CPU bit depth at the address. When using a segmented model for a program, memory is represented by a group of independent address blocks called segments. To address a memory byte, a program must use a logical address consisting of a segment selector and an offset. The segment selector selects a specific segment, and the offset points to a specific cell in the address space of the selected segment.

Physical and logical organization of computer systems memory. Principles of memory management of a computing system

main memory

Top Related Articles