How to set up smartphones and PCs. Informational portal
  • home
  • Windows Phone
  • The dynamic range is compressed or standard. Synthesis and speech recognition

The dynamic range is compressed or standard. Synthesis and speech recognition

Dynamic compression(Dynamic range compression, DRC) - narrowing (or expanding in the case of an expander) the dynamic range of a phonogram. Dynamic range, it is the difference between the quietest and loudest sound. Sometimes the quietest sound in the soundtrack will be a little louder than the noise level, and sometimes a little quieter than the loudest. Hardware devices and programs that perform dynamic compression are called compressors, distinguishing four main groups among them: compressors themselves, limiters, expanders and gates.

Vacuum tube analog compressor DBX 566

Up and down compression

Down compression(Downward compression) decreases the volume of a sound when it begins to exceed a certain threshold, leaving quieter sounds unchanged. The extreme down-compression option is limiter. Up-compression(Upward compression), on the contrary, increases the sound volume if it is below the threshold value, without affecting louder sounds. In this case, both types of compression narrow the dynamic range of the audio signal.

Down compression

Up-compression

Expander and Gate

If the compressor decreases the dynamic range, the expander increases it. When the signal level rises above the threshold level, the expander increases it even more, thus increasing the difference between loud and quiet sounds. Such devices are often used when recording drum kits to separate the sounds of some drums from others.

The type of expander that is used not to amplify loud sounds, but to drown out quiet sounds that do not exceed the threshold level (for example, background noise) is called Noise gate... In such a device, as soon as the sound level falls below the threshold, the signal flow stops. Usually the gate is used to suppress noise during pauses. On some models, you can make sure that the sound does not stop abruptly when the threshold level is reached, but gradually fades out. In this case, the decay rate is set with the Decay knob.

Gate, like other types of compressors, can be frequency-dependent(i.e. handle certain frequency bands differently) and can operate in the side-chain(see below).

Compressor working principle

The signal entering the compressor is split into two copies. One copy is sent to an amplifier, in which the degree of amplification is controlled by an external signal, the second copy forms this signal. It goes into a device called a side-chain, where the signal is measured, and based on this data, an envelope is created that describes the change in its volume.
This is how most modern compressors are arranged, this is the so-called feed-forward type. In older devices (feedback type) the signal level is measured after the amplifier.

There are various analogue variable-gain amplification technologies, each with its own advantages and disadvantages: tube, optical using photoresistors, and transistor. When working with digital sound (in a sound editor or DAW), you can use your own mathematical algorithms or emulate the work of analog technologies.

Basic parameters of compressors

Threshold

The compressor reduces the audio signal level if its amplitude exceeds a certain threshold value. It is usually specified in decibels, with a lower threshold (eg -60 dB) means more sound will be processed than a higher threshold (eg -5 dB).

Ratio

The amount of level reduction is determined by the ratio parameter: ratio 4: 1 means that if the input level is 4 dB higher than the threshold, the output signal level will be 1 dB higher than the threshold.
For instance:
Threshold = −10 dB
Input signal = −6 dB (4 dB above the threshold)
Output signal = −9 dB (1 dB above the threshold)

It is important to keep in mind that the suppression of the signal level continues for some time after it falls below the threshold level, and this time is determined by the value of the parameter release.

Compression with a maximum ratio of ∞: 1 is called limiting. This means that any signal above the threshold level is suppressed to the threshold level (except for a short period after a sudden increase in the input volume). See Limiter below for details.

Examples of different Ratio values

Attack and Release

The compressor provides some control over how quickly it responds to changes in signal dynamics. The Attack parameter determines the time it takes for the compressor to reduce the gain to the level determined by the Ratio parameter. Release determines the time during which the compressor, on the contrary, increases the gain, or returns to normal if the input signal level falls below the threshold value.

Attack and Release phases

These parameters indicate the time (usually in milliseconds) it will take to change the gain by a certain amount of decibels, usually 10 dB. For example, in this case, if Attack is set to 1ms, it will take 1ms to decrease the gain by 10dB and 2ms to decrease the gain by 20dB.

In many compressors the Attack and Release parameters can be adjusted, but in some they are preset and not adjustable. They are sometimes referred to as "automatic" or "program dependent", i.e. vary depending on the input signal.

Knee

One more compressor parameter: hard / soft Knee... It determines whether the start of the compression is hard or soft. Soft knee reduces the noticeable transition from dry signal to compressed signal, especially at high Ratios and sudden increases in volume.

Hard Knee and Soft Knee Compression

Peak and RMS

The compressor can respond to peaks (short-term maximum) values ​​or to the average input level. Using peaks can lead to sharp fluctuations in the compression ratio, and even distortion. Therefore, compressors apply an averaging function (usually RMS) of the input signal when comparing it to the threshold value. This gives a more comfortable compression, closer to the human perception of loudness.

RMS is a parameter that reflects the average volume of a phonogram. From a mathematical point of view, RMS (Root Mean Square) is the root mean square value of the amplitude of a certain number of samples:

Stereo linking

A compressor in stereo linking mode applies the same gain to both stereo channels. This avoids displacement of the stereo panorama that can result from the individual processing of the left and right channels. This shift occurs if, for example, a loud element is panned off-center.

Makeup gain

As the compressor reduces the overall level of the signal, it is common to add a fixed output gain option to obtain the optimum level.

Look-ahead

The look-ahead feature is designed to solve the problems of both too high and too low Attack and Release values. An attack time that is too long does not allow us to effectively intercept transients, and too short one may not be comfortable for the listener. When using the look-ahead function, the main signal is delayed relative to the manager, this allows compression to begin early, even before the signal reaches the threshold.
The only drawback of this method is the signal time delay, which is undesirable in some cases.

Using dynamic compression

Compression is used everywhere, not only in music soundtracks, but also wherever it is necessary to increase the overall volume without increasing peak levels, where inexpensive sound-reproducing equipment or a limited transmission channel is used (warning and communication systems, amateur radio, etc.) ...

Compression is used when playing background music (in shops, restaurants, etc.) where any noticeable changes in volume are undesirable.

But the most important application for dynamic compression is in music production and broadcasting. Compression is used to give the sound "density" and "drive", for better combination of instruments with each other, and especially when processing vocals.

Vocals in rock and pop music are usually compressed to make them stand out from the accompaniment and add clarity. A special kind of compressor tuned only to certain frequencies - deesser - is used to suppress hissing phonemes.

In instrumental parts, compression is also used for effects that are not directly related to volume, for example, quickly decaying drum sounds can become longer.

Side-chaining is often used in electronic dance music (EDM) (see below) - for example, the bass line can be driven by a kick drum or similar to prevent bass and drum clash and create dynamic ripple.

Compression is widely used in broadcasting (radio, television, webcasting) to increase the perceived loudness while reducing the dynamic range of the original audio (usually CD). Most countries have legal restrictions on the instantaneous maximum volume that can be broadcast. Usually these restrictions are implemented by permanent hardware compressors in the ether chain. In addition, increasing the perceived loudness improves the "quality" of the sound from the point of view of most listeners.

see also Loudness war.

Sequentially increasing the volume of the same song remastered for CD from 1983 to 2000.

Side-chaining

Another common compressor switch is the side chain. In this mode, the sound is compressed not depending on its own level, but depending on the level of the signal entering the connector, which is usually called the side chain.

There are several uses for this. For example, a vocalist lisps and all the letters "s" stand out from the overall picture. You pass his voice through the compressor, and into the side chain connector you feed the same sound, but passed through the equalizer. On the equalizer, you remove all frequencies except those used by the vocalist when pronouncing the letter "c". Usually around 5 kHz, but can range from 3 kHz to 8 kHz. If you then put the compressor in the side chain mode, then the compression of the voice will occur in those moments when the letter "c" is pronounced. Thus, the device known as the "de-esser" was obtained. This way of working is called frequency dependent.

Another use of this feature is called "ducker". For example, in a radio station, the music goes through the compressor, and the DJ's words go through the side chain. When the DJ starts chatting, the volume of the music is automatically reduced. This effect can be used with great success in recording, for example, by turning down the volume of the keyboard parts while singing.

Brick wall limiting

The compressor and the limiter work in about the same way, we can say that the limiter is a compressor with a high Ratio (from 10: 1) and usually a low Attack time.

There is the concept of Brick wall limiting - limiting with a very high Ratio (20: 1 and higher) and a very fast attack. Ideally, it does not allow the signal to exceed the threshold level at all. The result will be unpleasant to the ear, but it will prevent damage to sound-reproducing equipment or exceeding the bandwidth of the channel. Many manufacturers integrate limiters into their devices for this very purpose.

Clipper vs. Limiter, soft and hard clipping

The second part of the series is devoted to the functions of optimizing the dynamic range of images. In it we will tell you why such solutions are needed, consider various options for their implementation, as well as their advantages and disadvantages.

Embrace the immensity

Ideally, the camera should capture the image of the surrounding world as a person perceives it. However, due to the fact that the mechanisms of "vision" of the camera and the human eye are significantly different, there are a number of limitations that prevent this condition from being met.

One of the problems that previously faced film camera users and digital camera owners now face is the impossibility of adequately capturing scenes with a large difference in illumination without the use of special devices and / or special shooting techniques. The peculiarities of the human visual apparatus make it possible to perceive equally well the details of high-contrast scenes in both brightly lit and dark areas. Unfortunately, the camera sensor is not always able to capture the image as we see it.

The greater the difference in brightness in the photographed scene, the higher the likelihood of loss of detail in highlights and / or shadows. As a result, instead of a blue sky with lush clouds, only a whitish spot appears in the picture, and objects located in the shadows turn into indistinct dark silhouettes or completely merge with the surrounding environment.

In classical photography, to assess the ability of a camera (or media in the case of film cameras) to transmit a certain range of brightness, the concept is used photographic latitude(see the sidebar for details). Theoretically, the photographic latitude of digital cameras is determined by the capacity of the analog-to-digital converter (ADC). For example, when using an 8-bit ADC, taking into account the quantization error, the theoretically achievable value for photographic latitude will be 7 EV, for a 12-bit ADC - 11 EV, etc. However, in real devices, the dynamic range of images turns out to be at the same theoretical maximum due to the influence of various kinds of noise and other factors.

Large differences in brightness levels represent a severe
problem when taking pictures. In this case, the capabilities of the camera
was not enough to adequately transfer the most
bright areas of the scene, and as a result, instead of a blue area
the sky (marked with a stroke) has a white "patch"

The maximum brightness value that a light-sensitive sensor is able to fix is ​​determined by the saturation level of its cells. The minimum value depends on several factors, including matrix thermal noise, charge transfer noise, and ADC error.

It is also worth noting that the photographic latitude of the same digital camera may vary depending on the sensitivity value set in the settings. The maximum dynamic range is attainable when the so-called basic sensitivity is set (corresponding to the minimum possible numerical value). As the value of this parameter increases, the dynamic range decreases due to the increasing noise level.

The photographic breadth of modern digital camera models equipped with large sensors and 14- or 16-bit ADCs ranges from 9 to 11 EV, which is significantly greater than that of 35 mm format color negative films (on average from 4 to 5 EV ). Thus, even relatively inexpensive digital cameras have a photographic width sufficient to adequately convey most typical amateur photography subjects.

However, there is a different kind of problem. It is connected with the restrictions imposed by the existing standards for recording digital images. Using the JPEG format with 8 bits per color channel (which has now become the de facto standard for recording digital images in the computer industry and digital technology), it is not even theoretically possible to save a picture with a photographic width of more than 8 EV.

Suppose the ADC of a camera produces a 12-bit or 14-bit image that contains distinguishable details in both highlights and shadows. However, if the photographic latitude of this image exceeds 8 EV, then in the process of converting to the standard 8-bit format without any additional actions (that is, simply by discarding "extra" bits), some of the information recorded by the photosensitive sensor will be lost.

Dynamic range and photographic latitude

To put it simply, the dynamic range is defined as the ratio of the maximum value of the image brightness to its minimum value. In classical photography, the term photographic latitude is traditionally used, which essentially means the same thing.

The width of the dynamic range can be expressed in terms of a ratio (for example, 1000: 1, 2500: 1, etc.), but most often a logarithmic scale is used for this. In this case, the value of the decimal logarithm of the ratio of the maximum brightness to its minimum value is calculated, and after the number, a capital letter D (from the English density? - density) is placed, less often? - the abbreviation OD (from the English optical density? - optical density). For example, if the ratio of the maximum brightness value to the minimum value of a device is 1000: 1, then the dynamic range will be 3.0 D:

To measure photographic latitude, so-called exposure units are traditionally used, denoted by the abbreviation EV (from the English exposure values; professionals often call them "stops" or "steps"). It is in these units that the amount of exposure compensation is usually set in the camera settings. Increasing the photographic latitude value by 1 EV is equivalent to doubling the difference between the maximum and minimum brightness levels. Thus, the EV scale is also logarithmic, but in this case, the base 2 logarithm is used to calculate the numerical values. For example, if a device provides the ability to capture images, the ratio of the maximum brightness to the minimum value reaches 256: 1, then its the photographic latitude is 8 EV:

Compression is a smart compromise

The most efficient way to preserve the full image information captured by the camera's light sensor is to record images in RAW format. However, this function is not available in all cameras, and not every amateur photographer is ready to engage in painstaking work on the selection of individual settings for each shot.

To reduce the likelihood of loss of detail in high-contrast images, converted inside the camera to 8-bit JPEG, in the devices of many manufacturers (not only compact, but also SLR), special functions have been introduced that allow you to compress the dynamic range of stored images without user intervention. By reducing the overall contrast and the loss of an insignificant part of the information of the original image, such solutions allow you to save in 8-bit JPEG format the details in highlights and shadows recorded by the light-sensitive sensor of the device, even if the dynamic range of the original image is wider than 8 EV.

One of the pioneers in the development of this direction was the HP company. Launched in 2003, the HP Photosmart 945 digital camera introduced the world's first HP Adaptive Lightling technology, which automatically compensates for low light levels in dark areas of images and thus retains shadow detail without the risk of overexposure (especially useful when shooting high-contrast scenes). The HP Adaptive Lightling algorithm is based on the principles set forth by the English scientist Edwin Land in the RETINEX theory of human visual perception.

HP Adaptive Lighting menu

How does Adaptive Lighting work? After obtaining a 12-bit image of the image, an auxiliary monochrome image is extracted from it, which is actually a light map. When processing the image, this map is used as a mask that allows you to adjust the degree of influence of a rather complex digital filter on the image. Thus, in the areas corresponding to the darkest points of the map, the impact on the image of the future image is minimal, and vice versa. This approach allows you to show details in the shadows by selectively brightening these areas and, accordingly, reducing the overall contrast of the resulting image.

Note that when Adaptive Lighting is enabled, the captured image is processed as described above before the final image is written to a file. All described operations are performed automatically, and the user can only select in the camera menu one of the two modes of Adaptive Lighting (low or high exposure) or disable this function.

Generally speaking, many of the specific functions of modern digital cameras (including the face recognition systems discussed in the previous article) are a kind of by-products or conversion products of research projects that were originally performed for military customers. When it comes to image dynamic range optimization functions, one of the most well-known providers of such solutions is Apical. The algorithms created by its employees, in particular, underlie the SAT (Shadow Adjustment Technology) function, which is implemented in a number of Olympus digital cameras. Briefly, the SAT function can be described as follows: based on the original image of the image, a mask is created corresponding to the darkest areas, and then the exposure value is automatically corrected for these areas.

Sony has also acquired a license to use Apical's developments. Many Cyber-shot compact cameras and alpha series DSLRs have a so-called Dynamic Range Optimizer (DRO) feature.

Photos taken with the HP Photosmart R927 with (top) disabled
and activated Adaptive Lighting

Correction of the image when DRO is activated is performed during the initial processing of the image (that is, before the recording of the finished JPEG file). In the basic version, the DRO has a two-stage setting (in the menu you can select the standard or advanced mode of its operation). When you select Standard mode, based on an analysis of the image image, the exposure value is corrected, and then a tonal curve is applied to the image to equalize the overall balance. Advanced mode uses a more sophisticated algorithm that allows you to make corrections in both shadows and highlights.

Sony developers are constantly working to improve the DRO algorithm. For example, in the a700 SLR camera, when the advanced DRO mode is activated, one of five correction options can be selected. In addition, the ability to save three variants of one image at once (a kind of bracketing) with different DRO settings is implemented.

Many Nikon digital cameras are equipped with D-Lighting, which is also based on Apical algorithms. However, unlike the solutions described above, D-Lighting is implemented in the form of a filter for processing previously saved images using a tonal curve, the shape of which makes the shadows lighter, while keeping the rest of the image unchanged. But since in this case already finished 8-bit images are processed (and not the original image of the frame, which has a higher bit depth and, accordingly, a wider dynamic range), the possibilities of D-Lighting are very limited. The user can get the same result by processing the image in a graphical editor.

When comparing the enlarged fragments, it is clearly visible that the dark areas of the original image (left)
become lighter when Adaptive Lighting is enabled

There are also a number of solutions based on other principles. So, in many cameras of the Lumix family from Panasonic (in particular, DMC-FX35, DMC-TZ4, DMC-TZ5, DMC-FS20, DMC-FZ18, etc.), the Intelligent Exposure function is implemented, which is an integral part of the system. Intelligent Auto Shooting Control iA. Intelligent Exposure relies on the automatic analysis of the frame image and correction of dark areas of the image to avoid loss of detail in the shadows, as well as (if necessary) compression of the dynamic range of high-contrast scenes.

In some cases, the work of the dynamic range optimization function involves not only certain operations for processing the original image of the image, but also the correction of shooting settings. For example, in the new models of Fujifilm digital cameras (in particular, in the FinePix S100FS), the Wide Dynamic Range (WDR) function is implemented, which, according to the developers, allows increasing the photographic latitude by one or two stops (in the terminology of settings - 200 and 400%).

When the WDR function is activated, the camera takes pictures with exposure compensation of –1 or –2 EV (depending on the setting selected). Thus, the image of the frame is underexposed - this is necessary in order to preserve the maximum information about the details in the highlights. The resulting image is then processed using a tonal curve to even out the overall balance and adjust the black level. The image is then converted to 8-bit format and recorded as a JPEG file.

Dynamic range compression preserves more detail
in highlights and shadows, but the inevitable consequence of such an impact
is the reduction in overall contrast. In the bottom image
the texture of the clouds is much better worked out, however
due to the lower contrast, this version of the image
looks less natural

A similar function called Dynamic Range Enlargement is implemented in a number of compact and SLR cameras from Pentax (Optio S12, K200D, etc.). According to the manufacturer, the use of the Dynamic Range Enlargement function allows you to increase the photographic latitude by 1 EV without losing detail in highlights and shadows.

A similar function called Highlight tone priority (HTP) is implemented in a number of Canon DSLR models (EOS 40D, EOS 450D, etc.). According to the information in the user manual, activating HTP can improve the detail in highlights (more specifically, in the range of levels from 0 to 18% gray).

Conclusion

Let's summarize. The built-in dynamic range compression function allows you to convert the original high dynamic range image to an 8-bit JPEG file with minimal damage. In the absence of a RAW capture function, Dynamic Range Compression allows the photographer to more fully exploit the potential of their camera when capturing high-contrast scenes.

Of course, keep in mind that dynamic range compression is not a magic bullet, but rather a compromise. You have to pay for the preservation of details in highlights and / or shadows by increasing the noise level in the dark areas of the image, reducing its contrast and somewhat coarsening the smooth tonal transitions.

Like any automatic function, the dynamic range compression algorithm is not a fully universal solution to improve absolutely any picture. And therefore, it makes sense to activate it only in those cases when it is really necessary. For example, in order to capture a silhouette with a well-developed background, the dynamic range compression function must be turned off - otherwise, the effective scene will be hopelessly ruined.

Concluding the consideration of this topic, it should be noted that the use of the dynamic range compression functions does not allow to "stretch" on the resulting image details that were not fixed by the camera sensor. To get a satisfactory result when shooting high-contrast scenes, you need to use additional devices (for example, gradient filters for photographing landscapes) or special techniques (such as taking multiple exposure-bracketing frames and then combining them into one image using Tone Mapping technology).

The next article will focus on the burst function.

To be continued

, Media players

Records, especially older ones that were recorded and produced prior to 1982, were much less likely to be mixed and made louder. They reproduce natural music with a natural dynamic range that is retained on record and lost in most standard digital or high definition formats.

There are, of course, exceptions - listen to the recently released Steven Wilson album from MA Recordings or Reference Recordings and you will hear how good digital sound can be. But this is rare, most modern recordings are loud and compressed.

Compression of music has come under serious criticism lately, but I would argue that almost all of your favorite recordings are compressed. Some of them are less, some more, but still compressed. Dynamic range compression is a scapegoat for poor musical sound, but highly compressed music is not a new trend: listen to 60s Motown albums. The same can be said for the classics of Led Zeppelin or the younger albums by Wilco and Radiohead. Dynamic range compression reduces the natural relationship between the loudest and quietest sounds on a recording, so whispers can be as loud as screams. It's pretty hard to find pop music of the past 50 years that hasn't been compressed.

I recently had a nice chat with Tape Op founder and editor Larry Crane about the good, bad and evil aspects of compression. Larry Crane has worked with bands and artists such as Stefan Marcus, Cat Power, Sleater-Kinney, Jenny Lewis, M. Ward, The Go-Betweens, Jason Little, Eliot Smith, Quasi and Richmond Fontaine. He also runs Jackpot Recording Studio! in Portland, Oregon, home to The Breeders, The Decemberists, Eddie Vedder, Pavement, R.E.M., She & Him and many, many more.

As an example of surprisingly unnatural sounding, but still great songs, I cite Spoon's album They Want My Soul, released in 2014. Crane laughs and says that he listens to him in the car because he sounds great there. Which brings us to another answer to the question of why music is compressed: because the compression and extra "clarity" make it better heard in noisy places.

Larry Crane at work. Photo by Jason Quigley

When people say they like the sound of an audio recording, I believe they like music, as if sound and music were inseparable terms. But for myself, I differentiate these concepts. From a music lover's point of view, the sound can be rough and raw, but that won't matter to most listeners.

Many are in a hurry to accuse mastering engineers of overusing compression, but compression is applied directly during recording, during mixing, and only then during mastering. If you were not personally present at each of these stages, then you will not be able to tell how the instruments and vocals sounded at the very beginning of the process.

Crane was on fire: "If a musician wants to deliberately make the sound insane and distorted like Guided by Voices recordings, then there is nothing wrong with that - the desire always outweighs the sound quality." The performer's voice is almost always compressed, the same happens with bass, drums, guitars and synthesizers. Compression maintains the vocal volume at the desired level throughout the song, or slightly stands out from the rest of the sounds.

Correct compression can make the drum sound livelier or intentionally weird. In order for music to sound great, you need to be able to use the necessary instruments for this. This is why it takes years to figure out how to use compression and not overdo it. If the mix engineer has compressed the guitar part too much, the mastering engineer will no longer be able to fully restore the missing frequencies.

If musicians wanted you to listen to music that did not go through the mixing and mastering stages, then they would release it to store shelves straight from the studio. Crane says the people who create, edit, mix, and master music aren't there to get lost in the musicians' feet - they've been helping artists from the start, for over a century.

These people are part of the creation process that produces amazing works of art. Crane adds, "You don't want a version of 'Dark Side of the Moon' that hasn't gone through mixing and mastering." Pink Floyd released the song the way they wanted to hear it.

© 2014 site

Or photographic latitude photographic material is the ratio between the maximum and minimum exposure values ​​that can be correctly captured in the picture. When applied to digital photography, the dynamic range is virtually equivalent to the ratio of the maximum and minimum possible values ​​of the useful electrical signal generated by the photosensor during exposure.

Dynamic range is measured in exposure stops (). Each step corresponds to doubling the amount of light. So, for example, if a certain camera has a dynamic range of 8 EV, then this means that the maximum possible value of the useful signal of its matrix is ​​related to the minimum as 2 8: 1, which means that the camera is able to capture objects that differ in brightness within one frame. no more than 256 times. More precisely, it can capture objects with any brightness, however, objects whose brightness exceeds the maximum allowable value will appear dazzling white in the picture, and objects whose brightness will be below the minimum value will turn out to be coal black. Details and texture will be discernible only on those objects, the brightness of which is within the dynamic range of the camera.

To describe the relationship between the brightness of the lightest and the darkest of the objects being shot, the incorrect term "scene dynamic range" is often used. It would be more correct to talk about the brightness range or the contrast level, since the dynamic range is usually a characteristic of the measuring device (in this case, the matrix of a digital camera).

Unfortunately, the brightness range of many of the beautiful scenes we encounter in real life can significantly exceed the dynamic range of a digital camera. In such cases, the photographer is forced to decide which subjects need to be worked out in full detail and which can be left out of the dynamic range without compromising the creative intent. In order to make the most of the dynamic range of your camera, you may sometimes need to develop an artistic flair rather than a thorough understanding of how a photosensor works.

Dynamic range limiting factors

The lower limit of the dynamic range is set by the noise level of the photosensor. Even an unlit matrix generates a background electrical signal called dark noise. Also, interference occurs when the charge is transferred to the analog-to-digital converter, and the ADC itself introduces a certain error into the digitized signal - the so-called. sampling noise.

If you take a picture in complete darkness or with a lens cap, the camera will only record this meaningless noise. If you allow a minimal amount of light to hit the sensor, the photodiodes will begin to build up an electrical charge. The amount of charge, and hence the intensity of the useful signal, will be proportional to the number of captured photons. In order to show at least some meaningful detail in the image, it is necessary that the level of the useful signal exceeds the level of the background noise.

Thus, the lower limit of the dynamic range, or, in other words, the sensitivity threshold of the sensor, can formally be defined as the level of the output signal at which the signal-to-noise ratio is greater than unity.

The upper limit of the dynamic range is determined by the capacitance of an individual photodiode. If, during exposure, any photodiode accumulates an electric charge of a maximum value for itself, then the image pixel corresponding to the overloaded photodiode will turn out to be absolutely white, and further irradiation will not affect its brightness in any way. This phenomenon is called clipping. The higher the overload capacity of the photodiode, the more signal it can give at the output before it reaches saturation.

For clarity, let's turn to the characteristic curve, which is a graph of the dependence of the output signal on exposure. The horizontal axis represents the binary logarithm of the radiation received by the sensor, and the vertical axis represents the binary logarithm of the electrical signal generated by the sensor in response to this radiation. My drawing is largely arbitrary and for illustrative purposes only. The characteristic curve of a real photosensor has a slightly more complex shape, and the noise level is rarely so high.

The graph clearly shows two critical breaking points: in the first of them, the level of the useful signal crosses the noise threshold, and in the second, the photodiodes reach saturation. Exposure values ​​that lie between these two points make up the dynamic range. In this abstract example, it is equal, as is easy to see, 5 EV, i.e. the camera can handle five doubles of exposure, which is equivalent to a 32x (2 5 = 32) difference in brightness.

The exposure zones that make up the dynamic range are not equal. The upper zones have a higher signal-to-noise ratio and therefore appear cleaner and more detailed than the lower ones. As a result, the upper limit of the dynamic range is very real and tangible - clipping cuts off light at the slightest overexposure, while the lower limit imperceptibly drowns in noise, and the transition to black is far from being as sharp as to white.

The linear dependence of the signal on the exposure, as well as the sharp reaching a plateau, are unique features of the digital photographic process. For comparison, take a look at the conventional characteristic curve of traditional photographic film.

The shape of the curve and especially the angle of inclination strongly depend on the type of film and on the procedure for its development, but the main thing, striking difference between the film graph and the digital one, remains unchanged - the nonlinear nature of the dependence of the optical density of the film on the exposure value.

The lower limit of the photographic latitude of the negative film is determined by the density of the veil, and the upper limit is determined by the maximum achievable optical density of the photo layer; for reversible films, the opposite is true. Both in the shadows and in the highlights, smooth curves of the characteristic curve are observed, indicating a drop in contrast when approaching the boundaries of the dynamic range, because the slope of the curve is proportional to the contrast of the image. Thus, the exposure zones located in the middle of the graph have the maximum contrast, while the contrast in the highlights and shadows is reduced. In practice, the difference between film and a digital matrix is ​​especially noticeable in highlights: where in a digital image the lights are burnt out by clipping, on the film the details are still visible, albeit low in contrast, and the transition to pure white looks smooth and natural.

In sensitometry, even two independent terms are used: actually photographic latitude limited by a relatively linear portion of the characteristic curve, and useful photographic latitude, including, in addition to the linear section, also the base and the leverage of the chart.

It is noteworthy that when processing digital photographs, as a rule, a more or less pronounced S-shaped curve is applied to them, increasing the contrast in half-tones at the cost of reducing it in shadows and highlights, which gives the digital image a more natural and pleasing look to the eye.

Bit depth

Unlike the matrix of a digital camera, human vision is characterized by, let's say, a logarithmic view of the world. Successive doubling of the amount of light is perceived by us as equal changes in brightness. Light numbers can even be compared with musical octaves, because two-fold changes in the frequency of sound are perceived by ear as a single musical interval. Other senses work according to this principle. Non-linearity of perception greatly expands the range of a person's sensitivity to stimuli of varying intensity.

When converting a RAW file (it does not matter - by means of a camera or in a RAW converter) containing linear data, the so-called. gamma curve, which is designed to non-linearly increase the brightness of a digital image, bringing it in line with the characteristics of human vision.

With linear conversion, the image is too dark.

After gamma correction, the brightness returns to normal.

The gamma curve stretches the dark tones and compresses the light ones, making the distribution of gradations more even. As a result, the image looks natural, but noise and sampling artifacts in the shadows inevitably become more noticeable, which is only exacerbated by the small number of brightness levels in the lower zones.

Linear distribution of brightness gradations.
Even distribution after applying the gamma curve.

ISO and dynamic range

Despite the fact that digital photography uses the same concept of photosensitivity of photographic material as in film photography, it should be understood that this happens solely by virtue of tradition, since approaches to changing photosensitivity in digital and film photography differ fundamentally.

Increasing ISO sensitivity in traditional photography means replacing one film with another with a larger grain, i.e. there is an objective change in the properties of the photographic material itself. In a digital camera, the light sensitivity of the sensor is rigidly set by its physical characteristics and cannot be literally changed. When the ISO is raised, the camera does not change the real sensitivity of the sensor, but only amplifies the electrical signal generated by the sensor in response to radiation and adjusts the algorithm for digitizing this signal accordingly.

An important consequence of this is a decrease in the effective dynamic range in proportion to the increase in ISO, because along with the useful signal, noise is also amplified. If at ISO 100 the entire range of signal values ​​is digitized - from zero to the saturation point, then at ISO 200 only half of the photodiode capacity is taken as the maximum. With each doubling of ISO sensitivity, the upper stop of the dynamic range is cut off, and the remaining stops are pulled up in its place. This is why using ultra-high ISO values ​​is meaningless. You can just as well lighten a photo in a RAW converter and get a comparable noise level. The difference between raising the ISO and artificially brightening the image is that when the ISO is raised, the signal is amplified before it enters the ADC, which means that the quantization noise is not amplified, unlike the intrinsic noise of the sensor, while in the RAW converter it is amplified. including ADC errors. In addition, reducing the sampling range means more accurate sampling of the remaining values ​​of the input signal.

By the way, lowering the ISO below the base value (for example, to ISO 50), which is available on some devices, does not expand the dynamic range at all, but simply attenuates the signal by half, which is equivalent to darkening a picture in a RAW converter. This feature can even be viewed as detrimental, since using a sub-minimum ISO setting causes the camera to increase exposure, which, if the sensor saturation threshold remains unchanged, increases the risk of clipping in highlights.

True dynamic range value

There are a number of programs like (DxO Analyzer, Imatest, RawDigger, etc.) that allow you to measure the dynamic range of a digital camera at home. In principle, this is not really necessary, since data for most cameras can be freely found on the Internet, for example, on the DxOMark.com website.

Should we believe the results of such tests? Quite. With the only proviso that all these tests determine the effective or, if I may say so, the technical dynamic range, i.e. the relationship between the saturation level and the matrix noise level. For the photographer, the first thing that is important is the useful dynamic range, i.e. the number of exposure zones that really allow you to capture some useful information.

As you remember, the threshold for dynamic range is set by the noise level of the photosensor. The problem is that in practice the lower zones, which are formally already included in the dynamic range, still contain too much noise to be meaningfully used. Much depends on individual disgust - everyone determines the acceptable noise level for himself.

My subjective opinion is that shadow details start to look more or less decent with a signal-to-noise ratio of at least eight. On this basis, I define my useful dynamic range as the technical dynamic range minus about three stops.

For example, if a DSLR camera has a dynamic range of 13 EV according to reliable tests, which is very good by today's standards, then its useful dynamic range will be about 10 EV, which, in general, is also quite good. Of course, we are talking about shooting in RAW, with a minimum ISO and maximum bit depth. When shooting in JPEG, the dynamic range is highly dependent on the contrast settings, but on average, two or three stops should be dropped.

For comparison: color reversible photographic films have a useful photographic latitude of 5-6 stops; black-and-white negative films give 9-10 steps with standard development and printing procedures, and with certain manipulations - up to 16-18 steps.

Summing up the above, let's try to formulate a few simple rules, the observance of which will help you squeeze the maximum performance out of your camera's sensor:

  • The dynamic range of a digital camera is fully available only when shooting in RAW.
  • Dynamic range decreases with increasing sensitivity, so avoid high ISO values ​​unless absolutely necessary.
  • Using a higher bit depth for RAW files does not increase the true dynamic range, but it does improve tonal separation in the shadows at the expense of more brightness levels.
  • Exposure to the right. The upper exposure zones always contain the maximum of useful information with the minimum of noise and should be used most effectively. At the same time, do not forget about the danger of clipping - pixels that have reached saturation are absolutely useless.

Best of all, don't worry too much about the dynamic range of your camera. She's fine with dynamic range. Your ability to see light and manage exposure is much more important. A good photographer will not complain about the lack of photographic latitude, but will try to wait for more comfortable lighting, or change the angle, or use the flash, in a word, will act according to the circumstances. I'll tell you more: some scenes only benefit from the fact that they do not fit into the dynamic range of the camera. Often an unnecessary abundance of details simply needs to be hidden in a semi-abstract black silhouette, which makes the photograph both laconic and richer at the same time.

High contrast is not always a bad thing - you just need to know how to work with it. Learn to exploit the disadvantages of the equipment as well as its merits, and you will be amazed at how much your creativity will expand.

Thank you for your attention!

Vasily A.

Post scriptum

If the article turned out to be useful and informative for you, you can kindly support the project by contributing to its development. If you don't like the article, but you have thoughts on how to make it better, your criticism will be accepted with no less gratitude.

Please be aware that this article is subject to copyright. Reprinting and quoting are permissible provided there is a valid link to the source, and the text used should not be distorted or modified in any way.

At a time when researchers were just starting to solve the problem of creating a speech interface for computers, they often had to make their own equipment that allowed them to enter sound information into a computer, as well as output it from a computer. Today, such devices may only be of historical interest, since modern computers can be easily equipped with audio input and output devices such as sound adapters, microphones, headphones, and speakers.

We will not delve into the details of the internal structure of these devices, but we will talk about how they work and give some recommendations for choosing sound computer devices to work with speech recognition and synthesis systems.

As we said in the previous chapter, sound is nothing more than vibrations of air, the frequency of which lies in the range of frequencies perceived by a person. The exact boundaries of the audible frequency range may vary from person to person, but it is believed that sound vibrations lie in the range of 16-20,000 Hz.

The task of the microphone is to convert sound vibrations into electrical vibrations, which can be further amplified, filtered to remove interference and digitized for inputting sound information into a computer.

According to the principle of operation, the most common microphones are divided into carbon, electrodynamic, condenser and electret. Some of these microphones require an external current source for their operation (for example, carbon and condenser microphones), while others, under the influence of sound vibrations, are able to independently generate an alternating electrical voltage (these are electrodynamic and electret microphones).

You can also separate the microphones according to their purpose. There are studio microphones that you can hold in your hand or clip to a stand, there are radio microphones that you can clip to your clothes, and so on.

There are also microphones designed specifically for computers. These microphones are usually mounted on a stand that sits on top of the table. Computer microphones can be combined with headsets, as shown in Fig. 2-1.

Rice. 2-1. Headphones with microphone

So how do you choose from a variety of microphones the one that is best suited for speech recognition systems?

Basically, you can experiment with any microphone you have, as long as it can be connected to your computer's sound adapter. However, speech recognition systems developers recommend purchasing a microphone that, during operation, will be at a constant distance from the speaker's mouth.

If the distance between the microphone and the mouth does not change, then the average level of the electrical signal from the microphone will also not change too much. This will have a positive impact on the quality of modern speech recognition systems.

What is the problem here?

A person is able to successfully recognize speech, the volume of which varies over a very wide range. The human brain is able to filter out quiet speech from interference, such as the noise of cars passing along the street, extraneous conversations and music.

As for modern speech recognition systems, their abilities in this area leave much to be desired. If the microphone is on a table, then when you turn your head or change the position of your body, the distance between the mouth and the microphone will change. This will change the output level of the microphone, which in turn will impair the reliability of speech recognition.

Therefore, when working with speech recognition systems, the best results will be achieved if you use a microphone attached to the headphones, as shown in Fig. 2-1. When using such a microphone, the distance between the mouth and the microphone will be constant.

Please also note that all experiments with speech recognition systems are best done in a quiet room. In this case, the influence of interference will be minimal. Of course, if you need to choose a speech recognition system that can work in a strong interference environment, then the tests need to be done differently. However, as far as the authors of the book know, the noise immunity of speech recognition systems is still very, very low.

The microphone performs for us the transformation of sound vibrations into vibrations of electric current. These fluctuations can be seen on the oscilloscope screen, but do not rush to the store to purchase this expensive device. We can carry out all oscillographic studies using a regular computer equipped with a sound adapter, for example, a Sound Blaster adapter. We will tell you how to do this later.

In fig. 2-2 we have shown the oscillogram of the sound signal, obtained by pronouncing a long sound a. This waveform was obtained using the GoldWave software, which we will discuss later in this chapter of the book, as well as using a Sound Blaster sound adapter and a microphone similar to that shown in Fig. 2-1.

Rice. 2-2. Oscillogram of a sound signal

GoldWave software allows you to stretch the waveform along the time axis, which allows you to see the smallest details. In fig. 2-3 we have shown a stretched fragment of the above-mentioned oscillogram of sound a.

Rice. 2-3. Fragment of an oscillogram of an audio signal

Note that the magnitude of the input signal from the microphone changes periodically and takes on both positive and negative values.

If there was only one frequency in the input signal (that is, if the sound was "clean"), the waveform received from the microphone would be sinusoidal. However, as we have already said, the spectrum of human speech sounds consists of a set of frequencies, as a result of which the shape of the speech signal oscillogram is far from sinusoidal.

A signal whose magnitude changes continuously with time will be called analog signal... This is the signal that comes from the microphone. Unlike an analog signal, a digital signal is a set of numerical values ​​that change discretely over time.

In order for a computer to process an audio signal, it must be converted from analog to digital form, that is, presented as a set of numerical values. This process is called digitizing an analog signal.

Digitization of audio (and any analog) signal is performed using a special device called analog-to-digital converter ADC (Analog to Digital Converter, ADC). This device is located on the sound adapter board and is an ordinary-looking microcircuit.

How does an analog to digital converter work?

It periodically measures the level of the input signal, and outputs a numerical value of the measurement result at the output. This process is illustrated in Fig. 2-4. Here, gray rectangles mark the input signal values ​​measured with a certain constant time interval. The set of such values ​​is the digitized representation of the input analog signal.

Rice. 2-4. Measurements of the signal amplitude versus time

In fig. 2-5 we have shown how to connect an analog-to-digital converter to a microphone. In this case, an analog signal is supplied to the input x 1, and a digital signal is removed from the outputs u 1 -u n.

Rice. 2-5. Analog to digital converter

Analog-to-digital converters are characterized by two important parameters - the conversion frequency and the number of levels of quantization of the input signal. Choosing these parameters correctly is critical to achieving adequate digital representation of the analog signal.

How often do you need to measure the value of the amplitude of an input analog signal so that as a result of digitization, information about changes in the input analog signal is not lost?

It would seem that the answer is simple - the input signal should be measured as often as possible. Indeed, the more often the analog-to-digital converter makes such measurements, the better the smallest changes in the amplitude of the input analog signal will be tracked.

However, excessively frequent measurements can lead to an unjustified increase in the flow of digital data and a waste of computer resources in signal processing.

Fortunately, choosing the correct conversion rate (sampling rate) is easy enough. To do this, it is enough to turn to the Kotelnikov theorem, known to specialists in the field of digital signal processing. The theorem states that the conversion frequency must be twice the maximum frequency of the spectrum of the converted signal. Therefore, for digitizing without loss of quality of an audio signal, the frequency of which lies in the range of 16-20,000 Hz, it is necessary to select a conversion frequency not less than 40,000 Hz.

Note, however, that in professional sound equipment, the conversion frequency is selected several times higher than the specified value. This is done to achieve very high quality digitized audio. For speech recognition systems, such a quality is not relevant, so we will not focus your attention on this choice.

And what frequency of conversion is needed to digitize the sound of human speech?

Since the sounds of human speech lie in the frequency range of 300-4000 Hz, the minimum required conversion frequency is 8000 Hz. However, many computer speech recognition programs use the 44,000 Hz conversion rate that is standard for conventional audio adapters. On the one hand, such a conversion rate does not lead to an excessive increase in the digital data stream, and on the other hand, it ensures the digitization of speech with sufficient quality.

Back in school, we were taught that any measurements give rise to errors that cannot be completely eliminated. Such errors arise due to the limited resolution of measuring instruments, as well as due to the fact that the measurement process itself can introduce some changes in the measured value.

The analog-to-digital converter represents the analog input signal as a stream of limited-length numbers. Typical audio adapters contain 16-bit ADC blocks that can represent the amplitude of the input signal as 216 = 65536 different values. ADC devices in high-end audio equipment can be 20-bit, providing greater accuracy in representing the amplitude of the audio signal.

Modern speech recognition systems and programs were created for ordinary computers equipped with ordinary sound adapters. Therefore, you do not need to purchase a professional sound adapter to experiment with speech recognition. An adapter such as Sound Blaster is quite suitable for digitizing speech for the purpose of its further recognition.

Along with the useful signal, various noises usually get into the microphone - noise from the street, wind noise, extraneous conversations, etc. Noise has a negative impact on the performance of speech recognition systems and therefore has to be dealt with. One of the ways we have already mentioned is that today's speech recognition systems are best used in a quiet room, being alone with the computer.

However, it is far from always possible to create ideal conditions, so you have to use special methods to get rid of interference. To reduce the noise level, special tricks are used in the design of microphones and special filters that remove frequencies that do not carry useful information from the spectrum of the analog signal. In addition, a technique such as compression of the dynamic range of the input signal levels is used.

Let's talk about all this in order.

Frequency filter is called a device that converts the frequency spectrum of an analog signal. In this case, in the process of transformation, the selection (or absorption) of oscillations of certain frequencies occurs.

You can imagine this device as a kind of black box with one input and one output. As applied to our situation, a microphone will be connected to the input of the frequency filter, and an analog-to-digital converter will be connected to the output.

Frequency filters are different:

· Low-pass filters;

· High-pass filters;

· Pass band filters;

· Notch bandpass filters.

Low Pass Filters(low -pass filter) removes from the spectrum of the input signal all frequencies, the values ​​of which are below a certain threshold frequency, depending on the filter setting.

Since audio signals are in the 16-20,000 Hz range, all frequencies below 16 Hz can be cut without degrading sound quality. For speech recognition, the frequency range of 300-4000 Hz is important, so frequencies below 300 Hz can be cut. In this case, all interference with a frequency spectrum below 300 Hz will be cut out from the input signal, and they will not interfere with the speech recognition process.

Likewise, high pass filters(high -pass filter) cuts all frequencies above a certain threshold frequency from the input signal spectrum.

A person does not hear sounds with a frequency of 20,000 Hz and above, so they can be cut from the spectrum without noticeable degradation of sound quality. As for speech recognition, here you can cut out all frequencies above 4000 Hz, which will lead to a significant reduction in the level of high-frequency interference.

Pass band filter(band -pass filter) can be thought of as a combination of a low and high pass filter. Such a filter delays all frequencies below the so-called lower pass frequency as well as above upper pass frequency.

Thus, for the speech recognition system, a pass bandpass filter is convenient, which delays all frequencies, except for frequencies in the range of 300-4000 Hz.

As for the band-stop filters, they allow you to cut out from the spectrum of the input signal all frequencies that lie in a given range. Such a filter is convenient, for example, for suppressing interference that occupies a certain continuous part of the signal spectrum.

In fig. 2-6 we have shown the connection of a passband filter.

Rice. 2-6. Filtering the audio signal before digitizing

I must say that ordinary sound adapters installed in a computer include a band-pass filter through which the analog signal passes before digitization. The bandwidth of such a filter usually corresponds to the range of audio signals, namely 16-20,000 Hz (in different audio adapters, the values ​​of the upper and lower frequencies may vary within small limits).

And how to achieve a narrower bandwidth of 300-4000 Hz, corresponding to the most informative part of the spectrum of human speech?

Of course, if you have a penchant for designing electronic equipment, you can make your filter from an operational amplifier chip, resistors and capacitors. This is approximately what the first creators of speech recognition systems did.

However, industrial speech recognition systems must be operable on standard computer equipment, so the way of making a special bandpass filter is not suitable here.

Instead, modern speech processing systems use so-called digital frequency filters implemented in software. This became possible after the computer's central processing unit became powerful enough.

A digital frequency filter, implemented in software, converts an input digital signal to an output digital signal. In the process of conversion, the program processes in a special way the stream of numerical values ​​of the signal amplitude coming from the analog-to-digital converter. The conversion result will also be a stream of numbers, but this stream will correspond to the already filtered signal.

Talking about the analog-to-digital converter, we noted such an important characteristic of it as the number of quantization levels. If a 16-bit analog-to-digital converter is installed in the audio adapter, then after digitization the audio signal levels can be represented as 216 = 65536 different values.

If there are few quantization levels, then the so-called quantization noise... To reduce this noise, high-quality audio sampling systems should use analog-to-digital converters with as many quantization levels as possible.

However, there is another technique used in digital audio recording systems to reduce the effect of quantization noise on the quality of the audio signal. With this technique, the signal is passed through a non-linear amplifier before digitizing, which emphasizes low-amplitude signals. This device amplifies weak signals more than strong ones.

This is illustrated by the plot of the dependence of the output signal amplitude on the input signal amplitude, shown in Fig. 2-7.

Rice. 2-7. Non-linear gain before digitizing

At the stage of converting the digitized audio back to analog (we will discuss this stage later in this chapter), the analog signal is passed through a non-linear amplifier before being output to the speakers. This time, a different amplifier is used, which emphasizes signals with a large amplitude and has a transfer characteristic (the dependence of the amplitude of the output signal on the amplitude of the input signal), the opposite of that used during digitization.

How can all this help the creators of speech recognition systems?

A person, as you know, recognizes speech pronounced in a quiet whisper or in a loud enough voice quite well. We can say that the dynamic range of loudness levels of successfully recognized speech for a person is quite wide.

Unfortunately, today's computer speech recognition systems cannot yet boast of this. However, in order to slightly expand the specified dynamic range, before digitizing, you can pass the signal from the microphone through a nonlinear amplifier, the transfer characteristic of which is shown in Fig. 2-7. This will reduce the level of quantization noise when digitizing weak signals.

Developers of speech recognition systems, again, are forced to focus primarily on commercially available sound adapters. They do not provide for the non-linear signal conversion described above.

However, it is possible to create a software equivalent of a nonlinear amplifier that converts the digitized signal before passing it on to the speech recognition engine. And although such a software amplifier will not be able to reduce the quantization noise, it can be used to emphasize those signal levels that carry the most speech information. For example, you can reduce the amplitude of weak signals, thus removing noise from the signal.

Top related articles