Mediascape: Digitizing Sound and Images

A quick ‘n’ dirty overview of sampling

From time to time, we will explicate the digital technologies that underlie the consumer products and industry trends that Digital Media covers. Our purpose is twofold: to build a common glossary of terms and concepts, and to help the reader identify the possibilities and the limitations that are inherent in each technology.

Most devices that detect sound and images, such as microphones and video cameras, are analog devices. They output an electrical current that varies continuously in proportion to the changes in air pressure (which creates sound) or light intensity.

Digitizing these signals, so they can be processed and manipulated by computers, requires converting this continuously varying current into a sequence of numbers. This process is called sampling. At the other end of the digitizing process, we have to turn the numbers back into smoothly varying currents to drive speakers and picture tubes.

In any Radio Shack you can find blister-packed chips to perform the mechanics of analog-to-digital and digital-to-analog conversion. They work fine, provided you can live with their limitations on speed and precision. Faster or more precise parts are readily available (though not from Tandy) at a correspondingly steeper price.

How much speed and how much precision do you need? The rest of this article will sketch out the basic considerations. We’ll begin with the simplest case –monophonic audio — and work up to motion video.

AUDIO: ONE-DIMENSIONAL SAMPLING

If you plot a graph of the current coming out of a microphone for some period of time, you get a wiggly line. The line is fairly smooth; there are no sudden jumps from one level to another. (See Fig. 1.) The electrical current (vertical axis) is analogous to the air pressure on the microphone’s diaphragm. Higher-pitched music corresponds to a wigglier line; louder sound gives you a larger wave from peak to trough.

To make a digital representation, we sample the waveform. We divide the total possible range of current into some number of zones (the horizontal slices through the graph) and number the zones from lowest to highest.

At regular intervals (the vertical slices in Figure 1), we note which zone the wave is in. The number of the zone is what’s stored in the computer as the value for that sample. We do not take any account of whether it is high or low within a zone; each sample is pigeonholed with no further shades of meaning. The result is a stream of numbers, one value per sample, coming out of our digitizing circuitry in a fixed cadence.

So far, we have been deliberately vague about how many zones to divide the total range into, or how often to take the samples. The reason is that sampling is a very general technique for making digital measurements of continuous processes. It works not only for sound, but also for factory instrumentation, automotive diagnostics and so on. Each application has different sampling requirements. Here’s how they are determined.

Beautiful to ugly and back again. As you can see from Figure 2, the sampling process radically alters the wave. No longer continuous, it jumps suddenly from one level to another, and there are no values in between the levels. Yet this jagged collection of numbers can indeed be turned back into sound; simple averaging and smoothing (called “low-pass filtering” in the audio world or “unsharpening” in the computer graphics business) restore it to a semblance of the original, as Figure 3 shows.

Figure 3 shows something else. In region A, the original wave was changing slowly compared to the spacing of our samples, and here the restored version is a pretty close match. But in region B, the original was changing rapidly and the restoration is nothing like the original. To pick up rapid changes, we need to sample the analog signal very often.

Even if we had taken many more samples, the restoration wouldn’t have all the detail the original had, because much of the variation falls within a single zone. To pick up small variations, we need to divide the total range into many small zones rather than a few big ones.

The moral is clear: for better reproduction, it pays to sample the signal very frequently and divide the signal range into a large number of zones.

The Nyquist limit. In 1928, AT&T mathematician Harry Nyquist proved that under ideal conditions, you must sample a signal at least twice as frequently as the highest frequency that is present in the signal. For example, the threshold of hearing for most adults is somewhere below 16 kHz; to render sounds accurately up to that frequency, you must sample the signal at 32 kHz (32,000 times per second). Actually, because physical devices are never ideal, you should go well beyond that. The sample rate for CD audio is 44.1 kHz, while studios regularly use 48 kHz.

If you fail to sample often enough, the signal will be distorted. It’s not that you merely fail to reproduce the highest tones, which would sound no worse than turning down the treble control. Rather, those high tones are transformed into sounds of lower frequency called “aliases.” And while there is a mathematical relationship among the original signal, the sampling rate and the alias’s frequency, there is no musical relationship. Aliasing makes music sound awful.

To avoid aliasing, either you must make sure that you sample often enough for the highest frequency that you will ever encounter, or you must pre-filter the input signal so that you will never digitize a frequency that is too high for your sampling rate. Good filters — those that pass all the frequencies you want and reject all others — are hard to make. Sampling more frequently gives better reproduction, but at the cost of handling more data.

Dynamic range. “Quantization” is the process by which the audio signal is divided into zones, and the precision of that process — i.e., the number of zones you divide the signal into — governs the dynamic range (the ratio of the greatest component to the smallest component) of the restored signal. That little A-to-D converter chip from Radio Shack carves the total input range into 256 zones, so it can give you a 256:1 ratio of the loudest signal to the quietest. In audio terms, that’s a dynamic range of 48 dB.

(The abbreviation dB is for deciBel, named for Alexander Graham Bell. It connotes the logarithm of the ratio between two intensities. In the audio world, one dB represents a just-noticeable difference between two sound levels. Any 2:1 ratio equals 6 dB, while any 10:1 ratio is 20 dB. Because the scale is logarithmic, multiplying the ratios means adding the dB numbers.)

In computer terms, it takes eight bits to represent 256 values, so that particular chip can be called an 8-bit converter. We could also say that the conversion is done with eight bits of precision. Similarly, a 12-bit converter gives you a 4,096:1 ratio (72 dB). CD audio uses 16-bit samples — each sample is a 16-bit number — giving a 65,536:1 ratio (96 dB), which is greater than most human ears can distinguish.

Practical concerns. A real-world digital sound system does a little bit more than the simple sampling and quantization we have just described. First of all, we really want stereo sound, so we will digitize each channel separately; the output data stream will alternately have a sample from the left channel, then from the right channel. For quadraphonic sound, we’d need to digitize four channels, and the data stream would alternate channels 1-2-3-4.

Second, we will probably want to encode the stream of numbers to provide error detection and correction. This involves computing check-digits and weaving them into the data stream. Just before converting the data back to analog form, we recompute the check-digits; if they don’t match the originals, we have detected an error. Clever coding methods may even tell us which bits are wrong. Third, we’ll want to add control codes so we can skip to the beginning or end of a piece, show the elapsed time and so forth. (We’ll address these issues in a future article on CD data formats.)

IMAGE SAMPLING: SLICE, THEN DICE

We talked first about sound because the rules that apply there are quite general and work for any one-dimensional waveform: we chop a signal at regular intervals and quantize each sliver. This concept can be extended to a two-dimensional signal such as a still image. All we have to do is slice the image into strips, then process the strips one after another. We handle each strip as if it were one-dimensional. (Right! Slice, then dice.)

The clearest illustration is the television “raster,” the scanning lines that form the onscreen image. Every kid has looked closely at a TV screen and noticed that it is made of several hundred discrete, closely spaced lines. Along each line, the brightness varies more or less continuously. (See Fig. 4.) You can digitize this signal exactly the way you would an audio signal. A waveform is a waveform; the sampling rate and the dynamic range are still the main factors to consider.

Color images are analogous to stereo sound, except that you have to do everything by threes (for the red, green and blue primary colors) instead of left-right pairs. Each color is sampled separately by placing a filter between the image and the light detector; all the detector measures is the intensity of the light that reaches it. But what if you’re digitizing prerecorded analog video that’s spooling off a VCR and not directly from a camcorder, you ask?

If we’re getting the data from a VCR, the light has already been detected by the camcorder. So we’re just looking at a varying voltage waveform, which we can digitize pretty much exactly like audio. Red, green and blue are multiplexed together into a single (very complex) waveform; for compression, we have to then demultiplex. Thus, for each color component, the intensity of a sample can be represented as a single number.

Representing intensity. As with sound, the number of bits you use to represent intensity can vary for different applications. Practically speaking, the human eye has a hard time seeing differences that are less than one percent, but to be on the safe side we will want to render half-percent intervals, for a total of 200 different intensity levels. Since seven bits can only represent 128 levels while an 8-bit number can represent 256 levels, that calls for digitizing with 8-bit precision. Mind you, that’s eight bits per color; each sample is thus 24 bits in all.

This 24-bit color is the standard for photorealistic reproduction. (With 32-bit computers, each sample often carries an extra eight bits that can be used for special effects information such as transparency.)

There are some changes in terminology between sounds and images. For example, the sampling rate for sound is related to the highest frequency (pitch) in the audio signal, and is expressed in samples per second (hertz, or Hz) because sound is a time-based signal. The equivalent for images is called the “scanning resolution,” and is expressed in dots per inch or per millimeter.

The Nyquist rule now comes into play twice: once when you decide how many scan lines to slice the image into, and again when you dice each line. The highest “frequency” in the image is spatial and is set by the smallest object and the sharpest edge. The aliases due to undersampling are jaggies and picket-fence effects, which are most visible in near-vertical and almost-horizontal thin lines.

MOTION VIDEO: ONWARD TO 3D

From still images to full-motion video is a very short step. A moving picture is a sequence of stills, now called frames. As before, it is possible to undersample the scene by capturing too few frames per second. The grossest undersampling causes jerky motion and flicker; to prevent this, the frame rate ought to be at least 20 frames per second, even for slowly moving objects such as talking heads in a videoconference. A less obnoxious form of aliasing is the wagon wheels that seem to spin backward; to really prevent that effect, you’d have to shoot several hundred frames per second.

However, by reintroducing the time element, we have added a host of practical problems. The first is cost. There is very little time for quantizing each picture element. Take U.S. standard NTSC video as an example: each frame lasts of a second and has 525 scan lines. Each scan line thus lasts a mere 64 microseconds. There’s not much use in digitizing more than 500 samples per line, but even so, that only gives 128 nanoseconds (billionths of a second) per sample. Circuits that can react that fast are pricey; we ain’t talking Radio Shack here.

The second problem is that all this sampling and quantizing generates a phenomenal amount of digital data — one minute of digitized motion video takes up approximately 1,300 MB of hard disk space. However, a fair amount of this data contains redundant information that can be squeezed out for storage and reconstituted later. High-quality compression is the single most important technological feat to be achieved for the future success of digital video. Next month’s Mediascape will discuss various existing still and motion picture compression techniques, as well as those still on the drawing boards.

Peter Dyson