Mediascape: Video Compression, One Step Beyond

International standard wants NTSC quality on a CD-ROM

Last month, we began a discussion of video compression technology by examining the Joint Photographic Experts Group (JPEG) standard for still images. For video production systems where it may be necessary to edit individual frames of a movie, JPEG compression on a frame-by-frame basis is currently the best approach. For most other applications, we can achieve much higher compression by going beyond JPEG. The Moving Pictures Experts Group (MPEG) has proposed an international standard that it hopes will eventually pack NTSC-quality video into the bandwidth available from an ordinary CD-ROM.

As we noted last month, there are really only two tricks for compressing data. First, whenever the data has statistically predictable regularities, it is possible to devise a coding scheme that takes optimal advantage of them.

Examples include run-length coding (sending one example of a repeating pattern plus the number of times to repeat it) and Huffman coding (assigning the shortest codes to the patterns that occur most frequently). Although such codes are perfectly lossless — that is, the reconstructed pattern will be bit-for-bit identical to the original — they can only squeeze the data by a limited amount. For images, compression ratios between 3:1 and 10:1 are about all you can expect.

Selectivity is key. Second, in many applications a reasonable facsimile will work as well as an identical reproduction. By selectively throwing away data during the compression process, an algorithm can achieve much higher compression ratios; reasonable values range from 20:1 to 80:1. The key word here is selectively.

The trick is to take advantage of the peculiarities of the human eye and brain and to throw away only the data that won’t be missed. The JPEG standard for still images uses some fancy mathematical techniques to tell the difference between essential and nonessential data. It also allows a tradeoff: the amount of loss in image fidelity against the degree of data compression.

Such higher compression ratios exact a cost. It takes lots of number-crunching to compress and decompress the data. Running the computations on today’s general-purpose desktop computers takes a long time. However, the impatient (and well-heeled) user can elect to trade cash for time; several vendors offer special-purpose chips and boards to accelerate the computations.

JPEG video. JPEG was designed and optimized for compressing still images. Nonetheless, it is possible to apply JPEG compression to each frame of a movie. This makes sense for video producers, who need to save disk space without giving up the ability to edit individual frames. It might also be useful for distributing a few copies of a video — perhaps for beta testing — where turnaround time is more important than the cost of storage media.

In fact, because the viewer is never going to see any single frame by itself, it is safe to apply very high compression and, correspondingly, sacrifice much of the detail. The viewer’s eye and brain will be able to reconstruct a satisfying motion picture, and that’s all that counts. But even high amounts of JPEG compression — in the neighborhood of 60- or 80-fold — aren’t enough to meet the goal of packing NTSC-quality video onto a CD-ROM. For that, we’re going to need something on the order of a 200- or 300-fold compression factor.

TEMPORAL REDUNDANCY

In a moving picture, there is usually only a little change from one frame to the next. The background is static and the foreground action takes place in small increments. Many kinds of change occur in mathematically well-defined ways, such as by panning (linear displacement of elements) or zooming.

Getting the hints. We can exploit these facts for compression purposes by suppressing some of the frames –say, actually recording only every third or fifth frame –and letting the receiver synthesize (that is, fake) the intermediate frames by interpolation. For better quality, there might be “hints,” small packets of data that adjust the interpolation calculation for each frame.

This technique is a way of trading off the high data rate of raw video for the vast amount of computation needed to create hints and interpolate frames. (No free lunch, remember?) There’s another tradeoff, as well. The gain in compression is directly related to the loss of image fidelity.

The limit of what is acceptable will depend on the application. (In other words, you wouldn’t use this for medical imaging.) Fortunately, there are many applications where the viewer can be counted on to see what he expects to see and not to notice small discrepancies.

NTSC ON A CD-ROM

The Moving Pictures Experts Group (MPEG) has sought to achieve NTSC-quality video from a CD-ROM, which would be usable for industrial training films, video games and other commercial information products. There are no MPEG-based commercial products yet, but laboratory demonstrations suggest that its proposed standard will meet that goal.

One reason we began this discussion of video compression with an overview of the JPEG still-image technology is that many of JPEG’s techniques are incorporated into the MPEG proposal. Essentially, MPEG combines a variation of JPEG’s discrete cosine transform approach (which strips certain small details out of an image) with a set of methods for removing the frame-to-frame temporal redundancy.

JPEG and then some. Ideally, it should be possible to use JPEG-like compression on the first frame of a scene, and thereafter send only information about parts of the image that have changed in subsequent frames. With luck, it wouldn’t be necessary to send another full frame until the entire scene changed — due to a cut to a different camera angle, for instance. The compression ratio could be quite high.

That probably would not work so neatly in practice. Small errors would accumulate, and any loss of synchronization would be disastrous. Moreover, it would make random seeks (e.g., when reading from a CD-ROM in an interactive application) very difficult; the only places you could start viewing would be scene breaks. Both of these factors make it desirable to send a full frame at fairly regular intervals — at least every half-second.

That, in a nutshell, is how MPEG compression works. The data stream contains occasional full frames (compressed using a JPEG-style algorithm) interspersed with tiny packets of hint information. The overall compression depends on what kind of images the movie contains (some compress better than others) and on the frequency of full frames, along with the number of hints.

FLEXIBLE YET COMPATIBLE

The MPEG standard is very flexible. It allows the video producer to decide the relative proportion of full frames (which MPEG calls Intra-coded frames; see sidebar for more details) and smaller packets of hints (the Predictive and Bidirectional frames) and thus to control the tradeoff between compression ratios and image artifacts. It supports a wide range of image sizes and frame rates, though it assumes at least 24 frames/second. It allows a wide range of compressed data rates, from the 1.15 megabits per second of a CD-ROM up to the 40 Mbps or so needed for good-quality compressed HDTV.

To assure some interoperability among various MPEG-based products, the standard does define a “constrained parameter bitstream” that all products will support. These parameters include an image size up to 720 x 576 pixels, a data rate up to 1.86 Mbps and a decoder memory of less than 48 KB.

Like JPEG the standard is very specific about how to decode a signal and says nothing about how to encode one. Thus, there is plenty of room for product differentiation without wild incompatibility.

[[headline?]]How about audio? As we go to press, the MPEG standard is nearly complete. The video compression standard is widely agreed on, but there is still some wrangling about how to handle the audio portion of a movie.

This is not a trivial issue. One of the MPEG committee’s original goals was to get NTSC-quality video and sound off a CD-ROM, a device whose data rate was originally designed to be sufficient for stereo sound alone. Currently, the best compression algorithms for audio yield only about 4:1 or 5:1 ratios, which leaves only 60 to 80 percent of the bandwidth for the video. It is not clear how this will be worked out.

Until there are commercial products in the field — or at least prototypes in beta testing — we won’t know how good MPEG really is. Already, some observers have suggested that the current standard proposal is not optimal at higher data rates and image resolutions. The MPEG committee has already begun a new phase of study focused on digital media in the range of 5-10 Mbps, which may lead to an MPEG-II standard in a year or two.

Peter Dyson

MPEG: THE GORY DETAILS

Three types of framing. The Moving Pictures Experts Group, or MPEG, standard defines three main types of information about the frames of a moving picture. The first is called an Intra-coded frame (I-frame for short). It contains explicit information about every pixel in the image and is quite similar in format to a JPEG-compressed still image. That is, the picture is broken up into tiles that are eight pixels high and eight pixels wide, and each tile is run through a discrete cosine transformation. The results are quantized, then encoded using run-length and Huffman codes. (See Vol. 1, No. 8 for the gory details of JPEG compression.)

However, unlike JPEG, the MPEG standard explicitly calls for adapting the color/detail tradeoff on the fly, tile by tile, based on the content of the image. (As we discussed in last month’s Mediascape, JPEG permits, but does not require, on-the-wing adaptation.) This factor alone can improve the compression ratio by 30 percent over a “dumb JPEG” approach.

Because it describes every pixel, an I-frame serves two purposes. First, it erases any accumulated errors that have built up from previous frames. Second, it can be used as the start of a video sequence in interactive applications; a program that jumps around in the movie can begin playing at any I-frame, not just at scene changes.

But by the same token, an I-frame requires a lot of data. For high compression ratios, the objective is to send as few I-frames as possible. The receiver is expected to fill in the gaps between I-frames by interpolation. That is, knowing that a certain pixel was some color in the previous I-frame and is some other color in the current I-frame, the decoder guesses that at each point in time between those frames, the pixel was part way between those colors. Often that is a good guess and the picture doesn’t come out too badly. For the parts of a picture where that is the wrong approach, the MPEG standard also provides for two other types of frames.

Prediction fills in the gaps. The second type is called a Predictive frame (P-frame for short). It makes no effort to describe the whole picture; it only contains information about changes since the previous I-frame. The picture is broken into tiles 16L16 pixels in size, and each tile is compared with the same tile in a previous frame. The system then blithely assumes that all pixels in a given 16L16 tile are moving together, and it generates a set of numbers that specify the speed and direction of the motion.

That assumption is often true — panning and scrolling are common image shifts — but not always. Thus, a P-frame is really just a “hint” to the decompression system; it reduces, but cannot eliminate, the errors that would be generated by blind interpolation between I-frames.

Back to the future. The third type is a Bidirectional-interpolation or B-frame. Like a P-frame, it contains hints to guide the decoder. However, in using it to build up a displayable image, the decoder will combine its contents with both a past and a future frame. Obviously, the future frame must have been already received, or at least already predicted.

Bidirectional interpolation has the advantage that it deals with areas of the background that are uncovered when a foreground object moves. An area that becomes uncovered can’t be predicted from a past frame, but it can be “predicted” from a future frame. Its principal drawback is that it requires buffer memory in the receiver, which has to store at least one earlier and one later version of the displayed image in addition to the version it is currently reconstructing.

Peter Dyson

WHAT ABOUT DVI?

When it was introduced to the world by the General Electric/RCA Sarnoff Labs in 1986, Digital Video Interactive offered the only way to obtain good-quality video and sound from a CD-ROM. But despite the marketing efforts of such industry heavyweights as Intel (which acquired the technology in 1987) and IBM, DVI has had an uphill struggle for acceptance.

Now that MPEG’s adoption as the international standard seems inevitable, Intel has decided to reposition DVI as a multipurpose video compression engine — the i750 chip set — rather than as a specific encoding technique.

Thus, although the i750 can compress and decompress DVI-coded images, it can also handle JPEG stills and PhotoCD stills. (The next generation of chips will handle MPEG video.) With suitable microcode libraries, it can perform special effects such as merges, fades and wipes.

Two-chip set. There are two chips in the current product. The 82750PB pixel processor takes a compressed data stream and expands it into a YUV image frame (YUV is a color space comprising measurements for luminance, hue and saturation). It can also compress such a frame into a DVI data stream.

The architecture of this chip resembles a RISC processor — on-chip data registers, pipelined execution, very long instruction word format, multiple loads and stores on every clock cycle — with the addition of specialized function units such as the pixel interpolator to handle zooming and anti-aliasing. A standard microprocessor, says Intel, would have to run at 150 MIPS to match its performance.

The 82750DB display processor takes the YUV frame and massages it for physical display. It handles YUV-RGB conversion; it also contains a digital-to-analog converter and a cursor-shape bitmap. It generates sync pulses and can be programmed for various display resolutions and refresh rates. Current versions of the chip support VGA, 8514/A, XGA and super-VGA resolutions.

These two chips must be supplemented by video RAM and an interface to the host computer to make a complete compressed video product. Intel also supports the chip set by offering software libraries and complete compression-board products.

Peter Dyson

AND QUICKTIME, TOO

QuickTime is widely perceived as the technology that allows users to play back choppy audio and video in a small window. It is, however, much more complex than that. It is a system-level manager of dynamic data types (data over time, such as audio, video, animations, etc.), hardware peripherals (digitizing boards, VCRs, etc.) and compression algorithms. By putting these tools at the system level, instead of making them particular to each application, QuickTime enables any program, running on any Macintosh with a 68020 processor or better, to take advantage of the new capabilities. In addition, it makes application development significantly simpler and quicker.

The initial release of QuickTime includes four basic components: the system software; a new, dynamic data file format called a “movie”; three Apple data compressors (a JPEG photo compressor for still images, an animation compressor, and a video compressor); and interface standards.

Apple understood well the need for a modular system, one that can easily adapt to the widely varying needs of its customers. The compression manager, for example, allows developers to work with whichever algorithms are appropriate, without worrying about adding individual drivers to the application. In addition to the four software compression algorithms that Apple includes as part of the package, add-in boards and software packages are being developed that support JPEG, MPEG, dvi, P*64, Group 3 fax and PhotoCD, to name just a few ingredients of the current compression alphabet soup. QuickTime doesn’t care which compression scheme you use; the compression manager will automatically use the algorithm requested by the application, assuming it is available.

The important thing to remember about QuickTime is that future generations of software will include significantly improved capabilities. Where one might be disappointed in the small window of bad video, before long the Mac will have full-screen (640L480 resolution) video running without additional hardware. In addition, the extensibility of the Macintosh system will provide the groundwork for applications that have not yet even been contemplated.

David Baron