18-899 Group Project:
H.320 System
K. Bradley
X. Chen
Y. Yuan

1. Introduction

Our group project focuses on the implementation of an H.320 videoconferencing system. We constructed a portion of the H.261 system for the midterm project demo.

H.261 consists of three main parts: the encoder, the decoder, and the transport mechanism. As we have three group members, we have divided the work among the group in this fashion. Since the encoder itself uses the decoder in its feedback loop (to minimize error accumulation), the decoder was developed at the same time as the encoder.

First, fast versions of the relevant algorithms in H.261 were developed. Motion estimation and compensation is the most time-consuming algorithm in the system. An efficient DCT and inverse DCT for an 8x8 block were also developed. The following subsections will discuss the fast algorithm implementations.

1.1 Motion estimation

Motion estimation and compensation is to estimate the displacement between the corresponding blocks in the previous and current frames, then calculate the residue under this displacement. The displacements are usually represented as 2-D vectors. Actually, motion estimation is the bottleneck of a video encoder. So we should apply some efficient algorithms to speed up this process without significantly degrade the overall video quality.

1.1.1 Matching criteria

In order to estimate block motions, we need compare the block in the current frame to the shifted corresponding block in the previous frame. Two criteria are commonly used to quantify this matching: minimum mean square error (MSE) and minimum mean absolute difference (MAD). MSE is superior in performance; however, it is not widely used for VLSI implementation since the difficulty to run the square operation in hardware. We choose MAD, also because absolution operation is faster than square operation in software. MAD can be shown as:
where MAD(d1,d2) is mean absolute difference for the various displacements (d1,d2). N1 x N2 is the size of each block B. n1 and n2 are the pixel indicies within the block. sp and sf are the current and previous frame respectively. Then the motion estimate is given by:

1.1.2 Search procedure

Theoretically, we need to go over all the possible (d1, d2) to find the best match and get motion vector. However, this kind of exhaustive (full) search is very timing consuming and inefficient. Some suboptimal methods have been promoted and among them is the three step search which can greatly reduce the pool of candidates (d1,d2).

Look at the above figure where part of a block was shown. In the first step, the candidates are nine vectors corresponding to "1" and "0" points. Calculating motion estimation via eq.(1) and eq.(2). If "0" point is the estimated motion vector, stop searching. Otherwise, we begin the second step where the candidates are another eight points ("2") centered around the estimated vector (circled "1") in the first step. If the estimated vector is circled "1", stop searching. Otherwise, start the third step where the candidates are updated as eight "3" points centered around the estimated vector (circled "2") in the second step.

Compared to full search which would take 16x16=256 searches, three step procedure requires at most 9+8+8=25 searches. Although it may miss the optimum candidate, three step scheme speeds up the searching process by 10 times.

1.1.3 Types of numbers in the codes:

Operation of integer numbers is much faster than that of floating points. We strictly keep most of the numbers in the codes as integer.

1.2 Efficient DCT and IDCT

DCT directly implemented by matrix multiplication for an 8x8 block would require 1024 multiplications and 896 additions. We choose a fast DCT based on the 16-point fast DFT, which requires only 208 multiplications and 464 additions for an 8x8 block. Figure 1 shows the flowgraph for 1-D DFT. The output of the flowgraph needs to be scaled to get the DCT coefficient.
Fig. 1 Flowgraph of 1-D DCT (from "JPEG still image data compression standard", written by William Pennebaker and Joal Mitchell)

The inverse DCT can be implemented in a similar way by reversing the flowgraph of the corresponding forward DCT.

2. Encoder and Decoder

The encoder and decoder are design to receive a data stream and process it. The encoder receives QCIF YUV video frames and produces a "compressed" data stream. The decoder receives the "compressed" data stream and generates QCIF YUV video frames.

We constructed the main elements of the encoder and decoder during the efficient algorithm development stage. Only a few key decisions remained.

For videos containing fast moving objects, motion estimation may lead to large error residues. A decision between inter-block coding (using the motion estimation and a coded residue) and intra-block coding (using the uncompensated macroblock) is necessary. At this time, intra mode is preferred where coefficients of current frame instead of those of residue frame are transmitted. The choice between intra and inter modes is based on RM8. Motion estimation information can be saved if using intra mode.

The decoder takes the data streams from the entropy decoder and reconstructs the video sequence. The decoder has two buffers: the input buffer holds the input data associated with the current frame to be decoded; the output buffer holds two frames, one for the previous frame that is used in motion compensation and the other for the current frame as output. After one frame has been decoded, the pointers to the previous frame and the current frame are swapped and the current frame is sent to the display unit. The decoding is done macroblock by macroblock. For each macroblock, header information and block data are read from the input buffer. Depending on the encoding mode of the macroblock which is specified in the header, the macroblock is decoded in various INTRA or INTER mode. All of the ten coding mode can be processed. The data of the six blocks in one macroblock is dequantized first and a fast inverse DCT algorithm is implemented to reconstruct the block data from the DCT coefficient. The decoder is also embedded into the encoder as a part of motion prediction loop.

3. Accessing Video Data

Two Linux machines, running kernel 2.0.18, TCP/IP over Ethernet, and X-Windows were used for the display of video data as well as running the encoder and decoder programs.

One machine has a 150 MHz Intel Pentium Pro and 64 MB of RAM; the other machine has a 66 MHz Intel 486/DX2 with 16 MB of RAM. A miroMedia PCTV PCI framegrabber card was used in the Pentium Pro machine, and a VideoLabs CCD camera (producing NTSC-quality video) was attached to the framegrabber.

X-Windows was run with 8 bits per pixel.

The Linux driver for the framegrabber was obtained from the BTTV homepage http://www.thp.Uni-Koeln.DE/~rjkm/linux/bttv.html

The driver supports two modes: a direct view mode, in which the framegrabber communicates over the PCI bus to a section of memory on the video card, and DMA mode, in which the framegrabber communicates with the main memory of the system. Direct view mode supports the fastest frame rates, as it interacts directly with the display device. The driver is responsible for maintaining the memory correctly. Under X-Windows, a direct mapping of the X display to the frame buffer is required.

The driver also came with a small program that used an explicit call to the card to obtain a frame of video. The driver exports this call through an IO control call (ioctl), and takes as an argument the memory to copy the frame into. The card can be configured to grab in a variety of modes; we chose 24 bpp RGB, as that was the default. The card will also scale the image in hardware to the required size, so we were able to directly obtain QCIF images (176 x 144 pixels).

A small section of code was written to convert the RGB data to YUV data, and to decimate the U and V planes by two in both X and Y so that 4:2:0 YUV frames were generated.

The code for interacting with the driver and converting the images was placed into a C++ class call yuv_video. When a new frame was requested with the Get_Y, Get_U, and Get_V methods, the driver was invoked and the color conversion performed.

4. Network Transport

Network communication in Unix involves Berkeley sockets, which are a standard across many platforms. Sockets are treated like files, in that they support "read" and "write" operations.

A small socket class was created to handle the communication between two pieces of software. The socket class supports one of two modes: "receive", which waits for communication on a specified network port, and then establishes a socket; and "send", which communicates with the "receiver".

The sockets were configured to use TCP/IP, which guarantees that all packets sent will be received properly. Another option, which is slightly faster, is to use UDP/IP, although the guarantee of proper reception is lost. Since the two machines share an Ethernet hub, using TCP/IP is roughly equivalent to UDP as there are few collisions and dropped packets (dropped packets cause TCP to resend the missing information).

To ease portability, the socket file descriptor was then used to create a buffer I/O stream (using the C++ fstream class). This allows for C++ file stream operators to use the socket like any other file, and allowed for easy testing: to convert the code from reading a disk file to reading a socket, only the "open" statement was changed.

The operating system then handles the generation of packets, transmission, reception, and so forth. Our code just reads bytes or writes bytes to the file stream.

5. Video Display

Several window functions were created to display the data in an X-Window. X is not terribly efficient at displaying bitmapped images, as it endeavors to be portable across different display architectures. It uses an abstract "pixel" definition, where pixel is dependent on the depth of the display, number of colors allocated to the window, and so forth.

By using 8-bit displays, we were guaranteed that a "pixel" was a single character, and could then directly map pixel values to R/G/B values through the color allocations.

A routine for color allocation and dithering was borrowed from the MPEG-2 decoder, available at ftp://ftp.cstv.to.cnr.it/pub/MPEG2/conformance-bitstreams/video/verifier/.

This decoder supports 8-bit displays and dithers by quantization of the Y/U/V space, allocating 4 bits to Y and 2 bits to U and V. The quantized value is then used to look up an allocated colormap entry.

The routines for display were also placed in the yuv_video class. When a complete frame was written to the class using the Put_Y, Put_U, and Put_V methods, the frame was dithered and displayed.

6. Encoder and Decoder Performance

Since we have not yet implemented the true H.261 bitstreams, and are sending raw internal data over the network, network performance is particularly bad. We expect this figure to improve.

The code was instrumented to obtain timing measurements for different aspects of the encoder and decoder. The encoder was measured for the motion estimation algorithm time per frame (as well as the number of MAD values calculated) and the total time to perform the DCTs and quantization stages per frame. The decoder was instrumented for its entire time to decode the frame, as it involved only inverse quantization, inverse DCT, and motion compensation. The yuv_video class was instrumented for framegrab time and frame display time.
Code Measured Description Time
Read Video FrameGrabber IO 50.64 ms
  Color Conversion 8.9 ms
  Decimation 2.5 ms
  Read Total 61.5 ms
Encoder Motion Estimation 207.85 ms
  DCT, Quantization, Scan 46.9 ms
  Encoder Total 254.8 ms
Decoder IDCT, Inverse Quantize, etc. 41.7 ms
  Display 18.25 ms
Network Write time 443.1 ms
  Write bandwidth 2788.3 kbps
Total Total encoder + write time 833 ms
  Framerate 1.2 fps

The encoder spends most of its time doing motion estimation. The fast motion estimation algorithm is in fact doing its job: only an average of 1045.7 MAD calculation were required per frame, which is 4.13% of the number of MAD calculations needed for exhaustive search, and implies that a MAD calculation (16x16 subtraction and sum, with memory lookups) requires 200 ms.

7. Future Plan

We will implement the entropy decoder as a front end to the current decoder. The decoding will take the standard bit stream and generate the output that meets the input format for the current decoder. The decoding of bit stream involves searching in the code book. An efficiently organized code book will improve the performance of the entropy decoder.

Since the audio bandwidth in H.320 is of little concern running on 10 Mbps Ethernet, it has been suggested that we send raw audio data and not concern ourselves with compression or decompression. In that case, only synchronization of the audio stream with the video stream will be necessary.

The majority of the remaining work, therefore, is in the generation of entropy-encoded bitstreams, bitstream management, and synchronization.

8. Summary and Conclusions

We’ve successfully demonstrated a video compression and decompression system. It does not fully implement H.261, as we have not yet built the entropy encoder. However, it does do motion estimation, compensation, and macroblock-level encoding and quantization. In addition, it works in real-time with an aggregate rate of 1.2 frames per second, although it does require almost 3 Mbps of network bandwith. Once the entropy encoding is included, the network utilization will drop significantly, and as network access time is a major part of the overall frame processing time, the average frame rate should increase.

We have discovered several things while implementing this portion of our project.

(1) Project is not pure research and efficiency is vital. Although MAD and three step search may not give the optimum results, they are effective and reduce a lot of computation.

(2) Encoder and decoder should be developed in parallel or by one person. Some minor inconsistency between these two codes would arise a lot of problems during the stage of integration.

(3) We have learned the implementation of fast DCT and inverse DCT. FDCT can run five times faster than a conventional DCT. The performance of the decoder is greatly improved by using the fast DCT. We have fully understood the process of motion compensation in video compression.

(4) We now have an understanding of different color spaces, color dithering, and some idea about algorithms for manipulating image formats.

(5) We also have an understanding of some of the behaviors of hardware used in video systems.

(6) Finally, we’ve learned that group projects work best if there are specific goals that can be reached. In addition, having a common working environment is almost necessary, for issues in portability and development environments can overwhelm the overall goals of the project.