Implementation of Lucas-Kanade's
Optical Flow Algorithm on Graphics Processors using NVIDIA CUDA
Tuesday September 25, 2007
Hamerschlag Hall D-210
4:30 pm
Kevin Biswas
Carnegie Mellon University
General purpose computing with programmable graphics processing units
(GPUs) exhibits great potential to solve a wide variety of numerically
intensive problems significantly faster than conventional CPUs due to
the GPU's massively-parallel architecture and superior memory
bandwidth. The primary challenge, however, has been to overcome the
constraints of working within a graphics programming environment and
harness the available computational power from the GPU. In early 2007,
NVIDIA released its Compute Unified Device Architecture (CUDA) platform,
a co-designed hardware and software architecture for issuing and
managing computations on the G80 GPU without the need of mapping them to
a graphics API.
As the programmability and associated programming tools for GPUs are
improving, more attention is being paid towards accelerating
compute-intensive algorithms in computer vision.
One of the most intensive of such algorithms is optical flow, a method
used to estimate motion of objects within a visual representation. The
technique is often used in real-time applications such as robot
navigation and collision avoidance, but generally limited to low frame
rates and small image sizes since these systems are based on CPU
implementations. The Lucas-Kanade method for optical flow is a popular
version of the algorithm which calculates motion between two successive
image frames in time at every pixel position. The algorithm is unique
in that its structure is non-iterative. This characteristic, along with
its inherent data-parallelism, make it particularly attractive for
real-time implementation on GPUs.
In this talk, we first describe the NVIDIA CUDA computing framework and
then go on to illustrate that the per-pixel nature of the Lucas-Kanade
optical flow algorithm allows for an efficient mapping to the
multi-threaded architecture of the G80 GPU. To achieve optimum
performance, images are divided into sub-frames and a tiled-computation
method is used to efficiently utilize the memory bandwidth. In
addition, threads are organized such that multi-processor occupancy is
maximized to avoid any unnecessary hardware idling. For large image
sizes we ultimately achieve a speedup of 397X with respect to the CPU
implementation.
Kevin Biswas is a Ph.D. student advised by Professor Ken Mai in
Electrical and Computer Engineering at Carnegie Mellon University. He
received the B.A.Sc. and M.A.Sc. degrees in Electrical and Computer
Engineering from the University of Windsor. His primary research
interest is in the area of computer architecture.
|