TY - JOUR
T1 - Parallel H.264/AVC fast rate-distortion optimized motion estimation by using a graphics processing unit and dedicated hardware
AU - Shahid, Muhammad Usman
AU - Ahmed, Ashfaq
AU - Martina, Maurizio
AU - Masera, Guido
AU - Magli, Enrico
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2015/4/1
Y1 - 2015/4/1
N2 - Heterogeneous systems on a single chip composed of a central processing unit, graphics processing unit (GPU), and field-programmable gate array (FPGA) are expected to emerge in the near future. In this context, the system on chip can be dynamically adapted to employ different architectures for execution of data-intensive applications. Motion estimation (ME) is one such task that can be accelerated using FPGA and GPU for high-performance H.264/Advanced Video Coding encoder implementation. This paper presents an inherent parallel low-complexity rate-distortion (RD) optimized fast ME algorithm well suited for parallel implementations, eliminating various data dependencies caused by a reliance on spatial predictions. In addition, this paper provides details of the GPU and FPGA implementations of the parallel algorithm by using OpenCL and Very High Speed Integrated Circuits (VHSIC) Hardware Descriptive Language (VHDL), respectively, and presents a practical performance comparison between the two implementations. The experimental results show that the proposed scheme achieves significant speedup on GPU and FPGA, and has comparable RD performance with respect to sequential fast ME algorithm.
AB - Heterogeneous systems on a single chip composed of a central processing unit, graphics processing unit (GPU), and field-programmable gate array (FPGA) are expected to emerge in the near future. In this context, the system on chip can be dynamically adapted to employ different architectures for execution of data-intensive applications. Motion estimation (ME) is one such task that can be accelerated using FPGA and GPU for high-performance H.264/Advanced Video Coding encoder implementation. This paper presents an inherent parallel low-complexity rate-distortion (RD) optimized fast ME algorithm well suited for parallel implementations, eliminating various data dependencies caused by a reliance on spatial predictions. In addition, this paper provides details of the GPU and FPGA implementations of the parallel algorithm by using OpenCL and Very High Speed Integrated Circuits (VHSIC) Hardware Descriptive Language (VHDL), respectively, and presents a practical performance comparison between the two implementations. The experimental results show that the proposed scheme achieves significant speedup on GPU and FPGA, and has comparable RD performance with respect to sequential fast ME algorithm.
KW - Field-programmable gate array (FPGA)
KW - graphics processing unit (GPU)
KW - H.264/Advanced Video Coding (AVC)
KW - OpenCL
KW - parallel fast motion estimation (ME)
UR - http://www.scopus.com/inward/record.url?scp=84926468557&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2014.2351111
DO - 10.1109/TCSVT.2014.2351111
M3 - Article
AN - SCOPUS:84926468557
SN - 1051-8215
VL - 25
SP - 701
EP - 715
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 4
M1 - 6882206
ER -