A new architecture to compute the discrete cosine transform using the quadratic residue number system

of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report



Views: 18 | Pages: 4

Extension: PDF | Download: 0

A new methodology to compute the N-point DCT (Discrete Cosine Transform) in the QRNS (Quadratic Residue Number System) is presented, with a significant improvement in complexity and speed compared to the corresponding binary version. This reduction
  A NEW ARCHITECTURE TO COMPUTE THE DISCRETE COSINE TRANSFORM USING THE QUADRATIC RESIDUE NUMBER SYSTEM    J. Ramírez    (1)  , A. García   (1)  , P. G. Fernández    (2)  , L. Parrilla   (1)  , A. Lloris   (1)   (1) Department of Electronics and Computer Technology (2) Department of Electrical Engineering University of Granada, Spain University of Jaén, Spain ABSTRACT A new methodology to compute the  N  -point DCT (Discrete Cosine Transform) in the QRNS (Quadratic Residue Number System) is  presented, with a significant improvement in complexity and speed compared to the corresponding binary version. This reduction in the total number of arithmetic adders and multipliers is up to 21% and 26% for an 8 −  point and a 16 −  point DCT, respectively. In addition, large and slow binary adders and multipliers with a long carry propagation delay are replaced by high-speed small word-length modular adders and LUT (Look  − Up Table) multipliers. When a Field Programmable Logic (FPL) implementation using Altera FLEX10KE devices of the proposed architecture for the 8 −  point DCT is considered, throughput is three times better than that obtained with the corresponding fixed point 2’s complement  binary implementation. 1. INTRODUCTION The introduction of the DCT (Discrete Cosine Transform) in 1974  by Ahmed [1] has created great interest in its applications in digital image and speech coding. This interest derives from the fact that the DCT approaches the statistically optimal Karhunen-Loeve Transform (KLT) while providing a much more efficient computational framework. For this reason, it has been used in most of the coding standards for image compression, including CCITT H.261/H.263, JPEG (Joint Photographic Expert Group) and MPEG (Motion Pictures Experts Group). The DCT of an  N  −   point sequence {  x (0),  x (1), ...,  x (  N  − 1)} is defined by: ( ) ( ) ( ) 1.... 211-...,1,0,  212 cos2  1210 10 ===== =      += −−= ∑  N  N nm  K  K  K  K   N m N m !  nn x K  N m X    (1)  The existence of fast algorithms to compute the DFT (Discrete Fourier Transform) is the motivation to compute the DCT through the DFT, thus obtaining a reduction in hardware complexity and an increase in speed. Ahmed [1] introduced the first fast algorithm to compute the  N  -point DCT through the computation of the real part of a 2  N  -point DFT scaled by a complex exponential constant. The  N  -length input sequence is padded with zeros to 2  N   points and, thus, the  N  -point DCT of the srcinal sequence may be computed through the calculation of a 2  N-  point DFT with a fast transform algorithm, as derived from: 12...,,1, 0)(  )(Re2)( 22221202 −+=== = π−−=π− ∑  N  N  N nn xeW  W n xe K   N m X   N  j N mn N  N n N m jm   (2)  After this first proposal many new algorithms [2] have been developed for the efficient computation of the DCT with a reduced number of real multiplications and additions. Another possible 2  N  - point sequence to compute the DCT through a scaled FFT is the specular reflection of the srcinal  N  -point sequence. Thus, the  N  - point DCT can be computed with 2  N  ·log 2 (2  N  ) complex MAC operations. Haralick [3] shows that the  N  -point DCT can be computed with two  N  -point FFTs obtaining a reduction of 2  N   complex MACs. The procedure is based on the fact that the FFT of a real even data sequence of length 2  N   can be obtained by computing two FFTs of  N  -point data sequences. Tseng and Miller [4] have shown an algorithm with only  N  /2·(log 2  N  +1) complex operations which calculate only the real part of the FFT of a 2  N  - point real specular-reflection sequence. Narasimha and Peterson [5] have proposed a more efficient algorithm to compute the  N  -point DCT with a rearrangement of the input sequence and a radix-2 algorithm for the DFT. With this algorithm, the resulting number of complex multiplications is (  N· log 2  N  −  N  +1)/4, which leads to an implementation that is twice as fast as the previous proposals. This latter algorithm is described in detail in section 3 and is the objective of the subsequent binary and QRNS implementations. The use of complex arithmetic to compute the DCT makes use of the QRNS (Quadratic Residue Number System) [6] as an enabling tool due to its high efficiency in computing complex multiplications with only two modular multiplications and no additions at all. On the other hand, RNS [7] has been shown to be an efficient arithmetic system for the implementation of DSP applications [6]. The advent of new FPL (Field Programmable Logic) device families, such as Altera FLEX10K [8], has introduced new  possibilities for the fast implementation of RNS-based systems. The synergy between FPL devices and RNS has been shown with the realization of a special class of digital filters [9]. Moreover, a fast implementation of the DCT using RNS and FPL devices has recently been reported [10], with improved results over its binary counterpart. In this paper, a regular architecture composed of parallel and independent QRNS channels for the computation of the DCT is  presented, showing a significant reduction in hardware complexity over the traditional binary approach. This is materialized in a reduction in the number of adders and multipliers required. In addition, an implementation on FPL devices (previously modelled at structural level using VHDL) of the 8-point transform is V-321 0-7803-5482-6/99/$10.00 ©2000 IEEE ISCAS 2000 - IEEE International Symposium on Circuits and Systems, May 28-31, 2000, Geneva, Switzerland  included both for the QRNS-based system and the binary alternative. The throughput for the QRNS-based system is three times better than that of the corresponding fixed-point 2’s complement binary implementation. Different techniques have been examined to solve the performance degradation in the conversion stage. The RNS-to-binary conversion is usually the main throughput limitation of RNS-based systems. As in image coding the DCT need to be coded, the use of a special class of auto-scaling RNS to binary converters [11] is examined in this paper to maintain the high performance of the proposed QRNS-based DCT architecture. 2. QRNS BACKGROUND An RNS (Residue Number System) [6, 7] is defined by a set of  positive pairwise relatively prime integers { m 1 , m 2 , ..., m  L } called moduli. Any integer  X    ∈  [0,  M  − 1] is represented by the  L − tuple [  x 1 ,  x 2 , ...,  x  L ], where i mi  x x  = (i= 1, 2, ...,  L ) is the residue of  X   modulo m i , while  M  , the dynamic range of the RNS system, is given by: ∏ = =  Li im M  1   (3)  Arithmetic in the RNS is defined over the ring of integers modulo  M  . For 0 ≤  X  , Y  ,  Z  <    M:    L "  m L Lm M   y x y xY  X  Z   ...,, 11  ◊◊=◊=   (4)  where ◊  represents either addition, subtraction or multiplication. An extension of RNS is the QRNS [6], which is defined by an isomorphic mapping between the ring of complex integers modulo m  and the set of pairs of elements in the ring of integers modulo m . Complex integer multiplication is efficiently computed in the QRNS with only two parallel modular multiplications. In this way, algorithms involving intensive complex multiplications, such as the FFT, are efficiently implemented. QRNS exists if the prime factor decomposition of m  has only  primes of the form 4  K  +1 or, equivalently, if the equation  x 2 +1 can  be factorized as (  x − r  )(  x + r  ) with r  ∈ {0, 1, ..., m − 1}. Let a +j b  be a complex number with 0 ≤   a , b  < m  and r   a root of the equation  x 2 +1= 0 in the ring of integers modulo m . Thus, QRNS is defined  by the isomorphic mapping: mm rbabrbaa ba jba  ,  QRNS −=+=          →  + −+−+   (5)  Arithmetic in the QRNS is defined as follows: ( ) ( )      ◊◊=◊ −−++ −+−+  , ,, mm d bca d cba   (6)  where ◊  represents addition, subtraction or multiplication. The inverse QRNS mapping is given by: ( )  ( )  ( ) mm bar bbaa  jbaba  2 ; 2  ),( 11QRNS -1 −+−−+−−+ −=+= +      →     (7)  According to (6), addition and multiplication in the QRNS are computed with only two modular operations. In this way, DSP algorithms involving a great number of complex multiplications, such as the DFT, are more efficiently computed over a QRNS system. 3. QRNS-BASED DCT COMPUTATION The procedure for the QRNS-based DCT computation is as follows. Initially, the  N  -point input sequence {  x (0),  x (1), ...,  x (  N  − 1)} is reordered in the sequence {  y (0),  y (1), ...,  y (  N  − 1)} defined by: ( ) ( )( ) ( ) 12112 ...,0, 2 +=−−  −== n xn N  y  N nn xn y   (8)  Let { Y  (0), Y  (1), ..., Y  (  N  − 1)} be the DFT of the sequence {  y (0),  y (1), ...,  y (  N  − 1)}. It can be shown that the DCT {  X  (0),  X  (1), ...,  X  (  N  − 1)} of the srcinal sequence is obtained [5] through the real  part of  Z  ( n ), defined as follows: ( ) ( ) ( )  N  j N n N n n  W nY W  K   N nY  H n Z  42-44 e 2  π ===   (9)  If the following property: ( ) ( )( ) [ ]  ( ) [ ] n Z n N  Z  n jZ n N  Z  Im-Re  * −=⇔−=−   (10) is used, it is only necessary to compute  N  /2+1 values of  Z  ( n ). Thus, the  N  /2+1 set of points {  Z  (0),  Z  (1), ...,  Z  (  N  /4),  Z  (  N  /2), Z(  N  /2+1), ..., Z(3  N  /4 − 1)} is proposed to minimize the number of operations in the  N  -point DCT computation. With this set,  N  − 3 fewer adders + −   ∗   m + m + m −   m − −+ ba  , ( ) −+ d c  , −+  f e  ,   + a   − b   + c   − d    m e + ×   m  f   − ×   −+ h g   , ( ) −+  ji  , +  g    − h   + i   −  j   Figure.1 . A QRNS butterfly for a radix-2 FFT. V-322  and  N  /2 − 2 fewer multipliers are required. In this way, using this set and taking into account the property given in equation (10), the  N  - point DCT of the sequence {  x (0),  x (1), ...,  x (  N  − 1)} is given by {Re[  Z  (0)], Re[  Z  (1)], ..., Re[  Z  (  N  /4)], − Im[  Z  (3  N  /4 − 1)], − Im[  Z  (3  N  /4 − 2)], ..., − Im[  Z  (  N  /2+1)], Re[  Z  (  N  /2)], Re[  Z  (  N  /2+1)], ..., Re[  Z  (3  N  /4 − 1)], − Im[  Z  (  N  /4)], − Im[  Z  (  N  /4 − 1)], ..., − Im[  Z  (1)]}. In a QRNS implementation of the previous procedure, complex  binary adders and multipliers are replaced by QRNS adders and multipliers in the computation of the scaled radix-2 DIF (Decimation In Frequency) FFT given in (9). According to (6), each QRNS adder and each QRNS multiplier consists of two modular adders and two modular multipliers, respectively. However, the fact that the input sequence is real reduces each QRNS adder with real inputs to only one modular adder. A normal QRNS butterfly for the computation of a radix-2 DIF DFT is shown in Figure 1. It consists of a QRNS adder (two modular adders), a QRNS subtractor (two modular subtractors) and a QRNS multiplier (two LUT modular multipliers). The architecture for a prearranged dynamic range is composed of a sufficient number of parallel channels with pairwise relatively  prime moduli satisfying the QRNS mapping requirements. For an  N  -point DCT-QRNS implementation, the number of real input QRNS adders required is: ∑ −= = 1log0 2 2  N i i N  A   (11)  Thus, a direct calculation of the modular adders (  MA ) and LUT modular multipliers (  MM  ) required leads to: ( )( ) 51log61log2 22 +−=+−−=  N  N  MM  A N  N  MA   (12)   Note, in the case of an 8-point sequence, only the set {  Z  (0),  Z  (1),  Z  (2),  Z  (4),  Z  (5)} is required for DCT computation. In this way, the 8-point radix-2 DIF QRNS FFT is simplified and five QRNS adders and two QRNS multipliers are saved. Figure 2 shows the algorithm used to compute the 8-point DCT. Each arithmetic instance is a QRNS component consisting of two modular adders, two modular subtractors or two LUT modular multipliers. The two modular adders normally present in each of the 14 real input QRNS adders are reduced to only one. 4. COMPARISON WITH THE BINARY IMPLEMENTATION Table 1 shows the binary adders and multipliers required for the  binary 8-point and 16-point DCT, as well as the modular adders and multipliers required for the QRNS alternative. Reduction is up to 21% and 26% in the total number of arithmetic adders and multipliers for the computation of an 8 −  point and a 16 −  point QRNS DCT, respectively. Furthermore, a speed increase is obtained, since the QRNS alternative does not require large multipliers or adders with long carry propagation delay, as does the  binary solution, and a LUT is used for each modular multiplier and high speed parallel modular adders replace binary adders in each QRNS channel. Binary and QRNS implementations of the 8-point DCT were carried out on FPL devices in order to compare hardware complexity and speed performance. Altera FLEX10KE [8] devices were used for the practical implementation of both alternatives. First of all, a 2’s complement binary implementation of the radix-2 DIF FFT-based DCT was carried out. The 8-point input sequence {  x (0),  x (1),  x (2),  x (3),  x (4),  x (5),  x (6),  x (7)} is initially represented with 8-bit precision, while each complex “twiddle” factor and each scale factor are represented with 10-bit precision. The precision is extended as required in each stage, obtaining a 32-bit dynamic range for the DCT computation. The necessary registers between arithmetic instances for a pipelined implementation are added. The resulting 2’s complement binary architecture to compute the DCT has six latency cycles.  N Binary adders Binary multipliers  MA MM Averaged reduction 8 30 27 24 21 21% 16 94 75 72 53 26% Table 1 . Number of arithmetic modules for  N  -point DCT binary and QRNS implementations. For the DCT-QRNS implementation, the 32-bit dynamic range is obtained with four 8-bit moduli {221, 229, 233, 241}, while the roots for the isomorphic mapping are {47, 107, 89, 177}, respectively. 8-bit modulus parallel channels are chosen in order to ensure fast internal QRNS processing. On the other hand, it is  previously necessary to convert the 8-bit real inputs {  x (0),  x (1),  x (2),  x (3),  x (4),  x (5),  x (6),  x (7)} to the QRNS. According to (5), conversion of a real integer to the QRNS is easily simplified to just a residue computation only requiring an 8-bit comparator and an 8- bit subtractor. Thus, conversion to QRNS with an 8-bit modulus set is easily accomplished with few resources. Altera FLEX10KE [8] devices were used to implement the QRNS and binary DCTs. These devices contain an embedded array to implement memory and specialized logic functions, and a logic array to implement general logic. The embedded array consists of a series of EABs (Embedded Array Blocks). Each EAB provides a 4K-bit memory. The logic array consists of LABs (Logic Array + + + + −   −   −   −   ∗   ∗   ∗   ∗  + + + + −   −  + −  + + −   ∗   ∗   ∗   ∗   ∗    y (0)  Z  (0) ∗   ∗    Z  (4)  Z  (2)  Z  (1)  Z  (5) W  80  W  81  W  82  W  83  W  82  W  80   H  0  H  4  H  2  H  1  H  5  y (1)  y (2)  y (3)  y (4)  y (5)  y (6)  y (7) Figure. 2. Pipelined QRNS DCT implementation.   V-323  Blocks). Each LAB includes eight LEs (Logic Elements). A LE  provides a four input LUT, a programmable flip-flop and dedicated signal paths for carry and cascade functions. The results for both architectures are shown in Table 2, with throughput measured in millions of DCTs per second (MDCTPS). Several facts should be noted: • Approximately the same number of LEs are required for both implementations, 2497 for the binary and 2308 for the four-channel QRNS system. The limited number of LEs required  by the binary implementation, despite the large number of  binary multipliers, is due to the constant input to each binary multiplier. • In the QRNS implementation, the memory available in the FPL devices is exploited to simplify and accelerate the task of multiplying two complex numbers. Thus, 14 EABs are required for each channel. • QRNS implementation provides a higher throughput due to the efficiency of the QRNS multiplication through LUTs. Specifically, for grade − 1, − 2, and − 3 speed FLEX10KE devices, QRNS performance is 3.42, 3.38 and 3.31 times  better than binary, respectively. Binary implementation QRNS implementation # LE 2497 4 × 577 # EAB 0 4 × 14 − 1 27.02 92.59 − 2 23.64 80.00 Throughput for several device speed grades (MDCTPS) − 3 17.85 59.17 Table 2 . Throughput and resource usage for a QRNS DCT implementation over FLEX10KE devices. The practical implementation of RNS-based systems encounters a difficulty in the conversion stage. Different solutions were addressed to overcome the conversion drawback. Binary-to-RNS and RNS-to-QRNS conversions are fast operations since the input and the selected moduli are 8 bit width. RNS-to-QRNS and QRNS-to-RNS conversions only require one adder, one substractor and one LUT. However, conversion from RNS to binary implies the use of 32-bit word-length modular adders and large multipliers. Thus, a direct implementation of the Chinese Remainder Theorem (CRT) results in excessive hardware usage and slow performance. The auto-scaling RNS to binary converter ( ε -CRT) proposed by Griffin et al   in [11] allows to overcome these drawbacks using only LUTs and conventional binary adders. For a scaled n -bit binary output and a k  -bit modulus set, this converter needs one 2 k  × n  LUT  per modulus and a binary n -bit adder tree to add the LUT outputs. Using pipeline in the binary adder tree the performance of the ε -CRT converter with 24, 16 and 8 bits output improves the  performance of the QRNS architecture for several speed grade devices and particularly, this can be maintained with three pipeline stages in the adders for a 24-bit output. Thus, simulations showed that the high performance of the proposed QRNS-based DCT architecture is not degraded when converters are added. 5. SUMMARY RNS has been revealed as a very useful arithmetic system for the development of high-performance DSP applications over FPL devices. In particular, residue arithmetic has been used to compute the Discrete Cosine Transform with a reduction in hardware complexity and with a significant increase in performance. This  paper shows that the QRNS is an efficient tool for the development of high-performance DCT processors based on complex arithmetic DFTs. The computational algorithm considered leads to a substantial reduction in the number of arithmetic modules required for the DCT computation and, thus, a reduction in hardware complexity. It has been shown that the reduction in the total number of adders and multipliers is in the range 21-26% for 8-point and 16-point QRNS DCT implementations. The 8-point 1D-DCT  processor and its binary counterpart were implemented on Altera FLEX10KE FPL devices, resulting in a throughput improvement for the QRNS approach of up to 242%. 6. ACKNOWLEDGEMENTS The authors were supported by the Dirección General de Enseñanza Superior (Spain) under project PB98-1354. 7. REFERENCES [1]  N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete Cosine Transform”,  IEEE Trans. on Computers , vol. C-23, pp. 90-93, Jan. 1974. [2] K. R. Rao, P. Yip,  Discrete Cosine Transform. Algorithms,  Advantages, Applications.  Academic Press. Inc, 1990. [3] R. M. Haralick, “A Storage Efficient Way to Implement the Discrete Cosine Transform”,  IEEE Trans. on Computers , vol. C-25, pp. 764-765, June 1976. [4] B. D. Tseng and W. C. Miller, “On Computing the Discrete Cosine Transform”,  IEEE Trans. on Computers , vol. C-27,  pp. 966-968, Oct. 1978. [5] M. J. Narasimha and A. M. Peterson, “On the Computation of the Discrete Cosine Transform”,  IEEE Trans. on Communications , vol. COM-26, pp. 934-946, Jun. 1978. [6] M. Soderstrand, W. Jenkins, G. A. Jullien, and F. J. Taylor,   Residue Number System Arithmetic: Modern Applications in  Digital Signal Processing  . IEEE Press Reprint Series. IEEE Press, 1986. [7]  N. Szabo and R. Tanaka,  Residue Arithmetic and its  Applications to Computer Technology . McGraw-Hill, 1967. [8] Altera Corporation, “1998 Data Book”, Jan. 1998. [9] A. García, U. Meyer-Bäse and F. J. Taylor, “Pipelined  Hogenauer CIC Filters Using Field-Programmable Logic and  Residue Number System”,  Proc. IEEE Int. Conf. on  Acoustics, Speech and Signal Processing  , Seattle, WA, vol.5,  pp. 3085-3088, May 1998. [10] P. G. Fernández, A. García, J. Ramírez, L. Parrilla and A. Lloris, “A New Implementation of the Discrete Cosine Transform in the Residue Number System”, in 33 rd    Asilomar  Conf. on Signals Syst. and Comp. , CA, 1999. [11] M. Griffin, F. J. Taylor, and M. Sousa, “New scaling algorithms for the Chinese Remainder Theorem.”, in 22 nd    Asilomar Conf. on Signals, Syst .and Comp. , CA, 1988.  V-324
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!