## Diyala Journal of Engineering Sciences

Second Engineering Scientific Conference College of Engineering –University of Diyala 16-17 December. 2015, pp. 493-500

# LOW COMPLEXITY MULTILEVEL 2-D DHWT ARCHITECTURE

Saad Mohammed Saleh <sup>(1)</sup> and Ammar Ebdelmelik Abdelkareem<sup>(2)</sup>

<sup>(1)</sup>College of Engineering, Diyala University, Baquba, Diyala, Iraq
 <sup>(2)</sup>College of Information Engineering, Al-Nahrain University, Baghdad, Iraq
 <sup>(1)</sup>saad\_alazawi@engineering.uodiyala.edu.iq
 <sup>(2)</sup>a.abdelkareem@coie.nahrainuniv.edu.iq

**ABSTRACT:** -In this paper an efficient multilevel 2-D Discrete Haar Wavelet Transform (DHWT) architecture is designed and implemented. The proposed architecture is introduced to compute multilevel 2-D DHWT for image processing applications. The key points of the proposed architecture are its low memory needs and low complexity.

It composes of similar units that can easily compound to decompose the input signal into any required level. The architecture utilizes 4L (L; is the number of decomposition levels) adders and 8M (M is the number of columns of the input image) register stages to perform three levels decomposition with 3+M clock cycles as an initial latency.

The proposed architecture is implemented using Virtex 5 Xilinx FPGA platform. The implementation results reveal that the proposed architecture can operate at up to 110 MHz clock frequency. High output accuracy is also introduced as 63-77 dB PSNR for three-level 2-D DHWT decomposition are obtained.

Keywords: DWT, Haar, Image processing, DWT decomposition.

#### **1. INTRODUCTION**

Transforms that used for image processing applications got a lot of interests in the way of software computation and/or hardware architectures. An important transforms that used in several image processing applications are the Discrete Cosine Transform (DCT) [1] and Discrete Wavelet Transform (DWT). The latter is considered as an efficient transform in image compression, watermarking and denoising [2]. The main advantage of the DWT in image compression, as an image processing application example, it overcomes the blocking artifact that arise from applying the DCT especially for low bit rate applications.

The DWT has a wide range of filter families; one of these families is the Daubechies filters. The application of DWT on the proposed input signal takes two steps; the first one is the computation of Low pass filter components and the second encompass the high pass filter components. The application of two filters on the input signal introduces high computation complexity of the entire image processing system [3]. The computation complexity together with the complexity of implementation and memory requirement affects the performance of the entire system. In such case, the development of new DWT architecture is still an important, challenge and valuable task [3]. Thus numerous architectures are introduced in the literature to simplified and improve the DWT computation scheme. Another important issue, precisely for image compression applications, is that the input signal requires to be decomposed into several levels which increase the computation complexity of any DWT filter even more [2]. One of the important wavelet filters is the Haar filter which is used in image and video processing applications either by its own [4-6] or in a conjunction with other algorithms such the block truncation coding (BTC) [7, 8].

Several architectures are designed and implemented for the DHWT filter among them

are those in [9-11]. In [9], high performance DHWT architecture is designed and implemented using VLSI for multi-level decomposition. In this architecture the latency is N+7 clock cycles. The number of adders and multipliers for three levels decomposition was 12 and 3, respectively. In [11] 2-D DHWT VLSI architecture is implemented using Verilog HDL and synthesized on Virtex 6 FPGA using Xilinx ISE software. Four block RAMs are used to perform the computation process. The designed architecture is a parallel pipelined structure which can perform the computation of each 8×8 image block in parallel. In [10], a VLSI 1-D DHWT architecture is implemented using 0.18  $\mu$ m CMOS technology for eight samples and it can be generalized for N-samples by adding log<sub>2</sub>N processing blocks.

In this paper, 2-D DHWT is selected to be designed and implemented for multi-level decomposition. The proposed architecture is implemented using Xilinx System Generator and the ISE design suite from Xilinx. The key points of the proposed architecture are:

- Low complexity
- Easy to be implemented for any required 2-D DHWT decomposition level.
- No block memory is required.
- Low initial latency
- Parameterized for both image size and wordlength size.

The rest of this paper is organized as follows: in section-II, the DHWT filtering algorithm is introduced; section-III introduces new architectures for 2-D DHWT. The hardware implementation results are listed and discussed in section-IV, and finally conclusions are given in section-V.

## 2. THE DHWT FILTERING ALGORITHM

The computation of 1-D DWT filter can be performed using the following relations [10]:

where F is th length of wavelet filter, and LPF and HPF are the Low and High pass 1-D DHWT decomposition filter coefficients, respectively. The LPF and HPF filter coefficients can be described as shown below:

LPF=0.707 × 
$$[1 1]$$
 .....(3)

HPF=
$$0.707 \times [-1 \ 1]$$
 .....(4)

where LPF and HPF are the Low and High Pass Filter, respectively. The coefficients of Haar reconstruction filter are shown below:

ILPF=
$$0.707 \times [1 \ 1]$$
 .....(5)

Where: ILPF and IHPF are the Inverse Low and High Pass Filter, respectively.

The output of  $\dots$  (3) and  $\dots$  (4) introduces Low and high pass filtered version of the input signal. This can be described as a single level DWT decomposition. The multilevel decomposition can be computed by successively applying  $\dots$  (3) and  $\dots$  (4) to a down sampled version of the output of  $\dots$  (3) of the preceding level. On other word, decomposing the low pass filtered data of the preceding level. This can be described for two levels decomposition as shown in Figure (**1** [12].

In the same manner, the 2-D DWT can be computed using Row-Column approach. This can be done by firstly applying .....(3) and .....(4) on all image rows and then reapplying them on the output results column by column. In the case of 2-D signal, images as example, the output will be composed of four bands; Low-Low, High-Low, Low-High and Second Engineering Scientific Conference-College of Engineering –University of Diyala 16-17 December. 2015 LOW COMPLEXITY MULTILEVEL 2-D DHWT ARCHITECTURE

High-High filtered signals. In the rest of this paper these four bands are abbreviated as LL, HL, LH and HH, respectively. The LL band represents the coarse or the low pass filtered signal in row and column wise.

The above procedure represents single level 2-D DWT decomposition. The multilevel DWT can be computed by successively applying the same procedure on the LL-band of the preceding DWT level.

## 3. THE PROPOSED 2-D DHWT ARCHITECTURE

The proposed architecture is designed to be simple and easy to be implemented for any required DWT decomposition. The 2-D DHWT filter composes of two sub-stages for each decomposition level (l) from the total required decomposition levels (L).

The first stage is used to compute 1-D DHWT for the input signal for each row and the second sub-stage is used for columns filtering operation. The proposed two sub-stages architecture, single level, is shown in Figure (2.

As shown in Figure (2, two Add/Subtract elements are required, i.e. four elements for the whole computation operations are required. Also, a number of registers is used to perform the computation process. The number of registers for 2-D DWT computation is marked as P1, P2 and P3 in Figure (2. The size of each register will be introduced later.

Following the same procedure that described in section 2, the proposed multilevel 2-D DHWT architecture is shown in **Error! Reference source not found.** 

For an input image of  $M \times M$ -pixel, as shown in Figure (2 and Error! Reference source not found., as example, for level 1 ( $l_1$ ) 2M+2 registers and four add/subtract elements are required. The size of each registers can be determined according to Figure (2 and Error! Reference source not found..

For two level 2-D DWT decomposition 8-add/subtract elements and 4M+7 register size are required. To sum-up the total add/subtract elements required for three levels (L=3) decomposition are 12 and 8M+12 register size.

The proposed architecture works on serial bases, i.e. once the first output element of the first level is presented in the output port the second level stage will commence its computation process.

As shown in Figure (2 and Error! Reference source not found., two outputs are presented from each level, the first is LL- band output which is used for the next level decomposition. The second output is used to deliver all DWT bands of each level to the output port. Thus, the total number of output ports is equal to the number of the required decomposition levels.

As, an important point, the input image is converted to a 1-D array in Row-Row scanning and on each clock cycle single pixel is provided to the input port of the proposed architecture. The first output, the first decomposition level output, is presented on the output port after M+3 clock cycle in a rate of one output per clock cycle.

## 4. IMPLEMENTATION PERFORMANCE EVALUATION

The proposed architecture is synthesised using a vlx50t-3ff1136 Virtex 5 Xilinx FPGA platform. MRI1, MRI2 and Lena images are used for performance evaluation. The sizes of the previous mentioned images are;  $128 \times 128$ ,  $256 \times 256$  and  $512 \times 512$ -pixels, respectively. The wordlength size is set to15-bit and three-levels are used in the performance evaluation procedure. However, any other wordlengths can be used by easily adjusted using a designed input parameter.

#### A. Output Accuracy

The output accuracy is computed according to the PSNR rate-distortion criterion between the original images and the reconstructed images. The difference between Matlab results of three level decomposition and the LL-band obtained from the proposed architecture is also computed. The Maximum absolute error for the LL-band is computed according to the following relation:

$$LL-Error = Max(ABS(LL_{Matlab}-LL_{Architecture}) \qquad \dots \dots (7)$$

The PSNR is computed using [12]:

$$RMSE = \sqrt{\frac{1}{NM} \sum_{i=1}^{N} \sum_{j=1}^{M} (I_{in}(i,j) - I_{out}(i,j))^{2}} \dots \dots (8)$$
$$PSNR = 10 \log_{10} \left( \frac{(max(I_{in}))^{2}}{(RMSE)^{2}} \right)$$

The calculated accuracy results is tabulated in Table to introduce the accuracy of the proposed architecture. The results of three selected images are listed in this paper. The selected images is of size  $(128 \times 128)$  MRI1,  $(256 \times 256)$ -pixel MRI2 and the  $(512 \times 512)$ -pixel Lena gray scale image. The wavelet bands for three-level DWT decomposition for test images are shown in **Error! Reference source not found.**, **Error! Reference source not found.** and Figure (3, respectively. The reconstructed images are almost the same as the originals, as the lowest PSNR is about 63 dB, as described in Table .

#### B. Hardware Utilization Rate

The hardware usage of the proposed architecture for the selected three test images is illustrated in Table The maximum operating frequency is also introduced in this table. It is clear that, 1/7 of the available slices is occupied by the proposed architecture for an image size of  $512\times512$ -pixels. Few slices are used for other lower image sizes. Further, the maximum operating frequency of 2-D 3-level DWT decomposition is up to 110 MHz, which achieves a faster 2-D DWT decomposition speed of less than 0.6 m sec for an image of  $256\times256$ -pixels.

#### C. Power Consumption

The dynamic power consumption of the proposed architecture is computed using a Xilinx Xpower analyser for selected image sizes and operating frequencies, as shown in Figure (4. Three frame sizes are considered in this test:  $(128 \times 128)$ ,  $(256 \times 256)$  and  $(512 \times 512)$ -pixels using a wordlength size of 15 bits. Figure (4 shows that the dynamic power consumption at 50 MHz operating frequency is about 105, 130 and 184 mW for the three images, respectively. Further, lower power consumption is obtained as the operating frequency is reduced to 33.3 MHz. which is quite enough for high speed three-Level DWT computation for the three selected images. The required time in such operating frequency is 0.5, 2 and 8 msec, respectively.

### 1. Conclusions

In this paper, a new multi-level decomposition DHWT is designed and implemented using FPGA for image processing applications. The proposed architecture consists of low complex identical units which make higher DHWT decomposition levels easy to be implemented. It is parameterized for various image size and wordlength. The output accuracy is considered very high, negligible error, according to the obtained results. No block memory is used in the implementation and only 4 add/subtract elements per decomposition level are used. The implementation results reveal that the dynamic power consumption at 50 MHz operating frequency is about 105, 130 and 184 mW for 128×128, 256×256 and 512×512-pixel image sizes, respectively. The computation time for three-level DHWT decomposition of the abovementioned image sizes using 50 MHz operating frequency are 0.5, 2 and 8 m sec, respectively.

## REFERENCES

- 1) S. Al-Azawi, S. Boussakta, and A. Yakovlev, "High precision and low power DCT architectures for image compression applications," in *IET Conference on Image Processing (IPR)*, 2012, pp. 1-6.
- 2) S. G. Mallat, "Theory for multiresolution signal decomposition: the wavelet representation," *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 11, pp. 674-693, 1989.
- 3) S. Al-Azawi, Y. A. Abbas, and R. Jidin, "Low complexity multidimensional CDF 5/3 DWT architecture," in *Communication Systems, Networks & Digital Signal Processing (CSNDSP), 2014 9th International Symposium on,* 2014, pp. 804-808.
- 4) A. Ahmad, B. Krill, A. Amira, and H. Rabah, "3D Haar wavelet transform with dynamic partial reconfiguration for 3D medical image compression," in *IEEE Biomedical Circuits and Systems Conference (BioCAS 2009)*, 2009, pp. 137-140.
- 5) A. Ahmad, B. Krill, A. Amira, and H. Rabah, "Efficient architectures for 3D HWT using dynamic partial reconfiguration," *Journal of Systems Architecture*, vol. 56, pp. 305-316, 2010.
- 6) N. Sriraam and R. Shyamsunder, "3-D medical image compression using 3-D wavelet coders," *Digital Signal Processing*, vol. 21, pp. 100-109, 2011.
- S. Al-Azawi, S. Boussakta, and A. Yakovlev, "Performance improvement algorithms for colour image compression using DWT and multilevel block truncation coding," in *Communication Systems Networks and Digital Signal Processing (CSNDSP)*, 2010 7th International Symposium on, 2010, pp. 811-815.
- 8) S. Al-Azawi, S. Boussakta, and A. Yakovlev, "Image compression algorithms using intensity based adaptive quantization coding," *American Journal of Engineering and Applied Sciences*, vol. 4, pp. 504-512, 2011.
- 9) J. Altermann, E. Costa, and S. Almeida, "High performance Haar Wavelet transform architecture," in *Circuit Theory and Design (ECCTD), 2011 20th European Conference on*, 2011, pp. 596-599.
- 10) A. Reddy and A. Dhar, "A discrete time continuous level VLSI architecture in current mode to implement Discrete Haar Wavelet Transform," *Analog Integrated Circuits and Signal Processing*, vol. 73, pp. 353-362, 2012/10/01 2012.
- 11) J. Kidav, P. A. Ajeesh, D. Vasudev, V. S. Deepak, and A. Menon, "A VLSI Architecture for Wavelet Based Image Compression," in *Advances in Computing and Information Technology*. vol. 178, N. Meghanathan, D. Nagamalai, and N. Chaki, Eds., ed: Springer Berlin Heidelberg, 2013, pp. 603-614.
- 12) S. M. S. Al-Azawi, "Efficient Architectures for Multidimensional Discrete Transforms in Image and Video Processing Applications," 2013.

**TABLE (1):** PSNR and LL-Band error between MatLab; Computation and the proposed architecture for three-level decomposition

| Image | Size (pixel) | PSNR (dB) | LL-Maximum Absolute Error |
|-------|--------------|-----------|---------------------------|
| MRI1  | 128×128      | 77.2      | 0.58                      |
| MRI2  | 256×256      | 72.563    | 1.79                      |
| Lena  | 512×512      | 63.57     | 1.8                       |

#### Second Engineering Scientific Conference-College of Engineering –University of Diyala 16-17 December. 2015 LOW COMPLEXITY MULTILEVEL 2-D DHWT ARCHITECTURE

# TABLE (2): HARDWARE UTILIZATION RATE AND MAXIMUM OPERATING FREQUENCY FOR THE PROPOSED THREE LEVELS 2-D DHWT ARCHITECTURE

|                                | Available | Hardware usage |               |                   |  |
|--------------------------------|-----------|----------------|---------------|-------------------|--|
| Slice Logic Utilization        |           | 128×128-pixel  | 256×256-pixel | 512×512-<br>pixel |  |
| Slice Registers                | 28,800    | 1,023          | 1,894         | 3,605             |  |
| Slice LUTs                     | 28,800    | 1,199          | 2,054         | 3,795             |  |
| occupied Slices                | 7,200     | 382            | 637           | 1,165             |  |
| LUT Flip Flop                  |           | 1,223          | 2,086         | 3,803             |  |
| Bonded IOBs                    | 220       | 54             | 54            | 54                |  |
| Max. operating Frequency (MHz) |           | 109            | 110           | 100               |  |
| Computation Time (mSec)        |           | 0.15           | 0.60          | 2.62              |  |



Figure (1): Two-Level 1-D DWT decomposition.



Figure (2): The proposed single-level 2-D DHWT architecture.



Figure (3): Three-level DHWT decomposition of Lena image and its reconstructed version.



Figure (4): Dynamic power consumption using 15-bit wordlength size for various clock frequencies.

## معمارية كفوءة منخفضة التعقيد لحساب 2-D DHWT متعدد المستويات

سعد محمد صالح<sup>(1)</sup> وعمار عبدالملك عبد الكريم <sup>(2)</sup> (1) كلية الهندسة، جامعة ديالى، بعقوبة، ديالى، العراق (2) كلية هندسة المعلومات، جامعة النهرين، بغداد، العراق (1) saad\_alazawi@engineering.uodiyala.edu.iq (2) a.abdelkareem@coie.nahrainuniv.edu.iq

#### الخلاصة:

في هذه الورقة تم تصميم وتنفيذ دائرة كفوءة لحساب تحليلات هار ويف ليت ثنائي الابعاد متعددة المستويات. التصميم المقترح صمم ليلائم متطلبات انظمة معالجة الصور. النقاط المهمة في التصميم المقترح هو قلة التعقيد والاستغناء عن ذاكرة الخزن المؤقت. التصميم المقترح يتالف من وحدات متشابهه والتي تجعل من السهولة تشكيلها لحساب اي مستوى من تحليلات الويف ليت. الدائرة تتطلب 4L جامع و L هو عدد المستويات المطلوبة وكذلك يتطلب 8M ريجستر و M هو عدد الاعمدة للصورة المدخلة. التاخير الابتدائي هو M+3 نبضة. المعمارية المقترحة تم تنفيذها على ريجستر و M هو عدد الاعمدة للصورة المدخلة. التاخير الابتدائي هو M+3 نبضة. المعمارية المقترحة تم تنفيذها على ال والاستخاب المقترحة تم تنفيذها على المعمون المائي من المعلوبة وكذلك يتطلب 90 من المعلوبة وكذلك يتطلب 90 من ال