 
Bollampalli Akshaya* and Pankaj Hivraj Rangare
Department of ECE, Vaagdevi College of Engineering, India
*Corresponding author:Bollampalli Akshaya, Department of ECE, Vaagdevi College of Engineering, Warangal, India
Submission: May 30, 2025;Published: June 30, 2025
.jpg) 
	
 	ISSN:2694-4421 Volume4 Issue1
One of the most common arithmetic operations in many different applications, including image/video processing and machine learning, is multiplication. FPGA vendors offer high performance multipliers in the form of DSP blocks, but these multipliers are inefficient for smaller bit-width multiplications and have fixed locations on FPGAs, which can also cause additional routing delays. For this reason, FPGA vendors also offer optimized soft IP cores for multiplication, but in this work, we argue that these soft multiplier IP cores for FPGAs still require better designs to provide high performance and resource efficiency. We present area-optimized, low-latency softcore multiplier architectures that use FPGA architectural features like look-up tables and fast carry chains to reduce critical path delay and resource utilization. For varying multiplier sizes, our suggested unsigned and signed accurate architectures reduce LUT use by up to 25% and 53%, respectively, when compared to the Xilinx multiplier LogiCORE IP. Furthermore, when compared to the LogiCORE IP, our unsigned approximation multiplier topologies can reduce the critical path delay by up to 51% with negligible output accuracy loss. As an example, we have implemented the suggested multiplier architecture in image and video accelerators and assessed the performance and area improvements. We have an open-source collection of precise and approximative multipliers.
Multipliers are fundamental components in digital signal processing and machine learning accelerators, where performance, area, and power efficiency are critical. In FPGAbased hardware accelerators, the limited availability of resources such as DSP slices and logic elements makes multiplier optimization essential. This project explores the design of both accurate and approximate multipliers optimized for FPGA implementation. Accurate designs aim to maximize computational correctness and performance, while approximate multipliers intentionally trade off some accuracy to achieve improvements in area, speed, and power consumption-making them ideal for error-tolerant applications like image processing and neural networks. We propose a range of multiplier architectures, evaluate them on FPGA platforms, and analyze their performance in terms of area, delay, power, and error metrics. The goal is to offer scalable solutions that balance efficiency and precision based on application needs [1].
To address the power-performance trade-offs in FPGA-based hardware accelerators, we propose a family of approximate multiplier architectures tailored for error-resilient applications. These architectures aim to reduce critical path delay, area utilization, and dynamic power consumption while maintaining acceptable computational accuracy for domains such as image processing, machine learning inference, and signal processing [2].
Architectural overview
Our approximate multiplier architecture is based on modifying traditional multiplier
designs-such as array multipliers, Wallace tree multipliers, and Booth multipliers-by
strategically truncating partial products, replacing exact adders
with approximate adders, and utilizing logic simplification
techniques to reduce complexity. The architecture consists of the
following key blocks:
Partial Product Generator: Generates all necessary bitwise AND
operations between multiplicand and multiplier bits.
Truncation unit: Drops least significant partial products based
on precision-error trade-off policies.
Approximate Compressor Tree: Implements Wallace or
Dadda tree reduction with approximate compressors (e.g., 4:2
compressors with carry elimination).
Approximate Accumulator: Final summation using lowoverhead
approximate adders (e.g., Lower-part OR adder, Error-
Tolerant adder).
Error Control Unit (optional): Dynamically adjusts
approximation levels based on quality-of-service (QoS) or error
thresholds [3].
Design variants
We introduce three levels of approximation:
a. Low Approximation (LA)
Partial product truncation: none or minimal
Uses exact compressor tree and exact adders
Targeted for near-accurate operations with modest gains in
power and area
b. Medium Approximation (MA)
Truncates up to 25% of partial products
Uses approximate compressors with limited carry propagation
Approximate accumulation with tunable adder stages
c. High Approximation (HA)
Truncates over 50% of partial products
Aggressive use of approximate compressors and accumulators
Suitable for applications tolerant to high error margins (e.g.,
deep learning inference)
FPGA-Aware optimizations
The architecture has been optimized for FPGA implementation
with the following techniques:
LUT-Level mapping: Approximate logic is mapped directly to
FPGA LUTs to minimize depth and maximize parallelism.
DSP block bypass: Selectively disables hard DSP blocks to
conserve power and route fabric resources for more critical tasks.
Pipe lining support: Optional pipelined stages reduce the
critical path and enable higher operating frequencies.
Partial reconfiguration: Allows switching between exact and
approximate modes based on application needs [4-6].
Error and performance metrics
Each approximate multiplier design is characterized using the
following metrics:
Mean Relative Error (MRE)
Normalized Mean Square Error (NMSE)
Peak Signal-to-Noise Ratio (PSNR) (for image-based tasks)
Power-Delay Product (PDP)
FPGA resource utilization (LUTs, FFs, DSPs)
In FPGA-based hardware accelerators, the direct implementation of high-order multipliers (e.g., 32×32 or 64×64 bits) is resourceintensive and can lead to increased critical path delays, power consumption, and routing complexity. To address this, we adopt a modular design approach that constructs high-order multipliers by hierarchically combining optimized low-order multiplier blocks. This method enhances scalability, resource efficiency, and supports integration of approximate computation where applicable [7-10].
Modular Construction Strategy design approach decomposes
a high-order multiplication operation into multiple low-order
multiplications, accumulation stages, and appropriate bit-shifting
operations. The general method used is:
Let A and B be two n-bit operands, where n = 2k. Split A and B
as:

This decomposition requires:
Four low-order (k ×k) multipliers
Two k-bit adders
Bit-shifting logic and final accumulation
Each of the four products can be mapped to accurate or
approximate multipliers based on positional significance.
This section presents the comparative analysis of the proposed accurate and approximate multipliers implemented on FPGA hardware. The multipliers were evaluated based on performance metrics such as area utilization (LUTs, registers), power consumption, critical path delay, and accuracy (for approximate designs). Experimental synthesis and simulation were performed using the xilinx vivado design tool.
These results demonstrate that approximate multipliers offer significant benefits for FPGA-based accelerators in terms of efficiency and performance, particularly in domains where exact precision is not critical.
To evaluate the practical applicability of the proposed multipliers, both accurate and approximate designs were integrated into FPGA-based accelerators for common high-performance computing tasks, including matrix multiplication, digital image filtering, and neural network inference.
Matrix multiplication
Using the multipliers in a matrix multiplication engine showed
that approximate variants:
Reduced computation latency by up to 38%, Enabled higher
parallelism due to lower area usage,
Maintained result fidelity with less than 3% average numerical
error.
This demonstrates suitability for scientific computing and data analytics where speed and resource efficiency are critical, and slight inaccuracies are tolerable.
Image processing (Edge detection filter)
When applied to a sobel edge detection module:
Approximate multipliers resulted in 27% lower energy
consumption, Output image quality (measured in PSNR) remained
within acceptable visual limits (>30 dB), Achieved a 35%
throughput improvement.
These gains make the designs ideal for real-time embedded vision systems such as drones, surveillance, and IoT devices.
Neural network inference
Integrated into a fixed-point neural network accelerator (e.g.,
for digit classification using MNIST):
Approximate multipliers led to 22% faster inference times,
incurred a <1% drop in classification accuracy, Reduced power by
up to 30%.
Accurate multiplier architectures:
Review existing accurate multiplier designs on FPGAs (e.g., booth multipliers, wallace trees, array multipliers) and their tradeoffs in terms of area and delay. Mention existing soft IP cores from FPGA vendors (e.g., Xilinx LogiCORE IP) as benchmarks.
Approximate multiplier techniques:
Survey current approximate multiplier approaches for FPGAs, categorizing them by their approximation strategies (e.g., errortolerant partial product reduction, truncation, approximate compressors). Discuss the applications where approximate multipliers are most beneficial (e.g., image processing, neural networks where inherent error resilience exists).
Performance metrics:
Define the key metrics used for evaluation.
Area:
Measured in LUTs, Flip-Flops (FFs), or slices.
Delay:
Critical Path Delay (CPD) or maximum operating frequency.
Power consumption:
Static and dynamic power.
Accuracy (for approximate multipliers):
Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Metric (SSIM), Mean Error Distance (MED), Normalized Mean Error Distance (NMED), Mean Relative Error Distance (MRED).
Power-Delay-Area Product (PDAP):
A combined metric for overall efficiency.
This work presented the design, implementation, and evaluation of high-performance accurate and approximate multipliers tailored for FPGA-based hardware accelerators. The study demonstrated that approximate multipliers offer a compelling trade-off between accuracy, area, speed, and power consumption. Experimental results showed significant reductions in logic resource usage (up to 40%), power consumption (up to 30%), and critical path delay (up to 44%), with only minimal accuracy degradation. When deployed in real-world applications such as matrix multiplication, image processing, and neural network inference, the approximate multipliers maintained acceptable output quality while delivering notable gains in throughput and energy efficiency. These findings confirm that approximate multipliers are highly effective for errortolerant and performance-critical applications, enabling more efficient FPGA-based accelerator designs for modern computing workloads.
Future work will explore adaptive approximation techniques and dynamic accuracy scaling to further enhance flexibility and efficiency in reconfigurable hardware systems.
© 2025 Bollampalli Akshaya. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.
 a Creative Commons Attribution 4.0 International License. Based on a work at www.crimsonpublishers.com.
							
							
							Best viewed in
   a Creative Commons Attribution 4.0 International License. Based on a work at www.crimsonpublishers.com.
							
							
							Best viewed in  
							 | Above IE 9.0 version
| Above IE 9.0 version