Hardware Implementation of Deep Convolutional Neural Networks for Computer Vision

Our project focuses on the Hardware Implementation of Deep Convolutional Neural Networks for Computer Vision. We aim to deploy the state-of-the-art image segmentation model e.g. Mask-RCNN on an FPGA. This prototype can be used to solve real-world problems in multiple domains. Our prototype wi

2025-06-28 16:27:32 - Adil Khan

Project Title

Hardware Implementation of Deep Convolutional Neural Networks for Computer Vision

Project Area of Specialization Artificial IntelligenceProject Summary

Our project focuses on the Hardware Implementation of Deep Convolutional Neural Networks for Computer Vision. We aim to deploy the state-of-the-art image segmentation model e.g. Mask-RCNN on an FPGA.

This prototype can be used to solve real-world problems in multiple domains. Our prototype will include an FPGA connected with a camera and a display screen. It will perform real-time processing to detect objects and generate a high-quality segmentation mask for each instance.

A hardware-based approach to implementing Mask-RCNN will ensure energy efficiency to acquire comparable performance with commercially available Edge device solutions such as Nvidia’s Jetson Nano Kits.

Our designed prototype can be used for multiple applications including Low-cost security solutions, Autonomous Driving, Healthcare Industry, Agriculture Industry, and the Retail industry.

Project Objectives

Our main objective is to build a low-cost and energy-efficient solution for deploying real-time embedded object detection systems. Current solutions such as Edge devices with GPUs have an excessive cost associated with them and use a lot of power compared to a dedicated hardware-based solution.

Ideally, our prototype will achieve/improve the accuracy of traditional software-based methods while being more efficient with power consumption and having a lower cost when manufactured on silicon at scale. These benefits, however, will be achieved at the cost of reduced flexibility. This, unfortunately, is an inherent limitation of a hardware-based approach to implementing any algorithm.

Another major objective of our project will be to complete our literature review. This is required to provide the reader with a better understanding of the work which is already done in this domain and to accurately convey our results.

To achieve these objectives, we will implement both solutions i.e., hardware-based solution on FPGA and software-based solution on an Edge device. Afterward, we will compare both approaches and share our findings on their pros, cons, and potential pitfalls that may occur during implementation.

Project Implementation Method

The main goal of our project is to optimize the Mask R-CNN algorithm for real-time objection detection to reduce the computational cost and increase the power efficiency while keeping the accuracy of the network to a suitable level.

The Mask R-CNN model simply predicts the class label, bounding box, and mask for the objects in an image. Our design flow will comprise of the following parts:

  1. Prepare the model configuration parameters.
  2. Build the Mask R-CNN model architecture.
  3. Read the input image with the help of the integrated camera.
  4. Selecting CNN (Convolution Neural Network) model and structure.
  5. Designing and optimizing OpenCL Kernel.
  6. Deployment on FPGA.
  7. Detect the objects in the image.
  8. Visualize the results.

The proposed hardware architecture will be composed of an edge device i.e., Advanced RISC Machines (ARM)-centric processing system (PS) and programmable logic (PL). The data from the camera enters the FPGA, then some input processing takes place (deserialization, etc.), then if needed the resolution can be reduced to enable larger sliding windows or lower memory bandwidth requirements.

Only the greyscale component of the camera signal is sent to the external memory, where the entire image is stored. This image is then processed by the FPGA. We will use the ARM (Advanced RISC Machines) Processor to run some parts of the code where it is not required. The remaining part would run on the programmable logic.

The input data reordering module will rearrange the pixels and feed them to the processing array. CNN operates on a higher frequency than the reading from the memory to allow multiple sliding window sizes. We will compare the results with an edge device to show that the FPGA is power efficient and speeds up the process of image recognition.

Benefits of the Project

There are many Edge devices available today for implementing computer vision solutions. The most well-known of which is jetson nano. The cost factor associated with these edge devices is extremely high. Furthermore, they also require more energy than a hardware-based approach.

In computer vision, object detection is the most well-known and thoroughly researched problem. Our goal is to provide a hardware solution for this problem. This approach will ensure a lower manufacturing cost than GPU-based edge devices at the cost of flexibility.

The hardware approach will require less energy allowing for battery-powered solutions to last longer while maintaining/exceeding the accuracy achieved on edge devices. There are many applications for an embedded, real-time, and high-efficiency object detection system. Some of these include:

All these solutions can be brought to the market faster and cheaper when our hardware prototype achieves/improves the accuracy of traditional software-based methods while being more efficient with power consumption.

Technical Details of Final Deliverable

Our final delivery would be a prototype consisting of an FPGA, a Camera, and an HDMI Screen. We will compare the results of our product with a traditional Edge device as well.

Our first step would be Image Segmentation. Here, the image given by the camera is segmented based on the color range of the reference object. We get 8-bits each for red, blue, and green colors. We will deal only with the upper 4 bits of RGB channel.

The object gets highlighted by the mask which we made using the captured image. In the second step, we use a counter to count the number of pixels that lie in the color range of our reference object. Lastly, in the final stage, this count value is compared with the threshold value.

The image captured by the camera is received in Bayer Format and passed on to the FPGA which converts it into the proper format to calculate the value of the current pixel. The Camera and HDMI controller interfaces the VMOD CAM (System-on-a-Chip design) and LCD with the FPGA.

FPGA sliding switch is reserved for viewing the video capture and image segmented video separately. Most of the processes are done in parallel by FPGA thus we can say that the performance of this model is better as compared to the other platforms such as DSP and PC. The output data will support up to 1600 X 1200 resolutions with a 24-bit parallel bus in processed RGB.

We will implement the same algorithm on an Edge device and will compare the efficiency of both the models (FPGA & Edge Device) to show FPGA is more efficient in terms of power, speed and cost.

Final Deliverable of the Project HW/SW integrated systemCore Industry ITOther Industries Education , Medical , Energy , Security , Telecommunication Core Technology Artificial Intelligence(AI)Other Technologies OthersSustainable Development Goals Industry, Innovation and Infrastructure, Sustainable Cities and Communities, Peace and Justice Strong InstitutionsRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 76000
Artix-7™ Field Programmable Gate Array (FPGA) from Xilinx® Equipment14500045000
Display: BenQ GW2270H Equipment11500015000
FPGA Interfaceable Camera Equipment140004000
Cables and Extensions Equipment130003000
Logistics Expenses Miscellaneous 130003000
Printing and Stationery Expenses Miscellaneous 140004000
Overhead Expenses Miscellaneous 120002000

More Posts