Application Acceleration For ARM and NEON Based Architectures

2025-06-28 16:30:16 - Adil Khan

Project Title

Project Area of Specialization RoboticsProject Summary

Demand for processing is growing in embedded domain. These growing processing requirements are mainly satisfied by efficient application execution. This requires efficient utilization of underlying embedded architecture. The current trend in embedded architectures is the growing number of cores as well as the availability of vector units. This implies that programmers now have to parallelize and vectorize their applications to get benefit of the architecture at hand.

This project accelerates a sequential application on ARM+NEON based platform to meet real-time performance requirements. We used Meanshift video tracking application as a case-study to show the applicability of a real application on a real target architecture. The original application was running at 11-FPS. After optimizing the application to utilize existing cores and vector units, we managed to obtain 50-FPS.

Project Objectives

Main Objective

Attain real-time application performance on ARM+NEON based platform by utilizing multiple cores and vector units.

Sub-Objectives

Port a sequential application to ARM+NEON based platform.
Profile application to find out time and space complexity of target application.
Parallelize and vectorize code.Utilizing multiple cores to achieve task level parallelism and use of SIMD operations to achieve data level parallelism.
Compare experimental results to report performance gains in comparison to initial port.

Project Implementation Method

The first step before implementing any optimization was to profile the mean-shift application using gprof. Based on application profile, we figured out the parts that need to be accelerated.Then we applied the suitable optimizations which are:

1) Implementation of Thread Level Parallelism with OpenMp.

2) Implementation of Data Level Parallelism with Neon Intrinsics.

After performing the above optimizations we also performed some memory optimizations which further boost our FPS making our application execute even faster.

The following block diagram shows the project flow.

Application Acceleration For ARM and NEON Based Architectures _1582918066.jpg

Benefits of the Project

The project provides the following benefits.

Efficient mapping of mean-shift application on target platform.
Reduction of overall track of application.
Improved performance of application by utilizing parallel architectures.
Increased Frames Per Second in the application.
Allow real-time use of application in critical time scenarios.

Figure below depicts how parallelizing the application on a multicore system can have greater performance than running it on a single core.

Application Acceleration For ARM and NEON Based Architectures _1582918067.png

Technical Details of Final Deliverable

gprof Call Graph

Application Acceleration For ARM and NEON Based Architectures _1582918068.png

The profile result shows that the following following four functions are the candidates for optimization where most of the program time is spent :

• CalWeight

• Pdf representation

• Track

• Epanechnikov kernel

Implementation of Thread Level Parallelism with OpenMp

There are four major functions in the application as depicted by the call graph taking most of the time in application. Analyzing these functions tell us that they contain parallelizable loops.We used OpenMp API to parallelize the loops for the four functions mentioned above and this resulted in 2.2 times overall speedup.

Application Acceleration For ARM and NEON Based Architectures _1582918069.png

Version_2 is the parallelized version of the baseline application.

Memory optimizations

Parallelizing the application didnot gave us the expected FPS. After further analysis we found out that there were redundant memory accesses made in one of the function we parallelized .We also found out that our method of accessing pixel values was inefficient.So in the next two versions of the application we used an efficient method to access pixel values as well as fixed the issue with redundancy.After performing the optimizations our FPS count increased by almost 20.

Application Acceleration For ARM and NEON Based Architectures _1582918071.png

Vectorization with Neon Intrinsics

Neon is a built in (SIMD) vector unit in ARM-v8 architecture.We used it to implement data level parallelism in our application.We targeted one of the loops with sufficient contribution in our total track time and vectorized it using Neon intrinsics.This gave us an additional 8 FPS on top of the memory optimized version of application.

Application Acceleration For ARM and NEON Based Architectures _1582918071.png

Conclusion: We used different parallelism techniques to speedup the mean-shift application on rasp-berrypi board and obtained realtime performance results.

Final Deliverable of the Project HW/SW integrated systemType of Industry IT Technologies RoboticsSustainable Development Goals Partnerships to achieve the GoalRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	12400
Raspberrypi board	Equipment	1	6500	6500
Monitor for raspberrypi	Equipment	1	3500	3500
SD card	Equipment	1	600	600
keyboard/mouse	Equipment	1	800	800
PowerSupply	Equipment	1	1000	1000

Application Acceleration For ARM and NEON Based Architectures

More Posts