Application Acceleration For ARM and NEON Based Architectures
Demand for processing is growing in embedded domain. These growing processing requirements are mainly satisfied by efficient application execution. This requires efficient utilization of underlying embedded architecture. The current trend in embedded architectures is the growing number of cores as w
2025-06-28 16:30:16 - Adil Khan
Application Acceleration For ARM and NEON Based Architectures
Project Area of Specialization RoboticsProject SummaryDemand for processing is growing in embedded domain. These growing processing requirements are mainly satisfied by efficient application execution. This requires efficient utilization of underlying embedded architecture. The current trend in embedded architectures is the growing number of cores as well as the availability of vector units. This implies that programmers now have to parallelize and vectorize their applications to get benefit of the architecture at hand.
This project accelerates a sequential application on ARM+NEON based platform to meet real-time performance requirements. We used Meanshift video tracking application as a case-study to show the applicability of a real application on a real target architecture. The original application was running at 11-FPS. After optimizing the application to utilize existing cores and vector units, we managed to obtain 50-FPS.
- Main Objective
Attain real-time application performance on ARM+NEON based platform by utilizing multiple cores and vector units.
- Sub-Objectives
- Port a sequential application to ARM+NEON based platform.
- Profile application to find out time and space complexity of target application.
- Parallelize and vectorize code.Utilizing multiple cores to achieve task level parallelism and use of SIMD operations to achieve data level parallelism.
- Compare experimental results to report performance gains in comparison to initial port.
The first step before implementing any optimization was to profile the mean-shift application using gprof. Based on application profile, we figured out the parts that need to be accelerated.Then we applied the suitable optimizations which are:
1) Implementation of Thread Level Parallelism with OpenMp.
2) Implementation of Data Level Parallelism with Neon Intrinsics.
After performing the above optimizations we also performed some memory optimizations which further boost our FPS making our application execute even faster.
The following block diagram shows the project flow.

The project provides the following benefits.
- Efficient mapping of mean-shift application on target platform.
- Reduction of overall track of application.
- Improved performance of application by utilizing parallel architectures.
- Increased Frames Per Second in the application.
- Allow real-time use of application in critical time scenarios.
Figure below depicts how parallelizing the application on a multicore system can have greater performance than running it on a single core.

gprof Call Graph

The profile result shows that the following following four functions are the candidates for optimization where most of the program time is spent :
• CalWeight
• Pdf representation
• Track
• Epanechnikov kernel
Implementation of Thread Level Parallelism with OpenMp
There are four major functions in the application as depicted by the call graph taking most of the time in application. Analyzing these functions tell us that they contain parallelizable loops.We used OpenMp API to parallelize the loops for the four functions mentioned above and this resulted in 2.2 times overall speedup.

Version_2 is the parallelized version of the baseline application.
Memory optimizations
Parallelizing the application didnot gave us the expected FPS. After further analysis we found out that there were redundant memory accesses made in one of the function we parallelized .We also found out that our method of accessing pixel values was inefficient.So in the next two versions of the application we used an efficient method to access pixel values as well as fixed the issue with redundancy.After performing the optimizations our FPS count increased by almost 20.

Vectorization with Neon Intrinsics
Neon is a built in (SIMD) vector unit in ARM-v8 architecture.We used it to implement data level parallelism in our application.We targeted one of the loops with sufficient contribution in our total track time and vectorized it using Neon intrinsics.This gave us an additional 8 FPS on top of the memory optimized version of application.

Conclusion: We used different parallelism techniques to speedup the mean-shift application on rasp-berrypi board and obtained realtime performance results.
Final Deliverable of the Project HW/SW integrated systemType of Industry IT Technologies RoboticsSustainable Development Goals Partnerships to achieve the GoalRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 12400 | |||
| Raspberrypi board | Equipment | 1 | 6500 | 6500 |
| Monitor for raspberrypi | Equipment | 1 | 3500 | 3500 |
| SD card | Equipment | 1 | 600 | 600 |
| keyboard/mouse | Equipment | 1 | 800 | 800 |
| PowerSupply | Equipment | 1 | 1000 | 1000 |