PERCEPTRON - Proteoform Identification Pipeline for Top-Down Proteomics
Cells have billions of proteins that regulate crucial physiological functions in our bodies and any aberration in protein structure, quantity or function can lead to the development of pathologies such as cancer and diabetes. A major challenge for scientists, therefore, is to identify proteins and c
2025-06-28 16:28:46 - Adil Khan
PERCEPTRON - Proteoform Identification Pipeline for Top-Down Proteomics
Project Area of Specialization Biomedical EngineeringProject SummaryCells have billions of proteins that regulate crucial physiological functions in our bodies and any aberration in protein structure, quantity or function can lead to the development of pathologies such as cancer and diabetes. A major challenge for scientists, therefore, is to identify proteins and characterize their function besides investigating how defects at the protein level affect the overall function of the cell.
Highly sensitive mass spectrometry (MS) instruments can assist in identifying proteins and their products besides characterizing their functions. However, the complexity of protein data generated by MS requires sophisticated algorithms and resource-hungry software for its analysis. Moreover, rapidly advancing MS instrumentation and resulting spectral complexity necessitates a continued enhancement of data processing algorithms and software to maximize insights from data. The limitations in existing software to maximally extract this information from high-resolution MS spectra necessitates further development of protein search tools.
In this project, we propose to develop a next-generation protein identification search engine that would leverage high-resolution spectral data for top-down (whole protein) proteomics research. The envisaged search engine will be freely available on the web for use by experimental and in silico biologists and translational researchers. The middle-ware will be GPU-accelerated (Graphical Processing Unit) using GPU hardware. We have developed the CPU version of PERCEPTRON till date and now our aim is to accelerate the algorithms by implementing them on GPU for parallel computing. In this project, we aim to work on the de novo peptide sequence tags (PSTs) generation algorithm and filtering of the candidate proteins based on them. We propose to use the mass by charge ratio data from mass spec to deduce the amino acids that could be present and then chain the corresponding amino acids to generate multiple sequence length PSTs. PST score is calculated based on its abundance, followed by the scoring of candidate proteins that have these PSTs. Candidate proteins are then shortlisted based on their PST scores. CPU version of the algorithm calculates these PSTs in a linear fashion while in GPU version our aim is to do this processing in parallel to enhance the speed. This open-source high-performance software will be a bridge between academia and industry, wherein industrial partners can use it to analyze their data and provide feedback towards further development of algorithms for improved protein identification and characterization.
Project ObjectivesProject Objectives
- To design and develop an open source next-generation protein sequence search engine (software) for application in top-down proteomics towards identification and characterization of:
- Proteins,
- Unique proteoforms,
- Disease biomarkers, and
- Post-translational modifications responsible for the abnormal behavior of proteins leading to a pathological state
- To create a public code base for development and testing of novel and more efficient top-down proteomics algorithms.
- To devise novel GPU-based algorithms for identifying, quantifying and characterizing top-down proteomics data.
Expected Outcomes:
- Parallel programming using GPU-based algorithms will help reduce the processing time.
- Biologists will be able to use the proposed search engine to perform proteome analysis.
- Hospitals and clinics will be able to deploy the software to analyze protein biomarkers data towards personalized therapeutics.
- Pharmaceuticals involved in characterization and quantitation of active drug compounds can use the proposed software for identifying protein targets.
Project Implementation Method
The main goal of this project is to develop a protein sequence search engine, which takes spectral data from high-resolution top-down mass spectra as input (Figure 1 - Input Data Preprocessing) and identifies proteoforms. The output will be a list of proteoforms present in the experimental data ranked by their score (Figure 1 - PERCEPTRON Search Pipeline).

Our proposed search pipeline takes formatted top-down proteomics data and passes it to the first algorithm. Mass-based filtering of user-specified protein database is performed by the first algorithm and the resulting candidate protein list is subjected to the second algorithm for further filtering. Peptide sequence tag (PST) extractor implemented in the second algorithm filters the candidate protein list using the sequence-based approach and the candidate protein list (CPL) is updated. Updated (CPL) in the pilot software is then subjected to in silico fragmentation which is a part of the third algorithm. Using the third algorithm, theoretical spectra generated using user-specified fragmentation methods are compared with experimental spectra. The resulting filtered candidate protein list is then ranked using the fourth algorithm, which includes user-defined weight-based scoring of every algorithm. Below, we provide a detailed description of this workflow.
For the development methodology, .NET framework v4.8 will be used and the search engine core will be programmed using Visual C#. Open source mathematical and statistical libraries will be used whenever required; however, in-house functions and classes will be developed in the absence of such freely available libraries. SJU and LUMS will be working concurrently on the front and back-ends, respectively.
During deployment, search jobs will be created, priority queued (in case of user overload), and executed on a general-purpose graphics processing unit array (GPGPU). A GPU array will be programmed using CUDA .NET and utilized for high throughput data analysis. Amongst the benefits of utilizing such off-the-shelf GPGPU is that it is far cheaper when compared with a dedicated cluster or supercomputer besides being continually upgradeable by addition of newer GPGPU cards as research funding becomes available.
Protein sequence databases (e.g. Uniprot) will be stored and pre-indexed on a standalone NAS server and will be updated regularly to continually optimize the protein search process. Additionally, support for other protein databases will also be provided in a phased manner.
Benefits of the ProjectBenefits of the Project
In this proposal, we aim to develop a next-generation freely available web-based proteoform identification and characterization platform for top-down proteomics (TDP). PERCEPTRON search pipeline will bring together algorithms for: (i) intact protein mass tuning, (ii) de novo sequence tags-based filtering, (iii) characterization of terminal as well as post-translational modifications, (iv) identification of truncated proteoforms, (v) in silico spectral comparison, and (vi) weight-based candidate protein scoring. High-throughput performance will be achieved through execution of optimized code via multiple threads in parallel, on graphical processing units (GPUs) using Compute Unified Device Architecture (CUDA) framework. An intuitive graphical web interface will allow for setting up of search parameters as well as for visualization of results. Summarily, the benefits of this interdisciplinary project, if approved, are quite significant. Initially, it will help create a software test bed for developing and testing better algorithms for TDP data. Moreover, it will also indigenize cutting edge computational proteomics research to further support research and development efforts in the future. Highly specialized manpower can be trained at LUMS during the course of this project which stands to go a long way in the promotion of computational proteomics in Pakistan.
Technical Details of Final Deliverable| No. | DELIVERABLES | Timeline (Month)[KI1] |
| 1 | GPU-based PST Extractor | 6 months |
| 2 | Online deployment (24x7) | 2 months |
| 3 | Open Source Code | 1 month |
| 4 | User Manual | 2 weeks |
| 5 | Video Tutorials | 1 week |
| 6 | Test datasets | 1 week |
| 7 | Issues database on GitHub | 1 day |
No.
1
2
3
4
5
6
7
Final Deliverable of the Project HW/SW integrated systemCore Industry MedicalOther IndustriesCore Technology Big DataOther TechnologiesSustainable Development GoalsRequired Resources| Elapsed time in (days or weeks or month or quarter) since start of the project | Milestone | Deliverable |
|---|---|---|
| Month 1 | GPU-based PST Extractor | GPU-based PST Extractor Code |
| Month 2 | GPU-based PST Extractor | GPU-based PST Extractor Code |
| Month 3 | GPU-based PST Extractor | GPU-based PST Extractor Code |
| Month 4 | GPU-based PST Extractor | GPU-based PST Extractor Code |
| Month 5 | GPU-based PST Extractor | GPU-based PST Extractor Code |
| Month 6 | GPU-based PST Extractor | GPU-based PST Extractor Code |
| Month 7 | Online deployment (24x7) | Online deployment (24x7) Production Server |
| Month 8 | Open Source Code User Manual | Open Source Code User Manual |
| Month 9 | Test datasets Issues database on GitHub | Test datasets Issues database on GitHub |