ShortStop: Prioritizing Microproteins with Machine Learning

Written by Jason Armstrong | Aug 11, 2025 8:14:42 AM

In a new paper, Miller et al. (2025)¹ present ShortStop, a machine learning framework to distinguish potentially functional microproteins in non-coding regions from the noise. It’s a problem with clear relevance in clinical genomics, where data volume continues to outpace interpretation capacity.

ShortStop is designed to triage thousands of small open reading frames (smORFs) now known to be translated in the human genome. Most don’t resemble canonical proteins and may only have a regulatory role, if any function at all. The authors propose a method to classify smORFs based on their biochemical similarity to known, experimentally validated proteins.

A Two-class System

The ShortStops approach introduces two reference groups:

SAMs (Swiss-Prot Analog Microproteins): short, validated, evolutionarily conserved proteins in the Swiss-Prot database.
PRISMs (Physiochemically Resembling In Silico Microproteins): synthetic smORFs that mimic the composition of real ones but lack evolutionary or functional structure.

By defining two reference classes and training a model on physicochemical features (e.g., hydrophobicity, amino acid motifs, charge), the authors built a classifier that achieved high precision and recall. Extreme Gradient Boosting (XGBoost) outperformed other models with an AUC of 0.97.

When applied to 7264 smORFs from Mudge et al. (2022)², ShortStop classified only 8% as SAMS, suggesting most translated smORFs resemble non-canonical or likely non-functional sequences.

Why This Matters

As in variant interpretation, the challenge isn’t finding candidates but narrowing them down. The value of a method like ShortStop lies in prioritization. By flagging those smORFs that share biochemical properties with known proteins, researchers can focus follow-up efforts where they are most likely to yield meaningful results, reducing time spent on low-likelihood candidates and streamlining decision-making.

These goals are becoming increasingly important as data volumes grow and analysis becomes more complex. The infrastructure now exists to support more automation in research and clinical settings. What is still limited is time and expert capacity. Manual review remains essential, especially for edge cases or unexpected results, but it needs to be used more efficiently. Tools like ShortStop can help by guiding attention to the most promising signals and away from low-likelihood noise.

Case Example: StARump

A previously overlooked smORF in the StAR gene is used as an example by the authors. Later named StARump, this microprotein had not been identified by standard ribosome profiling or by TIS transformers, likely due to its location in a region of overlapping coding sequence with poor mappability. However, ShortStop classified it as a SAM, which was confirmed by mass spectrometry (MS). StARump was found to be highly expressed in the testis, ovary, and CSF, despite low transcript levels of the canonical StAR ORF in those tissues.

StARump’s detection demonstrates how combining prediction tools with complementary inputs, such as ShortStop’s classifier and MS data, respectively, can reveal overlooked biology.

Clinical Relevance: Lung Cancer

To show translational relevance, the authors applied ShortStop to RNA-seq and immunopeptidome data from non-smoking lung cancer patients. Several SAMs were differentially expressed between tumor and normal tissues. 210 SAM-derived peptides were detected on MHC-I, including one from an alternative COL1A1 transcript that was strongly upregulated. This supports the theory that some of these overlooked microproteins may play biological roles, or even become clinically relevant.

Final Thoughts

ShortStop fills a specific gap in current discovery tools by classifying smORFs according to their similarity to known microproteins. It cannot detect translation, but helps prioritize which translated smORFs are worth further study. This kind of triage supports the same goals as many diagnostic workflows: reducing manual review, improving consistency, and focusing attention on the most informative candidates.

There is a growing push to integrate multimodal data into clinical workflows. Combining genomics, transcriptomics, proteomics, and immunopeptidomics can reveal patterns invisible to any single data type. Machine learning tools that complement or incorporate this kind of information will become more important in diagnostic pipelines. As automation becomes more embedded, the aim is not to replace human expertise but to support it, so interpretation efforts can focus where they are most likely to be impactful.

The ShortStop framework is freely available on GitHub: https://github.com/brendan-miller-salk/ShortStop

References

View full post