The Mutation Signal We Missed at the Start of Our Genes
Germline mutations shape evolution and influence the risk of inherited disease. Most models treat mutation rates around genes as stable or smoothly varying, partly because they are built from de novo datasets that exclude mosaic mutations. This matters because promoters sit at the center of gene regulation and are often interpreted in clinical and evolutionary analyses. If mutational pressure around these regions is mischaracterized, measures of constraint and pathogenicity predictions can be skewed.
Cortés Guzmán et al. (2025) examined large cohort datasets and found that transcription start sites (TSSs) carry a concentrated mutational signal that has been missed by standard approaches. The authors show that early embryonic mosaic mutations accumulate in a narrow window around promoters, but these mutations are filtered out in de novo datasets, which has obscured the signal. Their findings suggest that transcriptional activity and repair timing during early development play an important role in shaping local mutation rates.
This work challenges the idea of uniform mutability around regulatory elements. It shows that promoter architecture has a measurable impact on germline mutation rates, with direct consequences for how researchers interpret constraint, assess variant pathogenicity, and model regulatory evolution.
Methods & Findings
The authors analyzed extremely rare variants (ERVs) from gnomAD v3 and the UK Biobank. They restricted these variants to genomic windows surrounding transcription start and termination sites to measure fine-scale mutational patterns. This window-based approach allowed them to compare upstream and downstream regions across thousands of protein-coding genes.
A pronounced mutational hotspot emerged. ERVs showed a sharp increase in mutation density centered on the TSS, peaking within the first 100 base pairs of the start site and extending a few hundred base pairs in both directions. This pattern was consistent across both datasets. In contrast, the same region showed no equivalent peak when the authors examined de novo mutations from family sequencing studies.
To resolve this discrepancy, the authors assembled early and late mosaic mutations, as well as confirmed de novo mutations, from thirteen published family sequencing studies. Early embryonic mosaic variants produced a clear hotspot at the TSS, matching the pattern seen in the population cohorts. When these variants were removed, the signal disappeared. This suggests that the TSS hotspot is driven by early embryonic mutations that arise shortly after fertilization, which are systematically filtered out of standard de novo datasets.
The authors then examined genomic features associated with this hotspot. Divergent transcription around promoters, RNA polymerase II stalling, and R-loop-prone regions all aligned with the interval of elevated mutability. Regression models indicated that these transcription-associated processes, along with mitotic rather than meiotic double-strand breaks, contribute to the mutation peak centered on the TSS. Mutational signature analysis supported this conclusion, highlighting signatures of alternative double-strand break repair and transcription-associated mutagenesis.
Expression level shaped the downstream landscape. Highly expressed genes showed a broader region of reduced mutation density downstream of the TSS, consistent with more effective transcription-coupled repair in gene bodies. The size of this protective region scaled with expression strength, indicating that transcription suppresses mutation downstream of the TSS, while the promoter remains a site of elevated risk.
Somatic mutation data from PCAWG provided further context. While tumor genomes show strong transcription-associated mutational patterns, the germline hotspot at the TSS reflected processes tied specifically to early embryogenesis rather than somatic mutagenesis.
Together, these findings suggest that promoters experience a concentrated influx of germline mutations driven by early mosaic events. Standard de novo datasets fail to capture this signal, but population-level rare variant data reveal a narrow, transcription-linked hotspot that clarifies how mutational processes at gene start sites operate.
What This Tells Us
The findings suggest that TSSs are not passive regulatory elements, but active sources of germline mutagenesis. The sharp mutation peak at the TSS, and its dependence on early embryonic mosaic variants, indicate that a significant fraction of mutations near the TSS arise during the earliest cell divisions. This timing explains why de novo datasets, which remove low-allele fraction variants, have missed this signal.
The study also shows that transcriptional architecture shapes local mutation rates. Divergent transcription, polymerase stalling, and R-loop formation all map onto the region of elevated mutability. These processes create structural and biochemical conditions that expose DNA to damage or repair delays, producing a focused increase in mutation density at promoter boundaries. At the same time, high expression suppresses mutation downstream of the TSS through transcription-coupled repair, producing a region of reduced mutation density downstream of the TSS.
These combined effects mean that promoters experience a different mutational environment from sequences immediately downstream. This distinction matters because many analytical frameworks assume smooth variation in mutability across regulatory regions. If promoter-adjacent mutation rates are higher than expected, models that rely on these assumptions may misestimate constraint or pathogenicity for variants near the TSS. This work suggests that the local mutation processes tied to transcription play a measurable role in shaping patterns of genetic variation at TSSs.
Early embryonic mosaicism is already recognised in rare disease diagnostics. Low-level parental mosaicism can account for presumed de novo mutations in neurodevelopmental and epileptic disorders, and post-zygotic events underlie conditions such as PIK3CA-related overgrowth. Although this study focuses on mosaicism as a source of background mutational signal rather than disease burden, it aligns with broader observations that early developmental mutations can have lasting consequences for inherited variation and clinical analysis.
Outlook
This study highlights how gaps in current de novo call sets and modelling approaches can shape our view of mutational processes. Because de novo call sets systematically exclude early embryonic mosaic variants, they miss a class of mutations that influence promoter regions. This creates a mismatch between the mutational models used in many analytical frameworks and the underlying biology revealed by the population-level data.
The work also points to broader challenges in interpreting variation in regulatory regions. Promoters are central to gene control, yet most functional and evolutionary annotations still rely on models that assume smooth mutational backgrounds. If promoter boundaries carry a higher mutational load than these models predict, constraint may be overestimated, and some regulatory variants may be interpreted without an accurate baseline for expected variation.
This study is part of a wider effort to understand how transcriptional processes interact with DNA repair and how these interactions influence germline mutation. Much of what is known about transcription-associated mutagenesis comes from somatic studies, where damage and repair dynamics differ from those in early development. The findings here show that early embryogenesis may be a critical period for shaping mutation landscapes and that these processes leave a signature that persists in the inherited genome.
Incorporating early mosaic mutations into mutational models may improve assessments of selective pressure and refine predictions of regulatory variant impact. It may also prompt a re-examination of how promoter regions are treated in evolutionary analyses, particularly in methods that compare observed with expected variant counts. As more whole-genome data accumulate and early developmental mutations are better cataloged, models may be able to account for promoter-specific mutational processes with greater accuracy.