4. Noise filtering
Noise filtering methodology
The noise filtering differs slightly for data with HARR/LTM treament, the following step is both used in HARR/LTM data as data without HARR/LTM.
For every sORF, the corresponding transcript information is fetched from Ensembl. Using the spliced transcript coordinates and CHX/snapfreezing ribosome profiling data, the transcript is reconstructed into a bit-array with 1’s represents positions covered by ribosomes A-site and 0’s represent positions uncovered by ribosome A-sites. For sORFs starting in an intron, this whole intron is considered an exon. For intergenic sORFs the transcript is considered to be the region -1000nt from the start-codon and +1000nt from the stop-codon. Using this array the in-frame coverage of the sORF is calculated. In order simulate an random distribution, each ‘transcript array’ is shuffled 10000x, recalculating the in-frame coverage for the sORF with each iteration. Next, using the non-parametric method for prediction interval computation on the ‘random distribution’, the probability of finding an in-frame coverage in this ‘random distribution’ greater or equal to the in-frame coverage of the sORF on the ‘un-shuffled’ transcript is calculated. The Benjamini-Hochberg procedure is enforced to control the FDR at α=0.05, sORFs passing this threshold are stored. A graphical representation of the local FDR can be found in Figure 1.
Remarks
The random distribution is very sORF dependent varying from close to 0 till 0.6-0.7 (Figure 2-3). Transcripts highly populated with ribosome profiling data, i.e. protein coding regions may be highly populated with ribosomes (if this protein coding region is translated). A sORF in such region should have an higher cutoff for in-frame coverage than a sORF where only a small segment (the sORF itself) is populated with ribsomes. Figure 3 represents the random distribution and in-frame coverage of the sORF residing in a lncRNA, where the majority of the transcript is not covered with ribosomes resulting in a low mean of the random distribution. Figure 4 is an extreme example where the mean of the random distribution is extremely low for a sORF located in an intergenic region, where ribosome occupancy is sparse.
Cutoff determination untreated pipeline
The untreated pipeline does not have the additional evidence for TIS as in the conventional pipeline where both CHX/flash-freeze RPF data as well as HARR/LTM RPF data is used, as a result this pipeline is subjected to more false positives. In order to reduce the amount of false positives, next to a local FDR as explained above, a global FDR is used as an additional cutoff. Here the random distribution consists of all the calculated ‘random in-frame coverages’ of all the transcripts. Again, using the non-parametric method for prediction intervals, a cutoff is determined at α=0.05.