Cleanup: supplementary tables for our genome paper
We submitted a genome paper to Genome Biology over the weekend. I'm going through our repository and tidying things up, and I realized I have not had a chance to record my notes on several of the supplemental tables providing a comparison of genome features across seven species. I'll go ahead and do so here.
The data files were retrieved from HymHub, and the mlr
program was obtained from johnkerl/miller (installed with Homebrew).
# Define an alias for mlr command settings
# --csv: comma-separated values
# --fs tab: lol j/k they're separated by tabs (fs=field separator)
# --rs lf: use line feed as record separator (instead of default carriage return + line feed)
# --ofmt %.2lf: print only 2 decimal points for real numbers
alias mlrcmd='mlr --csv --fs tab --rs lf --ofmt %.2lf'
Table S6: gene model characteristics¶
Here we're looking at gene models across the seven species, and in particular the distribution of gene length, exon count, and nucleotide composition. First we report the length distribution: the mean, median, min, and max gene lengths (in base pairs).
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
mlrcmd stats1 -a mean,p50,min,max -f Length \
scratch/species/${species}/${species}.pre-mrnas.tsv
done
The header is duplicated for each file, so let's get rid of that with the tail command.
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
mlrcmd stats1 -a mean,p50,min,max -f Length \
scratch/species/${species}/${species}.pre-mrnas.tsv \
| tail -n 1
done
Now we can add back in our own header, and include the species label on each row.
echo $'Species\tMean\tMedian\tMin\tMax'
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
echo -n ${species}$'\t'
mlrcmd stats1 -a mean,p50,min,max -f Length \
scratch/species/${species}/${species}.pre-mrnas.tsv \
| tail -n 1
done
Ok, this looks much better.
We've been using -f Length
to specify which field from our data tables to compute these statistics for.
In addition to length, we also want to compute these statistics for exon count and nucleotide composition (ExonCount
and GCContent
in the data tables, respectively).
Rather than retyping the command two more times, let's just wrap the command in another loop and iterate over each field.
for field in Length ExonCount GCContent
do
echo
echo $field
echo $'Species\tMean\tMedian\tMin\tMax'
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
echo -n ${species}$'\t'
mlrcmd stats1 -a mean,p50,min,max -f $field \
scratch/species/${species}/${species}.pre-mrnas.tsv \
| tail -n 1
done
done
Great. This doesn't get the data into exactly the same format as the table, but it's close enough that it requires only a few minutes of copy/paste work to compose the final table. Had we been using LaTeX to typeset the supplement, I might have taken a few more minutes to completely automate the construction of the table, but since we're using Word anyway :( I figured this would be sufficient.
Table S7-S8: exon and intron characteristics¶
We'll use the same basic approach for computing summary stats for exon and intron characteristics that we did for entire gene models. There are two differences, however.
- First, exon count only makes sense at the gene level, so we don't compute this for exons and introns.
- Second, some of the annotated exons and introns are extremely short, which really distorts the summary statistics for nucleotide composition. Although outliers may be interesting (if indeed they are annotated correctly), here we are simply trying to get a feel for how different, on average, these genomic features are across species.
With respect to this second point, we first compute the 5% and 95% length quantiles over a large number of related species, and then use these values to filter exons and introns before computing summary statistics for nucleotide composition. We also make sure that exons and introns contain no more than 25% ambiguous nucleotide calls.
mlrcmd stats1 -a p5,p95 -f Length scratch/data/exons.tsv
Let's capture these values and store them in variables so we can use them for subsequent commands.
p5=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/exons.tsv | tail -n 1 | cut -f 1)
p95=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/exons.tsv | tail -n 1 | cut -f 2)
echo " 5% quantile: $p5"
echo "95% quantile: $p95"
Now let's feed these in to the filter command. We have to be careful to escape field names and shell variables, since both begin with the dollar sign prefix.
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
mlrcmd filter \
"\$Length >= $p5 && \$Length <= $p95 && \$NContent < 0.25" \
scratch/species/${species}/${species}.exons.tsv \
| mlrcmd stats1 -a mean,p50,min,max -f GCContent
done
We need a bit of cleanup, similar to what we did before.
echo $'Species\tMean\tMedian\tMin\tMax'
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
echo -n ${species}$'\t'
mlrcmd filter \
"\$Length >= $p5 && \$Length <= $p95 && \$NContent < 0.25" \
scratch/species/${species}/${species}.exons.tsv \
| mlrcmd stats1 -a mean,p50,min,max -f GCContent \
| tail -n 1
done
Now let's wrap it all up and bring it together to produce the final versions of Table S7-S8.
for featuretype in exons introns
do
echo
echo
title=$(echo $featuretype | tr '[:lower:]' '[:upper:]')
echo "=====${title}====="
echo
echo "Length"
echo $'Species\tMean\tMedian\tMin\tMax'
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
echo -n ${species}$'\t'
mlrcmd stats1 -a mean,p50,min,max -f Length \
scratch/species/${species}/${species}.${featuretype}.tsv \
| tail -n 1
done
echo
echo "GCContent"
echo $'Species\tMean\tMedian\tMin\tMax'
p5=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/${featuretype}.tsv | tail -n 1 | cut -f 1)
p95=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/${featuretype}.tsv | tail -n 1 | cut -f 2)
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
echo -n ${species}$'\t'
mlrcmd filter \
"\$Length >= $p5 && \$Length <= $p95 && \$NContent < 0.25" \
scratch/species/${species}/${species}.${featuretype}.tsv \
| mlrcmd stats1 -a mean,p50,min,max -f GCContent \
| tail -n 1
done
done
That's it!