We submitted a genome paper to Genome Biology over the weekend. I'm going through our repository and tidying things up, and I realized I have not had a chance to record my notes on several of the supplemental tables providing a comparison of genome features across seven species. I'll go ahead and do so here.

The data files were retrieved from HymHub, and the mlr program was obtained from johnkerl/miller (installed with Homebrew).

In [1]:
# Define an alias for mlr command settings
#  --csv: comma-separated values
#  --fs tab: lol j/k they're separated by tabs (fs=field separator)
#  --rs lf: use line feed as record separator (instead of default carriage return + line feed)
#  --ofmt %.2lf: print only 2 decimal points for real numbers
alias mlrcmd='mlr --csv --fs tab --rs lf --ofmt %.2lf'

Table S6: gene model characteristics

Here we're looking at gene models across the seven species, and in particular the distribution of gene length, exon count, and nucleotide composition. First we report the length distribution: the mean, median, min, and max gene lengths (in base pairs).

In [2]:
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
    mlrcmd stats1 -a mean,p50,min,max -f Length \
           scratch/species/${species}/${species}.pre-mrnas.tsv
done
Length_mean	Length_p50	Length_min	Length_max
13139.40	3557	105	904366
Length_mean	Length_p50	Length_min	Length_max
15409.72	3774	276	674837
Length_mean	Length_p50	Length_min	Length_max
12699.43	3843	210	760915
Length_mean	Length_p50	Length_min	Length_max
11682.57	3739	205	580610
Length_mean	Length_p50	Length_min	Length_max
4796.88	2778	204	86971
Length_mean	Length_p50	Length_min	Length_max
11460.33	3535	228	549523
Length_mean	Length_p50	Length_min	Length_max
9917.48	2979	105	600051

The header is duplicated for each file, so let's get rid of that with the tail command.

In [3]:
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
    mlrcmd stats1 -a mean,p50,min,max -f Length \
           scratch/species/${species}/${species}.pre-mrnas.tsv \
        | tail -n 1
done
13139.40	3557	105	904366
15409.72	3774	276	674837
12699.43	3843	210	760915
11682.57	3739	205	580610
4796.88	2778	204	86971
11460.33	3535	228	549523
9917.48	2979	105	600051

Now we can add back in our own header, and include the species label on each row.

In [4]:
echo $'Species\tMean\tMedian\tMin\tMax'
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
    echo -n ${species}$'\t'
    mlrcmd stats1 -a mean,p50,min,max -f Length \
           scratch/species/${species}/${species}.pre-mrnas.tsv \
        | tail -n 1
done
Species	Mean	Median	Min	Max
Amel	13139.40	3557	105	904366
Bter	15409.72	3774	276	674837
Hsal	12699.43	3843	210	760915
Cflo	11682.57	3739	205	580610
Pdom	4796.88	2778	204	86971
Pcan	11460.33	3535	228	549523
Nvit	9917.48	2979	105	600051

Ok, this looks much better.

We've been using -f Length to specify which field from our data tables to compute these statistics for. In addition to length, we also want to compute these statistics for exon count and nucleotide composition (ExonCount and GCContent in the data tables, respectively). Rather than retyping the command two more times, let's just wrap the command in another loop and iterate over each field.

In [5]:
for field in Length ExonCount GCContent
do
    echo
    echo $field
    echo $'Species\tMean\tMedian\tMin\tMax'
    for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
    do
        echo -n ${species}$'\t'
        mlrcmd stats1 -a mean,p50,min,max -f $field \
               scratch/species/${species}/${species}.pre-mrnas.tsv \
            | tail -n 1
    done
done
Length
Species	Mean	Median	Min	Max
Amel	13139.40	3557	105	904366
Bter	15409.72	3774	276	674837
Hsal	12699.43	3843	210	760915
Cflo	11682.57	3739	205	580610
Pdom	4796.88	2778	204	86971
Pcan	11460.33	3535	228	549523
Nvit	9917.48	2979	105	600051

ExonCount
Species	Mean	Median	Min	Max
Amel	6.92	5	1	134
Bter	7.06	6	1	165
Hsal	6.89	5	1	76
Cflo	6.99	6	1	121
Pdom	6.59	5	1	138
Pcan	7.38	6	1	175
Nvit	6.51	5	1	73

GCContent
Species	Mean	Median	Min	Max
Amel	0.29	0.27	0.13	0.71
Bter	0.34	0.33	0.21	0.64
Hsal	0.41	0.40	0.22	0.70
Cflo	0.34	0.33	0.19	0.70
Pdom	0.31	0.30	0.17	0.62
Pcan	0.31	0.30	0.19	0.58
Nvit	0.40	0.40	0.20	0.74

Great. This doesn't get the data into exactly the same format as the table, but it's close enough that it requires only a few minutes of copy/paste work to compose the final table. Had we been using LaTeX to typeset the supplement, I might have taken a few more minutes to completely automate the construction of the table, but since we're using Word anyway :( I figured this would be sufficient.

Table S7-S8: exon and intron characteristics

We'll use the same basic approach for computing summary stats for exon and intron characteristics that we did for entire gene models. There are two differences, however.

  • First, exon count only makes sense at the gene level, so we don't compute this for exons and introns.
  • Second, some of the annotated exons and introns are extremely short, which really distorts the summary statistics for nucleotide composition. Although outliers may be interesting (if indeed they are annotated correctly), here we are simply trying to get a feel for how different, on average, these genomic features are across species.

With respect to this second point, we first compute the 5% and 95% length quantiles over a large number of related species, and then use these values to filter exons and introns before computing summary statistics for nucleotide composition. We also make sure that exons and introns contain no more than 25% ambiguous nucleotide calls.

In [6]:
mlrcmd stats1 -a p5,p95 -f Length scratch/data/exons.tsv
Length_p5	Length_p95
66	1101

Let's capture these values and store them in variables so we can use them for subsequent commands.

In [7]:
p5=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/exons.tsv | tail -n 1 | cut -f 1)
p95=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/exons.tsv | tail -n 1 | cut -f 2)

echo " 5% quantile: $p5"
echo "95% quantile: $p95"
 5% quantile: 66
95% quantile: 1101

Now let's feed these in to the filter command. We have to be careful to escape field names and shell variables, since both begin with the dollar sign prefix.

In [8]:
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
    mlrcmd filter \
           "\$Length >= $p5 && \$Length <= $p95 && \$NContent < 0.25" \
           scratch/species/${species}/${species}.exons.tsv \
        | mlrcmd stats1 -a mean,p50,min,max -f GCContent
done
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.37	0.35	0.07	0.80
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.40	0.39	0.14	0.73
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.46	0.46	0.13	0.80
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.42	0.41	0.16	0.75
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.37	0.36	0.04	0.73
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.37	0.37	0.12	0.75
GCContent_mean	GCContent_p50	GCContent_min	GCContent_max
0.46	0.45	0.15	0.79

We need a bit of cleanup, similar to what we did before.

In [9]:
echo $'Species\tMean\tMedian\tMin\tMax'
for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
do
    echo -n ${species}$'\t'
    mlrcmd filter \
           "\$Length >= $p5 && \$Length <= $p95 && \$NContent < 0.25" \
           scratch/species/${species}/${species}.exons.tsv \
        | mlrcmd stats1 -a mean,p50,min,max -f GCContent \
        | tail -n 1
done
Species	Mean	Median	Min	Max
Amel	0.37	0.35	0.07	0.80
Bter	0.40	0.39	0.14	0.73
Hsal	0.46	0.46	0.13	0.80
Cflo	0.42	0.41	0.16	0.75
Pdom	0.37	0.36	0.04	0.73
Pcan	0.37	0.37	0.12	0.75
Nvit	0.46	0.45	0.15	0.79

Now let's wrap it all up and bring it together to produce the final versions of Table S7-S8.

In [10]:
for featuretype in exons introns
do
    echo
    echo
    title=$(echo $featuretype | tr '[:lower:]' '[:upper:]')
    echo "=====${title}====="

    echo
    echo "Length"
    echo $'Species\tMean\tMedian\tMin\tMax'
    for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
    do
        echo -n ${species}$'\t'
        mlrcmd stats1 -a mean,p50,min,max -f Length \
               scratch/species/${species}/${species}.${featuretype}.tsv \
            | tail -n 1
    done
    
    echo
    echo "GCContent"
    echo $'Species\tMean\tMedian\tMin\tMax'
    p5=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/${featuretype}.tsv | tail -n 1 | cut -f 1)
    p95=$(mlrcmd stats1 -a p5,p95 -f Length scratch/data/${featuretype}.tsv | tail -n 1 | cut -f 2)
    for species in Amel Bter Hsal Cflo Pdom Pcan Nvit
    do
        echo -n ${species}$'\t'
        mlrcmd filter \
               "\$Length >= $p5 && \$Length <= $p95 && \$NContent < 0.25" \
               scratch/species/${species}/${species}.${featuretype}.tsv \
            | mlrcmd stats1 -a mean,p50,min,max -f GCContent \
            | tail -n 1
done
done

=====EXONS=====

Length
Species	Mean	Median	Min	Max
Amel	360.83	210	1	15394
Bter	389.78	210	2	14946
Hsal	339.73	206	2	18463
Cflo	324.61	206	2	12042
Pdom	276.92	189	1	10835
Pcan	350.70	212	3	11754
Nvit	352.79	216	1	16428

GCContent
Species	Mean	Median	Min	Max
Amel	0.37	0.35	0.07	0.80
Bter	0.40	0.39	0.14	0.73
Hsal	0.46	0.46	0.13	0.80
Cflo	0.42	0.41	0.16	0.75
Pdom	0.37	0.36	0.04	0.73
Pcan	0.37	0.37	0.12	0.75
Nvit	0.46	0.45	0.15	0.79


=====INTRONS=====

Length
Species	Mean	Median	Min	Max
Amel	1803.65	126	29	557153
Bter	2089.90	129	1	564389
Hsal	1758.45	193	30	564989
Cflo	1571.71	201	30	465607
Pdom	531.18	109	4	48311
Pcan	1390.85	109	30	398425
Nvit	1410.00	89	30	431041

GCContent
Species	Mean	Median	Min	Max
Amel	0.19	0.17	0.02	0.73
Bter	0.26	0.25	0.04	0.69
Hsal	0.35	0.34	0.03	0.94
Cflo	0.24	0.23	0.01	0.77
Pdom	0.22	0.21	0.01	0.74
Pcan	0.22	0.21	0.00	0.87
Nvit	0.32	0.30	0.06	0.81

That's it!

Comments