Permalink: 2016-03-01 by Daniel S. Standage in notebook tags: bash iLoci

Tonight I created the following table for my iLocus paper.

iLocus summary table

The purpose of this notebook entry is to document the commands used to populate the table while everything is still fresh in my head. The annotation files were created using GenHub v0.3.1.

genhub-build.py --workdir=species \
                --genome=Am10,Am32,Att6,Atha \
                --numprocs=4 \
                download format prepare

In [1]:

# The input files
annotfiles="species/Att6/Att6.iloci.gff3 species/Atha/Atha.iloci.gff3
            species/Am10/Am10.iloci.gff3 species/Am32/Am32.iloci.gff3"

In [2]:

# First row of the table: genome size
for file in $annotfiles; do
    grep '^##sequence-region' $file | awk '{ sum += $4 } END { print sum }'
done

In [3]:

# Second row: sequence count
for file in $annotfiles; do
    grep -c '^##sequence-region' $file
done

In [4]:

# Third row: gene count
for file in $annotfiles; do
     grep $'\tmRNA\t' $file \
         | perl -ne 'm/Parent=([^;\n]+)/ and print "$1\n"' \
         | sort -u \
         | wc -l \
         | tr -d ' '
done

In [5]:

# Determine all annotated feature types so we can test for all
# ncRNA types.
cat $annotfiles | grep -v '^#' | cut -f 3 | sort -u

CDS
exon
five_prime_UTR
gene
locus
mRNA
miRNA
ncRNA
rRNA
snRNA
snoRNA
tRNA
three_prime_UTR
transcript

In [6]:

# Fourth row: non-coding gene count
for file in $annotfiles; do
     grep -e $'\tmiRNA\t' -e $'\tncRNA\t' -e $'\trRNA\t' -e $'\tsnRNA\t' \
          -e $'\tsnoRNA\t' -e $'\ttRNA\t' -e $'\ttranscript\t' $file \
         | perl -ne 'm/Parent=([^;\n]+)/ and print "$1\n"' \
         | sort -u \
         | wc -l \
         | tr -d ' '
done

In [7]:

# Fifth row: total iLocus count
for file in $annotfiles; do
    grep -c $'\tlocus\t' $file
done

In [8]:

# Final rows: iLocus count breakdown
for file in $annotfiles; do
    echo ""
    echo $file
    for type in piLocus iiLocus ciLocus fiLocus niLocus; do
        echo -n "$type: "
        grep -c iLocus_type=$type $file
    done
done

species/Att6/Att6.iloci.gff3
piLocus: 26507
iiLocus: 9997
ciLocus: 17
fiLocus: 9
niLocus: 769

species/Atha/Atha.iloci.gff3
piLocus: 27003
iiLocus: 9276
ciLocus: 96
fiLocus: 8
niLocus: 1286

species/Am10/Am10.iloci.gff3
piLocus: 9898
iiLocus: 4790
ciLocus: 106
fiLocus: 7990
niLocus: 0

species/Am32/Am32.iloci.gff3
piLocus: 14531
iiLocus: 6160
ciLocus: 292
fiLocus: 5268
niLocus: 0

Daniel S. Standage

Summary table for iLocus paper

Comments