Is RefSeq's Arabidopsis annotation TAIR10?
On my slate of work this week is a comparison of two versions of the Arabidopsis thaliana genome: the TAIR6 version (2005ish) and the latest TAIR10 version (2010). I figure using GenHub is the best way to make the process of retrieving and pre-processing the data reproducible. I already have a recipe for the Arabidopsis genome in GenHub, but it's sourced from RefSeq and not TAIR. I need to figure out if this corresponds to TAIR10. If so, I can save myself some work. If not, I will need to add two new recipes to GenHub (TAIR6 and TAIR10) instead of one.
A quick look at the RefSeq assembly report shows that the NCBI assembly is in fact the TAIR10 assembly. But what about the annotation? I assume it's also the TAIR10 annotation, but the documenation in that directory doesn't confirm this. Perhaps it's an annotation by NCBI's GNOMON pipeline, or perhaps NCBI made some improvements to the TAIR10 annotation?
I was about to search for some more documentation on NCBI, but after a moment's hesitation I decided it would best to settle the question here and now by comparing the RefSeq annotation and the TAIR10 annotation. My first project as a graduate student was comparing different sources of annotation for the same genome, and I later built on this work to publish the ParsEval program. Using this tool, it should be fairly easy to compare the RefSeq GFF3 with the latest GFF3 from TAIR and determine if they're the same.
TAIR¶
First, let's take a look at the latest annotation from TAIR.
I followed the Download --> FTP Archive
link on the TAIR homepage to their FTP site, and poked around until I found the URL of the appropriate GFF3 file.
wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
The file is uncompressed, so a simple head
command will give us a look at the first few lines.
head TAIR10_GFF3_genes.gff
It looks like TAIR doesn't use ##sequence-region
pragmas or any other metadata, just basic feature entries with 9 tab-delimited values.
Let's take a look at how the sequences are labeled by cutting out the first column of each entry and then pulling out unique values.
cut -f 1 TAIR10_GFF3_genes.gff | sort -u
I'm not really interested in organellar genomes, and the way they're annotated (especially by NCBI) is often problematic. Let's go ahead and remove the chloroplast genome and the mitochondrial genome from the GFF3 file, leaving the 5 chromosomes of the nuclear genome.
grep -v -e '^ChrM' -e '^ChrC' TAIR10_GFF3_genes.gff > tair10.gff3
NCBI/RefSeq¶
The NCBI FTP site is a bit easier to navigate.
Once you locate the /genomes/refseq/
directory, finding the Arabidopsis data is trivial: plant/ --> Arabidopsis_thaliana --> all_assembly_versions
.
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/all_assembly_versions/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gff.gz
This file is compressed, so we combine head
with gunzip
to take a quick peak.
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz | head
It looks like the RefSeq GFF3 file refers to the chromosome sequences by their accession number rather than the simple Chr1-style labels used by TAIR.
If we want to compare this data to the TAIR data, we'll need to convert these accession numbers to chromosome numbers.
It looks like region
entries with a chromosome
attribute give us the mapping we need to do the conversion.
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz \
| grep $'\tregion\t' \
| grep 'chromosome='
I also know (based on personal experience) that trans-spliced genes can cause issues with ParsEval and the GenomeTools library it links to. Let's check whether there are any trans-spliced genes annotated in this file.
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz \
| grep $'\tgene\t' \
| grep 'trans-splicing'
There are indeed a few. So, in addition to converting sequence accession numbers to chromosome numbers, we need to filter out these trans-spliced genes so that they don't cause trouble for our comparison.
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz \
| sed -e 's/NC_003070.9/Chr1/' \
-e 's/NC_003071.7/Chr2/' \
-e 's/NC_003074.8/Chr3/' \
-e 's/NC_003075.7/Chr4/' \
-e 's/NC_003076.8/Chr5/' \
| grep -v -e 'GeneID:1466250' \
-e 'GeneID:3371312' \
-e 'GeneID:3371313' \
-e 'GeneID:3890477' \
-e 'GeneID:844801' \
> ncbi.gff3
Now we should be ready to compare the two annotations using ParsEval.
Let's use the --summary
flag to suppress the locus-level reports and print out only the aggregate report.
parseval --summary tair10.gff3 ncbi.gff3 2> stderr.txt
Well, other than a couple hundred UTRs that RefSeq appears to have trimmed, the annotations are identical. That's good enough for me! I'll add a recipe for TAIR6 to GenHub, and use the existing RefSeq recipe for TAIR10.