On my slate of work this week is a comparison of two versions of the Arabidopsis thaliana genome: the TAIR6 version (2005ish) and the latest TAIR10 version (2010). I figure using GenHub is the best way to make the process of retrieving and pre-processing the data reproducible. I already have a recipe for the Arabidopsis genome in GenHub, but it's sourced from RefSeq and not TAIR. I need to figure out if this corresponds to TAIR10. If so, I can save myself some work. If not, I will need to add two new recipes to GenHub (TAIR6 and TAIR10) instead of one.

A quick look at the RefSeq assembly report shows that the NCBI assembly is in fact the TAIR10 assembly. But what about the annotation? I assume it's also the TAIR10 annotation, but the documenation in that directory doesn't confirm this. Perhaps it's an annotation by NCBI's GNOMON pipeline, or perhaps NCBI made some improvements to the TAIR10 annotation?

I was about to search for some more documentation on NCBI, but after a moment's hesitation I decided it would best to settle the question here and now by comparing the RefSeq annotation and the TAIR10 annotation. My first project as a graduate student was comparing different sources of annotation for the same genome, and I later built on this work to publish the ParsEval program. Using this tool, it should be fairly easy to compare the RefSeq GFF3 with the latest GFF3 from TAIR and determine if they're the same.

TAIR

First, let's take a look at the latest annotation from TAIR. I followed the Download --> FTP Archive link on the TAIR homepage to their FTP site, and poked around until I found the URL of the appropriate GFF3 file.

In [1]:
wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
--2016-02-04 21:15:05--  ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
           => 'TAIR10_GFF3_genes.gff'
Resolving ftp.arabidopsis.org... 129.114.60.67
Connecting to ftp.arabidopsis.org|129.114.60.67|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /home/tair/Genes/TAIR10_genome_release/TAIR10_gff3 ... done.
==> SIZE TAIR10_GFF3_genes.gff ... 44139005
==> PASV ... done.    ==> RETR TAIR10_GFF3_genes.gff ... done.
Length: 44139005 (42M) (unauthoritative)

TAIR10_GFF3_genes.g 100%[===================>]  42.09M  3.55MB/s    in 14s     

2016-02-04 21:15:24 (3.08 MB/s) - 'TAIR10_GFF3_genes.gff' saved [44139005]

The file is uncompressed, so a simple head command will give us a look at the first few lines.

In [2]:
head TAIR10_GFF3_genes.gff
Chr1	TAIR10	chromosome	1	30427671	.	.	.	ID=Chr1;Name=Chr1
Chr1	TAIR10	gene	3631	5899	.	+	.	ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1	TAIR10	mRNA	3631	5899	.	+	.	ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1	TAIR10	protein	3760	5630	.	+	.	ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1	TAIR10	exon	3631	3913	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	five_prime_UTR	3631	3759	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	CDS	3760	3913	.	+	0	Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1	TAIR10	exon	3996	4276	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	CDS	3996	4276	.	+	2	Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1	TAIR10	exon	4486	4605	.	+	.	Parent=AT1G01010.1

It looks like TAIR doesn't use ##sequence-region pragmas or any other metadata, just basic feature entries with 9 tab-delimited values. Let's take a look at how the sequences are labeled by cutting out the first column of each entry and then pulling out unique values.

In [3]:
cut -f 1 TAIR10_GFF3_genes.gff | sort -u
Chr1
Chr2
Chr3
Chr4
Chr5
ChrC
ChrM

I'm not really interested in organellar genomes, and the way they're annotated (especially by NCBI) is often problematic. Let's go ahead and remove the chloroplast genome and the mitochondrial genome from the GFF3 file, leaving the 5 chromosomes of the nuclear genome.

In [4]:
grep -v -e '^ChrM' -e '^ChrC' TAIR10_GFF3_genes.gff > tair10.gff3

NCBI/RefSeq

The NCBI FTP site is a bit easier to navigate. Once you locate the /genomes/refseq/ directory, finding the Arabidopsis data is trivial: plant/ --> Arabidopsis_thaliana --> all_assembly_versions.

In [5]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/all_assembly_versions/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gff.gz
--2016-02-04 21:15:32--  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/all_assembly_versions/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gff.gz
           => 'GCF_000001735.3_TAIR10_genomic.gff.gz'
Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.13, 2607:f220:41e:250::13
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.13|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/refseq/plant/Arabidopsis_thaliana/all_assembly_versions/GCF_000001735.3_TAIR10 ... done.
==> SIZE GCF_000001735.3_TAIR10_genomic.gff.gz ... 12647493
==> PASV ... done.    ==> RETR GCF_000001735.3_TAIR10_genomic.gff.gz ... done.
Length: 12647493 (12M) (unauthoritative)

GCF_000001735.3_TAI 100%[===================>]  12.06M  3.56MB/s    in 3.6s    

2016-02-04 21:15:37 (3.34 MB/s) - 'GCF_000001735.3_TAIR10_genomic.gff.gz' saved [12647493]

This file is compressed, so we combine head with gunzip to take a quick peak.

In [6]:
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz | head
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build TAIR10
#!genome-build-accession NCBI_Assembly:GCF_000001735.3
##sequence-region NC_003070.9 1 30427671
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702
NC_003070.9	RefSeq	region	1	30427671	.	+	.	ID=id0;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003070.9	RefSeq	gene	3631	5899	.	+	.	ID=gene0;Dbxref=GeneID:839580,TAIR:AT1G01010;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,NAC001,T25K16.1,T25K16_1;locus_tag=AT1G01010
NC_003070.9	RefSeq	mRNA	3631	5899	.	+	.	ID=rna0;Parent=gene0;Dbxref=Genbank:NM_099983.2,GeneID:839580,TAIR:AT1G01010;Name=NM_099983.2;gbkey=mRNA;gene=NAC001;product=NAC domain-containing protein 1;transcript_id=NM_099983.2

It looks like the RefSeq GFF3 file refers to the chromosome sequences by their accession number rather than the simple Chr1-style labels used by TAIR. If we want to compare this data to the TAIR data, we'll need to convert these accession numbers to chromosome numbers. It looks like region entries with a chromosome attribute give us the mapping we need to do the conversion.

In [7]:
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz \
    | grep $'\tregion\t' \
    | grep 'chromosome='
NC_003070.9	RefSeq	region	1	30427671	.	+	.	ID=id0;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003071.7	RefSeq	region	1	19698289	.	+	.	ID=id57764;Dbxref=taxon:3702;Name=2;chromosome=2;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003074.8	RefSeq	region	1	23459830	.	+	.	ID=id90805;Dbxref=taxon:3702;Name=3;chromosome=3;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003075.7	RefSeq	region	1	18585056	.	+	.	ID=id131785;Dbxref=taxon:3702;Name=4;chromosome=4;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003076.8	RefSeq	region	1	26975502	.	+	.	ID=id164622;Dbxref=taxon:3702;Name=5;chromosome=5;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA

I also know (based on personal experience) that trans-spliced genes can cause issues with ParsEval and the GenomeTools library it links to. Let's check whether there are any trans-spliced genes annotated in this file.

In [8]:
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz \
    | grep $'\tgene\t' \
    | grep 'trans-splicing'
NC_001284.2	RefSeq	gene	79740	81297	.	-	.	ID=gene33323;Dbxref=GeneID:3371312;Name=nad2;exception=trans-splicing;gbkey=Gene;gene=nad2;gene_biotype=protein_coding;locus_tag=ArthMp026;part=1/2
NC_001284.2	RefSeq	gene	327890	333105	.	-	.	ID=gene33323;Dbxref=GeneID:3371312;Name=nad2;exception=trans-splicing;gbkey=Gene;gene=nad2;gene_biotype=protein_coding;locus_tag=ArthMp026;part=2/2
NC_001284.2	RefSeq	gene	20571	190761	.	-	.	ID=gene33329;Dbxref=GeneID:3371313;Name=nad5;exception=trans-splicing;gbkey=Gene;gene=nad5;gene_biotype=protein_coding;locus_tag=ArthMp006
NC_001284.2	RefSeq	gene	143219	318390	.	-	.	ID=gene33375;Dbxref=GeneID:3890477;Name=nad1;exception=trans-splicing;gbkey=Gene;gene=nad1;gene_biotype=protein_coding;locus_tag=ArthMp044
NC_000932.1	RefSeq	gene	69611	69724	.	-	.	ID=gene33454;Dbxref=GeneID:1466250;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ArthCp001;part=1/2
NC_000932.1	RefSeq	gene	97999	98793	.	-	.	ID=gene33454;Dbxref=GeneID:1466250;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ArthCp001;part=2/2
NC_000932.1	RefSeq	gene	69611	69724	.	-	.	ID=gene33521;Dbxref=GeneID:844801;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ArthCp047;part=1/2
NC_000932.1	RefSeq	gene	139856	140650	.	+	.	ID=gene33521;Dbxref=GeneID:844801;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ArthCp047;part=2/2

There are indeed a few. So, in addition to converting sequence accession numbers to chromosome numbers, we need to filter out these trans-spliced genes so that they don't cause trouble for our comparison.

In [9]:
gunzip -c GCF_000001735.3_TAIR10_genomic.gff.gz \
    | sed -e 's/NC_003070.9/Chr1/' \
          -e 's/NC_003071.7/Chr2/' \
          -e 's/NC_003074.8/Chr3/' \
          -e 's/NC_003075.7/Chr4/' \
          -e 's/NC_003076.8/Chr5/' \
    | grep -v -e 'GeneID:1466250' \
              -e 'GeneID:3371312' \
              -e 'GeneID:3371313' \
              -e 'GeneID:3890477' \
              -e 'GeneID:844801' \
    > ncbi.gff3

Now we should be ready to compare the two annotations using ParsEval. Let's use the --summary flag to suppress the locus-level reports and print out only the aggregate report.

In [10]:
parseval --summary tair10.gff3 ncbi.gff3 2> stderr.txt
============================================================
========== ParsEval Summary
============================================================

Started:                04 Feb 2016, 09:16PM
Reference annotations:  tair10.gff3
Prediction annotations: ncbi.gff3
Executing command:      parseval --summary tair10.gff3 ncbi.gff3 

  Sequences compared
    Chr1
    Chr2
    Chr3
    Chr4
    Chr5
    NC_000932.1
    NC_001284.2

  Gene loci................................25468
    shared.................................25468
    unique to reference....................0
    unique to prediction...................0

  Reference annotations
    genes..................................27206
      average per locus....................1.068
    transcripts............................35176
      average per locus....................1.381
      average per gene.....................1.293

  Prediction annotations
    genes..................................27202
      average per locus....................1.068
    transcripts............................35173
      average per locus....................1.381
      average per gene.....................1.293

  Total comparisons........................35078
    perfect matches........................35031 (99.9%)
      avg. length..........................2683.72 bp
      avg. # refr exons....................5.91
      avg. # pred exons....................5.91
      avg. refr CDS length.................412.90 aa
      avg. pred CDS length.................412.90 aa
    perfect matches with mislabeled UTRs...0 (0.0%)
    CDS structure matches..................47 (0.1%)
      avg. length..........................2981.28 bp
      avg. # refr exons....................4.34
      avg. # pred exons....................1.17
      avg. refr CDS length.................39.26 aa
      avg. pred CDS length.................39.26 aa
    exon structure matches.................0 (0.0%)
    UTR structure matches..................0 (0.0%)
    non-matches............................0 (0.0%)

  CDS structure comparison
    reference CDS segments.................196913
      match prediction.....................196913 (100.0%)
      don't match prediction...............0 (0.0%)
    prediction CDS segments................196913
      match reference......................196913 (100.0%)
      don't match reference................0 (0.0%)
    Sensitivity............................1.000
    Specificity............................1.000
    F1 Score...............................1.000
    Annotation edit distance...............0.000

  Exon structure comparison
    reference exons........................207206
      match prediction.....................207004 (99.9%)
      don't match prediction...............202 (0.1%)
    prediction exons.......................207057
      match reference......................207004 (100.0%)
      don't match reference................53 (0.0%)
    Sensitivity............................0.999
    Specificity............................1.000
    F1 Score...............................0.999
    Annotation edit distance...............0.001

  UTR structure comparison
    reference UTR segments.................65233
      match prediction.....................64991 (99.6%)
      don't match prediction...............242 (0.4%)
    prediction UTR segments................64992
      match reference......................64991 (100.0%)
      don't match reference................1 (0.0%)
    Sensitivity............................0.996
    Specificity............................1.000
    F1 Score...............................0.998
    Annotation edit distance...............0.002

  Nucleotide-level comparison      CDS          UTRs         Overall   
    Matching coefficient:          1.000        0.999        0.999
    Correlation coefficient:       1.000        0.996        --        
    Sensitivity:                   1.000        0.992        --        
    Specificity:                   1.000        1.000        --        
    F1 Score:                      1.000        0.996        --        
    Annotation edit distance:      0.000        0.004        --        

Well, other than a couple hundred UTRs that RefSeq appears to have trimmed, the annotations are identical. That's good enough for me! I'll add a recipe for TAIR6 to GenHub, and use the existing RefSeq recipe for TAIR10.

Comments