
Carnegie Institution for Science
Phone: (650) 325-1521 x 310
Fax: 650) 325-6857
Email:huala@acoma.stanford.edu
Overview
As director of The Arabidopsis Information Resource (TAIR), my main focus is on providing broad access to plant biology data and tools to researchers, educators, students and the general public. TAIR was launched in September 1999 and has grown into a vital resource for plant biology researchers around the world, providing access to the Arabidopsis genome sequence along with gene structures and splice forms, gene products, metabolic pathways, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis research community. The work of the TAIR group falls into three main areas: 1) manual and computational curation of Arabidopsis gene function, 2) improvements to genome annotation including addition of newly discovered Arabidopsis genes, genome assembly corrections and periodic genome releases to NCBI and other sites, and 3) maintenance and improvement of the TAIR website which provides public access to data detail pages along with search, display and data download tools for Arabidopsis data. Approximately 35,000 people worldwide use TAIR each month based on unique IP address and other usage tracking methods (Figure 1).

Figure 1. Worldwide TAIR usage over one year (11/18/07 - 11/17/08). A total of 1.4 million visits originated from 183 countries.
Gene Function Curation
The main focus of TAIR’s gene function curation is extraction of experimentally verified Arabidopsis gene function data from the published literature. A variety of different data types are manually extracted from published research articles by TAIR curators, including gene symbols, phenotypes, expression pattern, and molecular function, biological process and subcellular location of gene products. Total Gene Ontology (GO) and Plant Ontology (PO) controlled vocabulary annotations added between September 2004 and June 2008 include over 23,200 manual annotations to 4697 genes with molecular function, biological process, cellular compartment, plant structure or plant growth stage terms based on published experimental results. For genes lacking experimental evidence a computational annotation is carried out as part of each new TAIR genome release. These computational annotations assign GO terms based on protein domain matches identified with InterProScan and TargetP analysis results. New directions we are exploring in this area include collaborations with journals to gather gene function data directly from submitting authors at the time of publication, and text mining / NLP approaches to increase curation efficiency on the existing literature corpus.

Figure 2. Completeness of Arabidopsis gene function annotation with controlled vocabulary gene function terms (Gene Ontology and Plant Ontology) as of November 2008.
Genome Annotation
TAIR assumed responsibility for updating the Arabidopsis genome annotation following TIGR’s final genome release in January 2004. The three genome releases produced by TAIR since then have added a total of 2409 new genes and made 15,576 updates to gene structures including 88 gene splits, 82 gene merges and 2253 updates to coding regions, and added 4211 variant splice forms, increasing the total genes with variant splicing to 4330. In addition to improved gene structures, the most recent release (TAIR8) contained updates to the underlying chromosome sequences including removal of 14 regions of sequence contamination and 1425 single nucleotide corrections based on re-sequencing data. The TAIR8 release also contained a full annotation of native transposable elements and associated genes based on community submissions and previous TIGR annotations. Future work will be focused on a more consistent and complete annotation of pseudogenes, more extensive error correction for the Arabidopsis genome sequence, and comparative genomic approaches to improve the accuracy of gene structures. Incorporation of new ecotype sequences and the genomes of Arabidopsis relatives is also a high priority.
Data dissemination through the TAIR Website
The TAIR website serves as a convenient access point to data within the TAIR database as well as tools for analyzing data, well-organized lists of useful resources hosted at other sites, whole genome datasets for downloading, and information of interest to the community such as upcoming conferences and job listings. We provide extensive online and email help and custom dataset services upon request as well as frequent workshops on using TAIR and other online resources. The TAIR database houses information on the complete genome sequence along with gene structure, gene product information, metabolic pathways, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and researcher contact information. Tools available at TAIR include GBrowse (figure 2) and other genome browsers, BLAST configured with a large number of custom plant-related datasets, the AraCyc metabolic pathways database, and other tools for sequence analysis, data visualization and gene name registration.

Figure 3. A small sample of the types of data currently accessible through the TAIR genome browser.