My goal is to build an infrastructure that allows researchers to share information
and knowledge in order to identify new insights and facilitate the process of
generating new paradigms in biology. A long-term goal is to systematically delineate
what is known and unknown in order to mobilize the research community to solve
the rules underlying the workings of an organism.
One of the most efficient ways of solving problems in biology lies in the use
of model organisms or systems in which the basic rules are uncovered and applied
to more diverse sets of organisms and problems. For higher plants, Arabidopsis
thaliana has been adopted as a model organism due to its small genome size,
self-compatibility, and short generation time. Since its adoption as a model
organism, many tools have been developed for this plant, including facile and
efficient methods of transformation, complete genome sequence, and high-density
genetic maps. Capturing and representing biological knowledge from studies using
Arabidopsis thaliana is the subject of my research. More specifically,
my group has developed a computer-based infrastructure to capture the research
community information and the knowledge generated in the research literature
and developed a query/analysis/visualization system to allow researchers to
identify correlations in the information. In the future, we would like to develop
a knowledge-capture system to bring the research findings directly into the
computer infrastructure, and develop a simulation system that can predict an
accurate outcome of any scenarios that may occur in the plant.
The most amount of knowledge resides in the minds of individual researchers
and their laboratories. Some of this knowledge is refined in a form of publication.
With approximately 11,000 researchers and 4,000 laboratories around the world,
Arabidopsis research community is arguably the largest body of a model organism
research community to date, with a possible exception of the human biology research
community. Drosophila melanogaster, an insect that has been the subject
of genetic research for almost 100 years (history of more than five-fold of
that for Arabidopsis), has about half of the size of Arabidopsis community,
at about 5,000 researchers.
In order to capture the knowledge from this large body of research community,
we need to develop an infrastructure that allows researchers to find and share
the information and knowledge generated. Advancement of computer science and
communications technology has established the internet to be the most efficient
medium for exchanging knowledge. In addition, advancement of high-throughput
technology such as sequencing and microarray methods have allowed biologists
to produce large quantities of data. Developing an infrastructure to house and
make accessible these large quantities of data has been a problem for many research
communities. In collaboration with information technology scientists at the
National Center for Genome Resources in Santa Fe, New Mexico, my group has been
engaged in developing an infrastructure to house the vast quantities of information
for Arabidopsis. The infrastructure is called the Arabidopsis Information Resource
(TAIR, http://arabidopsis.org), which is accessible
via commonly used web browsers and can be searched and downloaded in a number
of ways. For example, researchers can identify genes or proteins of interest
based on many parameters (e.g. subcellular localization, expression patterns,
or mutant phenotypes) from the text-based search forms, sequence analysis tools,
or bulk query forms. SeqViewer (http://arabidopsis.org/servlets/sv)
allows visualization of these genes on the genome decorated with clones, transcripts,
genetic markers and polymorphisms. The SeqViewer interactively displays the
genome from the whole chromosome down to 10 kb of nucleotide sequence. Alternatively,
researchers can visualize these genes mapped on metabolic pathways from the
whole cell level down to individual reactions along with metabolic compound
structures using AraCyc (http://arabidopsis.org/tools/aracyc).
Upon finding relevant information about genes, researchers can order associated
DNA or seed stocks from the Arabidopsis Biological Resource Center (ABRC, http://arabidopsis.org/arbrc). Detailed,
and up-to-date information about the database content as well as its usage statistics
can be found online (http://arabidopsis.org/about).
TAIR uses an object-oriented approach to data representation and software architecture.
The underlying database is implemented in a relational database management system
(Sybase version 11.9.2). The data is organized in a hierarchical structure where
a parent table groups a set of child tables with similar attributes and each
node can be linked to other nodes and tables. At the top of the data hierarchy
is the TairObject class, which is linked to other top parent classes such as
Attribution (source of the data), Reference (experimental evidence source),
and Annotation (descriptive information). Thus, the Attribution, Reference and
Annotation classes constitute the meta data of all TAIR objects. This design
has the advantage of allowing easy expansion of new data types as well as flexibility
and minimization of linking tables. More detailed information about the database
schemas and documentation can be found online (http://arabidopsis.org/search/schemas.html).
TAIR software is developed in a client-server mode using the JAVA Servlet technology.
All applications are accessible to users by common web browsers to accommodate
maximum user platform and software (operating system) diversity. Software for
accessing the database is developed using an object-oriented architecture. A
set of JAVA classes called TAIR Foundation Classes serve a number of functions
to the front-end applications that use JAVA Server Pages. Documentation of the
TAIR Application Program Interface can be found on 'About TAIR’ section of the
home page. A set of bulk download tools based on flat files use CGI scripts
written in Perl. Finally a number of weekly updated, static HTML pages serve
relevant Arabidopsis and external links information to the community.
This project, in its third year, is accessed by about 20,000 unique internet
addresses per month. Approximately 2.5 million hits and 500,000 web pages are
accessed by researchers around the globe every month. TAIR is currently the
most visible Arabidopsis project. For example, when using the word `Arabidopsis’
on Google (http://google.com), TAIR is on top
of the list.
II. PubSearch: A Comprehensive Literature Extraction and Curation System
Peer-reviewed research articles remain the best medium for representing and
disseminating the refinement of scientific knowledge. For any model organism
database (MOD), the literature is one of the main data sources, and significant
resources are devoted to capturing this information. Our long-term goal is to
develop a set of systematic procedures and tools for integrating knowledge from
the confined context of a research article into the dynamic, broad context of
a model organism database.
We have developed a literature curation tool called PubSearch, which stores
literature, gene, functional annotation, and keyword data in a stand-alone database
and allows curators to establish associations between these data types using
a web browser. In PubSearch, first-pass associations between terms (gene names
and keywords) and articles are made automatically by a string matching program
that indexes terms to articles. Commonly occurring words such as AND, THE, IF
(stop words) are filtered out to minimize meaningless associations from being
stored. For terms with a higher signal-to-noise ratio, curators verify the matches
via the web browser user interface.
PubSearch uses a simple database schema in a MySQL database management system
(DBMS) (version 3.21), which can be queried and updated using a password-protected
login mechanism via the internet using a web-browser. The middleware is written
in Java (version 1.3) and uses Java Servlet and Java Server Page (JSP) technology.
The system is currently running on a Linux RedHat7.2 system with Tomcat (version
4.0) as the servlet engine. A demo of the current version of this tool and its
documentation can be accessed from:
http://tesuque.stanford.edu:9999/pub/index.jsp
Username: demo Password: demo
The tool has been used and refined for the past 6 months by 7 curators at TAIR
and 5 Arabidopsis curators at the Institute for Genome Resources (TIGR) to curate
over 12,000 articles. The tool is much more convenient and user-friendly than
our old system involving flat files and our curation work has become much more
efficient as a result.
In addition to providing curators with a sophisticated tool to facilitate literature
curation, this project impacts three bodies of the research community significantly.
First, the Arabidopsis research community benefits from access to accurate and
consistent annotations of data objects from the literature, which are produced
in a fast, efficient manner. Second, researchers engaged in high throughput
genomic projects benefit by having access to reliable, high quality annotations
that can be used to enhance automated annotations. Often sequence comparison
is used to predict the potential function of genes and gene products in a newly
sequenced organism; accurate and detailed descriptions of a model genome and
its complements will improve the accuracy of the newly sequenced organism’s
annotation. Third, members of the computer science research community can use
the rules, methods and curated data to develop more sophisticated and accurate
algorithms to extract and analyze data from the literature. The set of human-curated
data along with explicit rules used for the annotations will provide much-needed
test data sets for developing and improving algorithms based on methods such
as natural language processing and machine learning. This final application
of the tool lends the possibility that manual curation of literature can be
infinitely reduced, allowing our curation teams the freedom to use their scientific
training to explore and question the data collected in MODs leading to new hypotheses
and potential discoveries.
III. Gene Ontology Consortium and Plant Ontology Consortium: Establishing systematic ways of describing biology
for all organisms in both human and machine-readable forms
Although biology is one of the complex systems where large bodies of knowledge
exist, descriptions of rules underlying the knowledge reside in a thick semantic
soup. Attempts to standardize nomenclature across organisms have essentially
failed and remain a difficult task even within a single organism research community.
Recently, a few model organism databases (yeast, mouse, and Drosophila) have
joined forces to standardize the semantics with which to describe the roles
of genes and gene products (Gene Ontology (GO) Consortium, http://www.geneontology.org)
and my group has been an integral part of this effort since 2000. GO attempts
to describe the roles of genes and gene products in three large aspects: molecular
function, biological process, and anatomical parts. Controlled vocabularies
within each of these three aspects are structured in directed acyclic graphs
(DAG), which allow multiple parent-child relationships for each vocabulary.
Two types of parent-child relationship 'is a’ and 'part of’, currently exist
in GO. Since joining this group, we have added over 500 terms relevant for plants
as well as restructuring about 400 terms within the ontologies to better reflect
plant biology. We have collectively developed over 12,000 terms. This project
has been well-received by the biology community and is currently used by over
10 large databases around the world, including SWISS-PROT and TIGR, and is being
implemented into MEDLINE.
Although the use of GO is becoming a standard, it has some limitations. For
example, it does not accommodate anatomical parts or developmental stages of
a multicellular organism. Furthermore, it does not attempt to describe traits
or phenotypes. In order to accommodate the description of genes and gene products
in Arabidopsis, we developed orthologous vocabulary systems for anatomical parts
and developmental stages, in collaboration with Jonathan Clarke at John Innes
Centre, UK. In addition, we have established a collaboration with other plant
model organism databases such as MaizeDB, Gramene, and IRRI, in a project called
Plant Ontology Consortium, to develop shared anatomy and developmental stages
ontologies. In this project, Arabidopsis vocabularies have been used as the
baseline onto which terms from other plants have been added and the structures
modified with a goal to accommodate the description of all plant genes and gene
products.
The establishment and usage of these shared, controlled vocabularies will allow
researchers to query across all organisms for knowledge and begin to address
correlations between structure and function in explicit, systematic ways.
FUTURE PLANS IN THE NEXT FEW YEARS
I. Enhancement of TAIR schema and content
Currently the information in TAIR is heavily focused on the finished genome
and its gene complements. In the next few years, we would like to enhance the
structure of the TAIR database to represent more information about gene products.
These include genetic, physical, and regulatory relationship between genes and
gene products. In addition, the relationship between genotype (polymorphism
in a sequence) to phenotype (of a germplasm harboring the polymorphism(s)) will
be established. Finally, more derived relationships of genes and gene products
will be stored; these include gene family information based on phylogenetic
analysis, expression clusters based on microarray data analysis, and metabolic
pathway groupings based on enzymatic assays.
II. Enhancement of TAIR’s query and data input systems
Most of the initial efforts on the TAIR project went into developing a database
structure to store complex data types and relationships to represent Arabidopsis
biology. In addition, a set of sophisticated query and data retrieval software
has been implemented. However, current set of query tools do not reflect the
underlying complexity of the database structure. In the next few years, we will
focus on developing a comprehensive set of query tools that allow researchers
to perform and get access to any combinations and correlations of data stored
in TAIR. In effect, we will be developing a user interface for researchers to
design and execute Structured Query Language (SQL) to the TAIR database.
In addition, we will develop a set of data entry and update tools to allow
researchers to add and update any information in the database. Currently, we
have an interactive data entry system only for person or organization profile
information. We plan on expanding this to allow researchers to add information
about genetic markers, genes, proteins, microarray experiments, etc. In addition,
we will implement a system to allow a researcher to attach his or her own comments
to any information at TAIR. Our long-term goal is to establish TAIR as an essential
communication and research tool whereby it is the first place a researcher should
go to find out about any aspect of Arabidopsis biology. Some aspect of in-house
curation will always be essential but we hope to disperse some of the curation
responsibilities to those researchers that have generated the data and thus
create a co-operative resource.
III. Expansion of TAIR for plant researchers
Because the value of Arabidopsis derives from its utility in understanding
other plants, our goal is to build an infrastructure that permits facile high
resolution linking of specific information about Arabidopsis to similar information
in all other plants (and vice versa).
Ultimately, our goal is to provide the common vocabulary, visualization tools,
and information retrieval mechanisms that permit integration of all knowledge
about Arabidopsis into a seamless whole that can be queried from any perspective.
Of equal importance for plant biologists, the ideal TAIR will permit a user
to use information about one organism to develop hypotheses about less well-studied
organisms. In the next few years, we hope to develop user-friendly tools that
permit an individual working outside this model species to formulate a query
based on their organism of interest, have that query directed to the relevant
knowledge in Arabidopsis, and present the information in a way that can be understood
by any plant biologist. We will be making efforts to cross-link information
in TAIR with information about other plants and organisms in other databases.
In addition, we will develop a more comprehensive help system to allow researchers
not familiar with Arabidopsis to use the information in TAIR more effectively.
IV. Dissecting the unknown in
Arabidopsis
Sequencing the genome revealed the extent of gaps of our knowledge about Arabidopsis.
Approximately 27000 genes (and 2000 pseudogenes) have been predicted based on
gene prediction programs and sequence comparisons. Of these, approximately 30%
have evidence of transcription (e.g. ESTs available) but are not similar to
any genes of known function. About 10-15% of the genes do not even have any
evidence of transcription (termed 'hypothetical’). In addition, approximately
1% of the genes have experimental evidence for subcellular location.
In an effort to systematically characterize the unknown, we are collaborating
with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory,
David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook,
and Natasha Raikhel at UC Riverside) to identify subcellular localization of
approximately 800 genes that have no known function, not similar to any known
genes, and have no localization information. The selected genes with their 5’
and 3’ intergenic regions will be PCR-amplified, fused to GFP, and the transgenic
plants harboring the clones will be examined for subcellular localization. Our
role will be to develop a Laboratory Information Management System (LIMS) to
store and prioritize the candidate genes for cloning based on a number of criteria
(including annotation download from TAIR, existence of full-length cDNA, etc.),
track the status of the cloning, upload the preliminary results for internal
discussions, and export the data to TAIR and other public repositories. In addition,
the experimental results from this study will be used to identify potential
novel signal peptides and improve subcellular localization prediction algorithms.
V. Education and outreach to scientists, educators, and general public
We plan on expanding the resources at TAIR for education and outreach. First,
we will provide educational resources for high school and undergraduate-level
teachers (e.g. curricula, protocols, professional development materials) engaged
and interested in teaching plant courses and laboratories. In addition to gathering
these materials ourselves, we will implement an online submission form for teachers
and scientists to submit useful, classroom-tested protocols. Second, we will
establish a community of teachers and scientists by setting up a mailing list
and actively recruiting members from the scientific community to be involved
as advisors for the teachers. Third, we are developing a set of extensive help
pages, glossary, and tutorials for the resources available at TAIR, to facilitate
high school and undergraduate-level teachers and students in using TAIR for
their projects. This aspect of the project will be enhanced by collaborations
with teachers who are interested in developing courses that use TAIR. We are
currently in discussion with a couple of local high school and community college
teachers.