Tetrahymena Whole-Genome Sequencing Project:  A Concept Paper

November 5, 2001

 

Preface. To facilitate access to the large diversity of information, this concept paper is organized into two components: Main and Appendix. The Main document contains the introductory overview of the case for sequencing the Tetrahymena genome, the concise description of the sequencing and annotation project, and the specific answers to the NonMammalian Models Committee questionnaire. The Appendix expands on the coverage of two topics in the main document: advanced molecular genetic tools and unique or very special studies that would be enabled by the availability of the genome sequence. The latter are grouped by topic into numbered tables for easy reference from the main document. All the cited references are listed in the Appendix, organized so that related references are generally clustered together. Page numbers in tables of contents (main and appendix) may be sensitive to software print settings and are therefore only approximate.

 

Table of contents

 

Section

Topic

Page

1.

Introduction: Why sequence the Tetrahymena genome?

1

2.

Aims and outline of the project

4

3.

Answers to the Non-Mammalian Models Committee questionnaire

5

  a)

Community process

5

  b)

Other sources of support.

6

  c)

Advantages and limitations of the model organism for research purposes.

6

  d)

Justification for needing the genomic resources now.

9

  e)

Existence or plans to develop the proposed resources outside the U.S.

10

  f)&g)

Unique advantages of having the genomic information of this organism and scientific advances that will be made possible.

10

  h)

Cost of the project.

11

  i)

Duration of the project.

11

  j)

Support of resources after the completion of the project.

12

  k)

Availability of data and resources generated by this project to the research community.

12

  l)

Genomic resources currently existing.

13

  m)

Size of the research community.

14

  n)&o)

Who will benefit from the improved genomic resources and how?

14

  p)

Material transfer agreements.

15

Table 1

Tetrahymena Genomic Resources and Database Needs

16

 

Appendix

 

------------------

 

1. Introduction: Why sequence the Tetrahymena genome?

 

Tetrahymena is a fresh-water protozoan that is highly successful ecologically. It has been used as a microbial animal model for more than 75 years -- ever since Nobel Laureate Andre Lwoff [1] in 1923 succeeded in growing this unicell under axenic conditions, i.e., in pure culture. Tetrahymena has typical eukaryotic biology. Although unicellular, Tetrahymena displays a degree of cellular structural and functional complexity fully comparable to that of humans and other metazoans. Its ultrastructural morphology, cell physiology, development, biochemistry, genetics, and molecular biology have been comprehensively investigated [2-6]. Certain eukaryotic mechanisms are uniquely or especially well developed in Tetrahymena, and have facilitated discoveries that have generated major fields of fundamental research:

 

Advanced molecular and genetic tools developed in Tetrahymena have maintained this organism at the forefront of fundamental research. This is particularly the case in areas that are less accessible to in vivo experimental investigation in other model organisms, such as regulated secretion, cell motility, phagocytosis, telomere function, function of post-translational modifications of histones and tubulins, and developmental DNA rearrangements. Sustained extramural grant support of Tetrahymena research and published statements by leading researchers doing related work on other organisms support this self-assessment: telomerase [21], tubulins [22-24] regulated secretion of stored proteins [25] and development [26].

 

The advances in Tetrahymena knowledge and technology have resulted from the very productive and highly collaborative efforts of the ciliate community, which is the largest genetic model organism community without a genome project. The juncture has now been reached where the enormous potential of Tetrahymena for research in various areas -- fundamental science, genomic, biomedical, public health, bioagricultural, environmental and biotechnological -- will be wasted unless its genome sequence is quickly determined. The ciliate molecular biology research community has chosen Tetrahymena as the ciliate whose genome should be sequenced first because it has the most advantageous combination of biological features, the only genetic and physical mapping and other important accumulated genomic resources, and the most powerful array of molecular genetic tools for post-genomic in vivo experimental functional genomics. In this document, the ciliate community proposes a project to sequence, assemble, annotate and make publicly available the entire Tetrahymena expressed (macronuclear) genome, under a plan described later in this concept paper.

 

There are at least five major, unique or special reasons (described in more detail later) why the sequencing of the Tetrahymena genome would be an important contribution to science.

 

1) Evolutionary genomics: key phylogenetic position for comparative genomics. The ciliate Tetrahymena occupies a key position in the third, major independent branch of eukaryotic evolution, the Alveolata [27; 28]. All of the model organisms that have "completed" or on-going genome projects belong to the two other major clades: the Heterokonta (metazoa, fungi, Dictyostelium) or the Viridiplantae (plants and Chlamydomonas). The Alveolata also include the Dinoflagellates and the Apicomplexa -- a group exclusively composed of medically or agriculturally important parasites of metazoa. Several Apicomplexans, e.g., Plasmodium (the human malarial parasite), have ongoing genome projects, but their genomes are small (10-20% of the Tetrahymena genome). This genome simplification likely results mainly from the loss of functions supplied by their hosts. No free-living member of the entire Alveolate clade -- let alone an experimentally tractable genetic model organism -- has an ongoing genome project.

 

2) Investigating the unknown functions of important human genes. Humans share a higher degree of functional conservation with ciliates than with other microbial model organisms. This is evidenced by better matches (i.e., lower probability of a chance match) of Tetrahymena EST [29] and Paramecium coding sequences [30] to humans than to other non-ciliate microbial model organisms. Significant Tetrahymena EST matches to human proteins occur not just among housekeeping genes [29]. Examples: an opioid-regulated protein with previously unknown function (recently elucidated in Tetrahymena [31]), a protein required for stem cell maintenance [32], a brain NMDA-receptor glutamate-binding protein, and several human brain-expressed genes with unknown function sequenced by the Japanese KIAA project. Some of those proteins are not found in yeast. Tetrahymena is thus an excellent unicellular animal model.

 

Sequence conservation over more than a billion years of independent evolution predicts that the function of the genes is important -- and likely to cause human hereditary disease by dysfunctional mutation -- and that the proteins have retained their basic, ancestral biochemistry and molecular biology. Thousands of human genes of unknown function are predicted by the human genome sequence [33; 34]. Sequence conservation, coupled with the advanced and powerful experimental tools available in Tetrahymena, thus would confer on the biomedical research community an enormous opportunity to use Tetrahymena in the experimental elucidation of the in vivo function of many important human genes at the cell and molecular level. The results of this work would complement investigations of human gene function at more integrative levels using multicellular animal models.

 

3) Experimental functional genomics: advanced molecular genetic tools. An impressive array of robust and novel molecular genetic tools have placed Tetrahymena at the forefront of experimental, in vivo functional genomics research [35]. Two unique genetic features, heterokaryons and assortment genetics, are used in combination with a battery of DNA-mediated transformation techniques in novel, powerful and versatile ways. We anticipate an increased use of these methods by the general scientific community once the genome sequence becomes available.

 

4) Exploiting unique or special biological features. Tetrahymena is a unicell that can be grown rapidly (down to a 1.5-hr doubling time) and inexpensively, as a genetically and physiologically homogeneous culture, in a totally defined chemical and physical environment. These and other especially favorable features not only facilitate important on-going experimental investigations of fundamental biological mechanisms, but also make it a very useful model organism for pharmacological and drug screening purposes and important biotechnological applications, and a favorite organism for environmental toxicology and monitoring.

 

5) Advancing genome-sequencing technology. The drive to sequence the scientifically important Tetrahymena genome creates an opportunity to develop and test methods that meet the challenge of completely and efficiently finishing the intergenic segments of a larger A+T-rich genome. The sequencing of other important A+T rich genomes (e.g., other model ciliates, apicomplexan parasites, some bacteria) could well benefit from advances in the science of genomic sequencing accomplished in the context of the Tetrahymena genome.

 


2. Aims and outline of the project

 

We propose to sequence the expressed (somatic or macronuclear) genome because, during its programmed differentiation, it retains all the genes and other DNA elements required for vegetative life, while eliminating most of the repeated sequence in the germline genome. Furthermore, we propose to seek the finished sequence of the genome for several scientifically important reasons:

 

The size of the Tetrahymena macronuclear (MAC) genome (~180 Mb) precludes a distributed timely and cost-effective sequencing effort by the Tetrahymena research community. The plan that follows is based on careful consideration of interest, feasibility assessment, sequencing approaches and preliminary cost estimates provided by five major sequencing facilities (The Institute for Genomic Research (TIGR), Whitehead Institute-MIT, University of Oklahoma, University of Washington, and Integrated Genomics Company). Three centers (including the Joint Genome Institute) have obtained Tetrahymena DNA from us and intend to start genomic test-sequencing.

 

a) Whole-genome shotgun (WGS) sequencing. We propose to first sequence randomly sheared DNA from purified macronuclei to a depth of 8-fold coverage, using a mix of inserts from 4-kb and 10-kb libraries and from a 50-kb jumping (or linking) library. This will exploit automation and high throughput technology currently available at major sequencing centers. This level of sequencing will allow the assembly of most, if not all, of the genome into contigs and scaffolds. We expect that this depth of WGS will yield the finished sequence and the opportunity for annotation of virtually every Tetrahymena gene, because of the following advantages for protein coding sequence cloning, sequencing and gene prediction:

 

b) Closure of the genomic sequence. Most, if not all, of the Tetrahymena macronuclear genome can be closed with high throughput technology already in use in other genomic projects. Higher priority will be given to the closure of protein coding segments and their flanking sequence. Sequencing the macronuclear genome will avoid obstacles that have prevented closure of the other eukaryotic genomes, e.g., centromeric DNA, repeated DNA and extended GC tracts. Closure of those (protein non-coding) regions that have the highest AT composition may present a challenge and an opportunity. Closure of the sequence of two entire ~1 Mb chromosomes from the malarial parasite Plasmodium [39; 40] shows that the challenge can be overcome, even when their average A+T composition (83%) is significantly higher than that of Tetrahymena genome (75%). The opportunity is to use the Tetrahymena sequencing project to develop and test technology to facilitate the cloning and sequencing of larger AT-rich-DNA genomes.

 

c) Annotation of the Tetrahymena genomic information. The value of the Tetrahymena genomic sequence will be greatly enhanced by its high quality annotation, which will be done according to the following stages:

1.      Electronic annotation by the sequencing center, including the prediction of coding sequence, genes, cell compartment targeting, domains, motifs, etc. This stage is expected to identify the vast majority of genes. This stage can profitably begin from assembled contigs once the WGS effort has reached a depth of 5-fold coverage.

2.      Manual, gene-by-gene annotation by sequencing center experts on different cell processes and protein families. This process can begin once the coding region sequence has been declared finished.

3.      Annotation "jamboree" by experts in the ciliate research community, assisted by bioinformatics resources of the sequencing center. This will provide the most advanced and ciliate-specific level of annotation.

4.      Ownership of the sequence and its annotation will then be transferred to the Tetrahymena database. Subsequent maintenance, extension and refinement will become the responsibility of its curators (see section 3k).

 

3. Answers to the Non-Mammalian Models Committee questionnaire (http://www.nih.gov/science/models/process/index.html)

 

a. By what process did the community obtain input and reach a consensus about the priority for the proposed project?

 

b. What other sources of support, including non-U.S. sources, exist?

·         Project under the direction of William Nierman (TIGR) to make a Tetrahymena BAC library of ~50-kb inserts. This library would supplement or replace the linking library for the assembly and scaffolding of the Tetrahymena MAC genome.

·         We are currently exploring additional sources of partial support for the genome-sequencing project.

Table 1 contains a more systematic listing of funding, already awarded, for completed or in-progress genome-wide projects.

 

c. What are the advantages and limitations of the model organism for research purposes, including genome size, tractability for genetic studies, ease of use, generation time, storage of organism or gametes, etc.?

 

c1. Genome size and other genomic features

Advantages:

Potential Limitation:

 

c2. Tractability for genetic studies. Because some of the most powerful molecular genetic tools depend on unique Tetrahymena features (germline/soma differentiation, allelic assortment in the polyploid MAC, amplification of the major rRNA genes as a minichromosome all their own), the concise summary below is supplemented with more detailed explanation in Appendix Section 1. A recently published volume of Methods in Cell Biology [6] contains detailed protocols for the use of these tools.

 

a) "Conventional genetics" tools

·         Conjugation is readily induced and experimentally manipulated, allowing crossing, genetic analysis, mapping, and manipulation of replaced genes.

·         Readily inducible self-fertilization, leading to whole-genome homozygotes in a single step.

 

b) Advanced molecular genetic tools. These tools have been described at greater length (Appendix section 1 and [35]).

 

3) Ease of use [reviewed in 44]

·         Dual, self-sufficient nutritional modes: particle (bacteria) phagocytosis and small-molecule uptake by active transport. This gives complete control of Tetrahymena's chemical and physical growth environment, as well as making phagocytosis essential or not according to experimental conditions.

·         Large cell size (50 x 30 um): facile injection, cytology, immunocytology and FISH, electrophysiological recording and large-scale cell fractionation (micronuclei, mature and developmental stage-specific macronuclei, nucleoli, mitochondria, cilia, phagosomes, lysosomes, protein storage secretory granules, cell cortex, etc.).

·         Growth under wide range of volume conditions - microdrops to large bioreactors: industrial-scale production of valuable small molecules and macromolecules.

·         Vast temperature range for growth (18OC-41OC): great latitude in experimental investigation.

·         Readily visualized and quantifiable physiological endpoints: growth rate, phagocytosis rate, induced exocytosis, swimming speed and direction, chemotaxis, active water expulsion rate, cytokinesis, conjugation, meiosis induction, nuclear differentiation: useful for fundamental studies and for determining quantitative structure/activity relationships (e.g., drug design, environmental toxicants)

 

4) Generation time

·         Fastest growing microbial animal model (down to 1.5hr doubling time): Quick results, low maintenance costs, compact space requirements, noncontroversial animal model.

 

 

5) Storage of the organism or gametes

Cells are readily frozen alive at liquid nitrogen temperature [45]. This allows long term maintenance and germline protection of valuable strains.

 

6) Other advantages

·         Mitosis and meiosis restricted to a germline nucleus that is non-essential for growth, and gene transcription restricted to a non-mitotic nucleus: facilitates mutational and other experimental analyses of these processes.

·         Developmentally programmed, site-specific DNA rearrangements during MAC differentiation: precise germline-determined chromosome breakage; formation of physically and genetically identifiable MAC chromosomes; extensive, suppressible chromosome diminution.

·         Developmentally-regulated nuclear apoptosis at the "early development" stage of conjugation, resulting in the selective elimination of the parental macronucleus.

·         Single germline copy of ribosomal RNA genes. This is a unique feature of Tetrahymena that has allowed conventional rRNA genetics, ribosomal antisense repression and mutagenesis technology (see section c2 and Appendix section 1).

·         Abundance of sibling species with well-characterized phylogeny: useful for decryption of regulatory DNA elements and functional RNA domains.

 

7) A challenge: the Tetrahymena variant genetic code (UAR=Q)

The ciliates constitute a remarkable natural laboratory for investigating the late evolution of a diversity of variant genetic codes. In Tetrahymena [46] and many other ciliates, UAR (UAA and UAG, stop codons in the "universal code") are additional glutamine codons, leaving UGA as the only stop codon. This phenomenon poses no significant problem for the expression of foreign genes in Tetrahymena, which has already begun and is likely to become a major use of this organism. Genes already ending in UGA (from other ciliates or from universal code organisms) are directly expressible in Tetrahymena. For the rest, it would be sufficient to include a UGA codon flanking the insertion site in a universal expression cassette.

 

The variant code does become an important consideration when expressing Tetrahymena genes in universal-code cells. This challenge has been addressed by different approaches: 1) a general approach based on expression in a host carrying a UAR nonsense suppressor [47] or 2) UAR to CAR codon replacement in the Tetrahymena gene -- either by targeted in vitro mutagenesis (if the codons are few) or by efficient protocols for de novo synthesis of the entire gene [48-50].

 

d. What is the justification for needing the genomic resources now, rather than later, when costs are likely to be lower?

 

e. Do the proposed resources exist, or are there plans to develop such resources, outside the U.S.?

A significant amount of resources to support this proposed project already exist (see section 3l).

None of the resources proposed in this project exist outside the U.S. A project is being explored with Genome Canada to contribute sequence and closure in conjunction with (i.e., as matching funds for) this proposed project (section 3.b).

 

f & g. What are the unique advantages of having the genomic information of this organism? What scientific advances will be made possible that otherwise would not, given the current state of the genomic tools?

Tetrahymena has conserved virtually all the ancestral cellular processes and structures shared by humans and other currently living eukaryotes. In addition, this organism exhibits insightful elaborations of basic eukaryotic mechanism, which highlight the functional versatility and diversity of these mechanisms, and render them particularly accessible to investigation. We expect that thousands of Tetrahymena proteins will have sequence homology with important human proteins, the mutational dysfunction of which is expected to cause hereditary disease (section 1b). Thus the availability of the Tetrahymena genome sequence, favorable biological features (Section 3c) and advanced experimental tools (section 3c2 and Appendix section 1) should significantly contribute to the elucidation of the molecular basis of diseases that fall under the mission of every medically-oriented NIH institute. For the moment, we list concrete areas of general research interest in which on-going investigations in Tetrahymena, in combination with the availability of the genome sequence, present unique opportunities to contribute to the mission of a number of NIH Institutes and other federal granting agencies. These research areas are listed in the table below, with specific pointers to the more detailed information given in the appropriate (and referenced) tables of Appendix Section 2.

 

Area and Potential funding source

Table

In vivo telomere function and telomerase function (NCI, NIA, NIGMS, NSF)

1

Chromosome copy number homeostasis (NCI, NIGMS, NSF)

2

Developmentally regulated chromosome breakage and telomere formation (chromosome healing) (NCI, NIA, NIGMS, NSF)

2

Developmentally regulated ribosomal gene amplification (NCI, NIGMS, NSF)

2

Developmentally regulated, immunoglobulin-gene-like chromosome breakage-rejoining (chromatin diminution) (NCI, NHLBI, NIAID, NIGMS, NSF)

2

Function of chromatin in mitosis, meiosis, transcription and developmentally-programmed DNA rearrangement (NCI, NIGMS, NSF)

3

Function of histone post-translational modifications (NCI, NIGMS, NSF)

3

Epigenetic inheritance (NCI, NICHD, NIGMS, NSF, USDA)

4

Developmentally regulated apoptosis (NCI, NHLBI, NIA, NIAID, NIAMS, NICHD, NIGMS, NINDS, NSF)

5

Genetic analysis of the functions of the large rRNAs (NIGMS, NSF)

6

Microtubule diversity: 17 distinct systems including cilia, centriolar structures and mitotic spindles (NCI, NEI, NHLBI, NICHD, NIGMS, NSF)

7

Function of tubulin post-translational modifications (NIGMS, NSF)

7

Cytoskeletal motors (NCI, NHLBI, NIAMS, NICHD, NIGMS, NINDS, NSF)

8

Regulated secretion of protein storage granules (NIDDK, NINDS)

9

Phagocytosis and phagosome-mediated bacterial pathogenesis (NEI, NHLBI, NIAID)

10

Chemoreception and signal transduction (NCRR, NHLBI, NIDA)

11

Cell-cycle dependent regulation of cytoskeletal proteins (NCI)

12

Stem cell maintenance (NCI, NICHD)

12

Differential control of DNA replication and division in germinal and somatic nuclei (NCI)

12

Highly organized and polarized cell cortex; cellular handedness; intracellular positional information (NICHD, NSF)

13

Developmentally-controlled interactions between nuclei and the cell cortex (NICHD, NSF)

13

Biotechnology (NCI, NIAID, NIDDK, USDA)

14

Environmental adaptation and monitoring (DOE, EPA, NIEHS, NSF)

15

Eukaryotic evolution (NHGRI, NIGMS, NSF)

16

Science education opportunities (NHGRI, NSF)

17

Additional intriguing phenomena in Tetrahymena, not yet investigated molecularly: sexual maturation, resistance to viral infection, and conserved signaling elements

18

 

h. With as great precision as possible, what is the cost of the project?

The estimated total cost is $20.9M, distributed as follows:

 

Stage

Year

Category

Cost

1

1

8X-coverage WGS sequencing and assembly*

$7.5M

 

 

Database establishment**

$0.55M

2

2

Sequence closure and electronic annotation

$9.5M

 

 

Database maintenance

$0.35M

3

3

Manual annotation by sequencing center experts

$2.5M

 

 

Database expansion to deal with the finished sequence

$0.45M

Total

(3-year)

Direct and indirect costs***

$20.9M

 

* If funding availability becomes a limiting factor, it would be very important to at least accomplish a 5-fold coverage WGS sequencing during the first year, in order to be able to quickly initiate the electronic annotation. This 5-fold level of WGS sequencing would already bring major project benefits for comparative and experimental functional genomics to the ciliate and general research community, because it should identify virtually all the Tetrahymena genes -- and already provide finished sequence for many -- that will match those in other organisms. While cloning some genes from partial sequence would still involve some gene-by-gene labor, this would already enormously accelerate the pace of research and discovery.

** The needs are described in detail in section k below. Estimated costs are based on current salary scales and are reported on the basis of establishing an independent database. Some savings are expected (mainly in salaries) if, as we intend, we can affiliate with an existing model organism database.

*** This total would be reduced by any matching contribution negotiated with Genome Canada.

 

i. What is the duration of the project?

Three years, as follows:

Year 1:

Year 2:

·         Complete any unfinished WGS and start closure of the genome sequence

·         Complete the establishment of the database and begin incorporating and curating data generated by the genome project.

Year 3:

·         Complete the closure of the sequence

·         Manual annotation of the genomic sequence

·         Tetrahymena database: complete the incorporation of the annotated information generated at the completion of the sequencing project.

 

j. How will resources, such as databases and repositories, be supported after the completion of the project?

After the completion of this project, support for database maintenance will be sought among private sources. It seems likely that pharmaceutical and biotech industries will be willing to contribute such support once they recognize the value of the Tetrahymena genomic information.

 

k. How will data and resources generated by this project be made available rapidly and efficiently to the research community?

Once the genome sequence is available, we expect that Tetrahymena's experimental tractability and the completeness of its eukaryotic genome will make it an excellent complement to yeast among the unicellular eukaryotes and to Drosophila and C. elegans and model vertebrates among the animal models. Consequently, we plan to provide both the general and the ciliate scientific research communities with prompt, free, unrestricted and user-friendly access to the Tetrahymena genomic information. This includes:

·         Posting partial sequence and assembled contigs in a public, freely available and user-friendly sequencing center database as they become available.

 

The Tetrahymena database would be initially staffed as follows, taking into account the significant genome size and gene number:

1) A full-time Database Administrator, with strong bioinformatics experience.

2) A full time Programmer, to customize software for use with Tetrahymena. In years two and three, this position would be reduced to 50%.

2) Two full time Curators, having responsibility for:

The Curator should have substantial experience with Tetrahymena/Ciliate experimental biology and genomics. Two well-qualified members of the Ciliate research community have already expressed interest in filling those positions. A third Curator would be added on year 3, to deal with the influx of finished genomic sequence.

 

In addition, we would need hardware with sufficient power to deal with the computational needs of a 180 Mb genome and 30,000 genes.

 

We have considered two general strategies for establishing the Tetrahymena database.

1) Affiliation with a well-established, genetic model organism database -- such as FlyBase, SGD or Wormbase -- which would be immediately capable of providing administration, server and informatics support.

2) Establishing our own independent database, with the staffing proposed above.

We strongly favor the first alternative as a time- and cost-effective way to establish our database. We have had exploratory discussions with senior members of the above databases and we have been greeted with a supportive attitude. We have also learned that packages are under development that would facilitate and generalize the establishment and maintenance of independent genetic model organism databases. Thus we are confident that we can successfully establish a database that will be responsive to the needs of the scientific community. Once the Tetrahymena database is functioning successfully, we look forward to expanding its services to become a general ciliate database, including genomic data from Paramecium and other experimental ciliates that possess their own valuable biological and experimental features.

 

l. What genomic resources, including databases and repositories, currently exist?

1) Tetrahymena resources already available, in progress or already funded. They are listed in Table 1. Below we highlight those that will most directly facilitate the genome project.

·         Physically mapped sequenced tag sites (STS): Cbs, cloned DNA polymorphisms and other STS. These will anchor the DNA sequence to the physical map.

 

In addition, sequence from a random genomic sequencing pilot project on the related ciliate Paramecium [30] is likely to contribute to the quality of Tetrahymena gene prediction and annotation.

 

2) Existing repositories:

·         Genetic strains of T. thermophila are maintained frozen in several research labs, chiefly the Orias and Bruns labs.

·         Type strains of Tetrahymena thermophila and other Tetrahymena species of known phylogeny are deposited in the American Type Culture Collection (ATCC).

·         Data on all Tetrahymena EST sequences obtained to date (including sequence and results of blastx searches) are posted at http://www.cbr.nrc.ca/reith/tetra/tetra.html

 

m. What is the size of the research community for the organism?

 

n & o. Who will benefit from the improved genomic resources? The immediate community? The broader biomedical research community? What will be the benefits?

The most immediate beneficiaries of the Tetrahymena genome sequence and annotation will be the many on-going Tetrahymena research programs funded by U.S. (mainly), Canadian, European and Asian granting agencies. The genome sequence:

 

The sequence will also enlighten the biology and facilitate research on other ciliates and alveolates (including apicomplexan human and animal parasites) by providing the sequence of protein homologs that are evolutionarily much closer than those in heterokonts or green plants. This will facilitate the cloning of wanted genes using labeled probes or degenerate PCR primers.

 

Finally, through collaborative research, the sequence will put the very complete animal proteome and the advanced experimental tools of Tetrahymena, documented in this concept paper, in the service of fundamental, biomedical and applied research by the general scientific community. A list of postgenomic resources that will facilitate this research is included in Appendix section 3, although we are not requesting funding for these resources now. In the long term, the benefits to the general research community may well become one of the most important contributions derived from the Tetrahymena genome sequence.

 

p. Are there any material transfer agreements that would affect the availability of data or resources produced by this project?
No material transfer agreements are anticipated (or desired) that would affect the availability of data or resources produced by this project. The centers that have so far shown the greatest interest and intellectual engagement in this project are non-profit organizations. Regarding existing genomic resources, all the sequences are posted in GenBank and/or in freely accessible public databases. All mutants and other useful Tetrahymena laboratory strains are freely made available without restriction. The highly inducible MTT promoter is subject to a very friendly MTA designed to promote its academic use.


Table 1. Tetrahymena Genomic Resources and Database Needs

 

Funding sources are indicated in parenthesis

Resource

Already available

Expected when

Wanted in the Database

DNA sequence maps

- 1-2 kb surrounding an estimated 15% of the chromosome breakage sites already sequenced (NCRR)

- ~75 kb in 53 contigs from randomly cloned inserts from the 75-kb MAC chromosome fraction

- In 1 year: the rest of chromosome breakage sites (~300) (already funded by NCRR)

* In 2-3 years: 180 Mb of finished MAC DNA sequence (requested funds for this project).

* Location of genes (characterized or predicted), ESTs, physically and genetically mapped STS, other landmarks

Proteins and ESTs

­- About 500 non-redundant ESTs from exponentially growing cells (University of Chicago, Tetrahymena community, and a private contribution)

- More than 150 genes in GenBank from T. thermophila (mainly) and T. pyriformis; some are genomic, others are mRNA sequences (funded over the years mainly by NIGMS and NSF, American Cancer Society)

 

- In one year: 70,000 ESTs from several libraries (Genome Canada mainly, University of Chicago, and Tetrahymena community).

In 2-3 years: Roughly 30,000 predicted genes. Based on pilot studies, about 1/2 are expected to show matches to genes in public databases, of which a very large majority is expected to match human proteins (requested funds for this project)

* Gene names

* Sequence map coordinates

- Function: experimentally determined or *predicted

* Predicted protein domains/motifs

* Links to similar genes/proteins, especially in humans, other model organisms and other Alveolate species

- Regulation pattern

- Knockout phenotype

- Posttranslational modification

- Genetic, physical & functional interactions

Physical maps

- Physical size of MAC chromosomes flanking ~15% of chromosome breakage sites. (NCRR, NSF).

- Physical size of different MAC chromosomes carrying ~65 RAPD DNA markers (NCRR)

- In 1 year: Nearly all of the ~300 Cbs junctions characterized (already funded by NCRR)

- In 3 years: 2-4,000 physically mapped sequenced-tagged sites (mainly ESTs) (already funded by NCRR)

MIC and MAC molecular distance maps, anchored to the sequence map

Genetic maps

- Germline linkage maps and macronuclear coassortment maps linking an estimated 2/3 of the genome (NCRR)

 

- Conventional germline genetic maps and MAC coassortment groups, both anchored to physical and sequence maps

- Germline deletion maps (nullisomics, unisomics, partial chromosome deletions)

 


Table 1 (continued)

 

Resource

Already available

Expect/when

Wanted in the Database

Genetic markers & large deletions

- More than 400 genetically mapped markers (conventional mutant genes, RAPDs, RFLPs, Cbs-associated polymorphisms). (mainly NCRR, also NSF)

- More than 50 mutant genes and other genetic features assigned only to chromosome arm. (NIGMS & NSF, over the years)

- More than 100 mapped partial deletions of germline chromosomes (NSF)

 

How to test for, diagnostic phenotypes and map coordinates

References

Thousands of references on the biology of cloned genes and mutants

 

An increase in Tetrahymena's share of the current literature, currently nearly 300 papers per year

Linked to database entries

 

Database elements preceded by an asterisk and their annotation would be generated as part of the whole-genome sequencing project. Annotations for the remaining data sets would come primarily from Tetrahymena research experts.