A central problem for 21st century science will be the analysis and
understanding of the human genome. My talk will be concerned with
topics within this area, in particular annotating pseudogenes (protein
fossils) in the genome. I will discuss a comprehensive pseudogene
identification pipeline and storage database we have built. This has
enabled use to identify >10K pseudogenes in the human and mouse
genomes and analyze their distribution with respect to age, protein
family, and chromosomal location. One interesting finding is the large
number of ribosomal pseudogenes in the human genome, with 80
functional ribosomal proteins giving rise to ~2,000 ribosomal protein
I will try to inter-relate our studies on pseudogenes with those on
tiling arrays, which enable one to comprehensively probe the activity
of intergenic regions. Through this work we have been able to annotate
regulatory sites and regions of unannotated transcription in the
At the end I will bring these together, trying to assess the
biochemical activity of pseudogenes.
Throughout I will try to introduce some of the computational
algorithms and approaches that are required for genome annotation and
tiling arrays -- i.e. the construction of annotation pipelines,
developing algorithms for optimal tiling, and refining approaches for
References (4 most relevant are starred with "*")
Millions of years of evolution preserved: a comprehensive catalog of
the processed pseudogenes in the human genome. Z Zhang, PM Harrison,
Y Liu, M Gerstein (2003) Genome Res 13: 2541-58.
Patterns of nucleotide substitution, insertion and deletion in the
human genome inferred from pseudogenes. Z Zhang, M Gerstein (2003)
Nucleic Acids Res 31: 5338-48.
The ambiguous boundary between genes and pseudogenes: the dead rise
up, or do they? D Zheng, MB Gerstein (2007) Trends Genet
Pseudogene.org: a comprehensive database and comparison platform for
pseudogene annotation. JE Karro, Y Yan, D Zheng, Z Zhang, N Carriero,
P Cayting, P Harrrison, M Gerstein (2007) Nucleic Acids Res 35:
A computational approach for identifying pseudogenes in the ENCODE
regions. D Zheng, MB Gerstein (2006) Genome Biol 7 Suppl 1: S13.1-10.
* The real life of pseudogenes. M Gerstein, D Zheng (2006) Sci Am 295:
PseudoPipe: an automated pseudogene identification pipeline. Z Zhang,
N Carriero, D Zheng, J Karro, PM Harrison, M Gerstein (2006)
Bioinformatics 22: 1437-9.
Toward a universal microarray: prediction of gene expression through
nearest-neighbor probe sequence identification. TE Royce, JS
Rozowsky, MB Gerstein (2007) Nucleic Acids Res
* Pseudogenes in the ENCODE regions: consensus annotation, analysis of
transcription, and evolution. D Zheng, A Frankish, R Baertsch, P
Kapranov, A Reymond, SW Choo, Y Lu, F Denoeud, SE Antonarakis, M
Snyder, Y Ruan, CL Wei, TR Gingeras, R Guigo, J Harrow, MB Gerstein
(2007) Genome Res 17: 839-51.
* Statistical analysis of the genomic distribution and correlation of
regulatory elements in the ENCODE regions. ZD Zhang, A Paccanaro, Y
Fu, S Weissman, Z Weng, J Chang, M Snyder, MB Gerstein (2007) Genome
Res 17: 787-97.
The DART classification of unannotated transcription within the ENCODE
regions: associating transcription with known and novel loci. JS
Rozowsky, D Newburger, F Sayward, J Wu, G Jordan, JO Korbel, U
Nagalakshmi, J Yang, D Zheng, R Guigo, TR Gingeras, S Weissman, P
Miller, M Snyder, MB Gerstein (2007) Genome Res 17: 732-45.
* What is a gene, post-ENCODE? History and updated definition. MB
Gerstein, C Bruce, JS Rozowsky, D Zheng, J Du, JO Korbel, O
Emanuelsson, ZD Zhang, S Weissman, M Snyder (2007) Genome Res 17: