A central problem for 21st century science will be the annotation and
understanding of the human genome. My talk will be concerned with
topics within this area, in particular annotating pseudogenes (protein
fossils), binding sites, CNVs, and novel transcribed regions in the
genome. Much of this work has been carried out in the framework of the
ENCODE and modENCODE projects.
In particular, I will discuss how we identify regulatory regions and
novel, non-genic transcribed regions in the genome based on processing
of tiling array and next-generation sequencing experiments. I will
further discuss how we cluster together groups of binding sites and
novel transcribed regions.
Next, I will discuss a comprehensive pseudogene identification
pipeline and storage database we have built. This has enabled us to
identify >10K pseudogenes in the human and mouse genomes and analyze
their distribution with respect to age, protein family, and
chromosomal location. I will try to inter-relate our studies on
pseudogenes with those on transcribed regions. At the end I will bring
these together, trying to assess the transcriptional activity of
Throughout I will try to introduce some of the computational
algorithms and approaches that are required for genome annotation --
e.g., the construction of annotation pipelines, developing algorithms
for optimal tiling, and refining approaches for scoring microarrays.
Toward a universal microarray: prediction of gene expression through
nearest-neighbor probe sequence identification.
TE Royce, JS Rozowsky, MB Gerstein (2007) Nucleic Acids Res 35: e99.
Pseudogenes in the ENCODE regions: consensus annotation, analysis of
transcription, and evolution.
D Zheng, A Frankish, R Baertsch, P Kapranov, A Reymond, SW Choo, Y Lu, F
Denoeud, SE Antonarakis, M Snyder, Y Ruan, CL Wei, TR Gingeras, R Guigo, J
Harrow, MB Gerstein (2007) Genome Res 17: 839-51.
Statistical analysis of the genomic distribution and correlation of
elements in the ENCODE regions.
ZD Zhang, A Paccanaro, Y Fu, S Weissman, Z Weng, J Chang, M Snyder, MB
(2007) Genome Res 17: 787-97.
What is a gene, post-ENCODE? History and updated definition.
MB Gerstein, C Bruce, JS Rozowsky, D Zheng, J Du, JO Korbel, O
Zhang, S Weissman, M Snyder (2007) Genome Res 17: 669-81.
PeakSeq enables systematic scoring of ChIP-seq experiments relative to
J Rozowsky, G Euskirchen, RK Auerbach, ZD Zhang, T Gibson, R
Bjornson, N Carriero, M Snyder, MB Gerstein (2009) Nat Biotechnol
MSB: A mean-shift-based approach for the analysis of structural
variation in the genome.
LY Wang, A Abyzov, JO Korbel, M Snyder, M Gerstein (2009) Genome
Res 19: 106-17.
Pseudofam: the pseudogene families database.
HY Lam, E Khurana, G Fang, P Cayting, N Carriero, KH Cheung, MB
Gerstein (2009) Nucleic Acids Res 37: D738-43.
Analysis of copy number variants and segmental duplications in the human
genome: Evidence for a change in the process of formation in recent
PM Kim, HY Lam, AE Urban, JO Korbel, J Affourtit, F Grubert, X
Chen, S Weissman, M Snyder, MB Gerstein (2008) Genome Res 18: 1865-74.