Human Genome Annotation
Mark Gerstein, Yale University
ABSTRACT:
A central problem for 21st century science is annotating the human genome and making this annotation
useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the
genome that does not code for canonical genes, concentrating on intergenic features such as
structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs
(ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs)
based on processing next-generation sequencing experiments. I will further explain how we cluster
together groups of sites to create larger annotations. Next, I will discuss a comprehensive
pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome
and analyze their distribution with respect to age, protein family, and chromosomal location.
Throughout, I will try to introduce some of the computational algorithms and approaches that are
required for genome annotation. Much of this work has been carried out in the framework of the
ENCODE, modENCODE, and 1000 genomes projects.
URLS:
http://pseudogene.org
http://GenomeTECH.Gersteinlab.org
RELEVANT PAPERS:
Comparative analysis of processed ribosomal protein pseudogenes in four
mammalian genomes.
S Balasubramanian, D Zheng, YJ Liu, G Fang, A Frankish, N Carriero, R
Robilotto, P Cayting, M Gerstein (2009) Genome Biol 10: R2.
PeakSeq enables systematic scoring of ChIP-seq experiments relative to
controls.
J Rozowsky, G Euskirchen, RK Auerbach, ZD Zhang, T Gibson, R
Bjornson, N Carriero, M Snyder, MB Gerstein (2009) Nat Biotechnol 27: 66-75
MSB: A mean-shift-based approach for the analysis of structural
variation in the genome.
LY Wang, A Abyzov, JO Korbel, M Snyder, M Gerstein (2009) Genome
Res 19: 106-17.
Pseudofam: the pseudogene families database.
HY Lam, E Khurana, G Fang, P Cayting, N Carriero, KH Cheung, MB
Gerstein (2009) Nucleic Acids Res 37: D738-43.
Analysis of copy number variants and segmental duplications in the human
genome: Evidence for a change in the process of formation in recent
evolutionary history.
PM Kim, HY Lam, AE Urban, JO Korbel, J Affourtit, F Grubert, X
Chen, S Weissman, M Snyder, MB Gerstein (2008) Genome Res 18: 1865-74.
Integrating sequencing technologies in personal genomics: optimal low cost reconstruction of
structural variants.
J Du, RD Bjornson, ZD Zhang, Y Kong, M Snyder, MB Gerstein (2009) PLoS Comput Biol 5: e1000432.
Personal phenotypes to go with personal genomes.
M Snyder, S Weissman, M Gerstein (2009) Mol Syst Biol 5: 273.
PEMer: a computational framework with simulation-based error models for inferring genomic structural
variants from massive paired-end sequencing data.
JO Korbel, A Abyzov, XJ Mu, N Carriero, P Cayting, Z Zhang, M Snyder, MB Gerstein (2009) Genome Biol
10: R23.
Pseudogenes in the ENCODE regions: consensus annotation, analysis of
transcription, and evolution.
D Zheng, A Frankish, R Baertsch, P Kapranov, A Reymond, SW Choo, Y Lu, F
Denoeud, SE Antonarakis, M Snyder, Y Ruan, CL Wei, TR Gingeras, R Guigo,
J Harrow, MB Gerstein (2007) Genome Res 17: 839-51.
Statistical analysis of the genomic distribution and correlation of
regulatory elements in the ENCODE regions.
ZD Zhang, A Paccanaro, Y Fu, S Weissman, Z Weng, J Chang, M Snyder, MB
Gerstein (2007) Genome Res 17: 787-97.