Human Genome Annotation
Mark Gerstein, Yale University
A central problem for 21st century science is annotating the human
genome and making this annotation useful for the interpretation of
personal genomes. My talk will focus on annotating the 99% of the
genome that does not code for canonical genes, concentrating on
intergenic features such as structural variants (SVs), pseudogenes
(protein fossils), binding sites, and novel transcribed RNAs (ncRNAs).
In particular, I will describe how we identify regulatory sites and
variable blocks (SVs) based on processing next-generation sequencing
experiments. I will further explain how we cluster together groups of
sites to create larger annotations. Next, I will discuss a
comprehensive pseudogene identification pipeline, which has enabled us
to identify >10K pseudogenes in the genome and analyze their
distribution with respect to age, protein family, and chromosomal
location. Throughout, I will try to introduce some of the
computational algorithms and approaches that are required for genome
annotation. Much of this work has been carried out in the framework of
the ENCODE, modENCODE, and 1000 genomes projects.
Comparative analysis of processed ribosomal protein pseudogenes in four
S Balasubramanian, D Zheng, YJ Liu, G Fang, A Frankish, N Carriero, R
Robilotto, P Cayting, M Gerstein (2009) Genome Biol 10: R2.
PeakSeq enables systematic scoring of ChIP-seq experiments relative to
J Rozowsky, G Euskirchen, RK Auerbach, ZD Zhang, T Gibson, R
Bjornson, N Carriero, M Snyder, MB Gerstein (2009) Nat Biotechnol 27: 66-75
MSB: A mean-shift-based approach for the analysis of structural
variation in the genome.
LY Wang, A Abyzov, JO Korbel, M Snyder, M Gerstein (2009) Genome
Res 19: 106-17.
Pseudofam: the pseudogene families database.
HY Lam, E Khurana, G Fang, P Cayting, N Carriero, KH Cheung, MB
Gerstein (2009) Nucleic Acids Res 37: D738-43.
Analysis of copy number variants and segmental duplications in the human
genome: Evidence for a change in the process of formation in recent
PM Kim, HY Lam, AE Urban, JO Korbel, J Affourtit, F Grubert, X
Chen, S Weissman, M Snyder, MB Gerstein (2008) Genome Res 18: 1865-74.
Integrating sequencing technologies in personal genomics: optimal low
cost reconstruction of structural variants.
J Du, RD Bjornson, ZD Zhang, Y Kong, M Snyder, MB Gerstein (2009) PLoS
Comput Biol 5: e1000432.
Personal phenotypes to go with personal genomes.
M Snyder, S Weissman, M Gerstein (2009) Mol Syst Biol 5: 273.
PEMer: a computational framework with simulation-based error models
for inferring genomic structural variants from massive paired-end
JO Korbel, A Abyzov, XJ Mu, N Carriero, P Cayting, Z Zhang, M Snyder,
MB Gerstein (2009) Genome Biol 10: R23.
Pseudogenes in the ENCODE regions: consensus annotation, analysis of
transcription, and evolution.
D Zheng, A Frankish, R Baertsch, P Kapranov, A Reymond, SW Choo, Y Lu, F
Denoeud, SE Antonarakis, M Snyder, Y Ruan, CL Wei, TR Gingeras, R Guigo,
J Harrow, MB Gerstein (2007) Genome Res 17: 839-51.
Statistical analysis of the genomic distribution and correlation of
regulatory elements in the ENCODE regions.
ZD Zhang, A Paccanaro, Y Fu, S Weissman, Z Weng, J Chang, M Snyder, MB
Gerstein (2007) Genome Res 17: 787-97.
Nucleotide-resolution analysis of structural variants using BreakSeq
and a breakpoint library.
HY Lam, XJ Mu, AM Stütz, A Tanzer, PD Cayting, M Snyder, PM Kim, JO
Korbel, MB Gerstein (2010)
Nat Biotechnol 28: 47-55.