Monday, November 9, 2009

abstract for talk at LMB 1-Dec-2009 [I:LMB]


Human Genome Annotation

Mark Gerstein, Yale University


A central problem for 21st century science is annotating the human genome and making this annotation
useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the
genome that does not code for canonical genes, concentrating on intergenic features such as
structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs
(ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs)
based on processing next-generation sequencing experiments. I will further explain how we cluster
together groups of sites to create larger annotations. Next, I will discuss a comprehensive
pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome
and analyze their distribution with respect to age, protein family, and chromosomal location.
Throughout, I will try to introduce some of the computational algorithms and approaches that are
required for genome annotation. Much of this work has been carried out in the framework of the
ENCODE, modENCODE, and 1000 genomes projects.



