Lecture 1: The Human Genome Project

(version 31 July 2002)

Automated sequencing machines at the Center for Genome Research in the Whitehead Institute

This material is copyrighted and MAY NOT be used for commercial purposes

 You are visitor number   since 31 July 2002 

The Human Genome project

Two competing groups

Timeline to reach the sequence.

Some Details from the Sequence


Protein Coding Sequences

  • There appear to be about 30,000 to 40,000 protein-coding genes in the human genome. This very surpisingly is only about twice as many as in worm or fly. However, the genes are more complex, with more alternative splicing generating a larger number of protein products.

  • Only a very small fraction of the genome (3%) codes for proteins.

  • The typical length of a coding sequence is similar (1,311 bp for worm, 1,497 bp for fly and 1,340 bp for human). However, the worm and fly exon distributions have a fatter tail, resulting in a larger mean size for internal exons (218 bp for worm versus 145 bp for human).

  • Intron size distributions differ substantially among species. The worm and fly each have a reasonably tight distribution, with most introns near the preferred minimum intron length (47 bp for worm, 59 bp for fly) and an extended tail (overall average length of 267 bp for worm and 487 bp for fly). Intron size is much more variable in humans, with a peak at 87 bp but a very long tail resulting in a mean of more than 3,300 bp.


  • The full set of proteins (the proteome) encoded by the human genome is more complex than those of invertebrates. Vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures.



  • What fraction of genes are shared with other animals, eukaryotes and bacteria?


    Gene Regulation

  • There is a significant increase in the number of transcription factors in humans.


    Noncoding RNAs

  • Current known distribution on noncoding RNAs (very likely a significant underestimate)


    Repeated Sequence Composition

  • Current known distribution on noncoding RNAs (very likely a significant underestimate)

  • About half of the human genome derives from transposable elements,

  • There has been a marked decline in the overall activity of such elements in the hominid lineage. DNA transposons appear to have become completely inactive and long-terminal repeat (LTR) retroposons may also have done so.

  • Twenty human genes have been recognized as probably derived from transposons


  • Simple sequence repeats (SSRs) account for 3% of the human genome.


    Gene Order

    Large blocks of DNA have remained intact from mouse to humans.