Spring 2006 - Stat 992: Statistical Methods for Biological Sequence Analysis
Time: 1-2:15 MW
Location: 5295 Medical Sciences Center

Instructor: Sunduz Keles
                 keles at stat wisc edu
Office hours:  W 5-6pm and by appointment.

Objectives: This course will cover sequence analysis topics from the field of computational biology.  One of the aims of the course is to give a concise review of  relevant background biology  and an introduction to  the statistical problems arising recently.  A major portion of the  course will  be  dedicated to rigorous overview of the statistical  methods utilized in this field. Particular inference topics will include cross-validation both with observed and censored data, multiple hypothesis testing,  mixture models, HMMs and tree based regression techniques. The plan is to make the topics self contained so that 1st year level background on statistical estimation and inference is sufficient.  This entitles to Stat 609-610 for Statistics students and Stat 571-572 for Biological Sciences students.

Suggested textbooks: I will provide lecture notes and references for most of the topics to be covered. 
Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin,  Sean R. Eddy, Anders Krogh, Graeme Mitchison.  (Main text)
The elements of statistical learning by T. Hastie, R. Tibshirani, J. Friedman. (Supplemental text)

Course format: We  will  alternate  between lectures by me and paper presentations in the form of discussions  by you. In the first lecture, we will form student teams and each student team will be presenting one to  two papers throughout the semester. Each student  (enrolled or auditing) is required to participate in this.   For each paper to be presented, there will be a primary team presenting and leading the discussion and 2 or 3 supporting  teams who will be mainly responsible for  helping out with the discussion.  The supporting teams are required to email their questions to the  presenting team and me one day  before the presentation.  We will try to mainly schedule the presentations on wednesday so that by Monday night, you can send out your questions.

Grading: Grading will be based on 3 homework assignments (30 %), class participation in paper discussions (10%), paper presentation(s) (30%) and a final project written report and presentation (30%). If you are taking the course for one credit you are only required to do the paper presentation(s), return in 1 homework assignment and participate in class discussions.


Tentative Syllabus



Assigned Reading

Additional Reading


Lecture Notes


Jan 18

Course logistics
Introduction to the field of computationalbiology
Introduction to Genome Biology

Introduction to molecular biology

[1] [4]

Jan 23

Introduction to sequence similarity, pairwise alignment

Durbin et al. Chapter 2

Gibbs, A.J. and  McIntyre, G.A. (1970) The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. European Journal of  Biochemistry. Sep;16(1):1-11. [DOT plots]


Jan 25,30
Feb 1, 6, 8

Sequence alignment: dynamic programming, local versus global alignment

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of  Molecular  Biology 48, 443-453. [Global alignment]

Temple F. Smith and Michael S. Waterman (1981). Identification of Common Molecular Subsequences.  Journal of Molecular Biology, 147:195-197. [Local alignment]

Substitution matrices:
Dayhoff, M. O., Schwartz, R. M.  and Orcutt, B. C. (1978).   A model of evolutionary change in proteins. matrices for detecting distant relationships. In Atlas of protein sequence and structure, (Dayhoff, M. O., ed.), vol. 5, pp. 345-358. National biomedical research foundation Washington DC. [PAM derivation]

Henikoff, S. & Henikoff, J. G. (1992).  Amino acid substitution matrices from protein blocks. PNAS, 89, 10915-10919. [BLOSUM derivation]   

Henikoff, S. and Henikoff, J. G. (1993). Performance evaluation of amino acid substitution matrices. Proteins: Struct., Funct., Genet. 17, 49-61.

Altschul, S.F. (1991). Aminoacid substitution matrices from an information theoretic perspective. Journal of Molecular Biology 219:555-565. [Comparing substitutions matrices]

[1] (Jan 30)

Notes on the Jukes-Cantor model (Feb 1)

[1] [4] (Feb 1, 6)
[1] [4] (Feb 8)

Feb 13, 15

Heuristic alignment algorithms, significance of scores, extreme value distribution

Karlin, S and Altschul, SF (1990) Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes. PNAS, Vol 87, 2264-2268. [Derivation of the distribution of S under random matching model]

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215:403-410. [BLASTP]

Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison Proc. Natl. Acad. Sci. US 85:2444-2448.  [FASTA]

Mott , R. (1992) Maximum Likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores Bull. Math. Biol 54: 59-75.[MLE for fitting extreme value distribution to gapped local alignment scores]

Pearson, W. R. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276:71-84. [correcting length bias in gapped local alignment scores]

Altschul, S.; Madden, T.; Schaffer, A.; Zhang, J.; Zhang, Z.; Miller, W.; and Lipman, D. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 25, 3389-3402.  [PSI-BLAST]

Zhang Z, Schaffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF. (1998). Protein sequence similarity searches using patterns as seeds. (PHI-BLAST) Nucleic Acids Res. 26, 3986-90. [PHI-BLAST]

Waterman M. (1994) Estimating statistical significance of sequence alignments.
Philos Trans R Soc Lond B Biol Sci.344(1310):383-90.
[Application of the Azuma Hoeffding Lemma, includes a likelihood formulation]

S. Karlin, A. Dembo, and T. Kawabata. Statistical composition of high-scoring segments from molecular sequences. The Annals of Statistics, 2:571--581, 1990. [Derivation of the distribution of S under random matching model]

BLAST server  at NCBI

[1] [4] (Feb 13)
[1] [4] (Feb 15)

Feb 20

A brief introduction to Perl

Perl RegExp Quick Reference Card

Perl tutorial  WikiBooks

[1] [4] (Feb 20)

  Hmw #1
Due March 8
gap open = 11
gap extend = 1

Feb 22, 27

Hidden Markov Models: Estimation and Inference

Durbin et al. Chapter 3

A tutorial on Hidden Markov Models by L.R. Rabiner

Mar 1, 6, 8

Hidden Markov Models: Estimation and Inference

Durbin et al. Chapter 3

Dempster, A.P.,  Laird, N.M., Rubin, D. B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSSB, 39 (1): 1-38. [EM algorithm]

Mar 20

Hidden Markov Models for Sequence Alignment

Generalization of HMMs

A. Krogh, M. Brown, I.S. Mian, K. Sjolander, D. Haussler. (1994) Hidden Markov models in computational biology: applications to protein modeling. JMB 235:1501-1531.

Presenters: Hui Wang, Deyuan Jiang.

Slides (Mar 20)

Mar 22

Hidden Markov Models for Gene Prediction

Generalizations of HMMs

C. Burge, S. Karlin. (1997) Prediction of complete gene structures in human genomic DNA. JMB 268:78-94.

Presenters: Hyonho Chun and Qin Ren .

Mar 27

Hidden Markov Models for gene prediction

Alexandersson, M., Cawley, S. and Pachter, L. (2003)
SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model.
Genome Research. 13(3):496-502.

Rogic, S.,   Mackworth, A.K.,  and  Ouellette, F.B.F.  (2001)
Evaluation of Gene-Finding Programs on Mammalian Sequences. Genome Research Vol. 11, Issue 5, 817-832.

S.R. Eddy. (1998) Profile hidden Markov models. Bioinformatics. 14, 755-763.

Slides (Mar 27)

Hmw #1 solution

Hmw #2
Due April 10

PWMmix_seq.txt (paremeters to consider W = 6, 7, 8, K = 1, 2, 3)

Mar 29,
Apr 3

Motif finding  based on single species sequence data

Lawrence, C. E. and Reilly, A. A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7:41-51.

Bailey, T.L.  and Elkan C. (1995) Unsupervised Learning of Multiple Motifs in Biopolymers using EM.  Machine Learning, 21(1-2):51-80.

Lawrence, C.E., Altschul, S.F., Bogouski, M.S., Liu, J.S., Neuwald, A.F., and Wooten, J.C. (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262, 208-214.

Presenter: Xiwen Ma

Liu, J.S.,  Neuwald, A.F.  and  Lawrence, C. E. (1995) Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies. JASA. 90: 1156-1170.

Liu, J.S. and Lawrence, C.E. (1999) Bayesian Inference on Biopolymer Models. Bioinformatics,15(1):  38-52.

Boys, R.J.  and Henderson, D.A. (2005) A Bayesian Approach to DNA Sequence Segmentation. Biometrics. 61(2): 573:588.


Gibbs Motif Sample

Slides (Apr 3)

Mar 29,
Apr 3

Motif finding  based on single species sequence data

S. Keles, M.J. van der Laan, S. Dudoit, B. Xing,  and M.B. Eisen (2003). Supervised detection of regulatory motifs in DNA sequences. Statistical Applications in Genetics and  Molecular Biology, Vol. 2,  No. 1, Article 5.

Slides (Mar 29)

Apr 5

From motifs to modules

Zhou, Q.  and Wong, W. H. (2004) CisModule: De Novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. 101(33): 12114-12119

Presenters: Jee Young Moon and Anglinia Washington

Slides (Apr 5)

Apr 10

Incorporating phylogenetic information into motif finding

Li, X. and Wong, W.H. (2005) Sampling motifs on phylogenetic trees. PNAS. 102 (27): 9481-9486.

Moses, A.M., Chiang, D.Y., Eisen, M.B. (2004). Phylogenetic Motif Detection by Expectation Maximization of Evolutionary Mixtures. Pacific Symposium on Biocomputing

Li, X., Zhong, S. and  Wong, W.H.  (2005) Reliable prediction of transcription factor binding sites by phylogenetic verification PNAS. 102 (47): 16945-16950.

Presenters: Tim Grant & Oscar Perez

Slides (Apr 10)

Apr 12

Incorporating phylogenetic information into motif finding

Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R.K., Gibbs, R.A., Kent, W.J., Miller, W. and Haussler, D. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15:1034-1050.

Siepel, A. and Haussler, D. (2004) Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Comput Biolology 11:413-428.

Ludwig, M.Z., Palsson, A., Alekseeva, E., Bergman, C.M., Nathan, J., Kreitman, M. (2005) Functional evolution of a cis-regulatory module. PLoS Biology. 3(4):e93.

Apr 17

Dictionary models for motif finding

Sabatti, C., Rohlin, R., Lange, K., Liao, J.C. (2005) Vocabulon: a dictionary model approach for reconstruction and localization of transcription factor binding sites. Bioinformatics 21(7): 922-931.

Presenters: Yuan Jiang & Tao Yu

Slides (Apr 17)

Apr 19 Beyond independent site  models for motif finding  Qing Zhou, Jun S. Liu: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20(6): 909-916 (2004)

Barash, Y,  Elidan, G, Friedman, N, and Kaplan T (2003).  Modeling Dependencies in Protein-DNA Binding Sites.  RECOMB 2003.

OD King and FP Roth. A non-parametric model for transcription factor binding sites. Nucleic Acids Research 31(19):e116 (2003).

Irad E. Ben-Gal, Ayala Shani, Andre Gohr, Jan Grau, S. Arviv, Armin Shmilovici, Stefan Posch, Ivo Grosse: Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21(11): 2657-2666 (2005)

Slides (Apr 17)

Apr 24

Introduction to tiling arrays - Analysis of ChIP-chip experiments

Buck, M. J. and Lieb, J.D. (2004) ChIP-chip: Considerations for the Design, Analysis, and Application of Genome-wide Chromatin Immunoprecipitation Experiments. Genomics. 83(3):349-360.

Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V.I., Cheng, J., Williams, A.J., Wheeler, R., Wong, B., Drenkow, J., Yamanaka, M., Patel, S., Brubaker, S., Tammana, H., Helt, G., Struhl, K. and Gingeras, T.R. (2004) Unbiased Mapping of Transcription Factor Binding Sites Along Human Chromosomes 21 and 22 Points to Widespread Regulation of Non-Coding RNAs ,Cell 116(4):499-511.

S. Keles, M. J. van der Laan,  Sandrine Dudoit, and Simon E. Cawley (2004). Multiple Testing Methods for ChIP-Chip High Density Oligonucleotide Array Data. To appear in the Journal of Computational Biology.

Slides (Apr 24)

Apr 24, 26

Analysis of ChIP-chip experiments

Ji, H. and Wong, W.H. (2005) TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics. 21(18): 3629-3636.

Keles, S. (2005) Mixture modeling for genome-wide localization of transcription factors.

May 1 May 3 Final project presentations