# Statistical modeling and machine learning

# Julien Gagneur

Lecture, Tuesdays, 14:00-17:00. Starting 12.04.16

Room 01.11.018

In class exercises, Fridays, 12:30-15:30, Starting 15.04.16

Rooms 02.07.014 and 01.09.034

at the TUM-Inf, Boltzmannstr. 3, 85748 Garching

This lecture for students in bioinformatics, physics, and computer science will lay the theoretical and practical foundations for statistical data analysis, statistical modeling and machine learning from a Bayesian probabilistic perspective. It will train you in "thinking probabilistically", which is extremely useful in many areas of quantitative sciences where noisy data need to be analyzed (and hence modeled!). These techniques are also used extensively for "Big data" applications and in engineering, e.g. data mining, pattern recognition, or speech recognition.

The class will be based on Christopher Bishop's book "Pattern Recogntion and Machine Learning". The lecture will be held in inverted classroom style: Each week, we will give a ~30 min overview of the next reading assignment of a section of the book, pointing out the essential messages, thus facilitating the reading at home. Exercises to solve until next lecture will be given, including mathematical derivations of some book results. In the next lecture, the exercises will be discussed (~30 min), as well as questions and difficulties with the material are answered (~20 min). Then, practical exercises using the newly acquired material will be solved in teams, using the R statistics framework (100min). Further exercises will be performed during the Friday classes in smaller groups. The inverted classroom style is in our experience better suited than the conventional lecturing model for quantitative topics that require the students to think through or retrace mathematical derivations at their own speed.

We end with a final competition among students on a machine learning task. As an example we had last year prediction of the outcome of the next Bundesliga games.

Credit: 8 ECTS. Official agreement pending for the bioinformatics master.

## Required background

**Math**

Basic linear algebra (matrix and vector algebra, inverse and transposed matrices, determinants, eigenvalue decomposition) and one-dimensional calculus (e.g. chain and product rules, Taylor expansion). You can find a concise summary of the required math background in appendix A.1-A.2 of the freely available book by Barber:http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online. (Appendix A.3 to A.6 can also become helpful later.).

Please make sure you're able to solve exercises of sections A, FC, SS1-6, V, D1-3 in PDF.

We will have a small test at the beginning of the lecture to gauge your maths skills.

**R programming**

Get familiar with R basics. We recommend the (very) short introduction to R by Paul Torfs & Claudia Brauer.

## Getting the book

"Pattern recognition and Machine Learning" by Christopher Bishop can be borrowed from the Munich university libraries (TUM or LMU).

## Computer

If you can, bring a laptop with RStudio installed, a free programming interface for the R language.

## Topics

0. Univariate and simple multivariate calculus and summary of linear algebra with intuitive explanations

1. Concepts in machine learning: supervised vs. unsupervised learning, classification vs. regression, overfitting, curse of dimensionality

2. Probability theory, Bayes theorem, conditional independence, distributions (multinomial, Poisson, Gaussian, gamma, beta,...), central limit theorem, entropy, mutual information

3. Generative models for discrete data: likelihood, prior, posterior, Dirichlet-multinomial model, naive Bayes classifiers

4. Gaussian models: max likelihood estimation, linear discriminant analysis, linear Gaussian systems

5. Bayesian statistics: max posterior estimation, model selection, uninformative and robust priors, hierarchical and empirical Bayes, Bayesian decision theory

6. Frequentist statistics: Bootstrap, Statistical testing

7. Linear regression: Ordinary Least Square, Robust linear regression, Ridge Regression, Bayesian Linear Regression

8. Logistic regression and optimization: (Bayesian) logistic regression, optimization, L2-regularization, Laplace approximation, Bayesian information criterion

9. Generalized Linear Models: the exponential family, Probit regression

10. Expectation Maximization (EM) algorithm with applications

11. Latent linear models: Principle Component Anlaysis, Bayesian PCA

## Evaluation

The final exam is a 2 hours written exam with a bit of R programming. The mark will be the one of the final exam plus bonus points for the modeling competition.