CMU Probabilistic Graphical Models 10-708 Spring 2019 materials.
This page is intended as a resource for those wishing to self-study the graduate-level course “Probabilistic Graphical Models”, taught by Eric Xing to MS/MSc and PhD students in machine learning at Carnegie Mellon University in the spring of 2019. It contains all publicly available course materials from the official course page, work I completed as part of my own self-study; as well as my own comments and materials for those who wish to similarly benefit.
At the time of publishing, some uploads are still pending, to be completed.
The original source from which all publicly available course materials have been extracted is the following course page:
I have archived these course materials in my GitHub repo below:
https://github.com/cyber-rhythms/cmu-10-708-probabilistic-graphical-models-spring-2019
The above repo can be cloned in its entirety if one wishes to download all the materials linked from this page at once.
For those wishing to use the materials from this course for self-study, the utility of this particular spring 2019 iteration is that there is a complete set of publicly available recorded video lectures, and also a complete set of homework assignments.
In order to facilitate the use of the course materials for those self-studying, I have based this page on the original course page as faithfully as possible. I have endeavoured to keep my own comments to a minimum, at least in the first part of the page.
In the second part of the page under “Self-study support”, I include additional comments specific to my own experiences with self-studying this course. I also include additional materials such as my scanned notes, lecture summaries, homework assignment write-ups and also supplementary reference texts that I personally found useful.
Disclosure: I am an affiliate with Bookshop.org. This page contains affiliate links, so if you purchase a reference textbook through one of these links, you will not pay anything in addition to the listed price, but I will receive a small commission. If you wish to find out more about Bookshop, please visit their webpage.
- Why should I study this course?
- Course description.
- Textbooks.
- Course grading.
- Course schedule.
- Homework Assignments.
- Projects.
- Subsequent iterations of this course.
- Self study support.
- Additional comments on course prerequisites.
- Additional comments on video lectures.
- Additional comments on course content.
- Additional comments on homework assignments.
- Additional comments on reference textbooks.
- Lecture summaries, session, notes, review notes.
- Homework assignments: my write-ups.
- Supplementary texts.
- Electronic copies of reference textbooks.
Why should I study this course?
As someone undertaking a programme of self-study in machine learning using CMU publicly available materials, this course was one of the more technically challenging, but also more rewarding courses.
One might think from the title of the course that it solely concerns methods related to an esoteric subfield of machine learning, i.e. probabilistic graphical models, like I did prior to studying it. However, the educational value of this course extends far beyond the seemingly limited realm of graphical models per se.
Firstly, on the relevance of graphical models for machine learning, the following is a quotation by Michael Jordan, from this page on Kevin Murphy’s website, both of whom are authors of textbooks used on the course:
Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering – uncertainty and complexity – and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity – a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.
Many of the classical multivariate probabilistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism – examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages – in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems.
- Michael Jordan (1998).
This course will definitely get one up to scratch on the foundational topics in graphical models, but some elaboration is required on how it goes beyond that.
In particular, having studied this course, I can attest to the above quotation; that graphical models provide an elegant underlying framework from which to better understand not only the history of existing methods in probabilistic machine learning, but also how this history informs contemporary approaches e.g. deep learning.
Furthermore, I found that the selection of content on this course has supplied me with both the introductory literacy and consequently, the confidence, to be able to attempt to read machine learning research papers using more modern techniques not covered in introductory machine learning courses. Examples include variational inference, Markov chain Monte Carlo methods and their respective extensions; and also Bayesian nonparametrics and deep generative models.
Course description.
Many of the problems in artificial intelligence, statistics, computer systems, computer vision, natural language processing, and computational biology, among many other fields, can be viewed as the search for a coherent global conclusion from local information. The probabilistic graphical models framework provides an unified view for this wide range of problems, enables efficient inference, decision-making and learning in problems with a very large number of attributes and huge datasets. This graduate-level course will provide you with a strong foundation for both applying graphical models to complex problems and for addressing core research topics in graphical models.
Textbooks.
The required textbook for this class is (note that the material of the class goes beyond this book):
We will also be using excerpts from the following work, which you do not need to purchase:
Jordan, M. (draft). An Introduction to Probabilistic Graphical Models.
Optional:
Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
Chapters from Michael Jordan’s draft textbook are freely available at his homepage, and individual chapters prescribed to accompany each lecture are in course schedule below. All the chapters of his textbook can also be found and downloaded in their entirety in this subdirectory of my GitHub repo.
Course grading.
The class requirements include brief reading summaries, scribe notes for 1 lecture, 4 problem sets, and a project. This is a PhD level course, and by the end of this class you should have a good understanding of the basic methodologies in probabilistic graphical models, and be able to use them to solve real problems of modest complexity. The grading breakdown is as follows:
- Participation (4%)
- Scribe Duties (10%)
- Homework Assignments (40%)
- Final Project (46%)
Note that this class does not have any tests or exams.
Course schedule.
The course is a 15 week course, including a one-week break. The lecture schedule was as follows.
Date. | Lectures. | Readings. |
---|---|---|
1/14 | Lecture #1 (Eric): Introduction to GM. [slides (annotated) | video | notes] |
- Jordan, M. I. (2004). Graphical models. - Airoldi, E. (2007). Getting started in probabilistic graphical models. |
1/16 | Lecture #2 (Eric): Representation: Directed GMs (BNs). [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 2. (Sec. 2.1). - Koller and Friedman textbook, Ch. 3. |
1/21 | No lectures (MLK day). | |
1/23 | Lecture #3 (Eric): Representation: Undirected GMs (MRFs). [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 2. (Sec. 2.2 - end). - Koller and Friedman textbook, Ch. 4. - Fischer, A. & Igel, C. (2012). An introduction to restricted Boltzmann machines. |
1/28 HW1 out. |
Lecture #4 (Eric): Exact inference. - Elimination. - Message passing. - Sum product algorithm. [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 3 and Ch. 4 - Koller and Friedman textbook, Ch. 9, Ch. 10. - Minka, T. (2005). Divergence measures and message passing. |
1/30 | Lecture #5 (skipped): Parameter learning in fully observable Bayesian Networks. - Generalized Linear Models (GLIMs) - Maximum Likelihood Estimation (MLE) - Markov Models. [slides | notes] |
- Jordan textbook, Ch. 8, Ch. 9 (Sec. 9.1 - 9.2). - Koller and Friedman textbook, Ch. 17 (Sections 17.1 - 17.4). - Geiger, D. & Heckerman, D. (2002). Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. - Geiger, D. & Heckerman, D. (1997). A characterization of the Dirichlet distribution through global and local parameter independence. |
2/4 | Lecture #6 (Maruan): Parameter Learning of partially observed BNs. - Mixture models. - Hidden Markov Models (HMMs). - The EM algorithm. [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 11. - Koller and Friedman textbook, Ch. 19.1-19.4. - Borman, S. (2006). The EM algorithm (A short tutorial). - Neal, R. & Hinton, G. (1998). Some interesting aspects of the EM algorithm. |
2/6 | Lecture #7 (Eric): Maximum likelihood learning of undirected GM. [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 9 (Sec. 9.3-9.5), Ch. 20. - Friedman, J. et al. (2008). Sparse inverse covariance estimation with the graphical lasso. - Lafferty, J. et al. (2001). Conditional random fields - Probabilistic models for segmenting and labeling sequence data. |
2/11 | Lecture #8 (guest lecture, Kun Zhang @ Department of Philosophy): Causal discovery and inference. [slides | video | notes] |
- Pearl, J. et al. (2016). Causal Inference in Statistics: A Primer. Concepts related to causality and identification of causal effects. - Sprites, P. et al. (2001). Causation, Prediction, and Search. Pages 80-89 and 123-136. Traditional methods for causal discovery, including the PC and FCI algorithms. - Zhang, K. et al. (2017). Learning causality and causality-related learning. Brief review of recent methods for causal discovery. |
2/13 HW1 due. |
Lecture #9 (Eric): Modeling networks. - Gaussian graphical models. - Ising models. [slides (annotated) | video | notes] |
- Meinshausen, N. & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. - Kolar, M. et al. (2010). Estimating time-varying networks. - Dempster, A. (1972). Covariance selection. |
2/18 | Lecture #10 (Eric): Sequential models. - Discrete Hidden State (HMM vs. CRF). - Continuous Hidden State (Kalman Filter). [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 14, Ch. 15. - Ng, A. Lecture notes on factor analysis. - Welch, G. & Bishop, G. (2006). An introduction to the Kalman filter. - Wallach, H. (2004). Conditional random fields - An introduction. |
2/20 HW2 out. |
Lecture #11 (Eric): Approximate Inference: Mean Field (MF) and loopy Belief Propagation (BP) approximations. [slides (annotated) | video | notes] |
- Yedidia, J. et al. (2000). Generalized belief propagation. - Xing, E. et al. (2003). A generalized mean field algorithm for variational inference in exponential families. - Mohamed, S. et al. (2016). Variational inference tutorial (NIPS 2016). |
2/25 | Lecture #12 (Eric): Theory of Variational Inference: Inner and Outer Approximations. [slides (annotated) | video | notes] |
- Wainwright, M. & Jordan, M. (2003). Variational inference in graphical models - The view from the marginal polytope. - Wainwright, M. & Jordan, M. (2008). Graphical models, exponential families, and variational inference (Sec. 3 and 4). |
2/27 | Lecture #13 (Eric): Approximate Inference: Monte Carlo and Sequential Monte Carlo methods. [slides (annotated) | video | notes] |
- Jordan textbook, Ch. 21. - Mackay, D. (2003). Information Theory, Inference and Learning Algorithms, Ch. 29 (Sec. 29.1-29.3). |
3/4 | Lecture #14 (Eric): Markov Chain Monte Carlo. - Metropolis-Hastings. - Hamiltonian Monte Carlo. - Langevin Dynamics. [slides (annotated) | video | notes] |
- Mackay, D. (2003). Information Theory, Inference and Learning Algorithms, Ch. 29 (Sec. 29.4-29.10). - Neal, R. (2011). MCMC using Hamiltonian dynamics. - Patterson, S. & Teh, Y. (2013). Stochastic gradient Riemannian Langevin dynamics on the probability simplex. |
3/6 HW2 due. |
Lecture #15 (Eric): Statistical and Algorithmic Foundations of Deep Learning. - Insight into DL. - Connections to GM. [slides (annotated) | video | notes] |
- Goodfellow, I. et al. (2016). Deep Learning, Ch. 6.2-5, 20.3-4. - Salakhutdinov, R. & Hinton, G. (2009). Deep Boltzmann machines. - Belanger, D. & McCallum, A. (2016). Structured prediction energy networks. - Ranganath, R. et al. (2015). Deep exponential families. |
3/11 | No class (Spring Break). | |
3/13 | No class (Spring Break). | |
3/18 HW3 out. |
Lecture #16 (guest lecture, Zhiting Hu): Building blocks of DL. - RNN and LSTM. - CNN, Transformers. - Attention mechanisms. - (Case studies in NLP). [slides | video | notes] |
- Pascanu, R. et al. (2013). On the difficulty of training recurrent neural networks. - Vaswani, A. et al. (2017). Attention is all you need. - Devlin et al. (2019). BERT - Pre-training of deep bidirectional transformers for language understanding. |
3/20 | Lecture #17 (Eric): Deep generative models (part 1): Overview of the theoretical basis and connections of deep generative models. - Wake sleep algorithm. - Variational autoencoders. - Generative adversarial networks. - A unified view of DGM. [slides | video | notes] |
- Goodfellow, I. et al. (2016). Deep Learning, Ch. 20.9-10. - Kingma, D. & Welling, M. (2014). Auto-encoding variational Bayes.. - Goodfellow, I. et al. (2014). Generative adversarial nets. - Sanjeev, A. GANs, some open questions (blog). |
3/25 | Lecture #18 (guest lecture, Zhiting Hu): Deep generative models (part 2). - GANs and their variations. - Normalizing Flows. - Integrating domain knowledge in DL. [slides | video | notes] |
- Arjovsky, M. & Bottou, L. (2017). Towards principled methods for training generative adversarial networks. - Kingma, D. & Dhariwal, P. (2018). Glow - Generative flow with invertible 1x1 convolutions. - Hu, Z. et al. (2016). Harnessing deep neural Networks with logic rules. - Hu, Z. et al. (2018). Deep generative models with learnable knowledge constraints. |
3/27 Midway report due. |
Lecture #19 (guest lecture, Zhiting Hu): Case Study: Text Generation. - The encoder-decoder framework - Machine translation as conditional generation. - Unifying MLE and RL objectives for text generation. [slides | video | notes] |
- Ranzato, M. et al. (2016). Sequence level training with recurrent neural networks. - Hu, Z. et al. (2017). Toward controlled generation of text. - Tan, B. et al. (2019). Connecting the dots Between MLE and RL for sequence generation. |
4/1 | Lecture #20 (Maruan): Sequential decision making (part 1): The framework. - Brief introduction to reinforcement learning (RL). - Connections to GM: RL and control as inference. - Control via Variational Inference. [slides (annotated) | video | notes] |
- Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction, Ch. 3, Ch. 4. - Lilian Weng. A peek into RL (blog). - Levine, S. (2018). Reinforcement learning and control as probabilistic inference, Sec. 1-3. - Ziebart, B. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy, Ch. 5.1-2, 6.1-2. - Todorov, E. (2008). General duality between optimal control and estimation. - Koller and Friedman textbook, Ch. 20.3. |
4/3 HW3 due. |
Lecture #21 (Maruan): Sequential decision making (part 2): The algorithms. - Intro to RL algorithms: policy gradients and Q-learning. - Max-entropy policy gradient. - Soft Q-learning. [slides (annotated) | video | notes] |
- Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction, Ch. 13. - Levine, S. (2018). Reinforcement learning and control as probabilistic inference, Sec. 4. - Haarnoja, T. et al. (2017). Reinforcement learning with deep energy-based policies. - Haarnoja, T. et al. (2018). Soft actor-critic. |
4/8 HW4 out. |
Lecture #22 (Eric): Bayesian nonparametrics. - Dirichlet Process (DP). - Indian Buffet Process (IBP). [slides (annotated) | video | notes] |
- Teh, Y. (2011). The Dirichlet process. - Griffiths, T. & Ghahramani, Z. (2011). The Indian buffet process. |
4/10 | Lecture #23 (Maruan): Bayesian non-parametrics (continued). - Inference in Dirichlet Process (DP). - Hierarchical Dirichlet Process (HDP). - Indian Buffet Process (IBP). [slides | video | notes] |
- Hopper, T. Notes on Dirichlet processes (webpage). - Ishwaran, H. & James, L. (2001). Gibbs sampling methods for stick-breaking priors. - Teh, Y. et al. (2006). Hierarchical Dirichlet process. |
4/15 | Lecture #24 (Eric): Integrative Paradigms of GM: Regularized Bayesian Methods. [slides (annotated) | video | notes] |
- Ganchev, K. et al. (2010). Posterior regularization for structured latent variable models. - Zhu et al. (2014). Bayesian inference with posterior regularization and applications to infinite latent SVMs. |
4/17 | Lecture #25 (Eric): Elements of Spectral & Kernel GMs. [slides (annotated) | video | notes] |
- Song, L. (2008). Learning via Hilbert space embedding of distributions, Sec. 2.1, 2.2, 3.1, 3.2. - Hsu, D. et al. (2012). A spectral algorithm for learning hidden Markov models. - Song, L. et al. (2012). Nonparametric latent tree graphical models. - Parikh, A et al. (2011). A spectral algorithm for latent tree graphical models. |
4/22 | Lecture #26 (Maruan): Gaussian processes (GPs) and elements of meta-learning. - Gaussian Processes (GPs) and kernel functions. - (Deep) kernel learning and approximations. - Neural Processes (NPs) as an approximation to GPs. - Elements of meta-learning. [slides | video | notes] |
- Rasmussen, C. & Williams, C. (2003). Gaussian Processes for Machine Learning, Ch. 2.2-2.4, Ch. 4.1-4.2. - Görtler, J. et al. (2019). A visual exploration of Gaussian processes (blog). - Wilson et al. (2015). Deep kernel learning. - Garnelo, M. et al. (2018). Neural processes. |
4/24 HW4 due. |
Lecture #27 (guest lecture, Qirong Ho): Scalable algorithms and systems for learning, inference, and prediction. [slides | video | notes] |
- Bradley, J. et al. (2011). Parallel coordinate descent for L1-regularized loss minimization. - Yuan, J. et al. (2015). LightLDA - Big topic models modest computer clusters. - Ho, Q. et al. (2013). More effective distributed ML via a stale synchronous parallel parameter server. - Kim, J. et al. (2016). STRADS - A distributed framework for scheduled model parallel machine learning. |
4/29 | Lecture #28 (Eric): A civil engineering perspective on AI. [slides (annotated) | video | notes] |
- Xing, E. et al. (2015). Petuum - A new platform for distributed machine learning on Big Data. - Xing, E. et al. (2016). Strategies and principles of distributed machine learning on Big Data. |
4/30 | Project presentations. |
A complete set of lecture videos were uploaded to YouTube, and are individually linked above, except for lecture 5, “Parameter learning in fully observable Bayesian networks”, which was skipped in the course itself.
These recorded lectures contain instructor-led exposition of the lecture slides, on which annotations are made.
Lecture slides for all sessions are individually linked above, but can be downloaded in their entirety from this subdirectory of my GitHub repo.
Instructor annotated lecture slides are also individually linked above, but not every lecture has a set of annotated slides. All annotated lecture slides can be downloaded in their entirety from this subdirectory.
All prescribed readings are individually linked, as are notes in the form of distill.pub style blog articles scribed by CMU students taking the course. All the prescribed readings can be downloaded in their entirety from this subdirectory.
I have also included my own additional materials that I developed whilst watching these lectures. These additional materials can be found further down in the 2nd part of this page under “Lecture summaries, session notes, review notes.”
Homework Assignments.
There will be 4 homework assignments over the course of the semester. These assignments may contain material that has been covered by published papers and webpages. It is a graduate class and we expect students to solve the problems themselves rather than search for answers.
Students are required to typeset homework solutions using LaTeX and the provided template.
Collaboration Policy
Homework assignments must be done individually: each student must hand in their own answers. However, it is acceptable to collaborate when figuring out answers and to help each other solve the problems. We will be assuming that, as participants in a graduate course, you will be taking the responsibility to make sure you personally understand the solution arising from such collaboration. You also must indicate on each homework with whom you have collaborated.
Homework 1.
- Assignment PDF handout.
- Assignment LaTeX template.
- Starter code.
- Supplementary note on hidden Markov models.
Homework 2.
Homework 3.
Homework 4.
All homework assignment materials can be downloaded in their entirety in this subdirectory of my GitHub repo.
There are no official, publicly available course solutions for the assignments. For those self-studying, please see the section “Additional comments on homework assignments” in the 2nd part of this page.
Projects.
I have located this part of the course page in another location at:
insert link, pending construction.
Subsequent iterations of this course.
At the time of writing, there is an additional spring 2020 iteration of the course, also taught by Eric Xing, and the coursepage is below:
And there is also a later spring 2021 iteration of the course, taught by Matt Gormley, and the coursepage is below:
On the spring 2020 iteration of this course, the content and the content archived on this page are almost entirely identical, with the minor exception of at most 2 later lecture slide sets. All lecture slides and videos are available for this iteration, and prescribed readings are the same, but it does not possess a complete set of homework assignments due to the disruption caused by COVID-19. I will endeavour to list some of these minor differences when I have had a chance to audit it more comprehensively.
On the spring 2021 iteration by Matt Gormley, I cannot comment as I have only had a cursory glance at course page. However, it would seem that the lecture content has become more geared towards said instructor’s research interests, inevitably.
Self-study support.
In what follows, I include some additional comments from my own experiences self-studying the course. This may be helpful to those wishing to make use of this course in a similar fashion.
I also include additional materials that I wrote when self-studying this course. These include my own handwritten lecture notes and high-level summaries for each recorded video lecture; write-ups of my own attempts at the homework assignments with accompanying code for which solutions are not available; and supplementary reference textbooks that I found useful.
Additional comments on course prerequisites.
The official course page does not list any course prerequisites.
However, at times, aspects of the course are uncompromising in terms of the level of the technical material covered and readings suggested, which is why it is stated under “Course grading” that this is a PhD level course (although I suspect on the ‘easier’ spectrum of PhD level).
Therefore, the course can be technically demanding at times. First, a little needs to be said on what these technical demands are not. The technical demands of the course do not encompass being fluent in any sophisticated formal mathematical machinery e.g. in measure theory, stochastic processes, functional analysis etc. which would require substantial prior training.
Mathematical prerequisites.
In respect of formal mathematical fluency, only calculus, linear algebra, probability and statistics are required. The following are some minimum requirement pointers:
- Linear algebra - “computational” fluency is required. That is, one should be fluent with elementary operations and manipulations with vectors and matrices, to the extent that one can read and reconstruct proofs of basic results and identities. Fluency with the theory of linear algebra, in the sense of the study of linear maps on finite-dimensional vector spaces, while desirable in general, is not strictly necessary for the course.
- Calculus - “computational” fluency is required. That is, one should be fluent with computing derivatives for optimisation problems, and in particular multivariate generalisations such as gradients, Jacobians and Hessians.
- Probability - basic fluency is required. Fluency with manipulation of basic probability statements via application of probability rules is absolutely essential both to the exact inference and approximate inference parts of the course.
- Statistics - basic fluency is required. In particular, familiarity in dealing with multivariate objects in a statistical setting e.g. computing expectations; and also familiarity with the notion of a statistic, maximum likelihood estimation. Familiarity with Bayesian statistics is essential.
Prerequisites for homework assignment completion.
Programming skills in Python and also NumPy are essential to be able to complete the four homework assignments, because some of these will require one to write implementations of algorithms from scratch without recourse to off-the-shelf packages.
Background knowledge.
With respect to keeping up with the pace of the lectures, and in being able to derive significant value from the suggested readings, most of the technical demands lie in the requirement for a previous acquaintance with standard machine learning methods covered in an introductory graduate machine learning course. That is because many of these methods will be reinterpreted through the framework of probabilistic graphical models in the lectures.
I therefore strongly recommend that those wishing to use this course self-study already have acquaintance with the following: perceptrons, Bayesian linear regression, logistic regression, LASSO or \(l_1\)-regularised regression, Gaussian mixture models, k-means algorithm, parameter estimation and inference in hidden Markov models, expectation-maximisation (EM) algorithm, latent Dirichlet allocation (LDA), simple multi-layer perceptrons and the backpropagation algorithm.
In my view, attempting to derive significant value from the graphical model reinterpretation of the above methods without existing underlying knowledge of these methods would be difficult. Attempting to take self-study detours to learn about the above methods in parallel with assimilating the already technical material of the lectures is not impossible, but in my view, is not advised.
Lastly, I recommend spending some time reading “Chapter 8: Graphical Models” of Pattern Recognition and Machine Learning by Bishop (2006) to get a sense of the foundational topics the course covers, before undertaking self-study. The course however goes much further than this. As the chapter is around 80 pages, there is no need to read it entirely, rather dip into it. I have included specific suggestions on what to read from this chapter, as well as a link to it under “Supplementary texts.”
On my own background knowledge before undertaking self-study in this course, I had already completed a graduate-level machine learning course, and had already gone through most of the chapters of Pattern Recognition and Machine Learning by Bishop (2006) quite comprehensively.
Additional comments on video lectures.
The exposition in the video lectures is on the whole very clear. The utility of watching these lectures over and above just reading the slides and doing the recommended readings cannot be overstated.
Viewing the content of the lecture slides before having watched the lecture recordings might make one believe that the material is intimidatingly technical. However, one is clearly guided through the more difficult derivations by the instructor Eric Xing, and other guest lecturers, and any derivations not covered are generally routine.
However, where I would place the greatest educational value would be in the motivations and bird’s eye view supplied by the experienced instructors on the material covered. With Eric Xing’s instruction in particular, it was emphasised to me that techniques in and of themselves are not what counts, rather, what counts is the manner in which it is deployed to achieve a vision, or to solve a particular aspect of a problem. The strong sense I took away from this is that there is a scientifically informed artistry in the selection of what machine learning techniques one wishes to use to solve or flexibly adapt to a particular aspect of a problem.
A consequence of this perspective is that for the more advanced topics, such as the use of gradient information in Hamiltonian Monte Carlo methods to address the inefficient random walk behaviour of Markov chain Monte Carlo methods; or the use of stochastic approximation in stochastic variational inference; these become, in hindsight, natural and almost obvious extensions to existing methods.
For me, it is this viewpoint that was the singularly most valuable take away from the course, one which only comes from an experienced instructor, and which is difficult to holistically communicate in monographs or tutorials.
On the technical side, there are no issues with audio discernibility nor with video clarity.
There is an alternative set of links for the video lectures, whose video quality is clearer, but not overly noticeably so, and these are hosted on a publicly available part of CMU Panopto here:
YouTube links instead of the CMU Panopto links were included as it was decided that the former might be more ‘permanent’ than the latter.
Additional comments on course content.
The course seems to have been running at CMU for many years, and I suspect that its technical demands are by deliberate design. That is, the motivation is to supply MS and PhD students with literacy in state-of-the-art techniques regularly encountered in research, rather than presenting content which has been watered down for accessibility.
The course is divided into 5 mini-modules:
- Module 1: Introduction, representation and exact inference.
- Module 2: Approximate inference.
- Module 3: Deep learning and generative models.
- Module 4: Reinforcement learning and control through inference in graphical models.
- Module 5: Nonparametric methods.
Module 1 is an extremely fast-paced introduction to foundational topics in probabilistic graphical models, such as representation, parameter estimation, and exact inference. Some additional topics such as causal inference, network structure learning, and sequential models are also included.
Module 2 introduces both deterministic and stochastic approximate inference techniques. The former covers mean-field variational inference, and the theory of variational inference. The latter introduces Monte Carlo, sequential Monte Carlo, Markov chain Monte Carlo methods and further extensions.
Module 3 looks at the interplay between graphical models and deep learning, mainly through the recent development of deep generative models.
Module 4 and Module 5: tbc
Additional comments on homework assignments.
These comprise a mix of theoretical questions involving mathematical proof or the derivation of an algorithm; and programming exercises where an algorithm needs to be implemented from scratch on a pre-processed data set.
For CMU students, these derivations would be marked by a teaching assistant, and the code implementation would be marked by a scripted automatic grading system, or a teaching assistant.
As part of an audit almost a year ago on whether the assignments could be completed as part of self-study, outside of the support structure of formal education, I compiled the following summary of what each of the four assignments consisted of.
-
Homework assignment 1 - A mix of exercises and derivations taken from Koller and Friedman (2009). Programming element consists of implementing a junction tree algorithm and an EM algorithm for inference in a hidden Markov model. Script skeletons are provided and one has to fill in most of the code for submission to Gradescope.
-
Homework assignment 2 - Derivations of algorithms for parameter estimation and inference in a hidden Markov model and in a linear-chain conditional random field; of algorithms for variational inference in a sequence-augmented latent Dirichlet allocation model, and of the Metropolis algorithm for posterior inference in a hierarchical Bayesian model. Programming element consists of implementing all derived algorithms from scratch in code on provided data sets for parts-of speech tagging (Penn Treebank corpus), topic modelling (Wikipedia dataset), and sports analytics (English Premier League data), without recourse to off-the-shelf packages. Results are expected as a PDF write-up, and algorithm implementations as Python scripts.
-
Homework assignment 3 - A series of derivations and code implementations for various algorithms related to the variational autoencoder, and generative adversarial networks. Code is used create results that are written up. Results are submitted as a Jupyter/Colab notebook.
-
Homework assignment 4 - Derivations and code implementations for a reinforcement learning problem; and some derivations on Bayesian nonparametrics. Results are submitted as Jupyter/Colab notebook.
For those who may wish to complete these assignments as part of self-study, it might seem a daunting prospect as to whether these technical assignments can be completed, especially without the support structure of a formal educational environment.
The following is some general advice from my own experience with the course homework assignments during self-study.
For simpler theoretical derivation exercises, these tend to be doable, and success here will depend on how well one has assimilated the contents of the lectures. Whilst the derivations at times concern technically demanding content, they are not so impenetrable so as not be made easier by further reading around the subject.
For the more involved derivations where one has to use a procedure to derive an algorithm, these are based on or adapted from existing tutorial papers. My advice here would be that one really needs to completely deconstruct the relevant aspects of the tutorial papers with surgical precision and intellectual honesty. That is, one needs to fully understand all aspects of the method being applied to derive the algorithm. In the case that one is struggling to understand, one needs the strength to be able to admit this. My advice in this case would be to seek further supplementary materials, such as other tutorial papers or reputable references. Because any deficiencies in understanding at the pre-algorithmic, mathematical derivation stage will be revealed or compounded during the code implementation, where one has to additionally worry about coding bugs.
For code implementations that would normally be marked by the automated grading system or teaching assistants, not available to those self-studying, one will have to aim for algorithm correctness as far as possible within the parameters defined by the assignment. However those self-studying are not working completely blind.
For the more elementary code implementations, there are often very concrete mathematical constraints or behaviours that the implementation needs to exhibit that can be used to ascertain whether one is getting correct results. There are also other ways to check one’s code implementation, e.g. in parameter estimation, one can generate synthetic data with known parameters, and see if these known parameters are recovered.
Similar principles apply to the more involved algorithm implementations, there will be mathematical constraints or characteristic behaviours that the implementation will need to abide by or exhibit.
More generally, standard software engineering best practices apply. Write simple code to begin with, test it on simple examples, and slowly build up the complexity piece by piece. Doing so allows one to diagnose where bugs might be, or provide information to better guesstimate where the bugs are. Debugging code is often made easier when one has security that the modular pieces are working correctly individually.
The special nature of self-study outside of a formal educational environment allows for some leeway with the use of off-the-shelf packages. Whilst the objectives of the coding assignments are to teach one a technique by implementing it from scratch, and not use an off-the-shelf package, the lack of concrete, specific assistance from teaching assistants and peers means that if one is truly stuck a source of assistance might be to learn to use an off-the-shelf package on the data set, but only as a gold standard to measure one’s own implementation against. This would also mean developing literacy in that package as an added bonus.
On my own experiences with the homework assignments, some of them took much longer than I anticipated, and there was a great deal of ‘feeling around in the dark’, which at points became demoralising. Meaning that I had to spend a lot of time deconstructing the relevant papers, or searching for other references or tutorials when I wasn’t able to resolve a particular point. On most occasions, I was able to make it through.
For this reason, I have compiled a specific list of hints and tips; as well as additional references I found useful whilst doing the assignments.
insert PDF.
If one chooses to do these assignments in this way, then the added bonus is that one will develop skills to be able to reproduce a machine learning paper in code from scratch. While this will not be production-ready, it does induce confidence that one has understood a method enough to be able to engineer it from the ground up. Because ultimately, pre-production algorithms are instantiations of mathematical ideas.
Additional comments on reference textbooks.
Throughout the course, I consulted all three recommended course textbooks equally frequently. Here are some comments on each textbook, and how I used them while self-studying the course.
I found that the draft textbook Introduction to Probabilistic Graphical Models by Jordan to excel in its clarity of exposition, and I relied on this textbook most heavily when it came to developing a clear understanding of the foundational topics in graphical models, including but not limited to, representation; graphical model interpretations of sequential models; parameter estimation in fully and partially observed graphical models; and exact inference procedures such as message-passing and the junction tree algorithm.
Generally, I found that if one of the topics on the course overlapped with any of book chapters, this book would be the first place I would start. That this book is so remarkably clear is no surprise, Michael Jordan was the mentor of the course instructor, Eric Xing, and also had strong associations with the course textbook authors, Kevin Murphy, and Daphne Koller.
The textbook Probabilistic Graphical Models: Principles and Techniques by Koller and Friedman is extremely detailed. I used aspects of this text to supplement the readings from Jordan’s textbook, when I needed additional detail to understand the content of the lecture slides, particularly on representation. My only reservations about this text however is that it can be overwhelming to those who are relatively new to graphical models, and some of the notation can feel unconventional (many of the course slides adapt material and therefore notation from Jordan’s textbook).
Due to the accelerated pace at which the course presents foundational topics, if one is not careful, in my view, it is possible to get ‘lost in the details’ of Koller and Friedman’s comprehensiveness, and lose sight of the big picture. However, I have no doubt that their textbook remains a solid reference for researchers.
Finally, I found Machine Learning: A Probabilistic Perspective by Murphy also to be an excellent all round reference textbook for both the foundational and more esoteric topics covered in this course. While I’m not sure it is a book that could be exclusively used for self-study, I found this book to be excellent as a go-to reference for more conventional machine learning techniques.
In particular, where Murphy’s book excelled for me was that it contains many sections on more modern techniques in machine learning papers which have become assimilated into the ‘canon’ of required knowledge e.g. latent Dirichlet allocation, conditional random fields etc. Furthermore, I found myself using this as a precursor reference for approximate inference techniques in the course, such as variational inference, Markov chain Monte Carlo methods; and for topics such as loopy belief propagation and the theory of variational inference. This was useful because the papers which discuss those topics can get theoretically demanding quite quickly e.g. the entire 308 page Wainwright and Jordan (2008) paper, of which chapters 3 and 4 are prescribed as reading in lecture 12, is described in Murphy as the ‘monster’ for this reason.
In terms of the course textbooks that I bought hard-copies of, I already had a copy of Machine Learning: A Probabilistic Perspective. The draft textbook Introduction to Probabilistic Graphical Models is freely available online, and is also available to download in its entirety from this subdirectory of my GitHub repo. The sample chapters of Probabilistic Graphical Models: Principles and Techniques are sufficient to get by in the course. However, I did later purchase a copy of this textbook as a general reference.
Lecture summaries, session notes, review notes.
As part of self-study and to assist in my understanding, I kept a bullet-pointed high-level summary of the content of each recorded lecture in a single PDF document. This document is available below:
The recorded video lectures contains valuable information on the motivation behind some of the techniques introduced, as well as further exposition and annotations of selected parts of the lecture slides. In order to aid my understanding, I made handwritten notes on the points of emphasis communicated orally by Eric Xing. These notes will also contain core details on models or techniques copied from the slides, and while this may be repetitious, it aided my understanding, and gave me pointers on what I needed to pick up during review. They are listed as session notes below.
Some parts of the lecture slides are given less emphasis in the recordings, with the expectation that they will be reviewed in one’s own time. It was also the case that some of the slides were too technical to be able to fluently assimilate during the lecture, or queries emerged that required review and supplementary reading. On top of that, there are also fairly technical paper readings that need to be completed. These notes are listed as review notes below.
Lecture 1 - Introduction to GMs. [Session notes. | Review notes.]
Lecture 2 - Representation: Directed GMs (BNs). [Session notes. | Review notes.]
Lecture 3 - Representation: Undirected GMs (MRFs). [Session notes. | Review notes.]
Lecture 4 - Exact inference. [Session notes. | Review notes.]
Lecture 5 - Parameter learning in fully observable BNs. [Session notes. | Review notes.]
Lecture 6 - Parameter learning of partially observed BNs. [Session notes. | Review notes.]
Lecture 7 - Maximum likelihood learning of undirected GMs. [Session notes. | Review notes.]
Lecture 8 - Causal discovery and inference. [Session notes. ]
Lecture 9 - Modelling networks. [Session notes. | Review notes.]
Lecture 10 - Sequential models. [Session notes. | Review notes.]
Lecture 11 - Approximate inference: Mean-field (MF) and loopy belief propagation (BP) approximations. [Session notes. | Review notes.]
Lecture 12 - Theory of variational inference: Inner and outer approximations. [Session notes. | Review notes.]
Lecture 13 - Approximate inference: Monte Carlo and sequential Monte Carlo methods. [Session notes. | Review notes.]
Lecture 14 - Markov chain Monte Carlo. [Session notes. | Review notes.]
Lecture 15 - Statistical and algorithmic foundations of deep learning. [Session notes. | Review notes.]
Lecture 16 - Building blocks of deep learning (DL).
Lecture 17 - Deep generative models (part 1).
Lecture 18 - Deep generative models (part 2).
Lecture 19 - Case study: Text generation.
Lecture 20 - Sequential decision making (part 1).
Lecture 21 - Sequential decision making (part 2).
Lecture 22 - Bayesian nonparametrics (part 1).
Lecture 23 - Bayesian nonparametrics (part 2).
Lecture 24 - Integrative paradigms of GMs: Regularized Bayesian methods.
Lecture 25 - Elements of spectral and kernel GMs.
Lecture 26 - Gaussian processes (GPs) and elements of meta-learning.
Lecture 27 - Scalable algorithms and systems.
Lecture 28 - A civil engineering perspective on AI.
All scanned notes can be downloaded in their entirety from this subdirectory of my GitHub repo.
Pending complete upload of notes.
Homework assignments: my write-ups.
My write-ups of each homework assignment are listed below.
- Homework 1 write-up.
- Homework 2 write-up.
- Homework 3 write-up.
- Homework 4 write-up.
To insert links after upload.
For questions involving the hand-coded implementations of algorithms, the accompanying code repositories are listed below.
To insert links after upload.
Supplementary texts.
The following are supplementary reference textbooks not already listed that I found useful for introductory reading, or for remedying gaps in my knowledge, or as references to supplement the course materials.
As previously mentioned under “Additional comments on course prerequisites”, listed below is a link to “Chapter 8: Graphical Models” of Pattern Recognition and Machine Learning by Bishop (2006). I would recommend reading the introduction, the preliminary part of section 8.1 on Bayesian networks up to but not including the first example on polynomial regression in section 8.1.1; all of the parts of section 8.2 on conditional independence up to and including the definition and example of d-separation in 8.2.2; and all the parts of section 8.3 on Markov random fields up to but not including section 8.3.2 on factorization properties.
As previously mentioned under “Additional comments on course prerequisites”, familiarity with the fundamentals of Bayesian statistics is essential.
In spite of the fact that probabilistic graphical models are inherently neither frequentist nor Bayesian tools, they are most often encountered in Bayesian settings. A basic literacy with Bayesian statistics is therefore required to understand both the foundational topics, as well as the more modern methods and models in the course, such as latent Dirichlet allocation, variational inference, Markov chain Monte Carlo methods, Gaussian processes, and Bayesian nonparametrics. In particular, some of the assignments require developing familiarity with hierarchical Bayesian models. For these reasons, I both used and would recommend the well-known reference textbook below on Bayesian statistics, which also has some sections on the more modern topics covered in the course:
Gelman, A., Carlin, J., Stern, H., Dunson., D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall.
There are many excellent introductory tutorial papers written on the class of simulation-based methods, to which sampling, Monte Carlo and Markov chain Monte Carlo methods belong. However, given that Markov chain Monte Carlo methods arguably revolutionised Bayesian statistics, I wanted an authoritative reference textbook on these methods to supplement the introductory presentation from the course. For an in-depth graduate-level perspective into the theory of all the above methods, I used and would recommend the following:
Robert, C., Casella, G. (2010). Monte Carlo Statistical Methods (2nd ed.). Springer-Verlag New York Inc.
Electronic copies of reference textbooks.
Whilst in an ideal world one would have hard copies of all the books here, I recognise that there are many who are financially constrained and not able to purchase the books for themselves.
Almost all of the reference textbooks, supplementary textbooks and textbooks referenced in the readings are freely available online, often kindly by choice of the authors themselves.
Jordan, M. (draft). An Introduction to Probabilistic Graphical Models.
Goodfellow, I., Bengio, Y. Courville, A. (2016). Deep Learning. MIT Press.
Sutton, R., Barto A. (2018). Reinforcement Learning: An Introduction. (2nd ed.). MIT Press.
Rasmussen, C., Williams, C. (2006). Gaussian Processes for Machine Learning. MIT Press.
At the time of writing, these links were all working. I will not actively maintain these links. Furthermore, some of the links above are not the most up-to-date editions, which may or may not be appropriate for one’s purposes.
In the event that a link goes dead, and for the reference textbooks that are not included in the list above, if one really needs access to an electronic copy, the following is a possible solution:
- Visit the Genesis Library, a link to which is contained in the following Wikipedia article.