ArtificialIntelligenceArticles

Bayesian neural nets: The Saga.

Yarin Gal, a PhD student working with Zoubin Ghahramani in Cambridge posted his PhD dissertation on Bayesian deep learning (blog post: https://mlg.eng.cam.ac.uk/yarin/blog_2248.html )

Section 2.2 is an exceptionally well-informed history of Bayesian neural nets: https://mlg.eng.cam.ac.uk/yarin/thesis/2_language_of_uncertainty.pdf#page=4

In this historical section, Yarin starts by a 1987 paper entitled "Large Automatic Learning, Rule Extraction, and Generalization", by Denker, Schwartz, Wittner, Solla, Howard, Jackel and Hopfield. ( https://pdfs.semanticscholar.org/33fd/c91c520b54e097f5e09fae1cfc94793fbfcf.pdf ). This paper proposes a version of the very Bayesian idea that each training sample has the effect of modulating a probability distribution over parameters. The authors were all at Bell Labs in Holmdel NJ, in the same group that I joined one year after this paper appeared. John Denker and I became close friends and collaborators, Sara Solla and I were office neighbors, Larry Jackel was my boss, and Rich Howard became his boss. There are two interesting tidbits about this paper: (1) the main concepts were inspired by statistical physics, not by Bayesian statistics. AFAIK, the authors were unaware of the Bayesian inference literature of the time. (2) the word "Large" in the title was a typesetting mistake. They had sent a LaTeX file to the publisher with a {\Large } command for the title, and the \ somehow disappeared, making "Large" part of the title.

The second paper Yarin mentions is "Consistent inference of probabilities in layered networks: Predictions and generalizations" by "Naftali Tishby, Esther Levin, and Sara A Solla. published at IJCNN 1989 (a longer version appeared in Proc IEEE in 1990: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=jmERpL4AAAAJ&cstart=20&pagesize=80&sortby=pubdate&citation_for_view=jmERpL4AAAAJ:d1gkVwhDpl0C ). More than the Denker paper, the framework is very Bayesian: there is a distribution on parameters (starting with a prior) which each training samples modulates with a Gibbs term. Again, this all came from statistical physics, not Bayesian statistics (Tishby and Solla are both statistical physicists). I participated in many discussions leading to this paper and I thought this model of learning and generalization was incredibly enlightening. They presented the paper at COLT 1989 in Santa Cruz. This is where they (and John Denker) met Vladimir Vapnik, who joined our group one year later.

The next paper Yarin mentions is one John Denker and I co-authored: "Transforming neural-net output levels to probability distributions", by Denker and LeCun (NIPS'90, vol 3. 1991) ( https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WLN3QrAAAAAJ&cstart=300&pagesize=100&sortby=pubdate&citation_for_view=WLN3QrAAAAAJ:aqlVkmm33-oC ). Using the idea of a distribution over parameters, we figured we could estimate the uncertainty of the prediction of a neural net by approximating the bottom of the loss function by a quadratic form, which implies a Gaussian posterior over parameters, and by marginalizing over this Gaussian (linearizing the network). We used a diagonal approximation of the Hessian for that. Again, the idea was inspired by statistical physics, and not Bayesian statistics (which we were woefully unaware of). Little did we know that what we were doing could be called variational Bayesian inference with a Laplace approximation.

16 months later, at the annual "Snowbird workshop" (the real name was "neural networks for computing"), I met David MacKay for the first time. He was finishing his PhD at Caltech under John Hopfield (last co-author of the Denker et al. paper). In his talk at Snowbird, he presented an extension of our idea which he formulated in a proper Bayesian framework, and in which he used a "square Jacobian" approximation of the Hessian instead of a diagonal approximation. I thought this was very cool and complimented him after his talk. Here is what he told me: "your thing really works!"

Yarin then t

mlg.eng.cam.ac.uk

Uncertainty in Deep Learning (PhD Thesis) | Yarin Gal - Blog | Cambridge Machine Learning Group

So I finally submitted my PhD thesis, collecting already published results on how to obtain uncertainty in deep learning, and lots of bits and pieces of new research I had lying around...

351 views09:06

ArtificialIntelligenceArticles

alks about Radford Neal's Bayesian neural nets using Hamiltonian Monte Carlo and such. I was very impressed by the fact that Radford managed to win a competition (one of the first ones organized by Isabelle Guyon ) using these Bayesian NNs. But this came at a time when the community had already moved away from neural nets and towards SVMs and graphical models.

In late 1989, Vapnik had joined our group at Bell Labs, and our work on fundamental questions in learning and generalization moved toward his "statistical learning theory" framework, and away from Bayesian and statistical physics methods.

I was a Bayesian before I even knew what being Bayesian meant. With Vapnik's proximity, I was also a frequentist before I knew what being a frequentist meant.

Zoubin Ghahramani's invited talk at the NIPS 2016 workshop on Bayesian deep learning retraces this whole history. Slides here: https://bayesiandeeplearning.org/slides/nips16bayesdeep.pdf

311 views09:06

ArtificialIntelligenceArticles

Wonderful position paper by Max Welling entitled
"Are Machine Learning and Statistics Complementary?"
presented at the Rountable Discussion at the 6th IMSISBA meeting on "Data Science in the next 50 years".

Max makes excellent points about the relationship between ML and statistics. Until the deep learning tsunami, one of the dominant paradigms in ML was rooted in Bayesian statistics (graphical models, non-parametric Bayesian methods, etc).

Max points out that the emergence of deep learning is moving ML away from statistics. Like many people have pointed out (me included), he says that the success of deep learning can be explained by 3 factors:
1. faster computers
2. larger datasets
3. the use of very large models with very large numbers of parameters.

I would add:
4. the use of models with non-trivial representational power.

The criteria (1,2,3) without (4) would apply to things like massive logistic regression on sparse data, which has been a dominant approach to problems like ad ranking.

Here is my favorite quote: "Increasingly, the paradigm in deep learning seems to be: collect a (massive) dataset, determine the cost function you want to minimize, design a (massive) neural network architecture such that gradients can propagate “end to end”, and finally apply some version of stochastic gradient descend to minimize the cost until time runs out. Like it or not, the surprising fact is that nothing out there seems to beat this paradigm in terms of prediction."

But he correctly points out that there are many applications for which one still needs to: (1) deal with small sample size (2) marginalize over latent variables (3) compute error bars (3) establish causal relationships (4) produce "explanations" for decisions. Perhaps one of the best demonstrations of this is Leon Bottou's keynote talk at ICML 2015, in which he said that many large-scale ML problems break the classical paradigm of "training set, validation set, test set with IID samples" (video here: https://videolectures.net/icml2015_bottou_machine_learning/ ).

Despite my deep involvement in deep learning (haha), I agree wholeheartedly with Max and Léon..

Also, I'm a big fan of the conceptual framework of factor graphs as a way to describe learning and inference models. But I think in their simplest/classical form (variable nodes, and factor nodes), factor graphs are insufficient to capture the computational issues. There are factors, say between two variables X and Y, that allow you to easily compute Y from X, but not X from Y, or vice versa. Imagine that X is an image, Y a description of the image, and the factor contains a giant convolutional net that computes a description of the image and measures how well Y matches the computed description. It's easy to infer Y from X, but essentially impossible to infer X from Y. In the real world, where variables are high dimensional and continuous, and where dependencies are complicated, factors are directional.

Hey Michael Jordan, it would interesting to hear your comments on Max's piece?

Link to Max's original post: https://www.facebook.com/max.welling.10/posts/972944266111442?pnref=story

videolectures.net

Two high stakes challenges in machine learning

Machine learning technologies are increasingly used in complex software systems such as those underlying internet services today or self-driving vehicles tomorrow. Despite famous successes, there is more and more evidence that machine learning components…

345 views09:14

ArtificialIntelligenceArticles

Stéphane Mallat's tutorial at the "Statistical Physics and Machine Learning back Together" summer school in Cargese, Corsica.

There is a long history of theoretical physicists (particularly condensed matter physicists) bringing ideas and mathematical methods to machine learning, neural networks, probabilistic inference, SAT problems, etc.

In fact, the wave of interest in neural networks in the 1980s and early 1990s was in part caused by the connection between spin glasses and recurrent nets popularized by John Hopfield. While this caused some physicists to morph into neuroscientists and machine learners, most of them left the field when interest in neural networks wanted in the late 1990s

With the prevalence of deep learning and all the theoretical questions that surround it, physicists are coming back!

Many young physicists (and mathaticians) are now working on trying to explain why deep learning works so well. This summer school is for them.

We need to find ways to connect this emerging community with the ML/AI community. It's not easy because (1) papers submitted by physicists to ML conferences rarely make it because of a lack of qualified reviewers; (2) conference papers don't count in a physicist's CV.

https://cargese.krzakala.org

363 views09:16

ArtificialIntelligenceArticles

Theory and Evaluation Metrics for Learning Disentangled Representations

Kien Do, Truyen Tran : https://arxiv.org/abs/1908.09961

#DeepLearning #DisentangledRepresentations #MachineLearning

arXiv.org

Theory and Evaluation Metrics for Learning Disentangled Representations

We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we...

363 views12:48

About

Blog

Apps

Platform