A variational approach to the information bottleneck principle, providing a tractable, deep-learning-compatible framework for learning compressed, relevant representations.
Joshua V. Dillon
I am a machine learning researcher based in Mountain View, CA.
My work spans generative AI and information theory and includes training foundational models at Google and Luma as well as academically notable research in probabilistic modeling, variational inference, and uncertainty estimation. I created TensorFlow Probability (2017; 4.4k Github stars, O(1M) downloads/month still). I co-created the prototype that would become Veo (2024). I contributed to Gemini (2024) and VideoPoet (2023; ICML Best Paper). I designed and wrote the auction mechanism used by ContentAds (2013–2019). My best known paper is the Deep Variational Information Bottleneck (2017), for which I co-developed the idea and math with my dear friend Alex Alemi.
I was a Staff Research Scientist in Google Research and Google DeepMind for a combined total of 13 years. Most recently, I led the foundational model pre-training team at Luma.
I received my Ph.D. from the Georgia Institute of Technology, advised by Professor Guy Lebanon. My thesis, Stochastic m-Estimators for Controlling Accuracy-Cost Tradeoffs in Machine Learning, proposed and proved statistical properties of what would much later become known as the BERT loss. I was awarded the DHS Fellowship in Data Analysis and Visual Analytics (2010–2012). I also was awarded the Marshall Sherfield Fellowship for American researchers visiting the UK (accepted to Cambridge, UK under Zoubin Ghahramani) but ultimately chose sunny California instead.
Selected Publications
My research focuses on probabilistic machine learning, variational inference, and uncertainty estimation. For a complete list, see my Google Scholar (11,600+ citations, h-index 25).
A large-scale empirical study of predictive uncertainty methods under dataset shift, finding that ensembles are the most reliable approach.
Google DeepMind's frontier multimodal model with advanced reasoning, long context, and agentic capabilities.
Using likelihood ratios to correct for background statistics, enabling reliable out-of-distribution detection with deep generative models.
A library of probability distributions and bijectors for TensorFlow, forming the foundation of TensorFlow Probability. Enables efficient, composable probabilistic computation on accelerators.
An information-theoretic analysis revealing that the standard ELBO objective is broken, and proposing principled fixes via rate-distortion theory.
A large language model capable of zero-shot video generation across a variety of video generation tasks.
Using normalizing flows to reparameterize Hamiltonian Monte Carlo, neutralizing pathological posterior geometries.
Using density of states estimation from statistical physics for reliable out-of-distribution detection.
Exploring the uncertainty properties that naturally arise from the variational information bottleneck framework.
A framework for representing sequential text using locally weighted bag of words, applied to classification, segmentation, summarization, and visualization.
A compact parameterization for Gaussian mean field posteriors that scales Bayesian neural networks efficiently.
A method for distilling ensembles into a single model while preserving the diversity of the ensemble.
Methods for visualizing the sequential structure of documents using techniques from information geometry.
The MCMC library within TensorFlow Probability, designed for modern hardware with automatic batching and XLA support.
Extending automatic differentiation variational inference to mixture posteriors for improved approximation.
A PAC-Bayes framework that narrows the empirical risk gap when the Bayesian model is misspecified.
A family of point estimators resolving the computation-accuracy tradeoff in maximum likelihood, with consistency proofs and asymptotic variance formulas.
Combining ensemble methods with self-supervised learning through weighted aggregation.
A flexible optimization framework for constraining probabilistic models with imprecise domain knowledge, applied to robust pseudo-relevance feedback for information retrieval.
Machine translation and diffusion kernels for unsupervised metric learning of text.
A flexible API for specifying joint distributions in TensorFlow Probability, enabling compositional probabilistic modeling.
Quantifying the asymptotic accuracy of generative semi-supervised learning based on stochastic composite likelihood.
Automatic methods for tighter bounding of Taylor remainder series with applications.
Examining statistical and computational tradeoffs in stochastic composite likelihood estimation.
A framework connecting sampling and compression in generative modeling.
Showing the variational information bottleneck is equivalent to a half-Bayesian treatment.
Sharp polynomial enclosures for bounding univariate functions via Taylor series.
Exploring the relationship between speed and confidence in machine learning systems.
Proposes and proves statistical properties of what would later become known as the BERT loss.