A multimodal large language model for zero-shot video and audio generation from a wide variety of conditioning signals including images, videos, text, and audio.
Dan Kondratyuk
I am a machine learning researcher who has built large-scale multimodal systems, spanning LLMs and diffusion models for joint image, video, audio, and text generation.
My work encompasses training foundational models at Luma AI and Google Research, as well as exploring efficient training and inference techniques for video models. I co-developed the generative video models powering Dream Machine at Luma AI (its core product) and led a team building realtime interactive video models. I am the first author and lead contributor of VideoPoet (2023; ICML Best Paper), a multimodal LLM for video and audio generation. I created MoViNets (2021), a family of efficient mobile video networks enabling real-time video action recognition on mobile devices. Earlier, I built UDify (2019), a single multilingual model capable of parsing the syntax of 75 languages at once.
At Luma AI I also led the World Models team, where the focus was on realtime interactive GenAI video, and included joint video-audio generation, video avatar speech-to-video, and realtime player and camera control. Prior to Luma, I spent five years at Google Research, starting as an AI Resident and advancing to a Senior Machine Learning Engineer. There I also worked on Integrated Multimodal Perception (NeurIPS 2023), a Mixture-of-Experts model that efficiently combines image, video, audio, and text modalities in one model, as well as research on efficient model ensembling.
I hold an M.S. in Computational Linguistics from Charles University in Prague, earned through an Erasmus Mundus scholarship, where my thesis on multilingual dependency parsing received the Best Master’s Thesis award from the Mathematics & Physics department. I completed my B.S. at Boise State University, graduating Summa Cum Laude and receiving the Outstanding Student Award in Computer Science.
Selected Publications
My research spans multimodal generative models, efficient video understanding, and multilingual NLP. For a complete list, see my Google Scholar (1,500+ citations, h-index 9).
A family of computation and memory efficient video networks that operate on streaming video for online inference, achieving state-of-the-art accuracy on Kinetics while requiring 80% fewer FLOPs.
A multilingual multi-task model that accurately predicts universal part-of-speech, morphological features, lemmas, and dependency trees for all 124 Universal Dependencies treebanks across 75 languages using a single fine-tuned multilingual BERT.
An overlooked approach showing that committees of models can achieve faster and more accurate predictions than single large models.
Demonstrating that ensembles of smaller models can be more computationally efficient while matching or exceeding the accuracy of a single large model.
A featureless neural network architecture that jointly generates part-of-speech tags and lemmas using bidirectional RNNs, surpassing state-of-the-art in Czech, German, and Arabic.
Leveraging multilingual BERT with two-stage fine-tuning for cross-lingual morphology tagging and lemmatization, achieving the highest average accuracy in the SIGMORPHON 2019 shared task.
A scalable multimodal multi-task approach combining alternating gradient descent and mixture-of-experts, achieving new state-of-the-art in zero-shot video classification on Kinetics-400/600/700.
Jointly pre-training transformers on unpaired images and text towards a single unified foundation model.
Memory-augmented latent transformers enabling consistent, high-quality video generation at arbitrary lengths.
Camera-aware image-to-video generation using multimodal transformers for controllable viewpoint synthesis.
Highlighting the serious need for trivial baselines when evaluating multi-task neural machine translation systems.
Accelerating diffusion model training by reducing miscibility of noise and data distributions.
A multilingual multi-task framework for syntactic parsing across 75 languages, awarded Best Master's Thesis by the Mathematics & Physics department.