Dan Kondratyuk

Email Google Scholar LinkedIn Github

I am a machine learning researcher who has built large-scale multimodal systems, spanning LLMs and diffusion models for joint image, video, audio, and text generation.

My work encompasses training foundational models at Luma AI and Google Research, as well as exploring efficient training and inference techniques for video models. I co-developed the generative video models powering Dream Machine at Luma AI (its core product) and led a team building realtime interactive video models. I am the first author and lead contributor of VideoPoet (2023; ICML Best Paper), a multimodal LLM for video and audio generation. I created MoViNets (2021), a family of efficient mobile video networks enabling real-time video action recognition on mobile devices. Earlier, I built UDify (2019), a single multilingual model capable of parsing the syntax of 75 languages at once.

At Luma AI I also led the World Models team, where the focus was on realtime interactive GenAI video, and included joint video-audio generation, video avatar speech-to-video, and realtime player and camera control. Prior to Luma, I spent five years at Google Research, starting as an AI Resident and advancing to a Senior Machine Learning Engineer. There I also worked on Integrated Multimodal Perception (NeurIPS 2023), a Mixture-of-Experts model that efficiently combines image, video, audio, and text modalities in one model, as well as research on efficient model ensembling.

I hold an M.S. in Computational Linguistics from Charles University in Prague, earned through an Erasmus Mundus scholarship, where my thesis on multilingual dependency parsing received the Best Master’s Thesis award from the Mathematics & Physics department. I completed my B.S. at Boise State University, graduating Summa Cum Laude and receiving the Outstanding Student Award in Computer Science.

Selected Publications

My research spans multimodal generative models, efficient video understanding, and multilingual NLP. For a complete list, see my Google Scholar (1,500+ citations, h-index 9).

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar, et al.

ICML, 2024 · 448 citations / Best Paper

A multimodal large language model for zero-shot video and audio generation from a wide variety of conditioning signals including images, videos, text, and audio.

MoViNets: Mobile Video Networks for Efficient Video Recognition

D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, B. Gong

CVPR, 2021 · 396 citations

A family of computation and memory efficient video networks that operate on streaming video for online inference, achieving state-of-the-art accuracy on Kinetics while requiring 80% fewer FLOPs.

75 Languages, 1 Model: Parsing Universal Dependencies Universally

D. Kondratyuk, M. Straka

EMNLP, 2019 · 355 citations

A multilingual multi-task model that accurately predicts universal part-of-speech, morphological features, lemmas, and dependency trees for all 124 Universal Dependencies treebanks across 75 languages using a single fine-tuned multilingual BERT.

Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models

X. Wang, D. Kondratyuk, KM Kitani, Y. Movshovitz-Attias, E. Eban

arXiv, 2021 · 92 citations

An overlooked approach showing that committees of models can achieve faster and more accurate predictions than single large models.

When Ensembling Smaller Models is More Efficient than Single Large Models

D. Kondratyuk, M. Tan, M. Brown, B. Gong

arXiv, 2020 · 52 citations

Demonstrating that ensembles of smaller models can be more computationally efficient while matching or exceeding the accuracy of a single large model.

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs

D. Kondratyuk, T. Gavenčiak, M. Straka, J. Hajič

EMNLP, 2018 · 48 citations

A featureless neural network architecture that jointly generates part-of-speech tags and lemmas using bidirectional RNNs, surpassing state-of-the-art in Czech, German, and Arabic.

Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning

D. Kondratyuk

SIGMORPHON, 2019 · 44 citations

Leveraging multilingual BERT with two-stage fine-tuning for cross-lingual morphology tagging and lemmatization, achieving the highest average accuracy in the SIGMORPHON 2019 shared task.

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

H. Akbari, D. Kondratyuk, Y. Cui, R. Hornung, H. Wang, H. Adam

NeurIPS, 2023 · 33 citations

A scalable multimodal multi-task approach combining alternating gradient descent and mixture-of-experts, achieving new state-of-the-art in zero-shot video classification on Kinetics-400/600/700.

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Q. Li, B. Gong, Y. Cui, D. Kondratyuk, X. Du, MH Yang, M. Brown

arXiv, 2021 · 28 citations

Jointly pre-training transformers on unpaired images and text towards a single unified foundation model.

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

S. Yu, M. Hahn, D. Kondratyuk, J. Shin, A. Gupta, J. Lezama, I. Essa, D. Ross, et al.

arXiv, 2025 · 7 citations

Memory-augmented latent transformers enabling consistent, high-quality video generation at arbitrary lengths.

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

A. Marmon, G. Schindler, J. Lezama, D. Kondratyuk, B. Seybold, I. Essa

arXiv, 2024 · 6 citations

Camera-aware image-to-video generation using multimodal transformers for controllable viewpoint synthesis.

Replacing Linguists with Dummies: A Serious Need for Trivial Baselines in Multi-Task Neural Machine Translation

D. Kondratyuk, R. Cardenas, O. Bojar

PBML, 2019 · 3 citations

Highlighting the serious need for trivial baselines when evaluating multi-task neural machine translation systems.

Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility

Y. Li, F. Liang, D. Kondratyuk, M. Tomizuka, K. Keutzer, C. Xu

arXiv, 2025 · 1 citation

Accelerating diffusion model training by reducing miscibility of noise and data distributions.

Multilingual Learning Using Syntactic Multi-Task Training

D. Kondratyuk

M.S. Thesis, Charles University, 2019 · Best Thesis Award

A multilingual multi-task framework for syntactic parsing across 75 languages, awarded Best Master's Thesis by the Mathematics & Physics department.