Semantic World Models

Abstract

Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. We posit that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. To do this, we pose world modeling as a visual question answering problem, about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. We show how vision language models can be trained as "semantic world models" through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. We demonstrate how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling.

Try out Semantic World Models yourself!

Load an example: Click the "Load Random Example" button to sample a random initial state.
Draw actions: Click and drag on the initial state image to draw a trajectory. Each point is converted into an action.
Submit actions: Click "Submit Actions" to see what the final state will look like after executing your drawn actions.
Ask questions: Type a question about the future state (e.g., "Is the red pentagon touching the red star?") or click one of the suggested sample questions.
View predictions: The model will predict the answer with a confidence visualization showing how the answer changes from initial to final state.

SWM Inputs

Initial State

Click "Load Random Example" to start

Final State (For your reference)

Submit actions to see final state

Ask a Question

Submit actions to see questions

Processing...

Method Overview

Overview of Semantic World Models: Semantic World Models (SWMs) unlike traditional world models answer questions about the future given current observations (represented as an image) and a sequence of actions. Similar to how vision language models are trained to align image tokens with a language model, we are able to align actions to a vision language model to answer questions about the future.

Our key insights are:

Future Question Answering offers a framework in order to train world models and can be adapted to planning It also gives a framework to instantiate goals for planning in a genralizable and simple way.
Adapting Vision Language Models (VLMs) allows this framework to leverage the internet scale pretraining of VLMs. The pretraining of VLMs on question answering lends itself to future question answering well.

Results

Policy Improvement

For more complicated tasks, we consider a scenario in which a base policy generates a candidate trajectory that is refined using SWM and gradient-based optimization. As shown below, our method is able to both refine the candidate trajectory to improve base policy performance, and beat out the baselines.

Red Pentagon to Blue Moon

Sampling Based Planning

We evaluate our Semantic World Model as a world model for control using the Model Predictive Path Integral (MPPI) planner across LangTable and OGBench tasks. SWM enables direct planning in semantic space—without pixel-level reconstruction—achieving near-perfect success rates on reaching and block separation tasks. While MPPI planning is computationally intensive for complex domains, these results demonstrate that SWM can effectively guide low-level control through high-level semantic reasoning.

LangTable Reaching (100% success rate)

LangTable Separation (100% success rate). Goal: separate the star blocks

OgBench Reaching (97% success rate)

Multi Step Tasks

We extend planning to long-horizon problems by chaining short subgoals which completion of is verified by SWM. For each subgoal we ask a simple yes/no question (e.g. "Is the peg touching the red cube?") without the action conditioning and transition after the model predicts the subgoal is complete. Below are example multi step tasks:

Red pentagon and green cube to blue moon, and yellow star to blue cube

Yellow star to blue cube, yellow pentagon to red moon

Model Architecture

Semantic World Model Architecture: We design our model to answer questions about future events conditioned on actions. Since this is fundamentally a visual question-answering task with action conditioning, we bootstrap from a large pretrained VLM to transfer its generalization capabilities to robotics tasks. Our architecture is based on PaliGemma (3B parameters), which contains three core pretrained components: a Gemma language model, a SigCLIP vision encoder, and projection matrices. To condition the model on actions, we introduce a new projection matrix that maps actions into the language model's token embedding space, similar to how image tokens are projected. This approach enables the model to capture environment dynamics in language space without requiring pixel-level reconstruction, making it efficient for planning by evaluating candidate action sequences against desired outcomes.

Ablations

Out of Distribution Performance

Compositional Generalization: Introduce new colored blocks and modify color-shape pairs in LangTable
- 28% improvement over base policies
- Retains pretraining knowledge for compositional generalization
Background Robustness: Change OGBench background to novel color combinations
- 15% improvement over base policy
- Successfully generalizes to new visual conditions

OOD Language Table environment: Yellow moon to purple cube.

Learning from negative data

	LangTable		OGBench
Dataset Type	Expert Data	Expert Data OOD	Expert Data	Expert Data OOD
Sub Optimal	85.98 ± 0.33	81.99 ± 1.46	90.83 ± 0.39	85.56 ± 1.10
Expert	91.27 ± 0.79	86.49 ± 0.39	96.53 ± 0.13	87.33 ± 2.13
Combined	92.92 ± 0.34	88.32 ± 2.10	96.86 ± 0.13	88.16 ± 1.54

Training with combined sub-optimal and expert data yields the best performance across both in-domain and out-of-distribution scenarios, demonstrating the value of learning from diverse data quality. For the suboptimal dataset on language table, we used trajectories generated by sampling uniformly from the action space, and for OGBench we used their noisy expert.

Visualization of Attention Maps

Visualization of attention maps in an intermediate layer of the SWM overlayed on current observation. The model is prompted with the question "is the red moon touching the blue cube?".

BibTeX

@misc{berg2025semanticworldmodels,
  title={Semantic World Models}, 
  author={Jacob Berg and Chuning Zhu and Yanda Bao and Ishan Durugkar and Abhishek Gupta},
  year={2025},
  eprint={2510.19818},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2510.19818}, 
}