TMRL: Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning

TL;DR

A unified framework bridging behavior-cloning pre-training and reinforcement-learning fine-tuning.

Our method

Context-Smoothed Pre-training (CSP) trains a policy across all noise levels of a diffusion-style forward schedule, spanning the spectrum from precise imitation to broad coverage. Timestep-Modulated RL (TMRL) then trains a high-level policy to select the noise level per action chunk.

Results

Outperforming baselines in simulation and real-world fine-tuning of π₀ to near-perfect success on three tasks: under an hour on WidowX, under four on Franka.

Motivation

The need for action distribution interpolation.

Robots trained by imitation copy demonstrations narrowly — in sparsely covered contexts the conditional support collapses to near-zero probability on optimal actions, so online rollouts yield no reward signal and RL fine-tuning stalls. We propose injecting forward-diffusion noise into policy inputs during pre-training, so the policy learns the full spectrum of distributions between the conditional p(a | c) and marginal p(a).

On a 2D pointmaze from OGBench, we train two policies — one approximating p(a | c), the other p(a) — where c is the goal position. Both train on pointmaze-large and evaluate on the larger pointmaze-giant environment, with an unseen goal location (agent in yellow, goal in pink).

p(a | c) · BC

p(a) · CSP, σ = T

Left, the conditional p(a | c) collapses and drifts the agent away from the goal. Right, the marginal p(a) covers broadly enough to reach it, but assigns goal-reaching behaviors only low likelihood.

This motivates a policy that can interpolate between the two — and a mechanism to learn where on that interpolation to operate.

Method

Context smoothing & timestep modulation.

Diffusion noise on pointcloud contexts

On the dexterous-grasping suite, the context c is the object pointcloud. As σ rises, q_σ progressively corrupts c: the pointcloud slowly loses its original structure, aliasing the out-of-distribution object with nearby in-distribution pointclouds seen during training. The policy can then borrow coherent grasps from related training contexts rather than failing on an unfamiliar shape.

1 · Rollout

noising c with q_σ ↓

2 · Context-smoothed pointcloud

Top: TMRL on a held-out drill (context c = pointcloud). Bottom: q_σ removes structure as σ rises — aliasing the OOD object back to in-distribution shapes the policy has seen during training.

σ Slider

The conditional ↔ marginal spectrum.

CSP trains one policy across all σ by corrupting the context c (here, the cube goal position) with q_σ(c | c). The endpoints anchor the spectrum: σ = 0 recovers p(a | c), σ = T recovers p(a).

σ = 0 (conditional) σ = T/4 σ = T/2 σ = 3T/4 σ = T (marginal)

σ = 0.00 (conditional)

Drag the slider to see how the policy’s rollouts spread as the context (goal position) becomes more corrupted.

Simulation Results

Coverage and RL sample efficiency.

Action coverage before RL: Success@K

Success@K — the fraction of out-of-distribution states where at least one of K base-policy rollouts succeeds. On two tasks, CSP outperforms PostBC and BC for all K, with the gap widest on cube-single where both baselines remain at zero.

Success@K comparison — **CSP widens action coverage before RL.** On `cube-single`, BC and PostBC stay near zero while CSP rises with K.

RL sample efficiency

We compare against several state-of-the-art baselines. TMRL outperforms or matches all baselines across four simulation tasks.

Simulation RL results — **TMRL beats baselines across four simulation tasks.** TMRL approaches 100% across all four tasks; only TMRL reaches non-trivial success on the `libero-90` task.

Real-World RL

Context-Smoothed π₀ in the real world.

< 1 hr

DSRL never converges on either WidowX task. TMRL reaches near-perfect success in under an hour of real-robot fine-tuning on both (sausage-in-pot, shrimp-in-drawer).

Applying CSP to a VLA. Noise the VLM embedding context with q_σ to allow π₀ to sample across the conditional → marginal spectrum.

We fine-tune a context-smoothed π₀ on BridgeData-v2 (WidowX 250) and DROID (Franka).

Real-world RL fine-tuning results — **TMRL reaches near-perfect success on real-robot tasks in under an hour.** Base π₀ stalls — its action support is too narrow for any rollout to succeed. CSP widens that support; TMRL leverages it to reach near-perfect success on all three tasks.

Per-task rollouts

Base π₀ versus context-smoothed π₀, side by side. The learning curve below traces the full RL training run.

Base π₀

Context-Smoothed π₀ (ours)

Real-world RL learning curve

Button-Press

TMRL learns to be precise when it matters.

π_HL treats the diffusion timestep σ as an adaptive coverage control, choosing it per action chunk. Below, the converged TMRL policy on shrimp-in-drawer learns to keep σ low through the precision-critical pick and higher during reaching and placing.

“pick up the shrimp and put it into the white drawer”

Action Chunk 1 / 10

σ = 0.01

BibTeX

@inproceedings{hong2026tmrl,
    title  = {TMRL: Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning},
    author = {Hong, Matthew M. and Zhang, Jesse and Nagabandi, Anusha and Gupta, Abhishek},
    booktitle = {Robotics: Science and Systems (RSS)},
    year   = {2026}
}