SCORE: Support-Constrained Off-Domain Real-world Exploration

The Problem

Policy improvement in simulation doesn't always transfer to the real world

Learning from real data Left: collecting real-world demonstrations is slow and expensive. Right: the policy inherits the demonstrator’s flaws.

Improving policies in simulation Improving policies freely in simulation with RL exploits the simulator and transfers unsafely to the real world.

Regularization tradeoff Too much regularization toward real-world demos keeps the base policy’s failures. Too little exploits the simulator.

The key insight

Policy improvement in simulation should be constrained to the support of the real-world base policy.

Method

We learn to steer the base policy in simulation, improving within its support.

Base policy rollout in the real world — SCORE in three steps: pretrain a base policy on real demos, steer it in simulation within the prior’s support using only a **sparse task reward**, then deploy directly on the robot.

Flow steering in massively parallel simulation — SCORE in three steps: pretrain a base policy on real demos, steer it in simulation within the prior’s support using only a **sparse task reward**, then deploy directly on the robot.

Try it on any task

Watch a single policy go from real demonstrations, to steering in simulation, to deployment on the robot.

Base policy REAL WORLD

π_steer(z|o)latent conditioning

latent z

π_base

action a

π_base actionreal-world prior

off-target high-reward slow

real-world success rate

Base

SCORE

Task Easy

Base

SCORE

Beyond the Benchmark

SCORE handles more than the eight tasks.

Continuous operation

Running continuously, the policy picks up cubes one by one and drops them into the basket. The base policy misses most of them and leaves the basket nearly empty, while SCORE grasps reliably and fills it.

5× speed

Fast to iterate on new tasks

Adding a new task is fast: under half a day from collecting demonstrations to deploying a SCORE policy. In both examples below, the base policy cannot recover once a grasp fails, while SCORE retries until it succeeds.

Open a Clorox bottle

1× speed

✗ Basecan’t recover once the grasp slips

1× speed

✓ SCOREretries until the cap comes off

Take a book off the shelf

2.5× speed

✗ Basepulls the book too far, so it falls

2× speed

✓ SCOREhooks the book and pulls it out cleanly

Why You Shouldn’t Optimize Freely in Simulation

It works in sim, then breaks on hardware.

With unconstrained RL, the policy maximizes reward by exploiting the simulator. The resulting grasps are contorted and high-force: they achieve high reward in simulation but become erratic or dangerous on the real robot.

Push Soccer Ball

simhigh reward

→

realsnaps the thumb into the table

Grasp Bottle

simhigh reward

→

realexcessive, unstable force

Screw Lightbulb

simhigh reward

→

realnearly topples the light stand

Pick Credit Card

simhigh reward

→

reallarge force, fails completely

It can damage the hardware

Repeated at this force, these grasps eventually broke one of our hand’s fingers.

The Distributional Constraint Tradeoff

Constraining a policy toward the base trades improvement for transferability.

A common way to keep a simulation policy deployable is to regularize it toward the base policy with a behavior-cloning (BC) loss, but this introduces a tradeoff between improvement and deployment. In our paper, we show that this is a provable limitation of algorithms that limit deviation from the base policy’s distribution, such as BC-PPO (RialTo) or residual RL.

Success rate of RialTo baseline under varying BC-PPO loss coefficients

Sim success Real-world success training collapses

BC loss coefficient

Too loose to learn anything: the policy collapses in simulation.

On real hardware

Even with the BC constraint in place, the policy settles on behavior that is unsafe or unreliable once deployed.

The policy learns to drag its fingers across the table to earn reward, a motion that would be hazardous on the real robot.

A swooping reach that ends in an unstable pinch.

A clear case of reward-hacking: rewarded whenever the centroid lies inside the pegs’ box, the policy balances the plate on the edge and leaves it there instead of seating it.

Flow Steering as Support-Constrained RL

SCORE optimizes for reward while staying within the support of realistic base-policy behaviors .

The support of π_base is the model-induced set of actions the base policy can produce for a given latent z. A support constraint keeps optimization within that set. Using flow steering, SCORE optimizes over z and climbs toward higher reward, but stays within the prior’s support, allowing for better sim-to-real transfer.

Each point is an action that π_base can generate. A distributional (BC) constraint trades off transferability for reward, drifting out of support and eventually collapsing to unconstrained RL. SCORE learns high-reward actions inside the support.

Why it works

Behaviors in the support of the base policy can be transferred to the real world. Flow steering optimizes performance under the base policy's support, so improvement in simulation transfers to the real robot.

How does the base policy affect improvement?

SCORE goes a long way, as long as the behavior lies in the support of the base policy.

One policy across tasks

One policy, trained with SCORE on three tasks: credit card, cube, and bottle. It picks the correct grasp for each object, and even learns to reuse behaviors across tasks.

One multi-task policy, borrowing behaviors across tasks

credit card pinch behavior

bottle grasp behavior

cube placed beyond training randomization

The same cube is grasped two different ways: a credit-card pinch and a bottle grasp, both behaviors the policy learned on other objects. By reusing skills across tasks, one policy can still grasp the cube from placements outside its own training range.

The base policy mixes them up

bottle grasp behavior

✗ Credit carduses bottle grasp mode instead

credit card pinch behavior

✗ Bottleuses card pinch mode instead

The base policy swaps the two grasps: each object gets the grip meant for the other.

A new object: bottle → carrot

We take a frozen bottle-grasp policy and use SCORE in simulation to grasp a carrot, an object it never trained on. The carrot is thinner and needs a precise pinch that the bottle base policy produces only rarely.

SCORE (bottle)

ReferenceWith SCORE on a bottle, then run on a carrot, it still reaches as if the carrot were a bottle (22% success).

SCORE (carrot)

✓ SCORE behaviorWith SCORE on the carrot, it surfaces the rare in-support pinch and grasps it: 67% success.

Adding distractors

The bottle policy was trained with no distractor objects. We add two distractor cubes and apply SCORE to grasp around them, recovering a working grasp on hardware, but only when the bottle sits on one side of the workspace.

Base (with distractor)

✗ Base policyWith a distractor, the base bottle policy collapses (0% success).

SCORE (with distractor)

~ SCORE behaviorRetrained in sim with the distractors, SCORE recovers a grasp over part of the workspace (56%). Trained without distractors, the base policy never learned the visual features that separate the cubes from the bottle, which makes it much harder to steer in the new setting.

The limit

What limits SCORE is the base policy itself. The broader its coverage, the further it can go.

Takeaway

Broader priors lead to better improvement.

Given a sparse reward function and half a day of training, SCORE learns fast, precise, and robust policies that transfer to the real world. This is made possible by constraining policy improvement to the support of the real-world prior. A natural next step is to build broader behavior priors and datasets designed for steering.

BibTeX

@misc{yu2026score,
    title  = {SCORE: Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience},
    author = {Yu, Raymond and Huey, William and Mukadam, Mustafa and Nagabandi, Anusha and Gupta, Abhishek},
    year   = {2026}
}