SCORE

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

Raymond Yu* 1    William Huey* 1    Mustafa Mukadam1    Anusha Nagabandi2    Abhishek Gupta1 1University of Washington    2Amazon FAR    *Equal contribution

The Problem

The same simulation that improves a policy can break it.

Optimized freely in simulation, the policy solves the task, but through motions that exploit the simulator. Executed at high force on the real hand, those same motions eventually snapped a finger.

A policy learned from a handful of demonstrations can be fragile: it may grasp imprecisely, act slowly, and fail as soon as the object is disturbed. Improving policies in the real world is slow and expensive, which makes simulation an appealing alternative. However, naively optimizing freely in simulation allows the policy to take advantage of any inaccuracy in the simulator to maximize reward, producing grasps that fail or damage the real hardware. A distributional constraint is often used to combat this, but regularizing the policy too closely to the real-world demonstrations makes it inherit their failures. SCORE instead constrains improvement to the support of the real-world prior, moving the policy only toward behaviors that are realizable in the real world.

The key insight

Policy improvement in simulation should be constrained to the support of the real-world base policy.

Method

Steer through the support, don’t escape it.

89.9% average success 2.4× over the base policy Minimal reward engineering across 8 real-world multi-fingered tasks · pick one to explore
Base policy REAL WORLD
πsteer(z|o)latent conditioning
latent z
πbase
action a
πbase actionreal-world prior
off-target high-reward slow
real-world success rate
Base
SCORE
Task Easy
Base
SCORE
1Pretrain πbase in the real world
Base policy rollout in the real world
z ∼ 𝒩(0,1)
πbase(a ∣ s, z)
The base policy spreads its mass over suboptimal and slow actions, so it often fails.
2Support-constrained RL in simulation
Support-constrained RL in massively parallel simulation
πsteer(z ∣ o)
πbase(a ∣ s, z)
Massively parallel simulation steers the policy toward high-value actions, within support.
3Deploy directly in the real world
Steered policy deployed in the real world
πsteer(z ∣ o)
πbase(a ∣ s, z)
Steered policy is fast, precise, and robust.
SCORE in three steps: pretrain a base policy on real demos, steer it in simulation within the prior’s support using only a sparse task reward, then deploy directly on the robot.

Beyond the Benchmark

SCORE handles more than the eight tasks.

Continuous operation

Running continuously, the policy picks up cubes one by one and drops them into the basket. The base policy misses most of them and leaves the basket nearly empty, while SCORE grasps reliably and fills it.

Fast to iterate on new tasks

Adding a new task is fast: under half a day from collecting demonstrations to deploying a steered policy. In both examples below, the base policy cannot recover once a grasp fails, while SCORE retries until it succeeds.

Open a Clorox bottle
✗ Basecan’t recover once the grasp slips
✓ SCOREretries until the cap comes off
Take a book off the shelf
✗ Basepulls the book too far, so it falls
✓ SCOREhooks the book and pulls it out cleanly

Why You Shouldn’t Optimize Freely in Simulation

It looks great in sim, then breaks on hardware.

With unconstrained RL, the policy maximizes reward by exploiting the simulator. The resulting grasps are contorted and high-force: they achieve high reward in simulation but become erratic or dangerous on the real robot. Each pair below shows one such unconstrained-RL policy in simulation, then the same behavior on hardware.

Push Soccer Ball
simhigh reward
realsnaps the thumb into the table
Grasp Bottle
simhigh reward
realexcessive, unstable force
Screw Lightbulb
simhigh reward
realnearly topples the light stand
Pick Credit Card
simhigh reward
reallarge force, fails completely
FPO policy broke the robot finger
It can damage the hardware

Repeated at this force, these grasps eventually broke one of our hand’s fingers.

The Distributional Constraint Tradeoff

Constraining a policy toward the base trades improvement for transferability.

A common way to keep a simulation policy deployable is to regularize it toward the base policy with a behavior-cloning (BC) loss, then tune the strength of that regularization. Across every coefficient we tried, the same tradeoff appeared. Weak regularization lets the policy drift into dangerous, non-transferable behavior, while strong regularization drags it back into the base policy’s failures. Nothing in between recovers SCORE’s performance. In our paper, we show that this is a provable limitation of algorithms that limit deviation from the base policy’s distribution, such as BC-PPO or residual RL.

Success rate under varying levels of BC regularization to the base
Real-world success Sim success collapses
weaker constraint  ←  BC loss coefficient  →  stronger constraint

Too loose to learn anything: the policy collapses in simulation.

The same tension on hardware

The same thing happens on real hardware: even with the BC constraint in place, the policy settles on behavior that is unsafe or unreliable once deployed.

How Far Can Steering Go?

Steering goes a long way, as long as the behavior already lives in the prior.

One policy across tasks

One steered policy trained on three tasks: credit card, cube, and bottle. It picks the right grasp for each object, and even reuses behaviors across them.

Same cube, two borrowed behaviors
↳ borrows its credit-card pinch
↳ borrows its bottle graspcube placed beyond training randomization

The same cube is grasped two different ways depending on where it sits. Each behavior already lives inside the policy’s support.

For each object, the steered policy reaches with the right grasp. The base policy mixes them up, using one object’s grasp on another.

Credit Card
✓ SCOREprecise pinch
✗ basereaches with the bottle grasp
Bottle
✓ SCOREstable grasp
✗ basepinches as if it were the card

Steering to a new object: bottle → carrot

We take a frozen bottle-grasp policy and steer it in simulation onto a carrot, an object it never trained on. A carrot is much thinner and demands a precise pinch, a behavior that appears only rarely under the bottle prior. Steering still recovers that pinch: on hardware the carrot-steered policy reaches 67%, far above the same policy steered to a bottle (22%).

SCORE (steered to bottle)
ReferenceSteered to a bottle, then run on a carrot, it still reaches as if the carrot were a bottle.
SCORE (steered to carrot)
✓ SCORE behaviorSteered to the carrot, it surfaces the rare in-support pinch and grasps it: 67% on hardware.

Where it runs thin

The same bottle policy was trained with no distractor objects in the scene. When we add two distractor cubes and steer the policy to grasp around them, it recovers a working grasp on hardware (56%, up from 0% base), but only when the bottle sits on one side of the workspace. Even though SCORE trains with the distractors in simulation, exploration gets harder the farther the scene drifts from where the base policy was trained, so a robust grasp is harder to discover through steering.

Base (with distractor)
✗ Base policyWith a distractor, the base bottle policy collapses.
SCORE (steered, with distractor)
~ SCORE behaviorSCORE occasionally achieves a working grasp, but performance is limited because the prior is far out of distribution (56%).
The limit

What limits steering is the prior itself. The broader its coverage, the further steering can go.

Takeaway

Improve the policy you already have.

Overall, SCORE shows that simulation does not have to mean training a new policy from scratch. It can also be used to improve an existing real-world policy, so long as that improvement stays within the support of the real-world prior. With sparse rewards and a simple training pipeline, SCORE reaches robust, fast, and precise manipulation with minimal effort. A new task takes under half a day to add, from data collection to training to deployment, with large gains over the base policy. However, its reach is still bounded by coverage, since improvement falls off in scenes that drift far from the prior. A natural next step is to build broader behavior priors and datasets designed for steering, where coverage is measured by what simulation can later improve.

BibTeX

@misc{yu2026score,
    title  = {SCORE: Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience},
    author = {Yu, Raymond and Huey, William and Mukadam, Mustafa and Nagabandi, Anusha and Gupta, Abhishek},
    year   = {2026}
}