80s style ASID logo

Active Exploration for System Identification in Robotic Manipulation

1University of Washington
ICLR 2024 (oral)

ASID is a generic pipeline for Sim2Real transfer that solves dynamic tasks zero-shot!

Abstract

Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer.

Video

Pipeline Overview


Pipeline Overview

Overview of ASID: (1) Train an exploration policy \(\pi_{exp}\) that maximizes the Fisher information, leveraging the vast amount of cheap simulation data. (2) Roll out \(\pi_{exp}\) in real to collect informative data that can be used to (3) run system identification to identify physics parameters and reconstruct, e.g., geometric, collision, and kinematic properties. (4) Train a task-specific policy \(\pi_{task}\) in the updated simulator and (5) zero-shot transfer \(\pi_{task}\) to the real world

Real-World Experiments


Rod Balancing

Balancing, or dynamic stacking of objects critically depends on an accurate estimate of the inertia parameters. In this task, the agent can interact with a rod to identify its physical parameters, in this case varying the center of mass along the rod. To successfully balance the rod on the tower requires an accurate estimation of the system parameters.

Shuffleboard

In shuffleboard, the goal is to shoot a puck to a target area. We closely follow the original game and pour wax (sand) on the board. This modification makes the task especially difficult as the surface friction on the board changes slightly after each shot since the puck displaces the wax. The goal is to strike the puck to one of the target regions.

Fisher Information Exploration


Intuition

Simulation

To identify the underlying physics parameters that govern the dynamics of the system, we need to collect trajectories that are strongly affected by these parameters. In the case of the rod balancing task, the center of mass affects how the rod rotates when pushed by the robot during the exploration stage. As shown above, the same actions can lead to drastically different rod poses after interaction. Trajectories are uninformative if the rod is not forced to rotate around its center of mass or doesn't move at all. We indicate the center of mass by a blue marker on top of the rod in both sim and real (not visible to the policy).


Fisher Information Maximization

The Fisher information matrix plays a key role in the choice of our exploration policy, \(\pi_{exp}\). Recall that for a distribution over trajectories \(p_\theta\), the Fisher information is defined as:

\(\mathrm{I}(\theta,\pi_{exp}) := \mathrm{E}_{\tau \sim p_{\theta}} \left [ \nabla_{\theta} \log p_{\theta}(\tau; \pi_{exp}) \cdot \nabla_{\theta} \log p_{\theta}(\tau; \pi_{exp})^\top \right ]\)


The Fisher information matrix, therefore, captures the sensitivity of the distribution to the parameter \(\theta\). Since the distribution induced by trajectories \(\tau\) is different for different exploration policies, we can find distributions with higher Fisher information by changing the exploration policy. To find an exploration policy that gives us the most informative trajectories about the true parameter \(\theta^*\), we can formulate the optimization problem as:

\(\mathrm{argmin}_{\pi_{exp}}\quad\mathrm{tr}(\mathrm{I}(\theta^*, \pi_{exp})^{-1})\)


Intuitively, a policy that makes the Fisher information "large" will make \(\mathrm{tr}(\mathrm{I}(\theta^*, \pi)^{-1})\) "small", suggesting that the induced trajectories are very sensitive to the unknown parameters and are good candidates for system identification. Because the true parameters \(\theta^*\) are unknown, we solve this optimization problem in simulation by randomizing over the parameters and rolling out the resulting exploration policy in the real world to collect a trajectory.


Exploration Behavior

Simulation

Real World

We train the exploration policy \(\pi_{exp}\) in simulation (left) and roll it out in the real world (right) to collect trajectories for system identification. Observe that the exploration policy does not transfer perfectly but still collects informative data about the rod's center of mass. Note that even though the video shows multiple real-world rollouts, we only collect a single one in practice.

Baseline Comparisons


Rod Balancing

ASID correctly identifies the rod's center of mass and successfully balances it on the tower. The baseline trained with domain randomization, i.e., over a distribution of inertia parameters, fails catastrophically as it converges to picking up and placing the rod at a random location. These results show that identifying the correct parameters is crucial to solving dynamic tasks.

Shuffleboard

Due to the changing surface friction caused by previous shot attempts, the domain randomization baseline struggles to shoot the puck to the desired zone. With its dedicated exploration phase, ASID can accurately adapt the simulation to the current conditions and land the puck in the desired zone. A correct parameter estimate is crucial to accurately solve the task.


Double Stacking

Me against the Machine

BibTeX

@inproceedings{memmel2024asid,
        title={ASID: Active Exploration for System Identification in Robotic Manipulation},
        author={Memmel, Marius and Wagenmaker, Andrew and Zhu, Chuning and Fox, Dieter and Gupta, Abhishek},
        booktitle={The Twelfth International Conference on Learning Representations},
        year={2024}
      }