STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning

Abstract

tl;dr Retrieval for robust few-shot imitation learning by encoding trajectories with vision foundation models and retrieving sub-trajectories with subsequence dynamic time warping.

Robot learning is increasingly relying on large, diverse, and complex datasets, similar to trends in NLP and computer vision. Generalist policies trained on these datasets can perform well across multiple tasks but often underperform on individual tasks due to negative data transfer. This work proposes training policies during deployment, adjusting to specific scenarios rather than using pre-trained, zero-shot models. Our approach, \(\texttt{STRAP}\), retrieves and trains on relevant data at a sub-trajectory level, enhancing robustness. Results show that \(\texttt{STRAP}\) surpasses existing methods in both simulated and real experiments, achieving robust control with minimal real-world demonstrations.

Video

Method Overview

Overview of \(\texttt{STRAP}\): 1) demonstrations \(\mathcal{D}_{target}\) and offline dataset \(\mathcal{D}_{prior}\) are encoded into a shared embedding space using a vision foundation model, 2) automatic slicing generates sub-trajectories which 3) Subsequence Dynamic Time Warping (S-DTW) matches to corresponding sub-trajectories in \(\mathcal{D}_{prior}\) creating \(\mathcal{D}_{retrieved}\), 4) training a policy on the union of \(\mathcal{D}_{retrieved}\) and \(\mathcal{D}_{target}\) results in better performance and robustness.

Our key insights are:

Vision foundation models offer powerful out-of-the-box representations for trajectory retrieval. They sufficiently encode scene semantics and offer visual robustness in contrast to brittle in-domain feature extractors from prior work.
Sub-trajectory retrieval enable maximal re-use of prior data capturing temporality of tasks and dynamics.
Subsequence Dynamic Time Warping finds optimal sub-trajectory matches in offline datasets agnostic to length, horizon or demonstration frequency.

Kitchen Experiments

Videos

Table

Sink

Stove

Results (Kitchen)

\(\texttt{STRAP}\) solves several pick and place tasks in a realistic kitchen environment. \(\texttt{STRAP}\) is surprisingly robust to object poses unseen in demonstrations while baselines fail to adapt. The policy further shows recovery behavior, completing the task even when the initial grasp fails and alters the object's pose.

Results (Kitchen-DROID)

To investigate scalability to larger datasets, we construct an additional offline dataset consisting of 5k demonstrations from the DROID dataset and 50 demonstrations collected in the same environment. Since DTW scales linearly with the number of demonstrations and \(\texttt{STRAP}\)'s policy training stage is independent of the offline dataset's size but depends on the amount of data retrieved, it naturally scales to larger datasets like DROID while maintaining performance.

Datasets

Demonstration Dataset \(\mathcal{D}_{target}\)

Offline Dataset \(\mathcal{D}_{prior}\)

Pen-in-Cup Experiments

Videos

Base

OOD

Results

\(\texttt{STRAP}\) can leverage data from seemingly unrelated tasks to learn robustness behavior. While behavior cloning replays the demonstrations, \(\texttt{STRAP}\) can adapt to unseen object poses and appearance.

Datasets

Demonstration Dataset \(\mathcal{D}_{target}\)

Offline Dataset \(\mathcal{D}_{prior}\)

LIBERO-10 Experiments

Results

\(\texttt{STRAP}\) outperforms the retrieval baselines BehaviorRetrieval (BR) [Du et al. 2023] and FlowRetrieval (FR) [Lin et al. 2024] on average by \(+24.7\%\) and \(+25.0\%\) across all LIBERO-10 tasks. These results demonstrate the policy's robustness to object poses. Both DINOv2 and CLIP are viable representations for \(\texttt{STRAP}\) with only a \(+0.73\%\) difference across all LIBERO-10 tasks.

Datasets

Demonstration Dataset \(\mathcal{D}_{target}\)

Offline Dataset \(\mathcal{D}_{prior}\)

Qualitative Results

Does \(\texttt{STRAP}\) retrieval scale to DROID?

Demonstration Dataset \(\mathcal{D}_{target}\)

Data Retrieved from DROID \(\mathcal{D}_{retrieved}\)

We encode 5k random demonstrations from the DROID dataset and retrieve relevant sub-trajectories. \(\texttt{STRAP}\) finds semantically relevant tasks from other lab's with similar environment and camera setups. Note that our environment is not part of DROID!

What types of matches are identified by S-DTW?

Kitchen-DROID:
Below shows the demonstration segments (\(\mathcal{D}_{target}\)) denoted by the red box ◼ and retrieved sub-trajectories (from \(\mathcal{D}_{prior}\)). \(\texttt{STRAP}\) retrieves relevant segments from the multi-task trajectories even if object appearance and pose, or the task differ.

Use the buttons ← → to explore all tasks!

Task Instruction: "pick up the pepper 🫑 and put it in the sink 🚰"

LIBERO-10:
Below shows the demonstration segments (\(\mathcal{D}_{target}\)) denoted by the red box ◼ and retrieved sub-trajectories (from \(\mathcal{D}_{prior}\)). \(\texttt{STRAP}\) retrieves relevant segments from the multi-task trajectories matching the demonstrations.

Use the buttons ← → to explore all tasks!

Task Instruction: "pick up the book and place it in the back compartment of the caddy"

How does \(\texttt{STRAP}\) compare to other retrieval mechanisms?

We prompt all retrieval methods with demonstrations for "put the black bowl in the bottom drawer of the cabinet and close it" and investigate from which tasks they retrieve their data. The figure shows the top-5 tasks and aggregates the remaining.

\(\texttt{STRAP}\) only retrieves semantically relevant sub-trajectories — each task shares at least one sub-task with the target task! For example, "put the black bowl in the bottom drawer of the cabinet", "close the bottom drawer of the cabinet ...". \(\texttt{STRAP}\) retrieval is sparse, only selecting data from 5/90 semantically relevant tasks and ignoring irrelevant ones (no "others" )!

Ablations

Does sub-trajectory retrieval improve performance in few-shot imitation learning?

We compare sub-trajectory retrieval with S-DTW (\(\texttt{STRAP}\)) to retrieving full trajectories (D-T) and states (D-S). We find sub-trajectory retrieval to be preferred over states and full trajectories. Full trajectories can contain segments irrelevant to the task, effectively hurting performance and reducing the accuracy of S-DTW.

How effective are the representations from vision-foundation models for retrieval?

We replace the in-domain feature extractors from BehaviorRetrieval (BR) [Du et al. 2023] and FlowRetrieval (FR) [Lin et al. 2024] trained on \(\mathcal{D}_{prior}\) with an off-the-shelf DINOv2 encoder model (D-S). The choice of representation depends on the task with no method outperforming the others on all tasks. We want to highlight that vision foundation models don't have to be trained on \(\mathcal{D}_{prior}\) and scale much better with increasing amounts of trajectory data and on unseen tasks.

Subsequence Dynamic Time Warping

Subsequence Dynamic Time Warping: To retrieve sequences of variable length, we build Dynamic Time Warping (DTW). DTW methods compute the similarity between two sequence that may vary in length or frequency. The algorithm aligns the sequences by warping the time axis of the series using a set of step sizes to minimize the distance between corresponding points while obeying boundary conditions. Subsequence Dynamic Time Warping (S-DTW) loosens these boundary conditions to allow for subsequence matching.

DTW computes a distance matrix that compares the distance between each pair of points in the two sequences. We use vision foundation models, e.g., DINOv2, to encode the image observations and compute their distance with the L2 norm. The optimal matching is represented by the shortest path through the matrix. Dynamic programming finds this path, minimizing the total distance between the two sequences. The figure above shows matching a demonstration from \(\mathcal{D}_{target}\) (y-axis) to a sub-sequence in the offline dataset \(\mathcal{D}_{prior}\) (x-axis). Brighter colors indicate higher ▲ and darker colors indicate lower ▼ cost. The red line ▬ indicates the optimal path through the matrix, i.e., the optimal match between the sequences.

BibTeX

@article{memmel2024strap,
        title={STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning},
        author={Memmel, Marius and Berg, Jacob and Chen, Bingqing and Gupta, Abhishek and Francis, Jonathan},
        journal={arXiv preprint arXiv:2412.15182},
        year={2024}
      }