STRAP logo

Robot Sub-Trajectory Retrieval for Augmented Policy Learning

1University of Washington 2Bosch Center for Artificial Intelligence
*equal contribution equal advising

\(\texttt{STRAP}\) is a framework for retrieval augmented policy learning based on vision foundation models and dynamic time warping.

Abstract

TL;DR retrieval for robust few-shot imitation learning by 1) encoding trajectories with vision foundation models and 2) retrieving sub-trajectories with dynamic time warping

Robot learning is increasingly relying on large, diverse, and complex datasets, similar to trends in NLP and computer vision. Generalist policies trained on these datasets can perform well across multiple tasks but often underperform on individual tasks due to negative data transfer. This work proposes training policies during deployment, adjusting to specific scenarios rather than using pre-trained, zero-shot models. Our approach, \(\texttt{STRAP}\), retrieves and trains on relevant data at a sub-trajectory level, enhancing generalization and robustness. Results show that \(\texttt{STRAP}\) surpasses existing methods in both simulated and real experiments, achieving robust control with minimal real-world demonstrations.

Video

Method Overview


Pipeline Overview

Overview of \(\texttt{STRAP}\): 1) demonstrations \(\mathcal{D}_{target}\) and offline dataset \(\mathcal{D}_{prior}\) are encoded into a shared embedding space using a vision foundation model, 2) automatic slicing generates sub-trajectories which 3) Subsequence Dynamic Time Warping (S-DTW) matches to corresponding sub-trajectories in \(\mathcal{D}_{prior}\) creating \(\mathcal{D}_{retrieved}\), 4) training a policy on the union of \(\mathcal{D}_{retrieved}\) and \(\mathcal{D}_{target}\) results in better performance and robustness.


Our key insights are:

  • Vision foundation models offer powerful out-of-the-box representations for trajectory retrieval. They sufficiently encode scene semantics and offer visual robustness in contrast to brittle in-domain feature extractors from prior work.
  • Sub-trajectory retrieval enable maximal re-use of prior data capturing temporality of tasks and dynamics.
  • Subsequence Dynamic Time Warping finds optimal sub-trajectory matches in offline datasets agnostic to length, horizon or demonstration frequency.

Real-World Experiments


Pen-in-Cup


Pipeline Overview

While Behavior Cloning (BC) and \(\texttt{STRAP}\) solve the Franka-Pen-in-Cup demonstrated in \(\mathcal{D}_{target}\) (base), BC lacks robustness to out-of-distribution (OOD) scenarios. The policy replays the trajectories observed in \(\mathcal{D}_{target}\). \(\texttt{STRAP}\) retrieves relevant sub-trajectories from \(\mathcal{D}_{prior}\), e.g., the robot putting the screwdriver in the cup or picking up pens in various poses. Augmented policy learning then distills this knowledge into a policy, resulting in generalization to an OOD scenario.


In-distribution (base)



Out-of-distribution (OOD)



Datasets


Demonstration Dataset \(\mathcal{D}_{target}\)

Offline Dataset \(\mathcal{D}_{prior}\)

Simulation Experiments


LIBERO-10


Pipeline Overview

\(\texttt{STRAP}\) outperforms the retrieval baselines BehaviorRetrieval (BR) [Du et al. 2023] and FlowRetrieval (FR) [Lin et al. 2024] on average by \(+12.20\%\) and \(+12.47\%\) across all LIBERO-10 tasks. These results demonstrate the policy's robustness to object poses. Both DINOv2 and CLIP are viable representations for \(\texttt{STRAP}\) with only a \(+0.73\%\) difference across all LIBERO-10 tasks.


Datasets


Demonstration Dataset \(\mathcal{D}_{target}\)

Offline Dataset \(\mathcal{D}_{prior}\)

Qualitative Results


Does \(\texttt{STRAP}\) retrieval scale to DROID?


Demonstration Dataset \(\mathcal{D}_{target}\)

Data Retrieved from DROID \(\mathcal{D}_{retrieved}\)



We encode 5k random demonstrations from the DROID dataset and retrieve relevant sub-trajectories. \(\texttt{STRAP}\) finds semantically relevant tasks from other lab's with similar environment and camera setups. Note that our environment is not part of DROID!




What types of matches are identified by S-DTW?

Above shows the demonstration segments from \(\mathcal{D}_{target}\) denoted by the red box and retrieved sub-trajectories from the offline dataset \(\mathcal{D}_{prior}\). Use the buttons to explore all LIBERO-10 tasks!


Task Instruction: "pick up the book and place it in the back compartment of the caddy"




How does \(\texttt{STRAP}\) compare to other retrieval mechanisms?

Pipeline Overview

We visualize the top five tasks retrieved and accumulates the rest into the "others" category. \(\texttt{STRAP}\) only retrieves semantically relevant sub-trajectories — each task shares at least one sub-task with the target task! For example, "put the black bowl in the bottom drawer of the cabinet", "close the bottom drawer of the cabinet ...". \(\texttt{STRAP}\) retrieval is sparse, only selecting data from 5/90 semantically relevant tasks and ignoring irrelevant ones (no "others" )!

Ablations

Does sub-trajectory retrieval improve performance in few-shot imitation learning?



Pipeline Overview

We compare sub-trajectory retrieval with S-DTW (\(\texttt{STRAP}\)) to retrieving full trajectories (D-T) and states (D-S). We find sub-trajectory retrieval to be preferred over states and full trajectories. Full trajectories can contain segments irrelevant to the task, effectively hurting performance and reducing the accuracy of S-DTW.



How effective are the representations from vision-foundation models for retrieval?



Pipeline Overview

We replace the in-domain feature extractors from BehaviorRetrieval (BR) [Du et al. 2023] and FlowRetrieval (FR) [Lin et al. 2024] trained on \(\mathcal{D}_{prior}\) with an off-the-shelf DINOv2 encoder model (D-S). The choice of representation depends on the task with no method outperforming the others on all tasks. We want to highlight that vision foundation models don't have to be trained on \(\mathcal{D}_{prior}\) and scale much better with increasing amounts of trajectory data and on unseen tasks.

Subsequence Dynamic Time Warping


Pipeline Overview

Subsequence Dynamic Time Warping: To retrieve sequences of variable length, we build Dynamic Time Warping (DTW). DTW methods compute the similarity between two sequence that may vary in length or frequency. The algorithm aligns the sequences by warping the time axis of the series using a set of step sizes to minimize the distance between corresponding points while obeying boundary conditions. Subsequence Dynamic Time Warping (S-DTW) loosens these boundary conditions to allow for subsequence matching.



Pipeline Overview

DTW computes a distance matrix that compares the distance between each pair of points in the two sequences. We use vision foundation models, e.g., DINOv2, to encode the image observations and compute their distance with the L2 norm. The optimal matching is represented by the shortest path through the matrix. Dynamic programming finds this path, minimizing the total distance between the two sequences. The figure above shows matching a demonstration from \(\mathcal{D}_{target}\) (y-axis) to a sub-sequence in the offline dataset \(\mathcal{D}_{prior}\) (x-axis). Brighter colors indicate higher and darker colors indicate lower cost. The red line indicates the optimal path through the matrix, i.e., the optimal match between the sequences.

BibTeX

@article{2025strap,
  title={???},
  author={???},
  journal={???},
  year={2025}
}