Abstract
tl;dr Retrieval for robust few-shot imitation learning by encoding trajectories with vision foundation models and retrieving sub-trajectories with subsequence dynamic time warping.
Robot learning is increasingly relying on large, diverse, and complex datasets, similar to trends in NLP and computer vision. Generalist policies trained on these datasets can perform well across multiple tasks but often underperform on individual tasks due to negative data transfer. This work proposes training policies during deployment, adjusting to specific scenarios rather than using pre-trained, zero-shot models. Our approach, \(\texttt{STRAP}\), retrieves and trains on relevant data at a sub-trajectory level, enhancing robustness. Results show that \(\texttt{STRAP}\) surpasses existing methods in both simulated and real experiments, achieving robust control with minimal real-world demonstrations.