LocoVLM

Versatile Legged Locomotion Adaptation through Vision-Language Grounding

1KAIST, 2URobotics Corp.

Spotlight, ICRA 2025 SafeVLMs Workshop

LocoVLM is a versatile legged locomotion adaptation framework that enables quadrupedal robots to follow a given vision and language instructions in real-time while adhering to locomotion safety constraints.

LocoVLM is a versatile legged locomotion adaptation framework that enables quadrupedal robots to follow a given vision and language instructions in real-time while adhering to locomotion safety constraints.

Overview

Style Tracking with Compliance

Following gait styles are fancy. But following it too much can be dangerous. What if we follow these gait styles but allow some room for mistake? Here comes the compliant gait tracking, yielding the best of both worlds.

Scaling-up Instructions

Data collection and labeling is always expensive and tedious. But what if we can ask our friendly LLM to do it for us?

LocoVLM firstly constructs a skillset database that bridges language instructions with robot commands. We prompted a large language model (LLM) to generate instructions, whether it is a mimicking task, scene descriptions, or even a simple direct command. Then, we asked the LLM to utilize these instructions to generate the appropriate robot commands. And voilĂ , we have a skillset database that can be used to instruct our robot.

Skill

Real-time Vision-Language Grounding

Now that we have a skillset database, how can we make use of it?

We used BLIP-2, a vision-language model, to ground the language (or image) instructions to retrieve the corresponding robot commands from the skill database. We proposed a novel mixed-precision retrieval method that balances retrieval speed and accuracy.

Figure 1
Figure 2

How Hard is it to Apply to Other Embodiments?

As easy as 1, 2, 3! Simply train the compliant style tracking policy and you can use the same skill database to other embodiments. (Of course you can make more specific skill database too!)