LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

Abstract

Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.

87%

Instruction-following accuracy without online LLM

<100ms

VLM inference time for real-time deployment

50Hz

Locomotion controller frequency on onboard compute

2

Robot platforms validated (quadruped + humanoid)

Overview

How It Works

1

Train

Style-conditioned locomotion policy with compliant contact tracking for robust and diverse gaits

2

Generate

LLM-powered skill database mapping language instructions to executable motion descriptors

3

Retrieve

Real-time VLM-based mixed-precision retrieval from vision or language inputs

Style-Conditioned Locomotion with Compliant Tracking

Following gait styles is great, but enforcing them too rigidly can compromise the robot's stability on rough terrain. Our compliant contact tracking method introduces a compliance threshold that allows the robot to momentarily deviate from the desired gait pattern when it encounters disturbances, achieving the best of both worlds: accurate style execution and robust locomotion.

The locomotion policy is parameterized by a gait cycle duration, gait phase offsets, and a velocity limit—a minimal but expressive set of parameters that can represent diverse locomotion styles including pronk, trot, pace, bound, and rotary gallop.

Compliant gait tracking on the Unitree Go1. The robot accurately follows five different gait styles while maintaining stability over challenging terrain.

Gait Phase Encoding — Gait phase encoding with compliance zone (green). Within this zone, the policy is not penalized for deviating from the target gait phase.

Gait Tracking Performance — Foot contact states (top) and robot snapshots (bottom) for five gaits: (a) pronk, (b) trot, (c) pace, (d) bound, and (e) rotary gallop.

Scaling Up Instructions with LLM

Data collection and labeling is expensive and tedious. Instead, we leverage a large language model (GPT-4o) to automatically generate a skill database that bridges natural language instructions with executable robot motion descriptors. Our two-stage pipeline first generates diverse instructions across three categories—mimicking behaviors, scene responses, and direct instructions—then maps each to a structured motion descriptor through prompted reasoning.

This approach is 5x cheaper than generating entries one-by-one ($0.25 vs. $1.16 per 300 entries) while producing more diverse and structured motion descriptors.

LLM Data Generation Pipeline — Offline skill database generation pipeline. The LLM first generates instructions categorized into mimicking behaviors, scene responses, and direct instructions. These are then passed to a meta-prompt to generate motion descriptors with reasoning.

Skill Database Retrieval — Skill database retrieval process. Instructions (text or images) are encoded by the VLM into a shared embedding space, and the closest match in the database provides the corresponding motion descriptor.

Prompted Reasoning Improves Data Quality

Comparing 300-entry databases generated by different methods, our prompted reasoning approach produces a more balanced gait distribution and reduces unstructured gaits. The baseline (SayTap-style) produces 45.7% unstructured gaits, while LocoVLM with prompted reasoning reduces this to only 5.3%, promoting an even spread across standard gaits.

Skill Database Statistics — Statistics of skill databases: (a) categorical gait distribution, (b) gait cycle period histograms, (c) velocity limit histograms. LocoVLM with prompted reasoning yields the most balanced and structured distribution.

Real-time Vision-Language Grounding

We use BLIP-2, a pre-trained vision-language model, to ground text or image inputs to the skill database in real time. A key challenge is that naively using cosine similarity for retrieval degrades as the database scales. We address this with our mixed-precision retrieval: first narrowing candidates via cosine similarity (fast), then re-ranking with the ITM head (accurate).

We also discovered that rendering text as an image (text-as-image) and feeding it to the VLM's image encoder significantly improves retrieval, leveraging the VLM's strength in image-text matching.

Retrieval Accuracy

Retrieval Method	Text as String	Text as Image	Average
Cosine Similarity	21%	30%	20.5%
Top-K Similarity	27%	48%	37.5%
Top-K to ITM	51%	57%	54.0%
Mixed-Precision (Ours)	72%	87%	79.5%

Retrieval accuracy on 100 manually-annotated instructions. Our mixed-precision retrieval with text-as-image achieves 87%, a 4x improvement over baseline cosine similarity.

Semantic Reasoning Beyond the Database

LocoVLM can interpret instructions that are not in the database by leveraging the VLM's semantic understanding. This allows natural, intuitive interaction without being constrained by pre-defined queries.

"you are a kangaroo"

→ Retrieved: "let's jump like a rabbit" — pronk gait with moderate velocity

"this is a library!"

→ Retrieved: "move quietly" — slow trot with low velocity limit

"run! there's a danger!"

→ Retrieved: "there's a predator behind you, move fast!" — fast gallop at max velocity

Real-World Scene Interpretation

We evaluated LocoVLM outdoors on a campus environment where the robot transitioned from pavement to snow-covered terrain. Using onboard camera images, LocoVLM automatically adapts the robot's behavior: on pavement it chooses a moderate trot ("traipse lightly like a deer"), while on snow it selects slower, cautious gaits ("skulk with stealth like a lynx")—a particularly fitting analogy given the lynx's natural habitat in snowy environments.

Outdoor Scene Interpretation — LocoVLM interprets RGB images from the robot's camera and generates context-aware motion descriptors. On pavement: moderate trot at 0.6 m/s. On snow: cautious gaits at 0.2–0.3 m/s with longer gait periods.

Zero-Shot Generalization to Humanoid

How hard is it to apply LocoVLM to a completely different robot? As easy as 1-2-3! We demonstrate zero-shot generalization to the Unitree H1 humanoid by simply training a new style-conditioned locomotion policy (using only the first two gait phase offsets for two legs) and reusing the same skill database generated for the quadruped. No re-training of the VLM or re-generation of the database is needed.

LocoVLM on the Unitree H1 humanoid in MuJoCo simulation. The same skill database is used to interpret instructions like "go quickly!", "shh! the baby is sleeping", and "you are a kangaroo".

Humanoid Experiment Results — Humanoid locomotion results showing (a) fast walking, (b) quiet walking, and (c) kangaroo-like hopping with corresponding foot contact states and velocity profiles.

Citation

@inproceedings{nahrendra2025locovlm,
  title     = {LocoVLM: Grounding Vision and Language for
               Adapting Versatile Legged Locomotion Policies},
  author    = {Nahrendra, I Made Aswin and Lee, Seunghyun
               and Lee, Dongkyu and Myung, Hyun},
  booktitle = {ICRA 2025 Workshop on Safe Vision-Language
               Models (SafeVLMs)},
  year      = {2025}
}

LocoVLM

Grounding Vision and Language for AdaptingVersatile Legged Locomotion Policies