PedestrianQA

Abstract

A unified, explainable framework for pedestrian behavior.

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision–language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural-language reasoning.

We introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question–answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions — without needing specialized architectures tailored for each task.

Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting, and the quality of explanatory rationales — demonstrating the strong potential of VLMs as a unified, explainable framework for safety-critical pedestrian behavior modeling.

At a glance

What makes PedestrianQA different.

14,310

Question–answer–rationale samples

4

Source datasets, structured + unstructured

5+2

Rationale categories & conclusions

3B

Param model surpasses 7–9B baselines

Intention Prediction

Binary classification of whether a target pedestrian will cross the road, justified with multi-aspect reasoning rather than a black-box label.

Trajectory Forecasting

Future bounding-box prediction in image coordinates with paired rationales explaining the predicted spatio-temporal path.

Structured Rationales

Spatial, temporal, mathematical, ego-vehicle, and scene-context explanations — plus a final destination prediction and a plain-language conclusion.

Unified Benchmark

Bridges PIE, JAAD, TITAN, and unstructured IDD-PeD into a single QA schema — enabling cross-dataset generalization and rich evaluation.

PedestrianQA Dataset

Every sample comes with a question, an answer, and five rationales.

Each pedestrian sequence pairs a short observation video with a question, a binary or trajectory answer, and structured rationales spanning spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning — followed by a final destination prediction and a concise conclusion.

A PedestrianQA sample showing 15 observation frames, the PIP question, the answer, and five structured rationales plus a conclusion. — **Figure 2.** PedestrianQA PIP sample. The model must justify its *Yes*/No answer using spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning, with a concise conclusion suitable for non-expert users.

Spatial

Pose, body orientation, and physical placement (e.g., on the curb, in a lane, perpendicular to the road).

Temporal

How motion evolves over the observation window — acceleration, deceleration, stops, frame-indexed transitions.

Mathematical

Quantitative cues: pedestrian–vehicle distance, displacement, velocity estimates, and trajectory angles.

Ego-vehicle

Braking, decelerating, or yielding behavior of the ego-vehicle that enables or discourages crossing.

Scene-context

Traffic lights, crosswalks, road infrastructure, illumination, and interactions with surrounding agents.

Destination

Predicted endpoint within the scene — opposite curb, mid-road stop, or continuing along the same side.

Generation Pipeline

Grounded metadata + VLM motion captions + LLM rationale synthesis.

The pipeline aggregates human-annotated ground truth across constituent datasets into a unified metadata schema, enriches each sequence with fine-grained pedestrian motion captions from a VLM, and feeds a structured instruction package — including a compliance checklist and in-context exemplars — to claude-sonnet-4 for question–answer–rationale generation.

The PedestrianQA data generation pipeline mapping source annotations and VLM motion captions through metadata and an LLM prompt into the final QA dataset. — **Figure 3.** Data generation pipeline. Annotations from target pedestrians, interacting agents, scene context, and ego-vehicle telemetry are unified, augmented with VLM motion captions, and turned into Q–A–rationale triplets by a single LLM call.

01

Metadata Construction

Aggregate target/interacting pedestrian boxes, vehicle trajectories, scene objects, and ego telemetry into a unified TSV schema.

02

Motion Captioning

Crop each frame around the target, overlay a red box, and prompt a VLM with 13 motion-specific questions for fine-grained captions.

03

LLM QA Synthesis

System prompt + task definitions + rationale instructions + in-context exemplars + compliance checklist + metadata → Q–A–rationales.

04

Compliance Checking

Rationales must integrate cues from all annotation sources without parroting raw attributes — non-compliant outputs are regenerated.

Baseline

A 3B-param VLM, LoRA-finetuned on PedestrianQA.

We finetune Qwen2.5-VL-3B-Instruct on PedestrianQA using LoRA adapters (rank=8, α=16). Both PIP and PTP are formulated as Q–A pairs — no architectural changes required. The same model handles intention prediction, trajectory forecasting, and rationale generation, jointly trained across all four source datasets.

ModelQwen2.5-VL-3B-Instruct

Frames15 @ 30fps / 10 @ 10fps

Epochs3

Hardware2× RTX-A6000

EvaluatorClaude-Sonnet-4 (CLAIR)

Results

Compact, finetuned, and ahead on every front.

Finetuned Ours (All Datasets) outperforms zero-shot baselines including the much larger Qwen2.5-VL-7B, InternVL3-8B, Kwai-Keye-8B, and LLaVA-NeXT-Video-7B — across PIP, PTP, and rationale quality.

PIP Accuracy 78.3% Overall · best in class

PIP F₁ 0.542 +25.1% vs Qwen2.5-VL-7B

PTP ADE 37px Lowest overall displacement

Rationale 6/7 Best categories on combined set

Pedestrian Intention Prediction

Binary crossing classification. Accuracy and F₁ across PIE, JAAD, TITAN, IDD-PeD, and combined.

Table III: PIP results comparing baselines and our finetuned models across four datasets.

Pedestrian Trajectory Prediction

ADE and FDE in image-coordinate pixels. Lower is better.

Table IV: PTP results showing ADE and FDE across datasets.

Rationale Quality (CLAIR / Claude-Sonnet-4)

SR · TR · MR · EVR · SCR · FDP · C. Scores in [0–100]; higher is better.

Table V: rationale evaluation across seven categories on the combined dataset.

+54.4%

overall PIP accuracy vs. base Qwen2.5-VL-3B-Instruct (zero-shot).

+23.4%

ego-vehicle reasoning quality vs. Qwen2.5-VL-7B-Instruct.

IDD-PeD ↗

Finetuning on unstructured-only data nearly matches all-dataset training.

Access

Get the dataset, code, and models.

The PedestrianQA repository ships CSV indexes and Q–A annotations. Raw source videos are not redistributed — acquire all required licenses from the original source datasets before downloading them.

Repository

PIE JAAD TITAN IDD-PeD

Citation

Cite PedestrianQA

If you use PedestrianQA in your research, please cite our paper.

@inproceedings{mishra2026pedestrianqa,
  title     = {PedestrianQA: A Benchmark for Vision-Language Models
               on Pedestrian Intention and Trajectory Prediction},
  author    = {Mishra, Naman and Gangisetty, Shankar and Jawahar, C. V.},
  booktitle = {Proceedings of the IEEE International Conference on
               Robotics and Automation (ICRA)},
  year      = {2026},
  url       = {https://github.com/botmahn/PedestrianQA}
}

Acknowledgment

This project was supported by iHub-Data and Mobility at IIIT Hyderabad.