From Scene to Object: Text-Guided Dual-Gaze Prediction

Revolutionizing Driver Attention for Human-Like Autonomous Systems

Authors: Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng+, Heye Huang, Jianqiang Wang

Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a human evaluation test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.

Schedule Your AI Strategy Session

Executive Impact & Key Findings

This research introduces a paradigm shift in driver attention prediction, moving from scene-level global gaze to text-guided object-level attention. By developing the G-W3DA dataset and DualGaze-VLM architecture, it tackles issues of visual-bias hallucination and text-vision decoupling. The approach leverages multimodal large language models and advanced segmentation to create precise, object-level attention masks, achieving significant improvements in safety-critical scenarios and high human authenticity ratings for generated attention maps.

SIM Improvement in Safety-Critical Scenarios

Human Authenticity Rating

Decoupling Hallucination

Discuss Enterprise AI Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The paper identifies a critical limitation: existing datasets lack fine-grained, object-level attention annotations, leading to text-vision decoupling and visual-bias hallucinations in VLM-based models. To address this, it introduces G-W3DA, a novel object-level driver attention dataset. This dataset is constructed via an automated pipeline integrating Qwen3.5-Plus with SAM3, decoupling macroscopic heatmaps into precise object-level masks through rigorous cross-validation. This methodology fundamentally eliminates annotation hallucinations by enforcing strict compliance with human gaze, even when models would otherwise hallucinate objects not actually attended. The dataset ensures physically grounded supervision for text-vision cognitive alignment, particularly enhancing the mean attention intensity within identified regions by 65% across various driving scenarios.

The core innovation is the DualGaze-VLM architecture, a dual-branch predictor designed to exploit the decoupled data paradigm. Unlike traditional models that output a single holistic heatmap, DualGaze-VLM parallelly predicts both macroscopic scene-level gaze and microscopic object-level attention. It employs a Query-Conditioned SE-Gate Modulation mechanism, which extracts semantic queries from reasoning context (e.g., [ATTN] token for global, first token of regional description for object-specific) and dynamically modulates visual features. This allows for intent-driven precise spatial anchoring, ensuring that the visual backbone can adaptively generate either global or region-specific representations conditioned on different linguistic queries. The architecture maintains a shared progressive cognitive decoder to translate modulated features into continuous spatial heatmaps, standardizing output to 256x256.

Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics. Notably, it achieves up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios compared to previous best models like FSDAM, and a 4.9% improvement in Pearson's Correlation Coefficient (CC). Qualitative analyses show superior consistency with ground-truth, better coverage of hazard-relevant regions, stronger sensitivity to vulnerable road users, and reduced center-bias. A human evaluation test (Turing Test inspired) confirms that 88.22% of generated attention heatmaps are perceived as authentic by evaluators, validating the model's capability to generate rational cognitive priors.

Enterprise Process Flow

RGB Image & Query

→

VLM Semantic Query Extraction

→

Query-Conditioned SE-Gate

→

Shared Cognitive Decoder

→

Dual-Gaze Prediction (Global & Object)

17.8% SIM Improvement in Safety-Critical Scenarios

DualGaze-VLM vs. SOTA Baselines

Feature	SOTA Baselines (e.g., FSDAM)	DualGaze-VLM (Ours)
Attention Granularity	Scene-level global heatmaps Weak semantic grounding	Object-level attention masks Precise spatial anchoring
Text-Vision Alignment	Severe decoupling, visual-bias hallucinations	Rigorous cross-validation, hallucination-free Strong semantic-spatial alignment
Performance (SIM - Safety-Critical)	~0.467 (previous best)	0.550 (+17.8% improvement)
Human Plausibility	Not explicitly measured at object-level	88.22% perceived as authentic

88.22% Human Evaluators Perceived Attention as Authentic

Qualitative Advantages in Diverse Scenarios

Multi-target Scenarios: Our method better captures distributed attention patterns and covers multiple relevant regions more completely, as shown in Figure 7 (a) and (b).

Vulnerable Road Users: Enhanced sensitivity to pedestrians while preserving attention to the forward vehicle, critical for urban driving (Figure 7 (c)).

Reduced Center-Bias: Effectively shifts attention away from the image center towards truly relevant peripheral hazards, mitigating common biases of traditional models (Figure 7 (d) and (e)).

Compact & Semantically Focused: Generates more compact and semantically focused attention maps, avoiding unnecessary diffuse responses.

Calculate Your Potential ROI with AI

Understand the tangible benefits of integrating advanced AI into your operations. Adjust the parameters below to see your estimated annual savings and reclaimed human hours.

Your Industry

Number of Employees (Impacted by AI)

Average Hours Spent on Repetitive Tasks Per Week

Average Hourly Wage ($)

Estimated Annual Savings

Annual Hours Reclaimed

Calculate Your Custom ROI

Your AI Transformation Roadmap

Embark on a structured journey to integrate cutting-edge AI. Our phased approach ensures a smooth, efficient, and impactful implementation tailored to your enterprise needs.

Phase 1: Foundation & Data

Establish G-W3DA dataset construction, ensuring hallucination-free, object-level annotations. Integrate multimodal LLMs and SAM3 for precise mask generation and cross-validation.

Phase 2: Architecture Integration

Implement the DualGaze-VLM dual-branch architecture, integrating Query-Conditioned SE-Gate Modulation and the Shared Progressive Cognitive Decoder.

Phase 3: Training & Optimization

Train the model using joint optimization with a mixed loss strategy (Causal Language Modeling, KL Divergence, Spatial-Weighted BCE) across diverse driving scenarios.

Phase 4: Deployment & Iteration

Deploy the attention prediction system, gather feedback, and continuously update the G-W3DA dataset for ongoing model refinement and adaptation to new driving conditions.

Plan Your AI Transformation

Ready to Enhance Your Autonomous Systems?

Unlock the full potential of human-like driver attention prediction. Connect with our experts to explore how DualGaze-VLM can be tailored to your specific enterprise challenges.

Book a Consultation Now

From Scene to Object: Text-Guided Dual-Gaze Prediction

Revolutionizing Driver Attention for Human-Like Autonomous Systems

Executive Impact & Key Findings

Deep Analysis & Enterprise Applications

Enterprise Process Flow

DualGaze-VLM vs. SOTA Baselines

Qualitative Advantages in Diverse Scenarios

Calculate Your Potential ROI with AI

Your AI Transformation Roadmap

Phase 1: Foundation & Data

Phase 2: Architecture Integration

Phase 3: Training & Optimization

Phase 4: Deployment & Iteration

Ready to Enhance Your Autonomous Systems?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai