Video action recognition meets vision-language models exploring human factors in scene interaction: a review
DOI:
CSTR:
Author:
Affiliation:

1.shenyang ligong university;2.Zhejiang university;3.university of Adelaide

Clc Number:

Fund Project:

the National Natural Science Foundation of China under Grant 62406280, the Zhejiang Provincial Natural Science Foundation of China under Grant LQ23F030001; Autism Research Special Fund of Zhejiang Foundation For Disabled Persons under Grant 2023008; the Liaoning Province Higher Education Innovative Talents Program Support Project under Grant LR2019058, Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases under Grant 2021-KF-12-05, and the Central Guidance on Local Science and Technology Development Fund of Liaoning Province under Grant 2023JH6/100100066

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Video Action Recognition (VAR) aims to analyze dynamic behaviors in videos and achieve semantic understanding. VAR faces challenges such as temporal dynamics, action-scene coupling, and the complexity of human interactions. Existing methods can be categorized into motion-level, event-level, and story-level based on spatiotemporal granularity. However, single-modal approaches struggle to capture complex behavioral semantics and human factors. Therefore, in recent years, Vision-Language Models (VLMs) have been introduced into this field, providing new research perspectives for VAR. In this paper, we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field. Additionally, we propose the concept of "Factor" to identify and integrate key information from both visual and textual modalities, enhancing multimodal alignment. We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.

    Reference
    Related
    Cited by
Get Citation
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:March 16,2025
  • Revised:April 20,2025
  • Adopted:May 21,2025
  • Online:
  • Published:
Article QR Code