Detail výsledku VaV

Originální název

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Anglický název

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Probing methods are widely used to evaluate the multimodal representations of vision-language models (VLMs), with dominant approaches relying on zero-shot performance in image-text matching tasks. These methods typically assess models on curated datasets focusing on linguistic aspects such as counting, relations, or attributes. This work uses a complementary probing strategy called guided masking. This approach selectively masks different modalities and evaluates the model’s ability to predict the masked word. We specifically focus on probing verbs, as their comprehension is crucial for understanding actions and relationships in images, and it presents a more challenging task than subjects, objects, or attributes comprehension. Our analysis targets VLMs that use region-of-interest (ROI) features obtained from object detectors as input tokens. Our experiments demonstrate that selected models can accurately predict the correct verb, challenging previous conclusions based on image-text matching methods, which suggested VLMs fail in situations requiring verb understanding. The code for experiments will be available https://github.com/ivana-13/guided_masking.

Anglický abstrakt

Probing methods are widely used to evaluate the multimodal representations of vision-language models (VLMs), with dominant approaches relying on zero-shot performance in image-text matching tasks. These methods typically assess models on curated datasets focusing on linguistic aspects such as counting, relations, or attributes. This work uses a complementary probing strategy called guided masking. This approach selectively masks different modalities and evaluates the model’s ability to predict the masked word. We specifically focus on probing verbs, as their comprehension is crucial for understanding actions and relationships in images, and it presents a more challenging task than subjects, objects, or attributes comprehension. Our analysis targets VLMs that use region-of-interest (ROI) features obtained from object detectors as input tokens. Our experiments demonstrate that selected models can accurately predict the correct verb, challenging previous conclusions based on image-text matching methods, which suggested VLMs fail in situations requiring verb understanding. The code for experiments will be available https://github.com/ivana-13/guided_masking.

Klíčová slova

multimodal models, probing, understanding, verb phrases, foundational models, image-text matching, guided masking

Klíčová slova v angličtině

multimodal models, probing, understanding, verb phrases, foundational models, image-text matching, guided masking

Autoři

Ivana Beňová, Jana Košecká, Michal Gregor, Martin Tamajka, Marcel Veselý, Marián Šimko

Rok RIV

2026

Vydáno

01.01.2025

Nakladatel

Springer Nature

Místo

CHAM

ISBN

978-3-031-82669-6

Kniha

SOFSEM 2025: Theory and Practice of Computer Science

Periodikum

Lecture Notes in Computer Science

Stát

Švýcarská konfederace

Strany od

80

Strany do

93

Strany počet

14

BibTex

@inproceedings{BUT199780,
  author="Ivana {Beňová} and  {} and Michal {Gregor} and  {} and  {} and Marián {Šimko}",
  title="Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking",
  booktitle="SOFSEM 2025: Theory and Practice of Computer Science",
  year="2025",
  journal="Lecture Notes in Computer Science",
  pages="80--93",
  publisher="Springer Nature",
  address="CHAM",
  doi="10.1007/978-3-031-82670-2\{_}7",
  isbn="978-3-031-82669-6"
}

VUT

Fakulty a vysokoškolské ústavy

Součásti

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking