The seminar explores how the integration of vision and language models is transforming recognition tasks across multiple domains. Central to this discussion is the challenge of recognizing and adapting to previously unseen or evolving categories in open-world settings, particularly without relying on pre-defined vocabularies or exhaustive training data. Leveraging pre-trained vision-language models (VLMs) such as CLIP, the presented works propose techniques for improving recognition and adaptation in various complex scenarios.
The seminar highlights the increasing shift towards unsupervised and training-free methods, addressing limitations in existing models that require extensive labeled data or specialized training. For example, AutoLabel proposes a way to automatically generate candidate class names in Open-set Unsupervised Video Domain Adaptation, overcoming the need for oracle knowledge of label names. Similarly, a novel Vocabulary-free Image Classification task introduces a framework for classifying images in an unconstrained semantic space, bypassing the restrictions of fixed vocabularies through dynamic category search methods.
Another key area is zero-shot temporal action localization (ZS-TAL), where models must identify unseen actions in videos without training on labeled data. Test-time adaptation emerges as a promising approach here, allowing models to adapt to new contexts without requiring pre-training. This mirrors the emphasis on flexible, real-time solutions also present in Automatic Programming of Experiments (APEx), a framework that automates the benchmarking process for large multimodal models, accelerating evaluation and hypothesis testing.
The event will take place on 15 October 2024 from 10:00 to 11:30 am.
TU Berlin
Einsteinufer 17, 10587 Berlin