NAS just once: Neural Architecture Search for joint Image-Video Recognition

Sofia Casarin; S Escalera; Oswald Lanz

doi:10.1109/ICCVW69036.2025.00667

Back

NAS just once: Neural Architecture Search for joint Image-Video Recognition

Conference proceeding

Peer reviewed

NAS just once: Neural Architecture Search for joint Image-Video Recognition

Sofia Casarin, S Escalera and Oswald Lanz

2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp.6431-6441

IEEE International Conference on Computer Vision Workshops

IEEE International Conference on Computer Vision (Honolulu, Hawai'i, 19/10/2025–23/10/2025)

2025

DOI: https://doi.org/10.1109/ICCVW69036.2025.00667

Handle:

https://hdl.handle.net/10863/51551

Abstract

Image classification

Neural Architecture Search

Video action recognition

Neural Architecture Search (NAS) for Video Understanding has slowly advanced compared to the Image-domain counterpart. Current approaches often focus on 3D networks, search for untied spatial and temporal components, or for pseudo-3D operators. As NAS methods for image-related tasks are often unsuitable for videos due to the lack of benchmarks like NASBench-101, many video-NAS methods use naA+-ve search procedures and fail to leverage advancements in search mechanisms developed for NAS for image tasks. In this work, we propose the first approach to bridge the gap between NAS for Videos and IMages (VIM-NAS), proposing a unique solution to find high-performing and efficient neural networks across ImageNet, Kinetics-400, Kinetics-600, and Something-SomethingV2 datasets. We optimize the 2D space and 3D space-time tubes to tokenize images and videos, along with the architecture of a unique supernet Vision transformer, via a differentiable weight-entanglement mechanism. Leveraging a multi-dataset training strategy, VIM-NAS achieves 84.4% Top-1 accuracy on ImageNet, 90.7% on Kinetics-400, improves state-of-the-art on Kinetics-600 by 0.4%, and improves previous NAS SOTA by 13.4% on Something-SomethingV2 reducing the accuracy gap with hand-designed neural networks in Video Action Recognition.

Files and links (2)

url

https://openaccess.thecvf.com/content/ICCV2025W/Findings/html/Casarin_NAS_just_once_Neural_Architecture_Search_for_joint_Image-Video_Recognition_ICCVW_2025_paper.htmlView

url

https://doi.org/10.1109/ICCVW69036.2025.00667View

Details

Title: NAS just once: Neural Architecture Search for joint Image-Video Recognition
Creators: Sofia Casarin - Free University of Bozen-Bolzano
S Escalera - Barcelona Supercomputing Center
Oswald Lanz - Free University of Bozen-Bolzano
Publication Details: 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp.6431-6441
ISBN: 979-8-3315-8989-9
EISBN: 979-8-3315-8988-2
ISSN: 2473-9936
EISSN: 2473-9944
Conference: IEEE International Conference on Computer Vision (Honolulu, Hawai'i, 19/10/2025–23/10/2025)
Series / Volume: IEEE International Conference on Computer Vision Workshops
Publisher: IEEE
Format: Online
Number of pages: 11
Identifiers: 979-8-3315-8989-9
(UNIBZ)95686863
991007307154001241
Scopus ID: 2-s2.0-105035150781
Academic Unit: Faculty of Engineering
Language: English
Resource Type: Conference proceeding
Author Names String: Casarin S, Escalera S, Lanz O

Metrics

1 Record Views