Abstract
Neural Architecture Search (NAS) for Video Understanding has slowly advanced compared to the Image-domain counterpart. Current approaches often focus on 3D networks, search for untied spatial and temporal components, or for pseudo-3D operators. As NAS methods for image-related tasks are often unsuitable for videos due to the lack of benchmarks like NASBench-101, many video-NAS methods use naA+-ve search procedures and fail to leverage advancements in search mechanisms developed for NAS for image tasks. In this work, we propose the first approach to bridge the gap between NAS for Videos and IMages (VIM-NAS), proposing a unique solution to find high-performing and efficient neural networks across ImageNet, Kinetics-400, Kinetics-600, and Something-SomethingV2 datasets. We optimize the 2D space and 3D space-time tubes to tokenize images and videos, along with the architecture of a unique supernet Vision transformer, via a differentiable weight-entanglement mechanism. Leveraging a multi-dataset training strategy, VIM-NAS achieves 84.4% Top-1 accuracy on ImageNet, 90.7% on Kinetics-400, improves state-of-the-art on Kinetics-600 by 0.4%, and improves previous NAS SOTA by 13.4% on Something-SomethingV2 reducing the accuracy gap with hand-designed neural networks in Video Action Recognition.