Abstract
Human pose estimation (HPE) is a pivotal task in computer vision with applications spanning a wide range of domains, such as sports analytics, rehabilitation, performance capture and many more. However, obtaining labeled datasets for 3D pose estimation remains costly and resource intensive. To address this challenge, we propose a novel pipeline that uses contrastive learning to reduce labeling requirements while maintaining adequate performance. Our method employs unsupervised fine-tuning of pre-trained ResNet backbones on unannotated multiview data acquired in a skiing scenario. The learned repre-sentations are then utilized to strategically select a minimal, yet diverse subset of data for labeling, which is sub-sequently used for supervised training. We demonstrate the effectiveness of this approach using three contrastive paradigms, namely SimCLR, MoCo, and SimSiam, evaluating their impact on data efficiency and model performance on the SkiPose dataset. Our results indicate that contrastive learning can significantly reduce labeling costs while re-taining good pose estimation results, making it a promising solution for resource-constrained applications. Code is available at mmlab-cv.github.ioICLaP.