Review Articles
Self-Supervised Vision Transformers for Medical Image Segmentation with Limited Annotations
Abstract
Annotating medical images for segmentation is expensive and requires domain expertise. We propose MedSSL-ViT, a self-supervised pre-training framework for Vision Transformers (ViT) tailored to medical imaging. MedSSL-ViT combines masked image modeling with anatomical-aware contrastive learning, leveraging the structured nature of medical images. Pre-trained on 850K unlabeled chest X-rays and CT slices, the model achieves state-of-the-art segmentation performance on four downstream tasks using only 10% of annotations: lung segmentation (Dice: 97.2%), cardiac chamber segmentation (Dice: 93.5%), liver tumor segmentation (Dice: 78.8%), and retinal vessel segmentation (Dice: 82.1%). With just 1% labels, MedSSL-ViT still outperforms fully supervised baselines trained on 100% labels by 2-5% Dice score.