CLIP (Contrastive Language-Image Pre-training) excels in zero-shot image classification across diverse domains, making it an ideal candidate for pre-labelling unlabelled datasets. This paper introduces three pivotal enhancements designed to elevate CLIP-based pre-labelling efficacy without the need for labelled data. First, we introduce prompt refinement using a large language model (GPT-3.5- Turbo) to generate more descriptive prompts, significantly boosting accuracy on various datasets. Second, we address overconfident predictions through confidence calibration, achieving improved results without the need for a separate labelled validation set.
Lastly, we leverage the inductive biases of CLIP and DINOv2 through ensembling, demonstrating a substantial boost in zero-shot labelling accuracy. Experimental results across various datasets consistently demonstrate enhanced performance, particularly in handling ambiguous classes. This work not only addresses limitations in CLIP but also provides valuable insights for advancing multimodal models in real-world applications.
Access The Research Paper:
-
Lattice | Vol 5 Issue 1₹1,696.00