Abstract: In learning vision-language representations from Web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results