Multimodal Neurons in Artificial Neural Networks
We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn.
The good, and the bad...
By exploiting the model’s ability to read text robustly, we find that even photographs of hand-written text can often fool the model.