TY - GEN
T1 - Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
AU - Guo, Xiaoyuan
AU - Duan, Jiali
AU - Kuo, C. C.Jay
AU - Wawira Gichoya, Judy
AU - Banerjee, Imon
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Language modality within the vision language pre-training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize"the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.
AB - Language modality within the vision language pre-training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize"the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.
UR - http://www.scopus.com/inward/record.url?scp=85143593967&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143593967&partnerID=8YFLogxK
U2 - 10.1109/ICPR56361.2022.9956616
DO - 10.1109/ICPR56361.2022.9956616
M3 - Conference contribution
AN - SCOPUS:85143593967
T3 - Proceedings - International Conference on Pattern Recognition
SP - 4779
EP - 4785
BT - 2022 26th International Conference on Pattern Recognition, ICPR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th International Conference on Pattern Recognition, ICPR 2022
Y2 - 21 August 2022 through 25 August 2022
ER -