Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Xiaoyuan Guo; Jiali Duan; C. C.Jay Kuo; Judy Wawira Gichoya; Imon Banerjee

doi:10.1109/ICPR56361.2022.9956616

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Xiaoyuan Guo, Jiali Duan, C. C.Jay Kuo, Judy Wawira Gichoya, Imon Banerjee

Diagnostic Radiology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Language modality within the vision language pre-training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize"the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.

Original language	English (US)
Title of host publication	2022 26th International Conference on Pattern Recognition, ICPR 2022
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	4779-4785
Number of pages	7
ISBN (Electronic)	9781665490627
DOIs	https://doi.org/10.1109/ICPR56361.2022.9956616
State	Published - 2022
Event	26th International Conference on Pattern Recognition, ICPR 2022 - Montreal, Canada Duration: Aug 21 2022 → Aug 25 2022

Publication series

Name	Proceedings - International Conference on Pattern Recognition
Volume	2022-August
ISSN (Print)	1051-4651

Conference

Conference	26th International Conference on Pattern Recognition, ICPR 2022
Country/Territory	Canada
City	Montreal
Period	8/21/22 → 8/25/22

ASJC Scopus subject areas

Computer Vision and Pattern Recognition

Access to Document

10.1109/ICPR56361.2022.9956616

Cite this

Guo, X., Duan, J., Kuo, C. C. J., Wawira Gichoya, J., & Banerjee, I. (2022). Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics. In 2022 26th International Conference on Pattern Recognition, ICPR 2022 (pp. 4779-4785). (Proceedings - International Conference on Pattern Recognition; Vol. 2022-August). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICPR56361.2022.9956616

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics. / Guo, Xiaoyuan; Duan, Jiali; Kuo, C. C.Jay et al.
2022 26th International Conference on Pattern Recognition, ICPR 2022. Institute of Electrical and Electronics Engineers Inc., 2022. p. 4779-4785 (Proceedings - International Conference on Pattern Recognition; Vol. 2022-August).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Guo, X, Duan, J, Kuo, CCJ, Wawira Gichoya, J & Banerjee, I 2022, Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics. in 2022 26th International Conference on Pattern Recognition, ICPR 2022. Proceedings - International Conference on Pattern Recognition, vol. 2022-August, Institute of Electrical and Electronics Engineers Inc., pp. 4779-4785, 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, Canada, 8/21/22. https://doi.org/10.1109/ICPR56361.2022.9956616

Guo X, Duan J, Kuo CCJ, Wawira Gichoya J, Banerjee I. Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics. In 2022 26th International Conference on Pattern Recognition, ICPR 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 4779-4785. (Proceedings - International Conference on Pattern Recognition). doi: 10.1109/ICPR56361.2022.9956616

@inproceedings{a122c781375042bd81df354e28ae7ff2,

title = "Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics",

abstract = "Language modality within the vision language pre-training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to {"}discretize{"}the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.",

author = "Xiaoyuan Guo and Jiali Duan and Kuo, {C. C.Jay} and {Wawira Gichoya}, Judy and Imon Banerjee",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 26th International Conference on Pattern Recognition, ICPR 2022 ; Conference date: 21-08-2022 Through 25-08-2022",

year = "2022",

doi = "10.1109/ICPR56361.2022.9956616",

language = "English (US)",

series = "Proceedings - International Conference on Pattern Recognition",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4779--4785",

booktitle = "2022 26th International Conference on Pattern Recognition, ICPR 2022",

}

TY - GEN

T1 - Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

AU - Guo, Xiaoyuan

AU - Duan, Jiali

AU - Kuo, C. C.Jay

AU - Wawira Gichoya, Judy

AU - Banerjee, Imon

PY - 2022

Y1 - 2022

N2 - Language modality within the vision language pre-training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize"the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.

AB - Language modality within the vision language pre-training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize"the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.

UR - http://www.scopus.com/inward/record.url?scp=85143593967&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85143593967&partnerID=8YFLogxK

U2 - 10.1109/ICPR56361.2022.9956616

DO - 10.1109/ICPR56361.2022.9956616

M3 - Conference contribution

AN - SCOPUS:85143593967

T3 - Proceedings - International Conference on Pattern Recognition

SP - 4779

EP - 4785

BT - 2022 26th International Conference on Pattern Recognition, ICPR 2022

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 26th International Conference on Pattern Recognition, ICPR 2022

Y2 - 21 August 2022 through 25 August 2022

ER -

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this