Citation: | Liqiang Jing, Yiren Li, Junhao Xu, Yongcan Yu, Pei Shen, Xuemeng Song. Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization. Machine Intelligence Research, vol. 20, no. 2, pp.289-298, 2023. https://doi.org/10.1007/s11633-022-1372-x |
[1] |
A. M. Rush, S. Chopra, J. Weston. A neural attention model for abstractive sentence summarization. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389, 2015. DOI: 10.18653/v1/D15-1044.
|
[2] |
S. Chopra, M. Auli, A. M. Rush. Abstractive sentence sum-marization with attentive recurrent neural networks. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, USA, pp. 93–98, 2016. DOI: 10.18653/v1/N16-1012.
|
[3] |
H. R. Li, J. N. Zhu, T. S. Liu, J. J. Zhang, C. Q. Zong. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 4152–4158, 2018.
|
[4] |
H. R. Li, J. N. Zhu, J. J. Zhang, X. D. He, C. Q. Zong. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 5655–5667, 2020. DOI: 10.18653/v1/2020.coling-main.496.
|
[5] |
M. Lewis, Y. H. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020. DOI: 10.18653/v1/2020.acl-main.703.
|
[6] |
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Q. Zhou, W. Li, P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, vol. 21, no. 1, Article number 140, 2020.
|
[7] |
Y. H. H. Tsai, S. J. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, R. Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6558–6569, 2019. DOI: 10.18653/v1/P19-1656.
|
[8] |
T. Z. Yu, W. L. Dai, Z. H. Liu, P. Fung. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 3995–4007, 2021. DOI: 10.18653/v1/2021.emnlp-main.326.
|
[9] |
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, USA, Minnesota, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
|
[10] |
J. T. Gu, Z. D. Lu, H. Li, V. O. K. Li. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1631–1640, 2016. DOI: 10.18653/v1/P16-1154.
|
[11] |
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 3111–3119, 2013.
|
[12] |
J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543, 2014. DOI: 10.3115/v1/D14-1162.
|
[13] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
|
[14] |
X. Song, J. J. Chen, Z. X. Wu, Y. G. Jiang. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, vol. 24, pp. 2914–2923, 2022. DOI: 10.1109/TMM.2021.3090595.
|
[15] |
T. Hasan, A. Bhattacharjee, M. S. Islam, K. Mubasshir, Y. F. Li, Y. B. Kang, M. S. Rahman, R. Shahriyar. Xl-Sum: Large-scale multilingual abstractive summarization for 44 languages. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4693–4703, 2021. DOI: 10.18653/v1/2021.findings-acl.413.
|
[16] |
A. Nighojkar, J. Licato. Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 7106–7116, 2021. DOI: 10.18653/v1/2021.acl-long.552.
|
[17] |
X. M. Song, L. Q. Jing, D. T. Lin, Z. Z. Zhao, H. Q. Chen, L. Q. Nie. V2P: Vision-to-prompt based multi- modal product summary generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, pp. 992–1001, 2022. DOI: 10.1145/3477495.3532076.
|
[18] |
T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, USA, 2013. DOI: doi.org/10.48550/arXiv.1301.3781.
|
[19] |
G. Kulkarni, V. Premraj, S. Dhar, S. M. Li, Y. Choi, A. C. Berg, T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, USA, pp. 1601–1608, 2011. DOI: 10.1109/CVPR.2011.5995466.
|
[20] |
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 15–29, 2010. DOI: 10.1007/978-3-642-15561-1_2.
|
[21] |
M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. F. Han, A. Mensch, A. Berg, T. Berg, H. DauméIII Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 747–756, 2012.
|
[22] |
Q. Z. You, H. L. Jin, Z. W. Wang, C. Fang, J. B. Luo. Image captioning with semantic attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 4651–4659, 2016. DOI: 10.1109/CVPR.2016.503.
|
[23] |
T. Yao, Y. W. Pan, Y. H. Li, Z. F. Qiu, T. Mei. Boosting image captioning with attributes. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 4904–4912, 2017. DOI: 10.1109/ICCV.2017.524.
|
[24] |
T. Yao, Y. W. Pan, Y. H. Li, T. Mei. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 711–727, 2018. DOI: 10.1007/978-3-030-01264-9_42.
|
[25] |
L. Ke, W. J. Pei, R. Y. Li, X. Y. Shen, Y. W. Tai. Reflective decoding network for image captioning. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 8887–8896, 2019. DOI: 10.1109/ICCV.2019.00898.
|
[26] |
M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara. Meshed-memory transformer for image captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10575–10584, 2020. DOI: 10.1109/CVPR42600.2020.01059.
|
[27] |
Y. W. Pan, T. Yao, Y. H. Li, T. Mei. X-linear attention networks for image captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10968–10977, 2020. DOI: 10.1109/CVPR42600.2020.01098.
|
[28] |
B. W. Cheng, A. G. Schwing, A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 17864–17875, 2021.
|
[29] |
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, USA, pp. 9992–10002, 2021. DOI: 10.1109/ICCV48922.2021.00986.
|
[30] |
J. Cho, J. Lei, H. Tan, M. Bansal. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1931–1942, 2021.
|
[31] |
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
|
[32] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, pp. 1–21, 2021.
|
[33] |
J. Yosinski, J. Clune, Y. Bengio, H. Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3320–3328, 2014.
|
[34] |
W. L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism &Mass Communication Quarterly, vol. 30, no. 4, pp. 415–433, 1953. DOI: 10.1177/107769905303000401.
|
[35] |
C. Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out, Barcelona, Spain, pp. 74–81, 2004.
|
[36] |
J. Clarke, M. Lapata. Global inference for sentence compression an integer linear programming approach. Journal of Artificial Intelligence Research, vol. 31, no. 1, pp. 399–429, 2008.
|
[37] |
Q. Y. Zhou, N. Yang, F. R. Wei, M. Zhou. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1095–1104, 2017. DOI: 10.18653/v1/P17-1101.
|
[38] |
J. Libovický, J. Helcl. Attention strategies for multi-source sequence-to- sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 196–202, 2017. DOI: 10.18653/v1/P17-2031.
|
[39] |
I. Calixto, Q. Liu, N. Campbell. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1913–1924, 2017. DOI: 10.18653/v1/P17-1175.
|
[40] |
A. See, P. J. Liu, C. D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1073–1083, 2017. DOI: 10.18653/v1/P17-1099.
|
[41] |
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
|