基于一致性损失的多模态食谱检索

打开文本图片集
中图分类号:TP391.4 文献标识码:A 文章编号:2096-4706(2025)09-0074-06
Abstract: Multimodal recipe retrieval effectively matches food images with their corresponding recipe texts. However, semantic inconsistencies between image and text modalities pose challenges to retrieval accuracy and efficiency. This paper proposes a Consistent Multimodal Hierarchical Transformers (CMHT) model, which enhances semantic consistency between food images and recipe texts in the embedding space through cross-modal and intra-modal contrastive learning. The experimental results show that the application of CMHT on the recipe dataset improves the accuracy of retrieval, proving the application potential of this method in multimodal data processing in the food field.
Keywords: intelligent agriculture; food computing; multimodal recipe retrieval; cross-modal contrastive learning; semantic consistency
0 引 言
近年来,随着计算能力、数据规模的提升以及深度神经网络(Deep Neural Networks, DNNs)算法的突破,深度学习(Deep Learning)迅速发展,成为人工智能的核心技术之一。(剩余10107字)