基于EALMDA的医疗命名实体识别数据增强方法

打印
收藏

收藏成功

微博 QQ空间微信

打开文本图片集

关键词：数据增强；命名实体识别；自然语言处理；生成模型；Mixup

中图分类号：TP391.1 文献标志码：A 文章编号：1671-6841（2026）01-0043-08

DOI： 10.13705/j. issn.1671-6841.2024116

Abstract：Named entity recognition in the medical field involved identifying named entities from unstructured medical texts.This played a crucial role in various downstream tasks.Due to the complexity of medical named entities leveraging domain-specific knowledge was required in expert annotations，which led to a severe scarcity of annotated data in the medical field. To address this issue，an entity-aware mask local mixup data augmentation method （EALMDA） was proposed. This method firstly extracted key elements using an entity-aware masking channel and masked non-entity parts to retain core semantics. Then，masked sentences were fused through a linear combination of two sampling strategies： contextual entity similarity and k -nearest neighbors，which preserved the core semantics while increasing sample diversity.Finally，after sequence linearization，the sentences were input into a generative model to obtain augmented samples. Comparative experiments were conducted on five mainstream medical named entity recognition datasets， such as NCBI-disease， simulating low-resource scenarios against mainstream data augmentation baselines，and significant improvements were observed compared to baseline methods.

Key Words ： data augmentation； named entity recognition； natural language processng； generative mod-el；Mixup

0 引言

近几年来，在医疗领域，命名实体识别（namedentityrecognition，NER）任务常用于从非结构化文本中识别和分类相关实体[1]。（剩余13234字）

试读结束

购买全文6.00元下一篇融合多级语义的中文医疗短文本分类模型

郑州大学学报（理学版）

2026年01期

¥4.00/本