基于BERT的多特征融合农业命名实体识别
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划项目(2019YFD1101105);国家自然科学基金项目(61871041);北京市科技计划项目(Z191100004019007)


Recognition of the agricultural named entities with multi-feature fusion based on BERT
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    命名实体识别是农业文本信息抽取的重要环节,针对实体识别过程中局部上下文特征缺失、字向量表征单一、罕见实体识别率低等问题,提出一种融合BERT(Bidirectional Encoder Representations from Transformers,转换器的双向编码器表征量)字级特征与外部词典特征的命名实体识别方法。通过BERT预训练模型,融合左右两侧语境信息,增强字的语义表示,缓解一词多义的问题;自建农业领域词典,引入双向最大匹配策略,获取分布式词典特征表示,提高模型对罕见或未知实体的识别准确率;利用双向长短时记忆(Bi-directional Long-short Term Memory,BiLSTM)网络获取序列特征矩阵,并通过条件随机场(Conditional Random Field,CRF)模型生成全局最优序列。结合领域专家知识,构建农业语料集,包含5 295条标注语料,5类农业实体。模型在语料集上准确率为94.84%、召回率为95.23%、F1值为95.03%。研究结果表明,该方法能够有效识别农业领域命名实体,识别精准度优于其他模型,具有明显的优势。

    Abstract:

    Agricultural named entity recognition is a fundamental task for information extraction in the agricultural domain. Aiming at the problems of local context features、unable to solve the polysemy of the word、low recognition rate of rare entities in the process of entity recognition, the model combined with character level features and dictionary feature was proposed to automatically identify entities in the text,the character level features were obtained from the BERT(Bidirectional Encoder Representations from Transformers)model. Firstly, the BERT pre-trained language model was used to integrate the left and right contextual information to obtain the character level features, enhance the semantic representation of words, in order to alleviate the problem of polysemy; Secondly, we built an agricultural dictionary and introduced external dictionary information through the feature extraction strategy to improve the recognition accuracy of the model for rare or unknown entities. Among them, two feature extraction strategies were designed to capture the dictionary features, included N-gram feature template algorithm and bi-direction maximum matching algorithm. Then, the character level features and dictionary features were fused as the input of the next neural network layer. Finally, the fused feature information were encoded by the BiLSTM (Bi-directional Long-short Term Memory) neural network layer, obtained the sequence feature matrix, and the optimal text label sequence was obtained by CRF (Conditional Random Field). Based on the knowledge of domain experts, a labeling strategy of named entities in the agricultural field was proposed to solve the problem of fuzzy boundaries of agricultural named entities, in order to ensure the integrity of the entities. The experiments were carried out on the corpus of agricultural, which contained 5 295 labeled corpora and 5 categories of agricultural entities. The results showed that better overall performance was achieved in the corpus, where the recognition precision, recall, and F1-score were 94.84%, 95.23%, and 95.03%, respectively. In terms of specific categories, due to the obvious boundary characteristics of crop diseases and pesticide, the model achieved higher recognition precision than the remaining three entities of agricultural, such as machinery, pests, and crop variety. Experimental comparison showed that for the effectiveness of the dictionary feature extraction strategy, the performance of the model based on the bi-direction maximum matching algorithm was better than the N-gram feature template algorithm. When the number of templates was 10, the performance of the model based on N-gram feature template was the best with the recognition precision of93.95%and F1-score of 94.03%. The bi-directional maximum matching algorithm using feature embedding can obtain more potential information, which was better than one-hot encoding. The precision and F1-score of the model were improved by 0.49 and 0.91 percentage points, respectively. Compared with the models based on BiLSTM-CRF, BERT-BiLSTM-CRF, the precision of the BERT-Dic-BiLSTM-CRF model proposed in this paper had obvious performance advantages with the highest recognition precision of 94.84%. Compared with the BERT-BiLSTM-CRF model, for the recognition performance of rare or unknown entities, the recognition precision of the BERT-Dic-BiLSTM-CRF model was improved by 5.93 and 6.44 percentage points, respectively. Further verifying that the integration of dictionary features into the model can improve the recognition accuracy of the model for such entities.

    参考文献
    相似文献
    引证文献
引用本文

赵鹏飞,赵春江,吴华瑞,王维.基于BERT的多特征融合农业命名实体识别[J].农业工程学报,2022,38(3):112-118. DOI:10.11975/j. issn.1002-6819.2022.03.013

Zhao Pengfei, Zhao Chunjiang, Wu Huarui, Wang Wei. Recognition of the agricultural named entities with multi-feature fusion based on BERT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE),2022,38(3):112-118. DOI:10.11975/j. issn.1002-6819.2022.03.013

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-09-16
  • 最后修改日期:2022-01-10
  • 录用日期:
  • 在线发布日期: 2022-03-11
  • 出版日期:
文章二维码
您是第位访问者
ICP:京ICP备06025802号-3
农业工程学报 ® 2024 版权所有
技术支持:北京勤云科技发展有限公司