首页 | 本学科首页   官方微博 | 高级检索  
     检索      

半监督卷积神经网络的词义消歧
引用本文:张春祥,唐利波,高雪瑶.半监督卷积神经网络的词义消歧[J].西南交通大学学报,2022,57(1):11-17, 27.
作者姓名:张春祥  唐利波  高雪瑶
作者单位:哈尔滨理工大学计算机科学与技术学院, 黑龙江 哈尔滨 150080
基金项目:国家自然科学基金(61502124,60903082);;中国博士后科学基金(2014M560249);;黑龙江省自然科学基金(F2015041,F201420);
摘    要:为了解决有标签语料获取困难的问题,提出了一种半监督学习的卷积神经网络(convolutional neural networks, CNN)汉语词义消歧方法. 首先,提取歧义词左右各2个词汇单元的词形、词性和语义类作为消歧特征,利用词向量工具将消歧特征向量化;然后,对有标签语料进行预处理,获取初始化聚类中心和阈值,同时,使用有标签语料对卷积神经网络消歧模型进行训练,利用优化后的卷积神经网络对无标签语料进行语义分类,选取满足阈值条件的高置信度语料添加到训练语料之中,不断重复上述过程,直到训练语料不再扩大为止;最后,使用SemEval-2007:Task#5作为有标签语料,使用哈尔滨工业大学无标注语料作为无标签语料进行实验. 实验结果表明:所提出方法使CNN的消歧准确率提高了3.1%. 

关 键 词:半监督学习    卷积神经网络    词义消歧    消歧特征    词向量工具
收稿时间:2020-03-05

Word Sense Disambiguation Based on Semi-Supervised Convolutional Neural Networks
ZHANG Chunxiang,TANG Libo,GAO Xueyao.Word Sense Disambiguation Based on Semi-Supervised Convolutional Neural Networks[J].Journal of Southwest Jiaotong University,2022,57(1):11-17, 27.
Authors:ZHANG Chunxiang  TANG Libo  GAO Xueyao
Institution:School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China
Abstract:In order to solve the difficulty of acquiring tagged corpus, a Chinese word sense disambiguation method is proposed on the basis of semi-supervised learning convolutional neural networks (CNN). Firstly, the word, part of speech and semantic category are extracted as discriminative features, which are acquired from 2 word units on the both left and right adjacent to ambiguous word. Word vector tool is used to denote discriminative features as vector. Secondly, tagged corpus is preprocessed to obtain initialized clustering centers and thresholds. At the same time, it is used to train convolutional neural networks. The optimized CNN is applied for determining the semantic categories of ambiguous words in the untagged corpus. Corpus with high confidence that meets threshold conditions is selected into the training corpus. The above process is repeated until the training corpus is no longer expanded. In the last, SemEval-2007: Task#5 is used as the tagged corpus, and the unannotated corpus from Harbin Institute of Technology is used as the untagged corpus. Experimental results show that the proposed method improve disambiguation accuracy of CNN by 3.1%. 
Keywords:
点击此处可从《西南交通大学学报》浏览原始摘要信息
点击此处可从《西南交通大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号