首页 | 本学科首页   官方微博 | 高级检索  
     

基于词组主题建模的文本语义压缩算法
引用本文:王李冬,张引,吕明琪. 基于词组主题建模的文本语义压缩算法[J]. 西南交通大学学报, 2015, 28(4): 755-763. DOI: 10.3969/j.issn.0258-2724.2015.04.027
作者姓名:王李冬  张引  吕明琪
基金项目:浙江省自然科学基金资助项目(Q14F020032,LY15F020025)国家自然科学基金资助项目(61202282)大学数字图书馆国际合作计划资助项目
摘    要:为了实现文本代表性语义词汇的抽取,提出一种基于词组主题建模的文本语义压缩算法SCPTM(semantic compression based on phrase topic modeling).该算法首先将代表性语义词汇抽取问题转化为最大化优化模型,并通过贪心搜索策略实现该模型的近似求解.然后,利用词组挖掘模型LDACOL实现词组主题建模,得到SCPTM算法的输入参数;同时,针对该模型中词组的主题分配不稳定的问题进行改进,使得取得的代表性语义词汇更加符合人们对语义的认知习惯.最后,将改进LDACOL模型与LDA模型、LDACOL模型以及TNG模型的主题挖掘性能进行实验比较,并利用SCPTM算法针对不同语料库进行语义压缩,根据聚类结果评价其有效性.实验结果表明,在多数情况下,改进LDACOL模型的主题抽取效果优于其他3种模型;通过SCPTM算法抽取代表性语义词汇能达到70%~100%的精度,相比PCA、MDS、ISOMAP等传统降维算法能获得更高的聚类效果. 

关 键 词:主题模型   代表性语义词汇   文本挖掘   语义压缩   SCPTM
收稿时间:2014-06-16

Document Semantic Compression Algorithm Based on Phrase Topic Model
WANG Lidong,ZHANG Yin,L,Uuml,Mingqi. Document Semantic Compression Algorithm Based on Phrase Topic Model[J]. Journal of Southwest Jiaotong University, 2015, 28(4): 755-763. DOI: 10.3969/j.issn.0258-2724.2015.04.027
Authors:WANG Lidong  ZHANG Yin    Mingqi
Abstract:To extract representative semantic terms, a document SCPTM (semantic compression based on phrase topic modeling) algorithm was proposed. Firstly, SCPTM converts semantic terms extraction to the optimization model of maximization, and uses a greedy search algorithm to generate approximate solution. Then, in order to compute input parameters for SCPTM, phrase discovery model LDACOL was employed to extract important topics in phrase pattern. Meanwhile, the instability of topic allocation in LDACOL model was improved, so that the extracted semantic terms can satisfy the demand of human cognition. Finally, to evaluate the performance of topic discovery, the improved LDACOL was compared with LDA, LDACOL and TNG, and SCPTM was used for semantic compression on different corpora. Then the effectiveness of the algorithm was evaluated by clustering results. Empirical experimental results show that the preformance of topic discovery of improved LDACOL is superior over other three models in most cases. The accuracy of extracting the representative semantic terms by the proposed algorithm can reach 70%-100%, and can achieve better results for document clustering compared with other dimension-reduction algorithms, such as PCA, MDS and ISOMAP. 
Keywords:
点击此处可从《西南交通大学学报》浏览原始摘要信息
点击此处可从《西南交通大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号