首页 | 官方网站   微博 | 高级检索  
     

基于PMI与BTM的船舶事故原因文本挖掘
引用本文:于卫红,付飘云,任月,王庆武.基于PMI与BTM的船舶事故原因文本挖掘[J].交通信息与安全,2021,39(1):35-44.
作者姓名:于卫红  付飘云  任月  王庆武
作者单位:1.大连海事大学航运经济与管理学院 辽宁 大连 116026
基金项目:国家重点研发计划资助项目;中央高校基本科研业务费专项资金
摘    要:为了实现从海量的船舶事故调查报告中自动提取出水上交通安全知识,提出了从词语和主题2个层面对船舶事故调查报告进行语义挖掘的方法,并以100份船舶自沉事故调查报告为语料进行具体挖掘。在词语层面,使用PMI算法从事故原因文本中挖掘频繁共现的词语模式,通过文本特征词的共现揭示事故致因要素间的关联。在主题层面,使用BTM算法对事故原因文本进行主题建模,通过主题对数似然、主题一致性评估建模结果的优劣。通过主题建模,对表征自沉事故原因的特征词进行聚类,并根据主题在文档集合中的分布初步量化出每种原因的发生概率。根据使用500组新数据集对主题模型预测能力的测试,所构建的主题模型能够100%识别出领域无关的词并自动忽略;对于语料库中85.6%的词语,所构建的主题模型能够明确地将其归属于代表某一原因的主题;另14.4%的词主题边界不明显,难以将其单独以较大的可能性明确归属到某一主题下。 

关 键 词:交通安全    船舶事故调查报告    文本挖掘    主题模型    词共现    PMI算法    BTM算法
收稿时间:2020-10-23

Text Mining for Causes of Ship Accidents Based on PMI and BTM
YU Weihong,FU Piaoyun,REN Yue,WANG Qingwu.Text Mining for Causes of Ship Accidents Based on PMI and BTM[J].Journal of Transport Information and Safety,2021,39(1):35-44.
Authors:YU Weihong  FU Piaoyun  REN Yue  WANG Qingwu
Affiliation:1.Maritime Economics and Management College, Dalian Maritime University, Dalian 116026, Liaoning, China2.Navigation College, Dalian Maritime University, Dalian 116026, Liaoning, China
Abstract:The paper proposes a method of semantic mining for ship accident investigation reports from words and topics to automatically extract knowledge of water traffic safety from massive ship accident investigation reports. Moreover, 100 investigation reports on the self-sinking accidents of ships are used as corpus for specific excavations. At the word level, the PMI algorithm is used to mine frequent co-occurrence word patterns from the texts describing the causes of the accidents, and relationships between accident-causing factors are revealed through the co-occurrence of text feature words. At the topic level, the BTM algorithm is used to model the texts describing the causes of the accidents, and the modeling results are evaluated by topic log-likelihood and coherence. The feature words representing the causes of foundering accidents are clustered through topic modeling, and the occurrence probability of each cause is preliminarily quantified according to the distribution of topics in the corpus. According to the results on the predictive ability of the topic model using 500 new data sets, the topic model can recognize 100% of the domain-independent words and automatically ignore them. For 85.6% of the words in the corpus, the topic model can attribute them to a certain topic representing a specific cause. For about 14.4% of the words, the topic boundary is not obvious, so it is not easy to attribute them with a high probability. 
Keywords:
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《交通信息与安全》浏览原始摘要信息
点击此处可从《交通信息与安全》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号