首页 | 官方网站   微博 | 高级检索  
     

基于规则的海事自由文本信息抽取方法研究
引用本文:余晨,毛喆,高嵩.基于规则的海事自由文本信息抽取方法研究[J].交通信息与安全,2017,35(2):40-47.
作者姓名:余晨  毛喆  高嵩
作者单位:武汉理工大学智能交通系统研究中心 武汉 430063;武汉理工大学国家水运安全工程技术研究中心 武汉 430063
基金项目:交通运输部建设科技项目工信部高技术船舶项目
摘    要:海事数据的结构化处理是海事安全研究的一个重要步骤.目前,网络上存在着大量的海事相关信息,但多为不同格式的非结构化文档数据,可以采用一种基于规则的海事信息抽取方法,将海事自由文本转化为结构化的数据.通过网络爬虫从海事相关网页中得到待抽取文本数据,根据得到的文本信息定义抽取任务为时间、地点、船名和事故类型4个数据项,再根据抽取任务本身及其常见触发词构建自定义海事词库,用于自由文本的分词和词性标注;通过对大量事故语料的分析总结,编制抽取规则进行海事信息的抽取,形成结构化的海事数据.以长江海事局网站的事故详情为数据源,采用基于规则的抽取方法进行实验.实验结果表明,时间信息抽取的准确率为100%,召回率为91%;地点信息抽取的准确率为94.52%,召回率为69%;船名信息抽取的准确率为97.75%,召回率为86%;事故类型信息抽取的准确率为96.67%,召回率为87%. 

关 键 词:信息抽取    海事自由文本    自定义词库    抽取规则

An Approach of Extracting Information for Maritime Unstructured Text Based on Rules
Abstract:Structural processing of maritime data plays an important role in maritime safety.There is a plenty of maritime related information on internet.However, most of the information is unstructured data which has different formats.An approach of extracting maritime information and converting unstructured text into structural data is proposed in this paper.Web crawlers are used to obtain the text data from maritime-related Web pages.According to the definitions of the texts, they are divided into four items, which are time, location, vessel name, and type of accident.According to the extraction process and its common trigger words, the maritime lexicon for segmentation of Chinese words and part-of-speech tagging is constructed.Relying on an analysis of a large number of accident corpuses, the rules for extraction of information are summarized.The structured maritime data is then formulated.In order to verify the feasibility of this approach in term of extracting information based on rules, the data from the website of The Yangtze river maritime bureau is applied as a case study.The results indicate that the precision of extracting time information is 100%, with the recall rate of 91%.The precision of extracting location information is 94.52%, with the recall rate of 69%.The precision of extracting vessel name information is 97.75%, with the recall rate of 86%.The precision of extracting accident type information is 96.6%, with the recall rate of 87%. 
Keywords:
本文献已被 CNKI 等数据库收录!
点击此处可从《交通信息与安全》浏览原始摘要信息
点击此处可从《交通信息与安全》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号