基于网页去噪Hash的增量式网络爬虫研究 Incremental Web Crawler Webpage Denoising Based on Hash期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于网页去噪Hash的增量式网络爬虫研究

引用本文：	张皓,周学广.基于网页去噪Hash的增量式网络爬虫研究[J].舰船电子工程,2014(2):86-90.

作者姓名：	张皓周学广

作者单位：	海军工程大学信息安全系;

摘要：	基于网页Hash值产生的增量式网络爬虫,可以实现网页的增量抓取过程.然而,由于网页噪声的存在,经典Hash算法对文本产生的Hash值过于敏感,导致通过Hash值对比判断网页变化的过程偏离实际情况.研究提出一种去噪后Hash产生方法,通过对网页文本块进行“正文”与“噪声”分类,去除噪声后对网页正文内容产生Hash值并判断网页是否变化,提高网页增量抓取效率.实验结果表明所提出的基于去噪后Hash产生方法的增量式抓取过程,Hash值敏感度降低,有效提高了网络爬虫增量抓取性能.
关键词：	Hash 网页去噪增量 Heritrix
Incremental Web Crawler Webpage Denoising Based on Hash

ZHANG Hao,ZHOU Xueguang.Incremental Web Crawler Webpage Denoising Based on Hash[J].Ship Electronic Engineering,2014(2):86-90.

Authors:	ZHANG Hao ZHOU Xueguang

Institution:	(Department of Information Security, Naval University of Engineering, Wuhan 430033)

Abstract:	Webpage Hash value based on incremental web crawler can realize the incremental crawling process of the webpage. However, because of the existence of webpage noise, the classical Hash algorithm on the text of the Hash values are too sensitive, which leads to the fact that the process of comparative judgment webpage changes deviate from the actual situation through the Hash value. A denoising method By Hash is proposed. Through dividing the block of text in two cate gories, ＂webpage text＂ and ＂noise＂, noise removal of webpage content to generate Hash values and determine whether the change of webpage improve the webpage incremental crawling efficiency. Experimental results show thai the proposed pro- duction of Hash after denoising based on incremental crawling process methods of Hash values can decrease the Hash sensi- tivity, improve the performance of web crawler incremental crawling effectively.

Keywords:	Hash webpage denoising incremental Heritrix
本文献已被 CNKI 维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏