首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于网页去噪Hash的增量式网络爬虫研究
引用本文:张皓,周学广.基于网页去噪Hash的增量式网络爬虫研究[J].舰船电子工程,2014(2):86-90.
作者姓名:张皓  周学广
作者单位:海军工程大学信息安全系;
摘    要:基于网页Hash值产生的增量式网络爬虫,可以实现网页的增量抓取过程.然而,由于网页噪声的存在,经典Hash算法对文本产生的Hash值过于敏感,导致通过Hash值对比判断网页变化的过程偏离实际情况.研究提出一种去噪后Hash产生方法,通过对网页文本块进行“正文”与“噪声”分类,去除噪声后对网页正文内容产生Hash值并判断网页是否变化,提高网页增量抓取效率.实验结果表明所提出的基于去噪后Hash产生方法的增量式抓取过程,Hash值敏感度降低,有效提高了网络爬虫增量抓取性能.

关 键 词:Hash  网页去噪  增量  Heritrix

Incremental Web Crawler Webpage Denoising Based on Hash
ZHANG Hao,ZHOU Xueguang.Incremental Web Crawler Webpage Denoising Based on Hash[J].Ship Electronic Engineering,2014(2):86-90.
Authors:ZHANG Hao  ZHOU Xueguang
Institution:(Department of Information Security, Naval University of Engineering, Wuhan 430033)
Abstract:Webpage Hash value based on incremental web crawler can realize the incremental crawling process of the webpage. However, because of the existence of webpage noise, the classical Hash algorithm on the text of the Hash values are too sensitive, which leads to the fact that the process of comparative judgment webpage changes deviate from the actual situation through the Hash value. A denoising method By Hash is proposed. Through dividing the block of text in two cate gories, "webpage text" and "noise", noise removal of webpage content to generate Hash values and determine whether the change of webpage improve the webpage incremental crawling efficiency. Experimental results show thai the proposed pro- duction of Hash after denoising based on incremental crawling process methods of Hash values can decrease the Hash sensi- tivity, improve the performance of web crawler incremental crawling effectively.
Keywords:Hash  webpage denoising  incremental  Heritrix
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号