首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
针对面向中文自由文本的部分-整体关系抽取问题,提出一种基于无监督学习的方法. 首先提出子模式提取算法,从领域文本集中获取概念对和概念对所在上下文模式,利用概念对和概念对上下文模式建立分布式语义模型;然后采用协同聚类算法将具有相同语义关系的概念对聚合成簇,通过训练L1正则化逻辑回归模型提取簇的特征并得到代表每个簇语义关系的概念对上下文模式;最后根据模式识别表达部分-整体关系的簇,从而获取部分-整体关系概念对. 实验结果表明,该方法取得较好的性能,F度量达到68.97%,优于传统聚类方法(55.77%)和模式匹配方法(61.95%).   相似文献   

2.
This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallow parsing and semantic labeling. By iteratively extracting new words and clustering words, we get an inital semantic lexicon that groups words of the same semantic meaning together as a class. After that, a bootstrapping algorithm is adopted to extract semantic structures. Then the semantic structures are used to extract new key words and augment the semantic lexicon. The resultant semantic structures are interpreted by persons and are amenable to handediting for refinement. In this experiment, the semi-automatically extracted structures SSA provide recall rate of 84.5%.  相似文献   

3.
Introduction Nowadays the applications of scene text recog-nition are rapidly expanding with the developmentof portable digital imaging devices[1]. However,character extraction from scene images has alwaysbeen a challenging problem due to complex back-ground, uneven illumination, shadows and noise ofimages[2]. Besides, languages also impose anotherlevel of variation in text. Characters in such lan-guages as Chinese, Japanese and Korean are usual-ly composed of several strokes, which do not n…  相似文献   

4.
With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.  相似文献   

5.
为提高快递运输的风险监测管控能力,降低因快递货品风险导致城市安全事件发生的可 能性,本文基于语义挖掘方法将快递运输货品描述转化为风险的量化表征,为快递运输风险评价 提供可量化的客观指标依据。基于网络大数据资源提供的法院判决书数据,将物品词条与判决 结果相关联,通过隐狄利克雷分布模型挖掘物品风险主题,结合模糊均值聚类方法,实现对快递 货品语义风险的量化表征与柔性划分。与传统方法中依赖检视人员查验既定违禁品清单后的主 观判断方法不同,本文充分挖掘网络文本数据中的可迁移知识,并应用于种类繁多的快递运输货 品,有效避免人工评价造成的漏检、错检情况。研究结果表明,本文方法具有较高的准确率与较 低的误报率,获得的风险评价值不再是0或1的是非判断,有利于开展多样化、针对性的风险预警 及应对措施。  相似文献   

6.
针对基于视频监控的密集行人群识别难度大,运动轨迹提取困难,运动语义信息挖掘不足 等问题,本文提出基于多目标跟踪FairMOT框架及K-means聚类的行人轨迹捕获和运动语义信 息感知方法。首先,利用多目标跟踪算法提取视频中行人群目标过街时的运动轨迹特征向量;然 后,通过分析轨迹坐标的数值分布特点,设计了一种协方差滤波算法STCCF,以检测和剔除“准静 态轨迹”,得到行人群目标运动轨迹簇;最后,针对已提取的轨迹簇,应用K-means聚类方法,选取 S系数(Silhouette Coefficient)和DB指数(Davies Bouldin Index)两个指标,感知行人聚集和消散点 等场景语义特征。实验分析表明,算法从提取到的2689条轨迹中辨识出179条“准静态轨迹”,检 出率为81.73%;视频场景中的行人目标源点和消失点的解析结果与人工辨识结果吻合,验证了密 集行人群轨迹提取和运动语义信息感知方法的有效性。本文研究可为数据驱动的行为预测和轨 迹建模提供基础。  相似文献   

7.
Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.  相似文献   

8.
针对Internet上日益泛滥的色情信息,提出了一种语义链技术和向量空间模型相结合的方法,利用语义链技术找出待分类文本的语义链,由该语义链的各密度向量分量与色情(性文化)文本语义进行比较,来确定其与待分类文本的相似程度,从而将待分类文本分到对应的类中,最后可以使用先前的分类结果对色情信息实施过滤,通过实验表明,该方法能较好的识别色情网页并加以过滤.  相似文献   

9.
为了实现文本代表性语义词汇的抽取,提出一种基于词组主题建模的文本语义压缩算法SCPTM(semantic compression based on phrase topic modeling).该算法首先将代表性语义词汇抽取问题转化为最大化优化模型,并通过贪心搜索策略实现该模型的近似求解.然后,利用词组挖掘模型LDACOL实现词组主题建模,得到SCPTM算法的输入参数;同时,针对该模型中词组的主题分配不稳定的问题进行改进,使得取得的代表性语义词汇更加符合人们对语义的认知习惯.最后,将改进LDACOL模型与LDA模型、LDACOL模型以及TNG模型的主题挖掘性能进行实验比较,并利用SCPTM算法针对不同语料库进行语义压缩,根据聚类结果评价其有效性.实验结果表明,在多数情况下,改进LDACOL模型的主题抽取效果优于其他3种模型;通过SCPTM算法抽取代表性语义词汇能达到70%~100%的精度,相比PCA、MDS、ISOMAP等传统降维算法能获得更高的聚类效果.   相似文献   

10.
The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents, we obtain the traditional supervised classifier for text categorization (TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text (FACT) based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data, and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine (SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC.  相似文献   

11.
Web pages contain more abundant contents than pure text ,such as hyperlinks,html tags and metadata et al.So that Web page categorization is different from pure text. According to Internet Chinese news pages, a practical algorithm for extracting subject concepts from web page without thesaurus was proposed, when incorporated these category-subject concepts into knowledge base, Web pages was classified by hybrid algorithm, with experiment corpus extracting from Xinhua net. Experimental result shows that the categorization performance is improved using Web page feature.  相似文献   

12.
Semantic textual similarity(STS) is a common task in natural language processing(NLP). STS measures the degree of semantic equivalence of two textual snippets. Recently, machine learning methods have been applied to this task, including methods based on support vector regression(SVR). However, there exist amounts of features involved in the learning process, part of which are noisy features and irrelative to the result.Furthermore, different parameters will significantly influence the prediction performance of the SVR model. In this paper, we propose genetic algorithm(GA) to select the effective features and optimize the parameters in the learning process, simultaneously. To evaluate the proposed approach, we adopt the STS-2012 dataset in the experiment. Compared with the grid search, the proposed GA-based approach has better regression performance.  相似文献   

13.
针对交通场景目标分割边缘不平滑以及小目标难以准确分割等问题,本文提出一种双注意力引导的跨层优化交通场景语义分割算法。首先,构建多分支特征提取编码网络,并利用串行非比例式空洞卷积实现空间上下文信息提取,进而改善小目标信息的丢失;其次,构建基于空间对齐的跨层特征融合解码网络,实现语义信息和细节信息的融合,增强不同尺度目标的表达能力;最后,提出通道和空间注意力机制,建模全局通道相关性和长距离位置相关性,增强网络对关键特征的学习能力。交通场景数据集Cityscapes和CamVid上的实验结果表明,所提特征提取编码网络、跨层特征融合解码网络以及注意力机制模块是有效的;所提语义分割算法获得了77.79%和78.66%的平均交并比,能够平滑目标分割边缘,尤其对细长条形目标具有鲁棒性。  相似文献   

14.
在网络舆情分析中,人们迫切需要自动化的工具在海量信息中抽取所需要的信息,以供进一步分析利用.针对此问题,提出了基于自动生成模板的Web信息抽取方法,可以消除网页噪声,快速有效地抽取所需的网页信息.该方法通过解析器将Web文档解析成文档对象模型,根据用户需求建立抽取规则,采用自动生成模板机制,并依据模板的抽取规则对网页信息进行抽取.实验证明,该抽取方法具有较高的召回率和准确率.  相似文献   

15.
研究了云计算环境下的分布式文件系统KFS的系统架构,对于海量数据存储的云存储系统来说元数据管理效率是关键,通过分析KFS文件系统的元数据模型,提出了基于KFS分布式文件系统元数据的改进模型,即利用内存缓冲策略,对待插入的元数据进行预处理并批量插入,减少查找和分裂次数,大大提高了KFS文件系统的数据访问效率.最后通过算法复杂度的分析,证明该改进算法能有效提高分布式文件系统KFS的元数据服务器的效率.同时该改进模型对于采用B^+树索引机制来集中管理元数据的类似系统同样适用.  相似文献   

16.
设计了网格资源发现过程和网格资源经济调度过程以实现网格的资源管理.网格资源发现过程提供了一种基本的机制,它可使网格请求Agent能够发现所需要的网格资源.网格资源经济调度过程主要用于管理网格任务Agent通过竞标向网格资源Agent购买资源,以完成计算任务.文中对网格资源发现过程和网格资源经济调度过程进行了形式描述,并描述了所使用的主要数据结构.  相似文献   

17.
IntroductionVolunteer computing has the idea that homePCs are mostly idle and thus could be harnessed forsolving complex computational problems.Thisidea is to decompose the problem into many chunksthat can run concurrently with very little interac-tion,referred by some as“embarrassingly parallel”computation.The best- known example is SETI@HOME.A comparison of traditional high per-formance server versus volunteer computing isshown in Fig.1 .Though volunteer computingmodel has been succes…  相似文献   

18.
Automatic translation of Chinese text to Chinese Braille is important for blind people in China to acquire information using computers or smart phones. In this paper, a novel scheme of Chinese-Braille translation is proposed. Under the scheme, a Braille word segmentation model based on statistical machine learning is trained on a Braille corpus, and Braille word segmentation is carried out using the statistical model directly without the stage of Chinese word segmentation. This method avoids establishing rules concerning syntactic and semantic information and uses statistical model to learn the rules stealthily and automatically. To further improve the performance, an algorithm of fusing the results of Chinese word segmentation and Braille word segmentation is also proposed. Our results show that the proposed method achieves accuracy of 92.81% for Braille word segmentation and considerably outperforms current approaches using the segmentation-merging scheme.  相似文献   

19.
以马丁的评价理论为指导,结合叙述学原理,以弗吉尼亚·伍尔夫的著名小说《达洛卫夫人》为例,探讨意识流小说叙事的态度系统及其体现以及它们如何推动小说爱情主题的构建.研究结果表明:该小说的叙事是关于人的,因而人物的判断被前景化;人物之间的复杂关系被鉴赏;人物的情感也被评价.小说通过人物的自由间接思想表达从不同人物的视角对态度...  相似文献   

20.
在分析轨道交通基础数据内容构成与数据特点的基础上,探讨了轨道交通基础数据库元数据的概念与作用,建立了完善的轨道交通基础数据库的元数据内容体系,将元数据分为数据库层元数据、数据集层元数据和要素层元数据三个层次,并详细分析了各个层次的元数据作用与元数据组成要素.本文的研究成果为轨道交通基础库平台建设及信息共享莫定了基础.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号