It is the cache of ${baseHref}. It is a snapshot of the page. The current page could have changed in the meantime.
Tip: To quickly find your search term on this page, press Ctrl+F or ⌘-F (Mac) and use the find bar.

A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates | Li | Journal of Software
Journal of Software, Vol 5, No 5 (2010), 506-513, May 2010
doi:10.4304/jsw.5.5.506-513

A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates

Qingzhong Li, Yanhui Ding, An Feng, Yongquan Dong

Abstract


Web information extraction is the key part of web data integration. With the need of e-commerce website and the development of web design, web pages with multiple presentation templates arise. The current web information extraction systems are usually based on single presentation template, so web pages with multiple presentation templates can’t be extracted efficiently. This paper focuses on the extraction problem about web pages with multiple presentation templates. Four different kinds of this problem have been considered, and a novel method based on path entropy, presentation regularity and ontology knowledge is presented. The experiment indicates that this method is very promising and it achieves excellent recall and precision.


Keywords


Information Extraction; Multiple Presentation Templates; Path Entropy; Presentation Regularity; Ontology

References



Full Text: PDF


Journal of Software (JSW, ISSN 1796-217X)

Copyright @ 2006-2014 by ACADEMY PUBLISHER – All rights reserved.