R&D Result Detail

Original Title

Layout Based Information Extraction from HTML Documents

English Title

Layout Based Information Extraction from HTML Documents

Type

Paper in proceedings outside WoS and Scopus

Original Abstract

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

English abstract

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Keywords

page segmentation, layout analysis, information extraction

Key words in English

page segmentation, layout analysis, information extraction

Authors

BURGET, R.

Released

23.09.2007

Publisher

IEEE Computer Society

Location

Curitiba

ISBN

0-7695-2822-8

Book

9th International Conference on Document Analysis and Recognition ICDAR 2007

Pages from

624

Pages to

629

Pages count

6

BibTex

@inproceedings{BUT28821,
  author="Radek {Burget}",
  title="Layout Based Information Extraction from HTML Documents",
  booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
  year="2007",
  pages="624--629",
  publisher="IEEE Computer Society",
  address="Curitiba",
  isbn="0-7695-2822-8"
}

VUT

Faculties and university institutes

Parts

Layout Based Information Extraction from HTML Documents