Detail publikačního výsledku

Layout Based Information Extraction from HTML Documents

BURGET, R.

Originální název

Layout Based Information Extraction from HTML Documents

Anglický název

Layout Based Information Extraction from HTML Documents

Druh

Stať ve sborníku mimo WoS a Scopus

Originální abstrakt

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Anglický abstrakt

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Klíčová slova

page segmentation, layout analysis, information extraction

Klíčová slova v angličtině

page segmentation, layout analysis, information extraction

Autoři

BURGET, R.

Vydáno

23.09.2007

Nakladatel

IEEE Computer Society

Místo

Curitiba

ISBN

0-7695-2822-8

Kniha

9th International Conference on Document Analysis and Recognition ICDAR 2007

Strany od

624

Strany do

629

Strany počet

6

BibTex

@inproceedings{BUT28821,
  author="Radek {Burget}",
  title="Layout Based Information Extraction from HTML Documents",
  booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
  year="2007",
  pages="624--629",
  publisher="IEEE Computer Society",
  address="Curitiba",
  isbn="0-7695-2822-8"
}