Detail publikačního výsledku

Visual Area Classification for Article Identification in Web Documents

BURGET, R.

Originální název

Visual Area Classification for Article Identification in Web Documents

Anglický název

Visual Area Classification for Article Identification in Web Documents

Druh

Stať ve sborníku mimo WoS a Scopus

Originální abstrakt

In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the article discovery in complex HTML documents. We use a classification approach for the identification the important parts of the article within the page and we propose an algorithm for the detection of the article bounds within the page. Finally, we provide the results of an experimental evaluation.

Anglický abstrakt

In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the article discovery in complex HTML documents. We use a classification approach for the identification the important parts of the article within the page and we propose an algorithm for the detection of the article bounds within the page. Finally, we provide the results of an experimental evaluation.

Klíčová slova

article extraction, document cleaning, page segmentation, visual analysis

Klíčová slova v angličtině

article extraction, document cleaning, page segmentation, visual analysis

Autoři

BURGET, R.

Rok RIV

2011

Vydáno

30.08.2010

Nakladatel

IEEE Computer Society

Místo

Bilbao

ISBN

978-0-7695-4174-7

Kniha

21st International Workshop on Databases and Expert Systems Applications

Strany od

171

Strany do

175

Strany počet

5

BibTex

@inproceedings{BUT35628,
  author="Radek {Burget}",
  title="Visual Area Classification for Article Identification in Web Documents",
  booktitle="21st International Workshop on Databases and Expert Systems Applications",
  year="2010",
  pages="171--175",
  publisher="IEEE Computer Society",
  address="Bilbao",
  isbn="978-0-7695-4174-7"
}