Detail publikačního výsledku

Two-Phase Categorization of Web Documents

BARTÍK, V.; BURGET, R.

Originální název

Two-Phase Categorization of Web Documents

Anglický název

Two-Phase Categorization of Web Documents

Druh

Stať ve sborníku mimo WoS a Scopus

Originální abstrakt

The number of pages on the World Wide Web is permanently growing and there is a need to process pages efficiently and obtain some useful knowledge from them. Web page categorization is a very important issue in this area. The method proposed here takes both visual and textual information into consideration. It consists of two phases. In the first phase, web page areas obtained by segmentation are classified based on their visual properties, and in the second phase, pages are classified, based on information from the first phase and textual information. Several experiments with web pages taken from news web sites are presented in the final part of the paper.

Anglický abstrakt

The number of pages on the World Wide Web is permanently growing and there is a need to process pages efficiently and obtain some useful knowledge from them. Web page categorization is a very important issue in this area. The method proposed here takes both visual and textual information into consideration. It consists of two phases. In the first phase, web page areas obtained by segmentation are classified based on their visual properties, and in the second phase, pages are classified, based on information from the first phase and textual information. Several experiments with web pages taken from news web sites are presented in the final part of the paper.

Klíčová slova

Web page categorization, visual block classification, term weighting, TF-IDF, page segmentation

Klíčová slova v angličtině

Web page categorization, visual block classification, term weighting, TF-IDF, page segmentation

Autoři

BARTÍK, V.; BURGET, R.

Rok RIV

2011

Vydáno

01.11.2010

Nakladatel

Institute for Systems and Technologies of Information, Control and Communication

Místo

Valencia

ISBN

978-989-8425-28-7

Kniha

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval

Strany od

458

Strany do

462

Strany počet

5

Plný text v Digitální knihovně

BibTex

@inproceedings{BUT34415,
  author="Vladimír {Bartík} and Radek {Burget}",
  title="Two-Phase Categorization of Web Documents",
  booktitle="Proceedings of the International Conference on Knowledge Discovery and Information Retrieval",
  year="2010",
  pages="458--462",
  publisher="Institute for Systems and Technologies of Information, Control and Communication",
  address="Valencia",
  isbn="978-989-8425-28-7"
}