Detail výsledku VaV

Originální název

Content collector and document analysis for the M-Eco project

Anglický název

Content collector and document analysis for the M-Eco project

Druh

Software

Abstrakt

The system collects data from various sources, and makes them accessible to other components of the M-Eco project. The collection focuses on three groups of data: multimedia data such as broadcast news from TV and radio, online news data from MedISys, and social media content from blogs, forums and Twitter messages.

The multimedia data is collected and transcribed by SAIL's Media Mining Indexing System (MMI) that subsequently provides the transcriptions to the MedISys via RSS feed. For later retrieval, links to the original content are part of this RSS feed. MedISys provides these RSS feeds along with additional annotations and online news data collected by this system for further processing by the document analysis component. A third source of data collected by the content collector comprises social media content collected from MedWorm, Twitter, about 85 discussion fora and 45 blogs written especially in German.

Collected documents are pre-processed. This process includes filtering of irrelevant data, named entity recognition, parsing, tagging etc. As a result, a set of tagged documents is produced which is stored in the annotated text repository and made available via web services for the indicator detection and signal generation process.

Abstrakt aglicky

The system collects data from various sources, and makes them accessible to other components of the M-Eco project. The collection focuses on three groups of data: multimedia data such as broadcast news from TV and radio, online news data from MedISys, and social media content from blogs, forums and Twitter messages.

The multimedia data is collected and transcribed by SAIL's Media Mining Indexing System (MMI) that subsequently provides the transcriptions to the MedISys via RSS feed. For later retrieval, links to the original content are part of this RSS feed. MedISys provides these RSS feeds along with additional annotations and online news data collected by this system for further processing by the document analysis component. A third source of data collected by the content collector comprises social media content collected from MedWorm, Twitter, about 85 discussion fora and 45 blogs written especially in German.

Collected documents are pre-processed. This process includes filtering of irrelevant data, named entity recognition, parsing, tagging etc. As a result, a set of tagged documents is produced which is stored in the annotated text repository and made available via web services for the indicator detection and signal generation process.

Klíčová slova

name entitiy recognition, geonames.org, finite state automaton, Twitter, MedISys, M-Eco

Klíčová slova anglicky

name entitiy recognition, geonames.org, finite state automaton, Twitter, MedISys, M-Eco

Umístění

https://github.com/iotrusina/M-Eco-WP3-package

Licenční poplatek

K využití výsledku jiným subjektem je vždy nutné nabytí licence

www

https://github.com/iotrusina/M-Eco-WP3-package

VUT

Fakulty

Vysokoškolské ústavy

Součásti

Content collector and document analysis for the M-Eco project