Přístupnostní navigace
E-přihláška
Vyhledávání Vyhledat Zavřít
Detail publikačního výsledku
NANDI, T.; GUPTA, S.; KAUSHAL, A.; DUTTA, M.; BURGET, R.; JEŽEK, Š.
Originální název
Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis
Anglický název
Druh
Stať ve sborníku v databázi WoS či Scopus
Originální abstrakt
This work presents the development of an advanced multimodal Retrieval-Augmented Generation (MM-RAG) framework, specifically designed to integrate and process both textual and visual data for comprehensive document analysis. Unlike traditional systems that handle only text, this framework employs cutting-edge techniques to extract and embed unstructured information from PDFs containing both text and images, ensuring a more holistic understanding of complex documents. Textual data is segmented into manageable chunks and embedded using transformer-based models, such as Gemini, which operates within a 768-dimensional embedding space to capture nuanced textual information. Simultaneously, visual data is processed through sophisticated vision-language models, which generate high-level semantic summaries that encapsulate the visual content's meaning. The MM-RAG framework then seamlessly unifies these text and image embeddings into a cohesive multimodal representation, significantly enhancing the system's ability to perform complex document retrieval and question-answering tasks. This integration enables more accurate and contextually relevant responses, making it a powerful tool for detailed document analysis.
Anglický abstrakt
Klíčová slova
Multimodal Retrieval-Augmented Generation;Document Analysis;Vision-Language Models;Transformer-Based Models;Semantic Representation;Unified Embedding Space;Question-Answering System
Klíčová slova v angličtině
Autoři
Vydáno
26.11.2024
Místo
Meloneras
ISBN
978-3-8007-6544-7
Kniha
ICUMT 2024; 16th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops
Periodikum
International Congress on Ultra Modern Telecommunications and Workshops
Stát
Spojené státy americké
Strany od
106
Strany do
110
Strany počet
6
BibTex
@inproceedings{BUT190080, author="Tuhina {Nandi} and Sidharth {Gupta} and Abhishek {Kaushal} and Malay Kishore {Dutta} and Radim {Burget} and Štěpán {Ježek}", title="Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis", booktitle="ICUMT 2024; 16th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops", year="2024", journal="International Congress on Ultra Modern Telecommunications and Workshops", pages="106--110", address="Meloneras", isbn="978-3-8007-6544-7" }