Detail výsledku VaV

Originální název

Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis

Anglický název

Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

This work presents the development of an advanced multimodal Retrieval-Augmented Generation (MM-RAG) framework, specifically designed to integrate and process both textual and visual data for comprehensive document analysis. Unlike traditional systems that handle only text, this framework employs cutting-edge techniques to extract and embed unstructured information from PDFs containing both text and images, ensuring a more holistic understanding of complex documents. Textual data is segmented into manageable chunks and embedded using transformer-based models, such as Gemini, which operates within a 768-dimensional embedding space to capture nuanced textual information. Simultaneously, visual data is processed through sophisticated vision-language models, which generate high-level semantic summaries that encapsulate the visual content's meaning. The MM-RAG framework then seamlessly unifies these text and image embeddings into a cohesive multimodal representation, significantly enhancing the system's ability to perform complex document retrieval and question-answering tasks. This integration enables more accurate and contextually relevant responses, making it a powerful tool for detailed document analysis.

Anglický abstrakt

This work presents the development of an advanced multimodal Retrieval-Augmented Generation (MM-RAG) framework, specifically designed to integrate and process both textual and visual data for comprehensive document analysis. Unlike traditional systems that handle only text, this framework employs cutting-edge techniques to extract and embed unstructured information from PDFs containing both text and images, ensuring a more holistic understanding of complex documents. Textual data is segmented into manageable chunks and embedded using transformer-based models, such as Gemini, which operates within a 768-dimensional embedding space to capture nuanced textual information. Simultaneously, visual data is processed through sophisticated vision-language models, which generate high-level semantic summaries that encapsulate the visual content's meaning. The MM-RAG framework then seamlessly unifies these text and image embeddings into a cohesive multimodal representation, significantly enhancing the system's ability to perform complex document retrieval and question-answering tasks. This integration enables more accurate and contextually relevant responses, making it a powerful tool for detailed document analysis.

Klíčová slova

Multimodal Retrieval-Augmented Generation;Document Analysis;Vision-Language Models;Transformer-Based Models;Semantic Representation;Unified Embedding Space;Question-Answering System

Klíčová slova v angličtině

Multimodal Retrieval-Augmented Generation;Document Analysis;Vision-Language Models;Transformer-Based Models;Semantic Representation;Unified Embedding Space;Question-Answering System

Autoři

NANDI, T.; GUPTA, S.; KAUSHAL, A.; DUTTA, M.; BURGET, R.; JEŽEK, Š.

Vydáno

26.11.2024

Místo

Meloneras

ISBN

978-3-8007-6544-7

Kniha

ICUMT 2024; 16th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops

Periodikum

International Congress on Ultra Modern Telecommunications and Workshops

Stát

Spojené státy americké

Strany od

106

Strany do

110

Strany počet

6

BibTex

@inproceedings{BUT190080,
  author="Tuhina {Nandi} and Sidharth {Gupta} and Abhishek  {Kaushal} and Malay Kishore {Dutta} and Radim {Burget} and Štěpán {Ježek}",
  title="Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis",
  booktitle="ICUMT 2024; 16th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops",
  year="2024",
  journal="International Congress on Ultra Modern Telecommunications and Workshops",
  pages="106--110",
  address="Meloneras",
  isbn="978-3-8007-6544-7"
}

VUT

Fakulty a vysokoškolské ústavy

Součásti

Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis