Bachelor's Thesis

System for person retrieval based on text description

Final Thesis 14.4 MB

Author of thesis: Yaroslav Hryn

Acad. year: 2025/2026

Supervisor: Ing. Markéta Juránková, Ph.D.

Reviewer: Ing. Michal Hradiš, Ph.D.

Abstract:

This bachelor's thesis describes the design and implementation of a web application for searching people in image datasets based on natural language descriptions. The motivation comes from scenarios where a user is looking for a specific person and knows only their appearance – such as the colour of their clothing or other visual characteristics – but does not have a photo available.
At its core, the system converts both text queries and images into a shared vector space using vision-language models, with support for loading any compatible model from the HuggingFace platform. The resulting embeddings are indexed and searched using the Qdrant cloud vector database. To improve search precision, negative prompting was implemented as a re-ranking method, allowing users to actively exclude unwanted visual attributes from results. Search results can also be grouped by person identity based on identifiers provided in a metadata file.
The resulting application allows users to manage datasets and models, search for people using natural language, apply negative queries, and browse results in three different view modes. The system was experimentally evaluated by comparing several vision-language models on a real-world dataset.

Keywords:

person retrieval, text description, vision-language models, vector database, negative prompting, clustering, web application

Date of defence

16.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaBznamka

Grading

B

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm B.

Topics for thesis defence

  1. For the negative queries, do you really have to retrieve a larger set of images with embedding vectors and re-rank them in the backend based on the dot product with the negative vector? Isn’t there some more simple and more efficient way to do that?
  2. Explain the evaluation. Who did the experiments? What was the order of embedding models used? Could you quantify the uncertainty of the measured results? Are some of the observed differences strong enough, to formulate generalized conclussions?

Language of thesis

English

Faculty

Department

Study programme

Information Technology (BIT)

Composition of Committee

doc. Ing. Tomáš Martínek, Ph.D. (předseda)
doc. Ing. Michal Španěl, Ph.D. (místopředseda)
Ing. Jiří Hynek, Ph.D. (člen)
Ing. Filip Orság, Ph.D. (člen)
Ing. Vladimír Bartík, Ph.D. (člen)

Supervisor’s report
Ing. Markéta Juránková, Ph.D.

Student v průběhu celého roku aktivně pracoval, samostatně přicházel s návrhy řešení a tyto návrhy konzultoval. Splnil všechny body zadání a práci dokončil s dostatečným předstihem. Proto uděluji hodnocení vedoucího práce A.

Evaluation criteria Verbal classification
Informace k zadání

Práce byla průměrně náročná, zaměřená na vytvoření intuitivního softwaru za použití stávajících technologií. Všechny body zadání byly splněny a dosažené výsledky odpovídají požadavkům zadání.

Práce s literaturou

Student aktivně navrhoval vlastní řešení a úpravy systému.

Aktivita během řešení, konzultace, komunikace

Student práci v průběhu roku pravidelně konzultoval a na schůzky přicházel připravený s aktuálním postupem a vhodnými dotazy ke konzultaci.

Aktivita při dokončování

Práce byla dokončena včas a její výsledná podoba byla konzultována.

Publikační činnost, ocenění

Výstup práce je vhodný pro zveřejnění formou open-source softwaru.

Points proposed by supervisor: 90

Grade proposed by supervisor: A

Reviewer’s report
Ing. Michal Hradiš, Ph.D.

The student created a working application which can be used to search in collections of images. However, the thesis does not demonstrate a systematic approach to understanding the problem domain, designing the solution, or testing it.

Evaluation criteria Verbal classification Points
Náročnost zadání

Evaluation level: obtížnější zadání

The thesis topic combines advanced image and text processing with web application development.

Prezentační úroveň technické zprávy

The text is understandable, but the structure and separation of topics could be improved. The text was clearly written to describe an already finished system. It focuses solely on the methods, models, and approaches used by the student, and lacks discussion and justification of the choices made.

Specific issues:

  • Section 2.1: Vision-language models are generally understood as generative models, not CLIP and similar models.
  • Section 2.2 should rather present methods for semantic similarity search, not the specific Qdrant vector database.
  • Section 2.3: The idea and possible approaches to negative prompting are not explained in sufficient detail.
  • Section 2.4 is limited to datasets used by the student; however, many other datasets exist, such as RSTPReid, ICFG-PEDES, IIITD-20K, UFine6926 / UFineBench, MALS, LUPerson-T, LUPerson-MLLM, HAM-PEDES, CUHK-SYSU-TBPS, PRW-TBPS, PRW-TPS-CN, and ScenePerson-13W.
  • Chapter 3: Users and use cases should be presented before the system requirements, which should be motivated by the use cases.
  • Chapter 3: The requirements already specify the technical solution.
  • Chapter 3: The requirements are stated without sufficient motivation, discussion, or justification. The same problem applies to the system architecture and UI design.
  • Section 3.5 could be more concise, for example by presenting the backend routes in a table.
  • Chapter 4: The experimental protocol is not described in sufficient detail.
70
Formální úprava technické zprávy

The thesis is reasonably well written. The typography is satisfactory, with some reservations:

  • Sections and chapters often do not start with introductory text, and several headings follow one another without intervening text, for example in Chapter 3.
  • The tables with results could be moved to an appendix, while the main text could present only aggregated values. The graphs in Section 4.5 are also not ideal.
  • Tables and figures should preferably be placed at the top of the page, as their placement in the middle of the page interrupts the flow of the text.
  • The lists on page 30 are poorly formatted.
  • Some floats are not referenced from the text, for example Table 2.1 and Listing 3.1.
  • Listing 3.1 is split across pages and is not referenced from the text. It is also not clear whether the Python code is necessary.
  • The abbreviations NC / Cl. and O.NC / O.C are not explained.
70
Realizační výstup

The student created a working web application. I appreciate that it includes asynchronous image processing and adequate configuration. On the other hand, it is still a rather basic, single-user local application with serious technical limitations: there are no user accounts, uploads are stored in memory, and the asynchronous backend contains long-running synchronous code.

Some of the UI choices are somewhat unexpected, and the text does not suggest any interaction with potential users or deeper consideration of practical use cases. The application does not include documentation or tests.

The evaluation was manual and was probably performed by the author. This has both advantages and disadvantages. It could simulate real usage and interaction with the system. However, the evaluation protocol is not described in sufficient detail to assess its validity, and no statistical analysis was performed. In any case, automatic testing should also have been performed; even the negative query selection could have been simulated.

73
Využitelnost výsledků

The application works and can be used as a simple local tool. 

Rozsah splnění požadavků zadání

Evaluation level: student se odůvodněně odchýlil od zadání

The solution does not use large language models. The question is to what extent this was a specific objective of the thesis topic, but the solution is meaningful as presented.

Rozsah technické zprávy

Evaluation level: je v obvyklém rozmezí

I am missing an overview of current approaches and tools for searching for people by appearance, existing datasets, and evaluation methodologies. The tables, graphs, and UI images are excessive.

Práce s literaturou

The thesis references 18 relevant and generally high-quality sources. The sources are used well. However, I am missing a literature review beyond the sources directly used by the student. The thesis does not include even a review of similar applications.

69
Topics for thesis defence:
  1. Explain the evaluation. Who did the experiments? What was the order of embedding models used? Could you quantify the uncertainty of the measured results? Are some of the observed differences strong enough, to formulate generalized conclussions?
  2. For the negative queries, do you really have to retrieve a larger set of images with embedding vectors and re-rank them in the backend based on the dot product with the negative vector? Isn’t there some more simple and more efficient way to do that?
Points proposed by reviewer: 72

Grade proposed by reviewer: C

Responsibility: Mgr. et Mgr. Hana Odstrčilová