Master's Thesis

Recognition of Trajectories in Video Sequences

Author of thesis: Ing. Menghan Zhang

Acad. year: 2025/2026

Supervisor: prof. Dr. Ing. Pavel Zemčík, dr. h. c.

Abstract:

This thesis explores trajectory-related learning in video sequences using neural-network-based methods. Based on a literature review, Masked Motion Encoding (MME) is chosen as the baseline, since it reconstructs motion trajectories during self-supervised video pre-training. Building on this method, the thesis investigates a motion-biased tube masking strategy as a possible way to better align masking with the trajectory-oriented objective. The method is implemented within the MME framework without adding learnable parameters or using optical flow, and is examined through pre-training on Kinetics-400 and fine-tuning on UCF-101, HMDB-51, and Something-Something V2. The thesis also discusses the resulting observations, limitations, and possible directions for future work.

Keywords:

video sequence analysis, trajectory recognition, self-supervised learning, masked video modeling, motion trajectory reconstruction, Masked Motion Encoding, Vision Transformer, action recognition

Date of defence

25.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaEznamka

Grading

Process of defence

The student first presented the results she had achieved in the thesis. The committee then reviewed the supervisor’s evaluation and the reviewer’s assessment. The student then answered questions from the reviewer and additional questions from the people present at the defence. Based on the reviewer’s assessment, supervisor’s evaluation, the excellent presentation and the student's answers to the asked questions, the committee has evaluated the thesis with the grade E .

Topics for thesis defence

In the text, it is written that "video information density is much lower than that of images". Explain this statement. Intuitively, one would expect to find more information in a video that is technically a sequence of images.
Why is the "Clip Length" cell in the "HMDB-51" row not defined in Tab 4.2?
The author did not use the same parameters of pre-training epochs as in the original paper. Why? How long did the process take? Was it not possible to measure due to time limitations? Does this fact not affect the results? The method might be designed to work well assuming the same settings as the original authors described. Are the results still meaningful even with the low number of epochs and one single dataset?
The measurements use "accuracy" as the main metric. How exactly is it defined?
Can you comment on the review? Were auxiliary scripts a part of your thesis?

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Master of Information Technology (MIT-EN)

Composition of Committee

prof. Ing. Adam Herout, Ph.D. (předseda)
prof. Dr. Ing. Pavel Zemčík, dr. h. c. (místopředseda)
doc. RNDr. Milan Češka, Ph.D. (člen)
Ing. David Bařina, Ph.D. (člen)
doc. Ing. Vítězslav Beran, Ph.D. (člen)
Ing. Tomáš Milet, Ph.D. (člen)

Supervisor’s report
prof. Dr. Ing. Pavel Zemčík, dr. h. c.

Overall, I believe that the work is solid. The work could have been further improved with more extensive experiments and perhaps more new results, but it still met expectations and the student managed, after a series of experiments with alternating results, to successfully modify the algorithm and evaluate its results. Therefore, I give it an overall grade of good (C).

Evaluation criteria	Verbal classification
Informace k zadání	The assignment was, I believe, quite challenging. It concerned algorithms of video processing with a focus on tracking. The reason for the difficulty of the assignment was that it required extensive study of scientific publications, obtaining sources and experimenting with them somewhat beyond the standard. I believe that the student completed the task and achieved good experimental results. It is true that the scope of experimental work and algorithm modifications could have been wider, but I am still quite satisfied with the results, as they could serve as a basis for the future and even for publishable experimental work.
Aktivita při dokončování	The work was only completed just before submission, which is slightly unfortunate, because the experimental part could have been more extensive and the text of the work could have been corrected at the last minute and would benefit from more extensive corrections.
Publikační činnost, ocenění	N/A
Práce s literaturou	The student worked with the literature proactively and sought literature and other resources far beyond the recommendations.
Aktivita během řešení, konzultace, komunikace	The student was quite active in solving the work. At first she attended consultations regularly, then the intensity of consultations decreased, although I believe that the student was continuously working on the work, then consultations renewed before the end of the semester. She was always prepared.

Points proposed by supervisor: 70

Grade proposed by supervisor: C

Reviewer’s report
Ing. Tomáš Chlubna, Ph.D.

The work is interesting from a research point of view, however, the amount of implementational work is far below the average level of master's thesis. The text contains several non-critical formal issues and the description of the author's contribution is very short.

The research on state-of-the-art is thorough and the author apparently gained an excellent understanding of the baseline method used in this thesis. The design of the experiment is also interesting. The results show that the proposed extension does not lead to a significant improvement. However, this is not a problem since this thesis is an experimental attempt and reveals interesting conclusions nonetheless. The main issue is that the proposed extension is very simple. The thesis describes several future-work improvements. At least these should have been present in the current work to make it more solid. The training process and experimental evaluation of the method probably took a lot of time, however, that does not justify the low quality of the original contribution.

Evaluation criteria	Verbal classification	Points
Rozsah splnění požadavků zadání	Evaluation level: zadání splněno s vážnějšími výhradami All points of the assignment were addressed, but the proposed extension of the existing method is simple. The assignment was formally completed. However, the implemented extension is not very complex and does not even seem to be efficient. It is questionable whether the amount of implementational work is high enough for the standards or master's thesis at this faculty. It is rather on a level of an average project in standard study courses.
Rozsah technické zprávy	Evaluation level: splňuje pouze minimální požadavky The thesis contains a lot of thorough description of the state-of-the-art methods, however, the description of the author's contribution is limited. The thesis is approximately 73 standard pages long, according to the FIT Thesis Checker. This is slightly below the average but fulfils the required limits. The theoretical chapters are very rich and informative. In contrast, the chapters on the main proposal and its realisation are very brief.
Prezentační úroveň technické zprávy	The text deviates from the standard structure, as the proposal, implementation, and measurement are merged into one chapter, which is short, compared to the theoretical chapters. The two theoretical chapters are 23 pages long, while the description of the author's original contribution and its evaluation is 12 pages long. This almost double ratio suggests that the original contribution is not very significant. Furthermore, Section 4.2.1 describes the properties of existing datasets and should be placed in the theoretical chapters. The author well described the difference between the original method and the novel contribution. The overall description of the theory and the contribution is well-written. The experiments are also well-described, and the analysis of the results is thorough.	57
Formální úprava technické zprávy	The quality of the text is acceptable, only several non-critical issues are present. The text contains only several minor mistakes and typos (e.g., missing spaces before parentheses and after punctuation, wrong quotation marks, misplaced paragraph indentation, reference chapter 4 not capitalised, etc.) The excessive length of some sentences (e.g., under Tab. 2.1) worsens the clarity of the text. The sentence starting with Just as the meaning... seems to be duplicate. The paragraphs on page 21 are separated by an excessively large vertical space. The first-person narrative sometimes disrupts the technical nature of the text. The author created and recreated several nice figures in vector format. The names of the figures in the captions are highlighted in red, which is not a common format. The caption of Tab. 3.2 does not explain the meaning behind the percentage values in the last two columns. Captions of Tabs. 3.3 and 3.4 do not consistently end with full stop as the rest. Some variables in equations are not defined in the text (e.g., Eq. 2.1 has them defined in the caption of Fig. 2.4 referenced later, and variable C is not defined in Eq. 4.1 at all). Tabs. 2.1, 3.1, and 3.2 are not referenced in the text. Eq. 2.3 is not properly followed by a full stop to be part of the previous sentence. Eqs. 4.4 and 4.5 are duplicates.	79
Práce s literaturou	The thesis properly cites 47 relevant and mostly high-quality scientific sources. A small number of citations might be missing. Mentions of some algorithms, such as Farnebäck, could have been cited. A similar approach to the one proposed in this thesis exists in a highly relevant but not cited paper: B. Li, J. Chen, G. Li, D. Zhang, X. Bao and D. Huang, "Cross-Modal Contrastive Masked AutoEncoder for Compressed Video Pre-Training," in IEEE Transactions on Image Processing, vol. 34, pp. 4500-4514, 2025, doi: 10.1109/TIP.2025.3583168. Footnotes are sometimes not placed on the same page as their references (e.g., page 11). Some citation brackets are not preceded by spaces.	86
Realizační výstup	The idea behind the proposed extension is good, however, it is a very simple algorithm which could had been implemented in one day. The extension of the existing method sounds reasonable and worth experimenting with. However, it is a simple approach, consisting of computation of average pixel differences over several video frames, computing the pixel-block average of these values, and using them as a mask. It is basically a simple preprocessing of the input data, while the rest is an unchanged state-of-the-art method. The author states in the README file that the main changes, compared to the original code of the existing method, are located in the files transforms/masking.py, dataset/_init_.py and directory scripts/run_scrips. The changes were assessed using diff tool. The first and second files contain 171 and 34 changed lines, respectively. The directory contains 14 scripts, which are duplicates of one script, 50 lines long, only with different parameters.	34
Využitelnost výsledků	The work has scientific potential, as it extends an existing method and experimentally evaluates its efficiency. The scientific approach, analysing the method and proposing the improvement of the masking strategy, is reasonable, and experiments are well conducted and evaluated. With further improvements and experimental work, the thesis could lead to a scientific paper.
Náročnost zadání	Evaluation level: obtížnější zadání The assignment requires a detailed survey of video trajectory analysis methods, a thorough study of a selected method, and its substantial improvement. An extension of a state-of-the-art method could be difficult and requires many experiments.

Topics for thesis defence:

In the text, it is written that "video information density is much lower than that of images". Explain this statement. Intuitively, one would expect to find more information in a video that is technically a sequence of images.
Why is the "Clip Length" cell in the "HMDB-51" row not defined in Tab 4.2?
The author did not use the same parameters of pre-training epochs as in the original paper. Why? How long did the process take? Was it not possible to measure due to time limitations? Does this fact not affect the results? The method might be designed to work well assuming the same settings as the original authors described. Are the results still meaningful even with the low number of epochs and one single dataset?
The measurements use "accuracy" as the main metric. How exactly is it defined?

Points proposed by reviewer: 48

Grade proposed by reviewer: F

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties and university institutes

Parts

Recognition of Trajectories in Video Sequences