Master's Thesis

Reinforcement Learning of Agents with Memory

Final Thesis 13.95 MB

Author of thesis: Bc. Samuel Kuchta

Acad. year: 2025/2026

Supervisor: Ing. Michal Hradiš, Ph.D.

Reviewer: doc. Ing. Michal Španěl, Ph.D.

Abstract:

This thesis contributes a purpose-built environment, GridMazeWorld, designed to confront reinforcement learning agents with multiple, interdependent memory demands under partial observability.  The environment combines procedurally generated mazes, non‑local button–door dependencies, periodic dynamics, regrowing resources, and an energy budget, requiring simultaneous spatial, sequential, and relational memory.  Alongside a comprehensive survey of memory‑focused benchmarks and mechanisms, we implement and compare recurrent (LSTM), and attention‑based (Transformer) architectures under controlled training with curriculum management.  Our empirical study examines the influence of hyperparameter choices, network scaling, grid size, and curriculum design on the agents’ ability to form and exploit internal memory, offering a systematic analysis of memory in reinforcement learning under partial observability.

Keywords:

partial observability, reinforcement learning, memory, GridMazeWorld, recurrent neural networks.

Date of defence

24.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaDznamka

Grading

D

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm D.

Topics for thesis defence

  1. Is the design of the environment generator and the algorithms inspired by any existing environment, or is it your own?
  2. When initialising an empty grid, rather than caching the initial articulation point calculation, which is always the same, wouldn't it be better to randomly place some isolated obstacles in the grid?
  3. Do you have any insights into why the transformer training is failing? It seems that when task complexity increases (Figure 9.9), the transformer overwrites weights that were critical for simpler tasks. Since curriculum learning builds on prior knowledge, losing earlier representations destabilises the policy and the learning process. If my guess makes sense, what could be the reasons?
  4. How did you find good values for the parameters used in complexity adjustment, stage switching in curriculum learning, PPO hyperparameters, etc.?
  5. Co je klíčový přínos vaší práce? Zkoumal jste rozšíření algoritmů? Zjistil jste něco nového?
  6. Jak umísťujete ocenění? Co bylo při řešení nejtěžší?

Language of thesis

English

Faculty

Department

Study programme

Information Technology and Artificial Intelligence (MITAI)

Specialization

Computer Vision (NVIZ)

Composition of Committee

prof. Ing. Adam Herout, Ph.D. (předseda)
prof. Ing. Martin Čadík, Ph.D. (místopředseda)
doc. RNDr. Milan Češka, Ph.D. (člen)
prof. Dr. Ing. Pavel Zemčík, dr. h. c. (člen)
Ing. David Bařina, Ph.D. (člen)
Ing. Tomáš Milet, Ph.D. (člen)

Supervisor’s report
Ing. Michal Hradiš, Ph.D.

Student má jasný zájem o řešené téma a odhodlání hlouběji pochopit současný stav poznání. Z různých důvodů byla ale jeho aktivita sporadická a práci dokončoval na poslední chvíli.

Evaluation criteria Verbal classification
Informace k zadání

Téma práce je poměrně náročné a vychází ze zájmů studenta. Cílem bylo zkoumat pokročilé aspekty metod posilovaného učení. Téma je náročné kvůli nárokům na dobré pochopení řešené oblasti, obtížné interpretovatelnosti chování používaných metod a také obtížnému návrhu experimentů. Student navrhl vhodné experimenty, ale jejich provedení bylo kvůli výpadkům v jeho aktivitě oproti plánům omezené.

Aktivita při dokončování

Podstatná část experimentů proběhla na poslední chvíli a práce byla odevzdána po termínu.

Publikační činnost, ocenění
Práce s literaturou

Student si aktivně vyhledal potřebné zdroje a využil je.

Aktivita během řešení, konzultace, komunikace

Konzultace byly velmi omezené a aktivita studenta byla sporadická.

Points proposed by supervisor: 69

Grade proposed by supervisor: D

Mr Kuchta did a great job designing and implementing the enviroment GridMazeWorld. He clearly has deep knowledge of the RL grid-like environments and their challenges. The second part, dedicated to RL experiments, appears unfinished: basic experiments were conducted, but the unsatisfactory transformer results were not further addressed. Time pressure likely accounts for both the RL experiments and the many shortcomings in the technical report.

Evaluation criteria Verbal classification Points
Rozsah splnění požadavků zadání

Evaluation level: zadání splněno

The work designs and effectively implements a procedurally generated environment with a dynamic curriculum manager and experiments with baseline LSTM and transformer agents. The main emphasis is on the first part.

Rozsah technické zprávy

Evaluation level: je v obvyklém rozmezí

The work contains several chapters dedicated to RL and an overview of existing methods, but the author merely meets the common requirement of providing an overview without apparent effort to apply the acquired knowledge to his own design or identify the most suitable SotA approaches.

  • Figures in theoretical chapters lack deeper comments. The work uses a transformer, so a more detailed description of the transformer architecture (Figure 3.6), positional embeddings (Figure 3.3), etc. would be expected.

Prezentační úroveň technické zprávy

Chapter 6 presents the environment design clearly with well-motivated choices, though figures and pseudocode or flowcharts would improve clarity.

Figures 7.1 and 7.2 are overly simplistic and do not adequately describe the model architectures.

The structure of Chapters 8 and 9 is unnatural. Experiments are described on half a page with findings summarised in one sentence, while all relevant graphs appear in the next chapter without explanation of what they reveal or how to interpret them. It would be more appropriate to combine the graphs for different parameter settings (e.g., Figure 9.6) into a single graph, with one curve for each parameter.

I didn't find any formulation of the rewards the agent is getting - missing such a “detail” is a pity.

60
Formální úprava technické zprávy

The typographic and linguistic aspects of the work are good.

  • Bold highlighting in theoretical chapters is inconsistently replaced by a different font in later chapters.
  • The bibliography lacks required details; the first 5 items are mere links suitable only as footnotes.
  • Published works should be cited by their conference or journal rather than arxiv.org, which is not peer-reviewed.
70
Práce s literaturou

The literature is broad, but the author's presentation does not indicate how it influenced his RL model design. Chapter 5 on existing environments is well written, briefly summarising many variants and their limitations, and noting the potential of his own solution, GridMazeWorld. However, an overview of which specific RL models and techniques have been successful on benchmarks is missing - this would have been an ideal starting point for the author's own baseline.

75
Realizační výstup

The program solution is the most extensive result and deserves recognition. The algorithmic design of the environment generator is well thought out, implemented in C++ for efficiency, with parallel generation and execution, and includes Python bindings.

Code quality and structure are generally good. However, large parts of the C++ code are undocumented. The GridMazeWorld class is too large and should be split into smaller components: world representation, generation algorithms, execution logic, and customisable parts such as reward calculation.

87
Využitelnost výsledků

The environment generator and runtime scripts offer an interesting alternative to existing solutions. The author also presents some nice ideas for extending the environment!

Náročnost zadání

Evaluation level: obtížnější zadání

The topic places high demands on the literature review of reinforcement learning and on the realisation of the environment itself. Understanding and training RL techniques that leverage memory requires a deep understanding of those principles.

Topics for thesis defence:
  1. Do you have any insights into why the transformer training is failing? It seems that when task complexity increases (Figure 9.9), the transformer overwrites weights that were critical for simpler tasks. Since curriculum learning builds on prior knowledge, losing earlier representations destabilises the policy and the learning process. If my guess makes sense, what could be the reasons?
  2. When initialising an empty grid, rather than caching the initial articulation point calculation, which is always the same, wouldn't it be better to randomly place some isolated obstacles in the grid?
  3. Is the design of the environment generator and the algorithms inspired by any existing environment, or is it your own?
  4. How did you find good values for the parameters used in complexity adjustment, stage switching in curriculum learning, PPO hyperparameters, etc.?
Points proposed by reviewer: 75

Grade proposed by reviewer: C

Responsibility: Mgr. et Mgr. Hana Odstrčilová