Bachelor's Thesis

Automata techniques in DNA analysis

Final Thesis 2.83 MB

Author of thesis: Bc. Lucie Klímová

Acad. year: 2024/2025

Supervisor: doc. Mgr. Lukáš Holík, Ph.D.

Reviewer: Mgr. Juraj Síč

Abstract:

This thesis is focused on the possibilities of using finite automata to accelerate the detection of structural domains of transposons. The central part of the thesis introduces a method based on deterministic finite automata (DFA) as a faster alternative to the BLASTX tool. BLASTX is used by the TE-greedy-nester tool, which is designed to detect LTR retrotransposons. As a starting point, we used the HMMER tool, which uses a profile hidden Markov model (PHMM) to precisely describe the character of the searched sequence. Due to the significant nondeterminism of PHMMs, the determinization of the model of the entire domain proved unfeasible. Instead, a method to transform a PHMM into several smaller DFAs designed to detect domain subsequences was introduced. Closely located occurrences of these subsequences are subsequently interpreted as occurrences of the entire domain. The experimental evaluation demonstrated that the presented approach maintains high accuracy while achieving up to a tenfold search speedup compared to BLASTX.

Keywords:

profile hidden Markov models (PHMM), deterministic finite automata (DFA), LTR retrotransposons

Date of defence

16.06.2025

Result of the defence

Defended (thesis was successfully defended)

znamkaAznamka

Grading

A

Process of defence

Studentka nejprve prezentovala výsledky, kterých dosáhla v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Studentka následně odpověděla na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studentky na položené otázky rozhodla práci hodnotit stupněm A.

Topics for thesis defence

  1. Are there any plans to incorporate the proposed algorithm into TE-greedy-nester? How hard would it be?
  2. Konzultovala jste využití nástroje Blast? Do jaké míry je přínosný?
  3. Jakým způsobem plánujete v práci pokračovat?

Language of thesis

English

Faculty

Department

Study programme

Information Technology (BIT)

Composition of Committee

doc. RNDr. Milan Češka, Ph.D. (předseda)
Ing. Zbyněk Křivka, Ph.D. (člen)
Ing. Zdeněk Materna, Ph.D. (člen)
doc. Ing. Jan Kořenek, Ph.D. (člen)
Ing. Jaroslav Rozman, Ph.D. (člen)

Supervisor’s report
doc. Mgr. Lukáš Holík, Ph.D.

The work and results deserve to be priced by themselves, but the students tenacity and overall ability to conduct all kinds of research task highly independently stands out even among the best students.

Evaluation criteria Verbal classification
Informace k zadání

An extremely difficult research assignment. It combines a research on advanced automata-based pattern matching with advanced technology of gene annotations. The assignment is even more difficult due to the supervisor's illiteracy in gene annotation and computational biology. It is also very ambitious, aiming at outperforming state-of-the-art tools in gene annotation. 

Práce s literaturou

Exceptional. The student has compensated for my lack of knowledge in computational biology and studied on need literature form that field by herself, and generally worked with the literature competently and efficiently.

Aktivita během řešení, konzultace, komunikace

This project has been going on for several years, starting by an experimental collaboration and slowly evolving into a serious research endevour with a high potential to generate publications. Most of the credit for this goes to the student, who compensated for my lack of knowledge in computational biology, carried on with tenacity even after major setbacks (we run into several dead ends on the way, for instance with the Alergia learning approach), solved technical problems, understood quickly her advisor's half baked ideas and turned them into meaningful conversations and later solutions, and came up with very good and fresh ideas by herself. She has shown an exceptional abilities in theory and algorithm design as well as in implementation. The amount of work actually exceeds an excellent diploma thesis. 

Aktivita při dokončování

I was able to review the entire text before it was handed in. 

Publikační činnost, ocenění

The work is almost ready for publication, it may even turn into a research direction.  

Points proposed by supervisor: 100

Grade proposed by supervisor: A

Reviewer’s report
Mgr. Juraj Síč

The student presented two novel techniques for the detection of structural domains of transposons. The first approach, based on learning automata, did not yield competitive results. The second one is based on splitting profile hidden Markov models and turning them into deterministic finite automata which are then used for fast matching. The results show that it can outcompete the state-of-the-art tool BLAST. Considering the quality and scope of the work, I believe it would be suitable even for a master's thesis. Therefore, I recommend awarding the grade A.

Evaluation criteria Verbal classification Points
Náročnost zadání

Evaluation level: obtížnější zadání

I find the assignment quite difficult, as it required the student to study advanced methods of biological sequence alignment along with related topics in automata theory.

Prezentační úroveň technické zprávy

The thesis is very well written, with a good number of examples showing how the proposed algorithm works.

98
Formální úprava technické zprávy

The report is written in solid English, with only very minor problems.

95
Realizační výstup

The proposed algorithm was implemented in a new tool named DUCK. The tool is properly documented and was extensively compared with existing tools BLAST and HMMER.

93
Využitelnost výsledků

I think that the work could lead to a publication.

Rozsah splnění požadavků zadání

Evaluation level: zadání splněno

Rozsah technické zprávy

Evaluation level: přesahuje obvyklé rozmezí

Even though the thesis is longer than usual, it contains no unnecessary information. The length is justified and reflects the substantial amount of work the student has completed.

Práce s literaturou

The thesis contains an extensive overview of existing methods, with appropriate selection and citation of relevant sources. The student’s own work is clearly distinguished from existing approaches.

97
Topics for thesis defence:
  1. Are there any plans to incorporate the proposed algorithm into TE-greedy-nester? How hard would it be?
Points proposed by reviewer: 98

Grade proposed by reviewer: A

Responsibility: Mgr. et Mgr. Hana Odstrčilová