Detail publikačního výsledku

Towards Efficient Matching of Regexes with Backreferences using Register Set Automata

HAVLENA, V.; HOLÍK, L.; LENGÁL, O.; VAŠÁK, J.; GULČÍKOVÁ, S.

Originální název

Towards Efficient Matching of Regexes with Backreferences using Register Set Automata

Anglický název

Towards Efficient Matching of Regexes with Backreferences using Register Set Automata

Druh

Článek Scopus

Originální abstrakt

Matching regexes (regular expressions) is a common problem in many areas of computer science, with requirements on high speed and robust performance. Regexes with backreferences allow one to express certain patterns (even beyond regular) concisely, however, since the matching is usually done by backtracking, the matching speed can degrade to a degree that constitutes a service failure or a security threat. To facilitate high-speed matching of such regexes, we propose register set automata (RSAs), an extension of register automata where registers can contain sets of symbols (from a potentially infinite alphabet) and the following operations are supported: adding input values to registers, merging or clearing registers, and testing whether a register contains a value. We show that a large class of register automata can be transformed into deterministic RSAs, which can serve as a basis for fast matching of a family of regexes with single-letter capture groups and backreferences. We also give a derivative-based algorithm for transforming a large class of regexes with backreferences to register automata and show that the time complexity of matching is linear and quadratic to the length of the input for finite and infinite alphabets respectively. Our prototype implementation of a regex matcher shows that our approach can significantly improve the robustness of state-of-the-art regex matchers on regexes with backreferences. We also study the theoretical properties of the model and show that the emptiness problem for RSAs is decidable and complete for the F ω class and that RSAs are incomparable in expressive power to other popular automata models over data words.

Anglický abstrakt

Matching regexes (regular expressions) is a common problem in many areas of computer science, with requirements on high speed and robust performance. Regexes with backreferences allow one to express certain patterns (even beyond regular) concisely, however, since the matching is usually done by backtracking, the matching speed can degrade to a degree that constitutes a service failure or a security threat. To facilitate high-speed matching of such regexes, we propose register set automata (RSAs), an extension of register automata where registers can contain sets of symbols (from a potentially infinite alphabet) and the following operations are supported: adding input values to registers, merging or clearing registers, and testing whether a register contains a value. We show that a large class of register automata can be transformed into deterministic RSAs, which can serve as a basis for fast matching of a family of regexes with single-letter capture groups and backreferences. We also give a derivative-based algorithm for transforming a large class of regexes with backreferences to register automata and show that the time complexity of matching is linear and quadratic to the length of the input for finite and infinite alphabets respectively. Our prototype implementation of a regex matcher shows that our approach can significantly improve the robustness of state-of-the-art regex matchers on regexes with backreferences. We also study the theoretical properties of the model and show that the emptiness problem for RSAs is decidable and complete for the F ω class and that RSAs are incomparable in expressive power to other popular automata models over data words.

Klíčová slova

Antimirov’s derivatives | backreferences | determinization | ReDoS | register automata | register set automata | regular expression matching

Klíčová slova v angličtině

Antimirov’s derivatives | backreferences | determinization | ReDoS | register automata | register set automata | regular expression matching

Autoři

HAVLENA, V.; HOLÍK, L.; LENGÁL, O.; VAŠÁK, J.; GULČÍKOVÁ, S.

Vydáno

08.06.2026

Nakladatel

Association for Computing Machinery

Periodikum

Proceedings of the ACM on Programming Languages-PACMPL

Svazek

10

Číslo

PLDI

Stát

Spojené státy americké

Strany od

855

Strany do

880

Strany počet

26

URL

BibTex

@article{BUT211925,
  author="Vojtěch {Havlena} and Lukáš {Holík} and Ondřej {Lengál} and Jan {Vašák} and Sabína {Gulčíková}",
  title="Towards Efficient Matching of Regexes with Backreferences using Register Set Automata",
  journal="Proceedings of the ACM on Programming Languages-PACMPL",
  year="2026",
  volume="10",
  number="PLDI",
  pages="26",
  doi="10.1145/3808281",
  url="https://dl.acm.org/doi/10.1145/3808281"
}