Detail výsledku VaV

Originální název

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Anglický název

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Druh

Stať ve sborníku mimo WoS a Scopus

Originální abstrakt

Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed-such as demonstrations, label-based summaries, and self-revision-their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods - particularly target-language demonstrations with LLM-based revisions - yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models

Anglický abstrakt

Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed-such as demonstrations, label-based summaries, and self-revision-their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods - particularly target-language demonstrations with LLM-based revisions - yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models

Klíčová slova

multilingual evaluation, less-resourced languages, model analysis, synthetic data generation

Klíčová slova v angličtině

multilingual evaluation, less-resourced languages, model analysis, synthetic data generation

Autoři

ANIKINA, T.; ČEGIŇ, J.; ŠIMKO, J.; OSTERMANN, S.

Rok RIV

2026

Vydáno

04.11.2025

Nakladatel

Association for Computational Linguistics

Místo

Suzhou, China

ISBN

979-8-89176-332-6

Strany od

8293

Strany do

8314

Strany počet

22

URL

https://aclanthology.org/2025.emnlp-main.418/

BibTex

@inproceedings{BUT198568,
  author="{} and Ján {Čegiň} and Jakub {Šimko} and  {}",
  title="A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages",
  year="2025",
  pages="8293--8314",
  publisher="Association for Computational Linguistics",
  address="Suzhou, China",
  doi="10.18653/v1/2025.emnlp-main.418",
  isbn="979-8-89176-332-6",
  url="https://aclanthology.org/2025.emnlp-main.418/"
}

VUT

Fakulty a vysokoškolské ústavy

Součásti

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages