AMÁLIA and the future of European Portuguese LLMs

· ai · Source ↗

TLDR

  • Portugal’s €5.5M government-funded LLM (AMÁLIA) continues pre-training from EuroLLM on ~5.5% European Portuguese data, with weights and benchmarks not yet public.

Key Takeaways

  • AMÁLIA is not trained from scratch; it extends EuroLLM’s pre-training with modified context length and RoPE scaling, adding 107B tokens total.
  • European Portuguese data (Arquivo.pt) accounts for ~5.8B of those tokens (~5.5%); SFT stage reaches ~17-18% Portuguese with synthetically generated data.
  • The team built four new benchmarks including ALBA; AMÁLIA beats Qwen 3-8B on most Portuguese evals but loses on ALBA, raising questions about data sufficiency.
  • At time of writing, model weights, training data, logs, and new benchmarks are not publicly available – contradicting the “fully open source” claim and falling short of OLMo-style openness.
  • Author argues benchmarks miss a key dimension: intrinsic knowledge about Portugal specifically (history, culture, geography), not just language correctness.

Hacker News Comment Review

  • Commenters broadly agree the “open source” framing is misleading: no weights, no dataset, broken GitHub links – closer to a technical report than a release.
  • Skepticism runs deep on the strategic rationale: for €5.5M, a small continuation fine-tune of EuroLLM is seen as poor return, with some arguing the budget would go further improving multilingual frontier models.
  • A practical alternative raised: fine-tune a model to convert Brazilian Portuguese corpora into European Portuguese, bypassing the data scarcity problem entirely.

Notable Comments

  • @algoth1: Suggests converting Brazilian Portuguese corpus to European Portuguese via fine-tune as a cheaper data augmentation path.
  • @alexaholic: Points to Anália (analia.pt) as a usable European Portuguese model while AMÁLIA weights remain unreleased.

Original | Discuss on HN