AMÁLIA and the future of European Portuguese LLMs

May 11, 2026 · ai · Source ↗

TLDR

Portugal’s €5.5M government-funded LLM (AMÁLIA) continues pre-training from EuroLLM on ~5.5% European Portuguese data, with weights and benchmarks not yet public.

AMÁLIA is not trained from scratch; it extends EuroLLM’s pre-training with modified context length and RoPE scaling, adding 107B tokens total.
European Portuguese data (Arquivo.pt) accounts for ~5.8B of those tokens (~5.5%); SFT stage reaches ~17-18% Portuguese with synthetically generated data.
The team built four new benchmarks including ALBA; AMÁLIA beats Qwen 3-8B on most Portuguese evals but loses on ALBA, raising questions about data sufficiency.
At time of writing, model weights, training data, logs, and new benchmarks are not publicly available – contradicting the “fully open source” claim and falling short of OLMo-style openness.
Author argues benchmarks miss a key dimension: intrinsic knowledge about Portugal specifically (history, culture, geography), not just language correctness.

Commenters broadly agree the “open source” framing is misleading: no weights, no dataset, broken GitHub links – closer to a technical report than a release.
Skepticism runs deep on the strategic rationale: for €5.5M, a small continuation fine-tune of EuroLLM is seen as poor return, with some arguing the budget would go further improving multilingual frontier models.
A practical alternative raised: fine-tune a model to convert Brazilian Portuguese corpora into European Portuguese, bypassing the data scarcity problem entirely.

@algoth1: Suggests converting Brazilian Portuguese corpus to European Portuguese via fine-tune as a cheaper data augmentation path.
@alexaholic: Points to Anália (analia.pt) as a usable European Portuguese model while AMÁLIA weights remain unreleased.