Amália and the Future of European Portuguese LLMs

· ai · Source ↗

TLDR

  • Blog post critically reviews AMÁLIA, a €5.5M Portuguese government-funded LLM built on EuroLLM, raising concerns about openness and Portuguese data share.

Key Takeaways

  • AMÁLIA continues pre-training from EuroLLM rather than training from scratch; architecture mirrors EuroLLM with minor context length and RoPE scaling changes.
  • European Portuguese data is thin: only 5.8B of 107B extended pre-training tokens come from Arquivo.pt, roughly 5.5%; SFT Portuguese share is 17-18%.
  • Despite that, AMÁLIA beats Qwen 3-8B on most Portuguese benchmarks, though Qwen 3-8B still leads on the new ALBA benchmark.
  • Openness is incomplete at time of writing: no public model weights, training data, training logs, or new benchmarks – only some GitHub repos and Arquivo.pt processing scripts.
  • Benchmarks cover grammar, syntax, and Brazilian Portuguese bias but miss factual knowledge about Portugal specifically, a gap the author flags as a missed opportunity.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN