Talking to Transformers

May 3, 2026 · ai coding · Source ↗

TLDR

Four-pillar prompting framework: precise intent, attention-aware railroading, cross-domain compression, and actually reading model outputs.

Treat attention like a budget: every irrelevant token competes with signal; shorter context improves attention targeting.
Use /nothink at prompt end to create a predictable attention sink that doesn’t pollute downstream tokens.
Non-reasoning models (IBM Granite 4.1) outperform large reasoning models on structured extraction tasks: lower latency, no cross-run variance.
Mirror model-specific RLHF language (e.g., Qwen’s “Now let me…”) to work with training grain instead of against it.
Qwen 3.6 and Gemma4:26bA4b now replace Claude Opus 4.6 as recommended models for coding and general use respectively.