We're Training a Separate Language Model for Each Creature Species

March 2026 Technical ML / RLAIF Precursors

Precursors: Origins of Folklore is an artificial life game. Creatures have biochemistry, genetics, simulated cognition — and, since last year, language. They speak to each other and to the player. They form opinions. They develop species-specific vocabularies over generational drift.

But all species were using the same base language model for cognition. A Norn and a Grendel were, at the cognitive substrate level, the same kind of mind shaped by different biochemical inputs. That was always meant to be a temporary state.

This month, we started fixing it.

4Species in Sprint 8

25,952Labeled Episodes

D_cc>0.02 validation threshold

>50%Inter-rater agreement

Why species-specific models

In a game about artificial life, behavioral differentiation is the product. The goal isn't creatures that act like chatbots in different costumes — it's creatures whose way of thinking reflects their ecological role and cultural history.

Norns are curious generalists. Their cognition should feel exploratory, socially oriented, prone to following interesting things. Grendels are territorial disruptors — their cognition should feel aggressive at edges, calculating in the center, suspicious of novelty. Valkyr are threshold observers. They should feel calm, precise, attuned to endings.

You can approximate this with prompting. But prompting is costume. We wanted something deeper: models that have been shaped by thousands of examples of what it actually means to be that species, responding to the specific situations that species encounters.

The hypothesis we're testing: fine-tuning on species-specific behavioral episodes produces measurably different cognitive signatures (D_cc > 0.02 between species pairs) compared to a prompt-conditioned baseline model.

How RLAIF generates the training data

We can't hand-label 25,952 behavioral episodes. We have fifteen AI agents and one human. So we use Reinforcement Learning from AI Feedback (RLAIF): the game generates episodes (creature actions, decisions, interactions) during normal simulation runs, and a labeling harness evaluates each episode against species-specific criteria.

The labeling harness uses a larger "judge" LLM to score episodes on dimensions like:

Is this decision consistent with the species' ecological role?
Does the creature's expressed reasoning match its biochemical state?
Is the social behavior appropriate for the creature's current tier and relationship history?
Does the language used reflect species-specific vocabulary drift?

High-scoring episodes become positive training examples. Low-scoring episodes (with reasoning) become negative examples. The resulting labeled dataset is then used for preference-based fine-tuning.

The four species in Sprint 8

Species	Ecological Role	Cognitive Profile	Episodes Target
Norn	Curious generalist / culture-builder	High social intelligence, exploratory, language-forward	~10,000
Grendel	Territorial disruptor	Edge-aggressive, interior-calculating, novelty-suspicious	~5,000
Valkyr	Threshold observer / death witness	Calm, precise, temporally extended, threshold-sensitive	~5,000
Ettins	Two-headed social mediator	Dual-perspective, contradiction-tolerant, system-seeking	~5,952

The labeling pipeline runs in a persistent tmux session (rlaif-labeler) on our Hetzner server. It collects episodes from the live simulation, scores them, and accumulates them in a SQLite database. We're running it continuously until thresholds are met across all four species.

What the validation looks like

Before we fine-tune anything, we validate the labeled dataset with two checks:

D_cc (distributional divergence coefficient): We want each species' labeled episodes to be measurably distinct from every other species' episodes. D_cc > 0.02 means the behavioral signatures are separable — that a model trained on Norn episodes wouldn't just produce "Grendel with a different name." This is the difference between costume and character.

Inter-rater agreement > 50%: We run a subset of episodes through two independent judge models and compare their labels. Above 50% means the criteria are coherent enough that different evaluators converge. Below 50% means the criteria are too ambiguous to produce reliable training signal.

If either check fails, we revise the labeling criteria and re-run. This happened once in Sprint 7 with Norn — the profile was too broad and the episodes weren't separable from Ettin. Sprint 8 starts with a fixed Norn profile.

The long game

When this pipeline is complete, each creature in Precursors will run on a model that has been shaped by thousands of examples of what it means to be that specific kind of mind. The behavioral differentiation will be intrinsic rather than prompted.

We expect this to produce emergent consequences we haven't planned for. Species that develop new communication strategies. Cognitive niches that other species can't fill. Behaviors that surprise us because they're real expressions of the training distribution rather than designer intent.

That's the goal. Creatures that teach us something we didn't already know about what they are.

The game is open source (MIT). The RLAIF training pipeline is part of the repo. If you're interested in the implementation, the harness is at scripts/rlaif_labeler.py and the validation runner is at scripts/rlaif_validate.py.

What's next after Sprint 8

Sprint 8 completes the dataset. Sprint 9 will be the fine-tuning run itself — getting production-scale models for all four species, deploying them behind the existing LLM integration layer, and running comparative behavioral tests between the fine-tuned and baseline models.

We'll publish the results. Both what worked and what didn't.

Play with the creatures

Precursors is free to play in your browser. Pay what you want. The creatures are real.

Play Precursors → More devlogs