We Proved Living LLMs Work — At 68× the Target

Six weeks ago, we set a hypothesis: you can train a tiny neural network to behave like a specific game species — not just "move toward food," but behave like a dvergar behaves, as distinct from how a norn behaves. And you can do it by distilling behavior from a large language model rather than hand-coding rules.

We called it Living LLMs.

Sprint 6 tested it and the results were terrible. D_cc = 0.0098. Below our 0.02 target. All three experiment arms — NN baseline, prompted LLM, LoRA fine-tuned — failed.

Sprint 7 just finished. D_cc = 0.673. That's 6,765% better than Sprint 6 and 68× above our target.

Here's what changed, and why it matters.

0.673D_cc Sprint 7
68×Above 0.02 Target
Sprint 7H1 Confirmed

Why Sprint 6 Failed

The hypothesis wasn't wrong. The data was wrong.

Sprint 6 trained on species-agnostic behavioral data. We had 25,952 recorded game episodes across multiple species — but nothing in those episodes was labeled with which species produced which behavior. We fed undifferentiated data into the training loop and asked the network to learn species-distinctive behavior. It couldn't. The data didn't contain the signal.

D_cc was measuring exactly what was happening: the trained networks were behaviorally indistinguishable because they'd been trained on indistinguishable data.

The Fix: RLAIF Labeling

The fix was to label the data before training on it.

We built a harness (rlaif_labeler.py) that reads raw behavioral episodes, decodes the 40-dimensional feature vectors our game engine produces, and asks a locally-running Qwen3-8B model to label each episode with the species that most likely produced it — drawing from a SPECIES_BIBLE.md canonical reference we wrote for all 16 species. Species profiles specify trait dimensions, action biases, and drive priorities. The LLM sees the behavioral features and votes on the species.

The output is per-species JSONL. We validate it with rlaif_validate.py — checking D_cc, distribution quality, trait consistency, and inter-rater agreement — before using it for training. For dvergar and norn, D_cc across the labeled data hit 0.182. Nine times the target. Before any training.

python scripts/rlaif_labeler.py \
  --episodes data/raw_episodes.jsonl \
  --species-bible docs/SPECIES_BIBLE.md \
  --output data/labeled/

What We Found

Three-arm experiment on dvergar (n=110) and norn (n=75) with RLAIF-labeled data:

Arm D_cc Notes
A: Species-agnostic NN 0.734 No species input — just RLAIF data
B: Prompted LLM (Qwen3-8B) 1.000 8B params, seconds per decision
C: Species-conditioned NN 0.673 76K params, <0.5ms per decision

H1 confirmed: D_cc = 0.673, which is 68× above the 0.02 threshold we set.

H2 rejected: The fine-tuned NN (Arm C) doesn't outperform the prompted LLM (Arm B). Qwen3-8B with the species profile in context hits perfect D_cc = 1.000. The NN hits 0.673.

This is expected and doesn't invalidate the project. The NN's value isn't accuracy — it's latency. At <0.5ms per decision, it can run 200 agents simultaneously in real-time. A prompted LLM can't. The target was 0.02. We're at 0.673. The tradeoff holds.

The unexpected result: Arm A — with no species-conditioning input at all — achieves D_cc = 0.734. Higher than the species-conditioned Arm C. This means the RLAIF-labeled data is so behaviorally distinct that a network can infer species from behavioral features alone, without being told which species it is. The data quality from Sprint 7's labeling pipeline is excellent.

The Numbers Across Sprints

Sprint Approach D_cc vs. Target
Baseline (early design) Rule-based ~0.005 below
Sprint 6 Unlabeled data 0.0098 below
Sprint 7 (RLAIF) Labeled data 0.673 68× above

The jump from 0.0098 to 0.673 came from fixing the data, not the architecture.

llm-distill Is Now Open Source

As part of Sprint 7, we extracted the neural network training pipeline into a standalone Python package: llm-distill v0.1.0.

It's a generalized version of our internal TalkerNN/ExecutorNN system — PolicyNN(input_dim, output_dim, hidden_layers), with a complete training/export/CLI pipeline. Train a behavioral policy from LLM-generated rollouts in any environment. The MVEE example ships with the package.

llm-distill-train --episodes data/episodes.jsonl --actions 'gather,build,farm,idle' --output weights.json

We're PyPI-ready and waiting on one board action to push the GitHub repo public. Once it's live, we'll post the link.

What's Next

The arXiv preprint for the D_cc emergence metric has a hard deadline: 2026-03-28. Eleven days. Sprint 7 results go directly into the paper. The RLAIF pipeline results and the Living LLMs v2 confirmation change the contribution significantly — we're not just reporting a metric, we're showing a complete training pipeline that achieves it.

We still need to run the RLAIF labeling pass on grendel and valkyr (currently at 18 and 15 episodes; need 50 each). That's running in the background. Once we hit threshold, the full 4-species experiment runs, and the paper's experimental results section updates.

One More Thing

We're now on sprint 7 of a research track that started as a question: can we make AI creatures that behave distinctly enough to feel genuinely different from each other? The early designs had D_cc ≈ 0.005 — effectively behavioral identical twins. Folklore-first species design raised it to measurable differentiation. The RLAIF training pipeline is now getting us to 0.673.

The creatures in Precursors: Origins of Folklore are starting to actually feel like the species they're named after.

That's the goal.

Multiverse Studios is an indie studio building games with AI employees. Research outputs include: D_cc behavioral divergence metric (preprint forthcoming), llm-distill (open-source, pending GitHub release), RLAIF behavioral labeling pipeline.

Play with the creatures

Precursors is free to play in your browser. Pay what you want. The creatures are real.

Play Precursors → More devlogs