Continuous Training

Every model on SOMA shares the same architecture. What differs is the weights. Train good weights, publish them via commit-reveal, and earn commission every time your model has the lowest loss for a data submission.

The lifecycle is a loop: train → commit → epoch advances → reveal → train more → repeat. This guide covers deploying that loop on testnet so it runs continuously. If you haven’t run through the localnet cycle yet, start with the Quickstart.

Source: training.py

uv run modal run src/quickstart/training.py --steps-per-round 500

uv run modal run src/quickstart/training.py --steps-per-round 500 --framework flax

This trains for 500 steps on an H100, then commits to the network. The train_and_commit function runs training followed by commit in a single GPU container:

@app.function(
    image=gpu_image,
    gpu="H100",
    timeout=86400,
    volumes={MODEL_DIR: volume},
    secrets=[modal.Secret.from_name("soma-secrets")],
)
async def train_and_commit(
    steps: int = 500,
    localnet: bool = False,
    framework: str = "torch",
):
    """Train for N steps then commit. GPU is released after commit."""
    await do_training(steps, framework=framework)

    state = await do_commit(localnet=localnet)
    state["framework"] = framework
    save_training_state(state, MODEL_DIR)
    await volume.commit.aio()
    return state

The main entrypoint calls this remotely:

@app.local_entrypoint()
def main(steps_per_round: int = 500, framework: str = "torch"):
    train_and_commit.remote(
        steps=steps_per_round, localnet=False, framework=framework,
    )

After this completes, your model is committed to the network and waiting for the next epoch to reveal.

Deploy for continuous training

uv run modal deploy src/quickstart/training.py

Deploying activates the reveal cron job, which runs every 6 hours:

@app.function(
    image=cpu_image,
    schedule=modal.Cron("0 */6 * * *"),  # every 6h — epochs are 24h
    timeout=600,
    volumes={MODEL_DIR: volume},
    secrets=[modal.Secret.from_name("soma-secrets")],
)
async def reveal(
    localnet: bool = False,
    auto_continue: bool = True,
    steps_per_round: int = 500,
):
    """Reveal the model if the epoch has advanced. Optionally spawn next round."""
    result = await do_reveal(localnet=localnet)

    if result is not None and auto_continue:
        fw = result.get("framework", "torch")
        train_and_commit.spawn(
            steps=steps_per_round,
            localnet=localnet,
            framework=fw,
        )

Each invocation: checks if the epoch advanced past the commit → reveals → spawns the next training round. The cycle then repeats automatically.

Training details

Data pipeline

make_batches() streams The Stack v2 and tokenizes each source file into fixed-length byte sequences:

def make_batches(batch_size: int):
    from soma_models.v1.configs import V1_MAX_SEQ_LEN
    from soma_models.v1.tokenizer import tokenize

    ds = load_dataset(
        "bigcode/the-stack-v2-dedup",
        split="train",
        streaming=True,
        token=os.environ.get("HF_TOKEN"),
    )
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    buffer_ids, buffer_targets = [], []
    for row in ds:
        sequences = tokenize(
            data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN
        )
        for seq in sequences:
            buffer_ids.append(seq.token_ids)
            buffer_targets.append(seq.targets)
            if len(buffer_ids) == batch_size:
                yield buffer_ids, buffer_targets
                buffer_ids, buffer_targets = [], []

The vocabulary is 264 tokens: 256 byte values plus PAD (256) and EOS (257). Each sequence is V1_MAX_SEQ_LEN = 1024 bytes. See Models: Architecture for the full spec.

Training loop

PyTorch
Flax/JAX

Resumes from the latest checkpoint, trains with gradient accumulation (64 micro-batches, effective batch size = 128), and saves artifacts for the CPU commit step:

async def _do_training_torch(steps, model_dir, vol, grad_accum_steps, log_every):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    ckpt_path, ckpt_step = find_latest_checkpoint(model_dir, CHECKPOINT_PREFIX)
    if ckpt_path:
        model = Model.load(ckpt_path, ModelConfig(dropout_rate=DROPOUT_RATE))
        model = model.to(device)
        start_step = ckpt_step
    else:
        model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device)
        start_step = 0

    model.train()
    sig_reg = SIGReg(SIGRegConfig()).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

    batches = make_batches(MICRO_BATCH_SIZE)

    for i in range(steps):
        optimizer.zero_grad()
        accum_loss = 0.0

        for _micro in range(grad_accum_steps):
            ids, tgts = next(batches)
            token_ids = torch.tensor(ids, device=device)
            targets = torch.tensor(tgts, device=device)
            loss, embedding = compute_loss(model, sig_reg, token_ids, targets)
            (loss / grad_accum_steps).backward()
            accum_loss += loss.item()

        optimizer.step()

    # Save checkpoint + artifacts for CPU commit
    model.save(f"{model_dir}/{CHECKPOINT_PREFIX}-{final_step}.safetensors")
    weights_bytes = model.save_bytes()
    save_training_artifacts(model_dir, final_step, embedding_list, weights_bytes)

Uses a JIT-compiled micro-step with manual gradient accumulation:

async def _do_training_flax(steps, model_dir, vol, grad_accum_steps, log_every):
    rngs = nnx.Rngs(0)
    model = Model(ModelConfig(dropout_rate=DROPOUT_RATE), rngs)
    model.train()
    sig_reg = SIGReg(SIGRegConfig(), rngs)
    optimizer = nnx.Optimizer(
        model, optax.adam(learning_rate=LEARNING_RATE), wrt=nnx.Param
    )

    @nnx.jit
    def micro_step(model, sig_reg, token_ids, targets):
        def loss_fn(model, sig_reg):
            return compute_loss(model, sig_reg, token_ids, targets)
        (loss, embedding), grads = nnx.value_and_grad(loss_fn, has_aux=True)(
            model, sig_reg
        )
        return loss, embedding, grads

    batches = make_batches(MICRO_BATCH_SIZE)

    for i in range(steps):
        accum_loss = jnp.zeros(())
        accum_grads = None

        for micro in range(grad_accum_steps):
            ids, tgts = next(batches)
            loss, embedding, grads = micro_step(
                model, sig_reg, jnp.array(ids), jnp.array(tgts)
            )
            accum_loss = accum_loss + loss
            accum_grads = grads if accum_grads is None else jax.tree.map(
                jnp.add, accum_grads, grads
            )

        accum_grads = jax.tree.map(lambda g: g / grad_accum_steps, accum_grads)
        optimizer.update(model, accum_grads)

    # Save checkpoint + artifacts for CPU commit
    model.save(f"{model_dir}/{CHECKPOINT_PREFIX}-{final_step}.safetensors")
    weights_bytes = model.save_bytes()
    save_training_artifacts(model_dir, final_step, embedding_list, weights_bytes)

save_training_artifacts() saves the embedding and serialized weights to disk so the commit step can run on CPU without loading the model framework.

Commit and reveal

After training, do_commit() encrypts the weights with AES-256-CTR, uploads to your bucket, and commits to the network. On the first commit, it calls create_model() with a 10% commission rate (1000 basis points). do_reveal() waits for the epoch to advance, then reveals the decryption key and embedding. See the Quickstart walkthrough for the full code.

Your model’s embedding

The embedding you register determines which targets your model competes for. It’s how the KNN router finds you. See Model Strategies: Your model’s embedding for how to compute, update, and strategically position your embedding.

Next steps

Model Strategies Load competitor weights, distill knowledge, and optimize your embedding.

Data Strategies Curate better training data with smart filtering and LLM generation.

Claim Rewards Claim your model commission after targets settle.