Skip to content

Continuous Training

Every model on SOMA shares the same architecture. What differs is the weights. Train good weights, publish them via commit-reveal, and earn commission every time your model has the lowest loss for a data submission.

The lifecycle is a loop: traincommit → epoch advances → revealtrain more → repeat. This guide covers deploying that loop on testnet so it runs continuously. If you haven’t run through the localnet cycle yet, start with the Quickstart.

Source: training.py

uv run modal run src/quickstart/training.py --steps-per-round 500

This trains for 500 steps on an H100, then commits to the network. The train_and_commit function runs training followed by commit in a single GPU container:

@app.function(
image=gpu_image,
gpu="H100",
timeout=86400,
volumes={MODEL_DIR: volume},
secrets=[modal.Secret.from_name("soma-secrets")],
)
async def train_and_commit(
steps: int = 500,
localnet: bool = False,
framework: str = "torch",
):
"""Train for N steps then commit. GPU is released after commit."""
await do_training(steps, framework=framework)
state = await do_commit(localnet=localnet)
state["framework"] = framework
save_training_state(state, MODEL_DIR)
await volume.commit.aio()
return state

The main entrypoint calls this remotely:

@app.local_entrypoint()
def main(steps_per_round: int = 500, framework: str = "torch"):
train_and_commit.remote(
steps=steps_per_round, localnet=False, framework=framework,
)

After this completes, your model is committed to the network and waiting for the next epoch to reveal.

uv run modal deploy src/quickstart/training.py

Deploying activates the reveal cron job, which runs every 6 hours:

@app.function(
image=cpu_image,
schedule=modal.Cron("0 */6 * * *"), # every 6h — epochs are 24h
timeout=600,
volumes={MODEL_DIR: volume},
secrets=[modal.Secret.from_name("soma-secrets")],
)
async def reveal(
localnet: bool = False,
auto_continue: bool = True,
steps_per_round: int = 500,
):
"""Reveal the model if the epoch has advanced. Optionally spawn next round."""
result = await do_reveal(localnet=localnet)
if result is not None and auto_continue:
fw = result.get("framework", "torch")
train_and_commit.spawn(
steps=steps_per_round,
localnet=localnet,
framework=fw,
)

Each invocation: checks if the epoch advanced past the commit → reveals → spawns the next training round. The cycle then repeats automatically.

make_batches() streams The Stack v2 and tokenizes each source file into fixed-length byte sequences:

def make_batches(batch_size: int):
from soma_models.v1.configs import V1_MAX_SEQ_LEN
from soma_models.v1.tokenizer import tokenize
ds = load_dataset(
"bigcode/the-stack-v2-dedup",
split="train",
streaming=True,
token=os.environ.get("HF_TOKEN"),
)
ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)
buffer_ids, buffer_targets = [], []
for row in ds:
sequences = tokenize(
data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN
)
for seq in sequences:
buffer_ids.append(seq.token_ids)
buffer_targets.append(seq.targets)
if len(buffer_ids) == batch_size:
yield buffer_ids, buffer_targets
buffer_ids, buffer_targets = [], []

The vocabulary is 264 tokens: 256 byte values plus PAD (256) and EOS (257). Each sequence is V1_MAX_SEQ_LEN = 1024 bytes. See Models: Architecture for the full spec.

Resumes from the latest checkpoint, trains with gradient accumulation (64 micro-batches, effective batch size = 128), and saves artifacts for the CPU commit step:

async def _do_training_torch(steps, model_dir, vol, grad_accum_steps, log_every):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt_path, ckpt_step = find_latest_checkpoint(model_dir, CHECKPOINT_PREFIX)
if ckpt_path:
model = Model.load(ckpt_path, ModelConfig(dropout_rate=DROPOUT_RATE))
model = model.to(device)
start_step = ckpt_step
else:
model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device)
start_step = 0
model.train()
sig_reg = SIGReg(SIGRegConfig()).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
batches = make_batches(MICRO_BATCH_SIZE)
for i in range(steps):
optimizer.zero_grad()
accum_loss = 0.0
for _micro in range(grad_accum_steps):
ids, tgts = next(batches)
token_ids = torch.tensor(ids, device=device)
targets = torch.tensor(tgts, device=device)
loss, embedding = compute_loss(model, sig_reg, token_ids, targets)
(loss / grad_accum_steps).backward()
accum_loss += loss.item()
optimizer.step()
# Save checkpoint + artifacts for CPU commit
model.save(f"{model_dir}/{CHECKPOINT_PREFIX}-{final_step}.safetensors")
weights_bytes = model.save_bytes()
save_training_artifacts(model_dir, final_step, embedding_list, weights_bytes)

save_training_artifacts() saves the embedding and serialized weights to disk so the commit step can run on CPU without loading the model framework.

After training, do_commit() encrypts the weights with AES-256-CTR, uploads to your bucket, and commits to the network. On the first commit, it calls create_model() with a 10% commission rate (1000 basis points). do_reveal() waits for the epoch to advance, then reveals the decryption key and embedding. See the Quickstart walkthrough for the full code.

The embedding you register determines which targets your model competes for. It’s how the KNN router finds you. See Model Strategies: Your model’s embedding for how to compute, update, and strategically position your embedding.