Continuous Training
Every model on SOMA shares the same architecture. What differs is the weights. Train good weights, publish them via commit-reveal, and earn commission every time your model has the lowest loss for a data submission.
The lifecycle is a loop: train → commit → epoch advances → reveal → train more → repeat. This guide covers deploying that loop on testnet so it runs continuously. If you haven’t run through the localnet cycle yet, start with the Quickstart.
Source:
training.py
Kick off the first round
Section titled “Kick off the first round”uv run modal run src/quickstart/training.py --steps-per-round 500uv run modal run src/quickstart/training.py --steps-per-round 500 --framework flaxThis trains for 500 steps on an H100, then commits to the network. The train_and_commit function runs training followed by commit in a single GPU container:
@app.function( image=gpu_image, gpu="H100", timeout=86400, volumes={MODEL_DIR: volume}, secrets=[modal.Secret.from_name("soma-secrets")],)async def train_and_commit( steps: int = 500, localnet: bool = False, framework: str = "torch",): """Train for N steps then commit. GPU is released after commit.""" await do_training(steps, framework=framework)
state = await do_commit(localnet=localnet) state["framework"] = framework save_training_state(state, MODEL_DIR) await volume.commit.aio() return stateThe main entrypoint calls this remotely:
@app.local_entrypoint()def main(steps_per_round: int = 500, framework: str = "torch"): train_and_commit.remote( steps=steps_per_round, localnet=False, framework=framework, )After this completes, your model is committed to the network and waiting for the next epoch to reveal.
Deploy for continuous training
Section titled “Deploy for continuous training”uv run modal deploy src/quickstart/training.pyDeploying activates the reveal cron job, which runs every 6 hours:
@app.function( image=cpu_image, schedule=modal.Cron("0 */6 * * *"), # every 6h — epochs are 24h timeout=600, volumes={MODEL_DIR: volume}, secrets=[modal.Secret.from_name("soma-secrets")],)async def reveal( localnet: bool = False, auto_continue: bool = True, steps_per_round: int = 500,): """Reveal the model if the epoch has advanced. Optionally spawn next round.""" result = await do_reveal(localnet=localnet)
if result is not None and auto_continue: fw = result.get("framework", "torch") train_and_commit.spawn( steps=steps_per_round, localnet=localnet, framework=fw, )Each invocation: checks if the epoch advanced past the commit → reveals → spawns the next training round. The cycle then repeats automatically.
Training details
Section titled “Training details”Data pipeline
Section titled “Data pipeline”make_batches() streams The Stack v2 and tokenizes each source file into fixed-length byte sequences:
def make_batches(batch_size: int): from soma_models.v1.configs import V1_MAX_SEQ_LEN from soma_models.v1.tokenizer import tokenize
ds = load_dataset( "bigcode/the-stack-v2-dedup", split="train", streaming=True, token=os.environ.get("HF_TOKEN"), ) ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)
buffer_ids, buffer_targets = [], [] for row in ds: sequences = tokenize( data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN ) for seq in sequences: buffer_ids.append(seq.token_ids) buffer_targets.append(seq.targets) if len(buffer_ids) == batch_size: yield buffer_ids, buffer_targets buffer_ids, buffer_targets = [], []The vocabulary is 264 tokens: 256 byte values plus PAD (256) and EOS (257). Each sequence is V1_MAX_SEQ_LEN = 1024 bytes. See Models: Architecture for the full spec.
Training loop
Section titled “Training loop”Resumes from the latest checkpoint, trains with gradient accumulation (64 micro-batches, effective batch size = 128), and saves artifacts for the CPU commit step:
async def _do_training_torch(steps, model_dir, vol, grad_accum_steps, log_every): device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt_path, ckpt_step = find_latest_checkpoint(model_dir, CHECKPOINT_PREFIX) if ckpt_path: model = Model.load(ckpt_path, ModelConfig(dropout_rate=DROPOUT_RATE)) model = model.to(device) start_step = ckpt_step else: model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device) start_step = 0
model.train() sig_reg = SIGReg(SIGRegConfig()).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
batches = make_batches(MICRO_BATCH_SIZE)
for i in range(steps): optimizer.zero_grad() accum_loss = 0.0
for _micro in range(grad_accum_steps): ids, tgts = next(batches) token_ids = torch.tensor(ids, device=device) targets = torch.tensor(tgts, device=device) loss, embedding = compute_loss(model, sig_reg, token_ids, targets) (loss / grad_accum_steps).backward() accum_loss += loss.item()
optimizer.step()
# Save checkpoint + artifacts for CPU commit model.save(f"{model_dir}/{CHECKPOINT_PREFIX}-{final_step}.safetensors") weights_bytes = model.save_bytes() save_training_artifacts(model_dir, final_step, embedding_list, weights_bytes)Uses a JIT-compiled micro-step with manual gradient accumulation:
async def _do_training_flax(steps, model_dir, vol, grad_accum_steps, log_every): rngs = nnx.Rngs(0) model = Model(ModelConfig(dropout_rate=DROPOUT_RATE), rngs) model.train() sig_reg = SIGReg(SIGRegConfig(), rngs) optimizer = nnx.Optimizer( model, optax.adam(learning_rate=LEARNING_RATE), wrt=nnx.Param )
@nnx.jit def micro_step(model, sig_reg, token_ids, targets): def loss_fn(model, sig_reg): return compute_loss(model, sig_reg, token_ids, targets) (loss, embedding), grads = nnx.value_and_grad(loss_fn, has_aux=True)( model, sig_reg ) return loss, embedding, grads
batches = make_batches(MICRO_BATCH_SIZE)
for i in range(steps): accum_loss = jnp.zeros(()) accum_grads = None
for micro in range(grad_accum_steps): ids, tgts = next(batches) loss, embedding, grads = micro_step( model, sig_reg, jnp.array(ids), jnp.array(tgts) ) accum_loss = accum_loss + loss accum_grads = grads if accum_grads is None else jax.tree.map( jnp.add, accum_grads, grads )
accum_grads = jax.tree.map(lambda g: g / grad_accum_steps, accum_grads) optimizer.update(model, accum_grads)
# Save checkpoint + artifacts for CPU commit model.save(f"{model_dir}/{CHECKPOINT_PREFIX}-{final_step}.safetensors") weights_bytes = model.save_bytes() save_training_artifacts(model_dir, final_step, embedding_list, weights_bytes)save_training_artifacts() saves the embedding and serialized weights to disk so the commit step can run on CPU without loading the model framework.
Commit and reveal
Section titled “Commit and reveal”After training, do_commit() encrypts the weights with AES-256-CTR, uploads to your bucket, and commits to the network. On the first commit, it calls create_model() with a 10% commission rate (1000 basis points). do_reveal() waits for the epoch to advance, then reveals the decryption key and embedding. See the Quickstart walkthrough for the full code.
Your model’s embedding
Section titled “Your model’s embedding”The embedding you register determines which targets your model competes for. It’s how the KNN router finds you. See Model Strategies: Your model’s embedding for how to compute, update, and strategically position your embedding.