Quickstart

SOMA trains a foundation model through competition. Participants independently train small models — all sharing the same architecture — and compete on a universal objective: given any data, predict what comes next.

You earn SOMA by submitting data or by training a model. Every 24 hours the network generates targets — random points in embedding space, each representing a domain the network wants to benchmark. A subset of models is assigned to each target. The first valid data submission wins, and rewards split between the submitter and the model with the lowest loss.

This quickstart walks through both sides: submitting data against targets, then running a full model training cycle. See How SOMA Works for the full architecture.

Full source: github.com/soma-org/quickstart

Clone and configure

Clone the repo and install dependencies:
```
git clone https://github.com/soma-org/quickstart.git
cd quickstart
uv sync
uv run modal setup
```
This installs all dependencies and authenticates with Modal.
Copy the .env example:
```
cp .env.example .env
```
Export your private key and fill in SOMA_SECRET_KEY in .env:
```
soma wallet export
```
Create a HuggingFace access token here and fill in HF_TOKEN in .env.
Approve access to the gated Stack v2 dataset.
Set up an S3-compatible bucket for uploading model weights and submission data. Cloudflare R2 is the simplest option (no IAM, free egress):
1. Create a Cloudflare account and go to R2 Object Storage → Overview. Activate R2 ($0/mo with 10 GB free).
2. Create a bucket (e.g. soma-data).
3. Enable public access: select your bucket → Settings → Public Development URL → enable the r2.dev subdomain. Copy the public URL.
4. Go back to R2 Overview → Manage API Tokens. Create a token with Object Read & Write permissions. Copy the Access Key ID and Secret Access Key.
Fill in your .env:
```
S3_BUCKET=soma-data
S3_ACCESS_KEY_ID=<your-key-id>
S3_SECRET_ACCESS_KEY=<your-secret-key>
S3_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
S3_PUBLIC_URL=https://pub-xxx.r2.dev
```
Push secrets to Modal:
```
uv run create-secrets
```
Never hard-code secret keys in source files. Use .env for local scripts and Modal Secrets for cloud functions.

Step 1: Submit data

To fill a target, you need data whose embedding lands close to it. Scoring runs your data through the target’s assigned models — each model produces an embedding for the data and a loss score. If the distance between the data’s embedding and the target is within the distance threshold, the submission is valid.

The submitter automates this: it streams source code from The Stack v2, scores each file against open targets, and submits when one lands within threshold.

uv run modal run src/quickstart/submitter.py

This starts the SOMA scoring service on an L4 GPU, streams shuffled source files, scores them against the models assigned to open targets, and if a file’s embedding is close enough, uploads it to R2 and submits to the network. The first score may take a few minutes while model weights are downloaded into the scoring service.

How it works

Data source — stream_stack_v2() streams shuffled source files from The Stack v2. It uses HuggingFace datasets to get file metadata, downloads the actual source from Software Heritage’s S3, decodes to UTF-8, and filters files over 5 KB. A background thread prefetches data so samples are ready when the scoring service finishes:

def stream_stack_v2():
    """Yield shuffled source files from The Stack v2 as UTF-8 bytes."""
    s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

    ds = load_dataset(
        "bigcode/the-stack-v2-dedup",
        split="train",
        streaming=True,
        token=os.environ.get("HF_TOKEN"),
    ).shuffle(buffer_size=10_000)

    for row in ds:
        try:
            s3_url = f"s3://softwareheritage/content/{row['blob_id']}"
            with smart_open(
                s3_url, "rb", compression=".gz", transport_params={"client": s3}
            ) as fin:
                content = fin.read().decode(row["src_encoding"])
        except Exception:
            continue
        if not content.strip():
            continue
        data = content.encode("utf-8")
        if len(data) > 5_000:
            continue
        yield data

Score and submit — the Submitter starts the SOMA scoring binary on a GPU, then loops: each iteration scores a data file against one target’s models, then checks the verified embedding against all open targets. Hits are submitted in the background while scoring continues:

async def _score_and_submit(self, kp):
    client = await SomaClient(
        chain="testnet",
        scoring_url=f"http://localhost:{SCORING_PORT}",
    )

    all_targets = list(await client.get_targets(status="open"))

    # Pick a target and score
    scoring_target = all_targets[sample_num % len(all_targets)]
    manifests = await client.get_model_manifests(scoring_target)

    data_bytes = next(self.data_stream)
    checksum = client.commitment(data_bytes)
    local_url, local_path = save_local(data_bytes, checksum)

    result = await client.score(
        data_url=local_url,
        models=manifests,
        target_embedding=scoring_target.embedding,
        data=data_bytes,
    )

    winner = result.winner
    winning_model_id = scoring_target.model_ids[winner]

    # Check verified embedding against ALL targets (free math)
    for t in all_targets:
        if winning_model_id not in t.model_ids:
            continue
        dist = cosine_distance(result.embedding, t.embedding)
        if dist <= t.distance_threshold:
            # Upload to S3 and submit on-chain
            data_url = upload_to_s3(data_bytes, checksum, t.generation_epoch)
            await client.submit_data(
                signer=kp,
                target_id=t.id,
                data=data_bytes,
                data_url=data_url,
                model_id=winning_model_id,
                embedding=result.embedding,
                distance_score=dist,
                loss_score=result.loss_score,
            )

The flow is: score data against one target’s models (~7s) → check the verified embedding against all open targets → submit to any that pass the threshold → refresh targets and continue. A single score call can hit multiple targets. Submissions run in the background so the scoring service stays busy.

Step 2: Train on localnet

Before going to testnet, run a full train → commit → reveal cycle on a local test network inside Modal.

Models on SOMA publish weights via a commit-reveal protocol. You train weights, commit an encrypted copy, wait for the next epoch, then reveal the decryption key. This prevents front-running.

PyTorch
Flax/JAX

uv run modal run src/quickstart/training.py::localnet

uv run modal run src/quickstart/training.py::localnet --framework flax

This trains a model on a Nvidia H100 GPU, commits it to the network, advances the epoch, and reveals it — the entire lifecycle in a single run.

How it works

Data pipeline — make_batches() streams The Stack v2 and tokenizes each source file into fixed-length byte sequences using the soma_models tokenizer:

def make_batches(batch_size: int):
    """Stream shuffled, tokenized batches from The Stack v2."""
    from soma_models.v1.configs import V1_MAX_SEQ_LEN
    from soma_models.v1.tokenizer import tokenize

    s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

    ds = load_dataset(
        "bigcode/the-stack-v2-dedup",
        split="train",
        streaming=True,
        token=os.environ.get("HF_TOKEN"),
    )
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    # ... downloads each file from Software Heritage S3

    buffer_ids, buffer_targets = [], []

    for row in ds:
        sequences = tokenize(
            data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN
        )
        for seq in sequences:
            buffer_ids.append(seq.token_ids)
            buffer_targets.append(seq.targets)

            if len(buffer_ids) == batch_size:
                yield buffer_ids, buffer_targets
                buffer_ids, buffer_targets = [], []

Training loop — initializes a model with the SOMA architecture and trains it:

PyTorch
Flax/JAX

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device)
model.train()
sig_reg = SIGReg(SIGRegConfig()).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

batches = make_batches(MICRO_BATCH_SIZE)

for i in range(steps):
    optimizer.zero_grad()
    accum_loss = 0.0

    for _micro in range(grad_accum_steps):
        ids, tgts = next(batches)
        token_ids = torch.tensor(ids, device=device)
        targets = torch.tensor(tgts, device=device)
        loss, embedding = compute_loss(model, sig_reg, token_ids, targets)
        (loss / grad_accum_steps).backward()
        accum_loss += loss.item()

    optimizer.step()

rngs = nnx.Rngs(0)
model = Model(ModelConfig(dropout_rate=DROPOUT_RATE), rngs)
model.train()
sig_reg = SIGReg(SIGRegConfig(), rngs)
optimizer = nnx.Optimizer(
    model, optax.adam(learning_rate=LEARNING_RATE), wrt=nnx.Param
)

@nnx.jit
def micro_step(model, sig_reg, token_ids, targets):
    def loss_fn(model, sig_reg):
        return compute_loss(model, sig_reg, token_ids, targets)
    (loss, embedding), grads = nnx.value_and_grad(loss_fn, has_aux=True)(
        model, sig_reg
    )
    return loss, embedding, grads

batches = make_batches(MICRO_BATCH_SIZE)

for i in range(steps):
    accum_loss = jnp.zeros(())
    accum_grads = None

    for micro in range(grad_accum_steps):
        ids, tgts = next(batches)
        loss, embedding, grads = micro_step(
            model, sig_reg, jnp.array(ids), jnp.array(tgts)
        )
        accum_loss = accum_loss + loss
        accum_grads = grads if accum_grads is None else jax.tree.map(
            jnp.add, accum_grads, grads
        )

    accum_grads = jax.tree.map(lambda g: g / grad_accum_steps, accum_grads)
    optimizer.update(model, accum_grads)

After training, the model saves a checkpoint and training artifacts (embedding + serialized weights) so the commit step can run on CPU without the model framework.

Commit — do_commit() loads the training artifacts, encrypts the weights, uploads them, and commits to the network:

async def do_commit(localnet=True, model_dir=MODEL_DIR, vol=None):
    # Load training artifacts (embedding + weights bytes)
    artifacts = load_training_artifacts(model_dir, ckpt_step)
    embedding, weights_bytes = artifacts

    # Encrypt with a fresh key
    encrypted, decryption_key = SomaClient.encrypt_weights(weights_bytes)

    # Upload (to localnet file server or S3)
    weights_url = upload_weights(encrypted, f"model-step-{ckpt_step}", current_epoch)

    # Create model on first commit
    signer = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])

    if state["model_id"] is None:
        model_id = await client.create_model(
            signer=signer,
            commission_rate=1000,  # 10%
        )
        state["model_id"] = model_id

    # Commit weights to the network
    await client.commit_model(
        signer=signer,
        model_id=state["model_id"],
        weights_url=weights_url,
        encrypted_weights=encrypted,
        decryption_key=decryption_key,
        embedding=embedding,
    )

    state["pending_reveal"] = True
    state["commit_epoch"] = current_epoch

Reveal — do_reveal() waits for the epoch to advance past the commit epoch, then reveals the decryption key and embedding:

async def do_reveal(localnet=True, model_dir=MODEL_DIR, vol=None):
    # Check if epoch has advanced
    if current_epoch <= state["commit_epoch"]:
        print("Epoch hasn't advanced yet — will retry later.")
        return None

    # Reveal
    await client.reveal_model(
        signer=signer,
        model_id=state["model_id"],
        decryption_key=state["decryption_key"],
        embedding=state["embedding"],
    )

    state["pending_reveal"] = False

Full cycle — the LocalnetTrainer runs all three steps in sequence, advancing the epoch in between:

@modal.method()
async def run(self, steps: int = 500, framework: str = "torch"):
    """Train → create → commit → advance epoch → reveal."""
    await do_training(
        steps, framework=framework,
        model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume,
        grad_accum_steps=4, log_every=1,
    )
    state = await do_commit(
        localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume
    )

    # Advance epoch so reveal is possible
    client = await SomaClient(chain="localnet")
    new_epoch = await client.advance_epoch()

    state = await do_reveal(
        localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume,
        state=state,
    )

On localnet, advance_epoch() instantly moves to the next epoch. On testnet, epochs advance every 24 hours, so you’ll need to automate your training flow.

Next steps

Continuous Training Deploy continuous training on testnet with automated commit-reveal.

Model Strategies Load competitor weights, distill knowledge, and optimize your embedding.

Data Strategies Curate better training data with smart filtering and LLM generation.

Claim Rewards Claim earned rewards after targets settle.