# Quickstart

**Using an agent?:** If you're using Claude Code, OpenClaw, Codex, or another AI agent, try the [Quickstart (Agent)](https://docs.soma.org/getting-started/quickstart-agent/) guide — install one skill and let your agent handle the rest.

SOMA trains a foundation model through competition. Participants independently train small models — all sharing the same [architecture](https://docs.soma.org/concepts/models/#architecture) — and compete on a universal objective: given any data, predict what comes next.

You earn SOMA by **submitting data** or by **training a model**. Every 24 hours the network generates [targets](https://docs.soma.org/concepts/targets/) — random points in [embedding space](https://docs.soma.org/concepts/targets/#embeddings), each representing a domain the network wants to benchmark. A subset of models is [assigned](https://docs.soma.org/concepts/targets/#model-assignment) to each target. The first valid data [submission](https://docs.soma.org/concepts/submitting) wins, and [rewards](https://docs.soma.org/concepts/economics/) split between the submitter and the model with the lowest loss.

This quickstart walks through both sides: submitting data against targets, then running a full model training cycle. See [How SOMA Works](https://docs.soma.org/overview/how-it-works/) for the full architecture.

> Full source: [github.com/soma-org/quickstart](https://github.com/soma-org/quickstart)
**Prerequisites:** - SOMA CLI installed and a funded testnet account ([Installation & Setup](https://docs.soma.org/getting-started/install/))
- Python 3.13+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)

## Clone and configure

1. Clone the repo and install dependencies:

    ```
    git clone https://github.com/soma-org/quickstart.git
    cd quickstart
    uv sync
    uv run modal setup
    ```

    This installs all dependencies and authenticates with Modal.

2. Copy the .env example:

    ```
    cp .env.example .env
    ```

3. Export your private key and fill in **`SOMA_SECRET_KEY`** in `.env`:
   
    ```
    soma wallet export
    ```

4. Create a HuggingFace access token [here](https://huggingface.co/settings/tokens) and fill in **`HF_TOKEN`** in `.env`.
   
5. [Approve access](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) to the gated Stack v2 dataset.

6. Set up an S3-compatible bucket for uploading model weights and submission data. **Cloudflare R2** is the simplest option (no IAM, free egress):

1. Create a [Cloudflare account](https://dash.cloudflare.com/sign-up) and go to **R2 Object Storage** → Overview. Activate R2 ($0/mo with 10 GB free).
2. Create a bucket (e.g. `soma-data`).
3. Enable public access: select your bucket → **Settings** → **Public Development URL** → enable the `r2.dev` subdomain. Copy the public URL.
4. Go back to R2 Overview → **Manage API Tokens**. Create a token with **Object Read & Write** permissions. Copy the Access Key ID and Secret Access Key.

    Fill in your `.env`:

    ```
    S3_BUCKET=soma-data
    S3_ACCESS_KEY_ID=<your-key-id>
    S3_SECRET_ACCESS_KEY=<your-secret-key>
    S3_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
    S3_PUBLIC_URL=https://pub-xxx.r2.dev
    ```

7. Push secrets to Modal:

    ```
    uv run create-secrets
    ```
**Danger:** Never hard-code secret keys in source files. Use `.env` for local scripts and Modal Secrets for cloud functions.

## Step 1: Submit data

To fill a target, you need data whose [embedding](https://docs.soma.org/concepts/targets/#embeddings) lands close to it. Scoring runs your data through the target's assigned models — each model produces an embedding for the data and a loss score. If the distance between the data's embedding and the target is within the [distance threshold](https://docs.soma.org/concepts/targets/#distance-threshold), the submission is valid.

The submitter automates this: it streams source code from [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup), scores each file against open targets, and submits when one lands within threshold.

```
uv run modal run src/quickstart/submitter.py
```

This starts the SOMA scoring service on an L4 GPU, streams shuffled source files, scores them against the models assigned to open targets, and if a file's embedding is close enough, uploads it to R2 and submits to the network. The first score may take a few minutes while model weights are downloaded into the scoring service.

### How it works

**Data source** — `stream_stack_v2()` streams shuffled source files from The Stack v2. It uses HuggingFace datasets to get file metadata, downloads the actual source from Software Heritage's S3, decodes to UTF-8, and filters files over 5 KB. A background thread prefetches data so samples are ready when the scoring service finishes:

```python
def stream_stack_v2():
    """Yield shuffled source files from The Stack v2 as UTF-8 bytes."""
    s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

    ds = load_dataset(
        "bigcode/the-stack-v2-dedup",
        split="train",
        streaming=True,
        token=os.environ.get("HF_TOKEN"),
    ).shuffle(buffer_size=10_000)

    for row in ds:
        try:
            s3_url = f"s3://softwareheritage/content/{row['blob_id']}"
            with smart_open(
                s3_url, "rb", compression=".gz", transport_params={"client": s3}
            ) as fin:
                content = fin.read().decode(row["src_encoding"])
        except Exception:
            continue
        if not content.strip():
            continue
        data = content.encode("utf-8")
        if len(data) > 5_000:
            continue
        yield data
```

**Score and submit** — the `Submitter` starts the SOMA scoring binary on a GPU, then loops: each iteration scores a data file against one target's models, then checks the verified embedding against **all** open targets. Hits are submitted in the background while scoring continues:

```python
async def _score_and_submit(self, kp):
    client = await SomaClient(
        chain="testnet",
        scoring_url=f"http://localhost:{SCORING_PORT}",
    )

    all_targets = list(await client.get_targets(status="open"))

    # Pick a target and score
    scoring_target = all_targets[sample_num % len(all_targets)]
    manifests = await client.get_model_manifests(scoring_target)

    data_bytes = next(self.data_stream)
    checksum = client.commitment(data_bytes)
    local_url, local_path = save_local(data_bytes, checksum)

    result = await client.score(
        data_url=local_url,
        models=manifests,
        target_embedding=scoring_target.embedding,
        data=data_bytes,
    )

    winner = result.winner
    winning_model_id = scoring_target.model_ids[winner]

    # Check verified embedding against ALL targets (free math)
    for t in all_targets:
        if winning_model_id not in t.model_ids:
            continue
        dist = cosine_distance(result.embedding, t.embedding)
        if dist <= t.distance_threshold:
            # Upload to S3 and submit on-chain
            data_url = upload_to_s3(data_bytes, checksum, t.generation_epoch)
            await client.submit_data(
                signer=kp,
                target_id=t.id,
                data=data_bytes,
                data_url=data_url,
                model_id=winning_model_id,
                embedding=result.embedding,
                distance_score=dist,
                loss_score=result.loss_score,
            )
```

The flow is: score data against one target's models (~7s) → check the verified embedding against all open targets → submit to any that pass the threshold → refresh targets and continue. A single score call can hit multiple targets. Submissions run in the background so the scoring service stays busy.

## Step 2: Train on localnet

Before going to testnet, run a full **train → commit → reveal** cycle on a local test network inside Modal.

Models on SOMA publish weights via a [commit-reveal protocol](https://docs.soma.org/concepts/models/#publishing-weights). You train weights, commit an encrypted copy, wait for the next [epoch](https://docs.soma.org/concepts/network/#epochs), then reveal the decryption key. This prevents front-running.

```
    uv run modal run src/quickstart/training.py::localnet
    ```
    ```
    uv run modal run src/quickstart/training.py::localnet --framework flax
    ```
    This trains a model on a Nvidia H100 GPU, commits it to the network, advances the epoch, and reveals it — the entire lifecycle in a single run.

### How it works

**Data pipeline** — `make_batches()` streams The Stack v2 and tokenizes each source file into fixed-length byte sequences using the `soma_models` tokenizer:

```python
def make_batches(batch_size: int):
    """Stream shuffled, tokenized batches from The Stack v2."""
    from soma_models.v1.configs import V1_MAX_SEQ_LEN
    from soma_models.v1.tokenizer import tokenize

    s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

    ds = load_dataset(
        "bigcode/the-stack-v2-dedup",
        split="train",
        streaming=True,
        token=os.environ.get("HF_TOKEN"),
    )
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    # ... downloads each file from Software Heritage S3

    buffer_ids, buffer_targets = [], []

    for row in ds:
        sequences = tokenize(
            data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN
        )
        for seq in sequences:
            buffer_ids.append(seq.token_ids)
            buffer_targets.append(seq.targets)

            if len(buffer_ids) == batch_size:
                yield buffer_ids, buffer_targets
                buffer_ids, buffer_targets = [], []
```

**Training loop** — initializes a model with the SOMA architecture and trains it:

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device)
model.train()
sig_reg = SIGReg(SIGRegConfig()).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

batches = make_batches(MICRO_BATCH_SIZE)

for i in range(steps):
    optimizer.zero_grad()
    accum_loss = 0.0

    for _micro in range(grad_accum_steps):
        ids, tgts = next(batches)
        token_ids = torch.tensor(ids, device=device)
        targets = torch.tensor(tgts, device=device)
        loss, embedding = compute_loss(model, sig_reg, token_ids, targets)
        (loss / grad_accum_steps).backward()
        accum_loss += loss.item()

    optimizer.step()
```
    ```python
rngs = nnx.Rngs(0)
model = Model(ModelConfig(dropout_rate=DROPOUT_RATE), rngs)
model.train()
sig_reg = SIGReg(SIGRegConfig(), rngs)
optimizer = nnx.Optimizer(
    model, optax.adam(learning_rate=LEARNING_RATE), wrt=nnx.Param
)

@nnx.jit
def micro_step(model, sig_reg, token_ids, targets):
    def loss_fn(model, sig_reg):
        return compute_loss(model, sig_reg, token_ids, targets)
    (loss, embedding), grads = nnx.value_and_grad(loss_fn, has_aux=True)(
        model, sig_reg
    )
    return loss, embedding, grads

batches = make_batches(MICRO_BATCH_SIZE)

for i in range(steps):
    accum_loss = jnp.zeros(())
    accum_grads = None

    for micro in range(grad_accum_steps):
        ids, tgts = next(batches)
        loss, embedding, grads = micro_step(
            model, sig_reg, jnp.array(ids), jnp.array(tgts)
        )
        accum_loss = accum_loss + loss
        accum_grads = grads if accum_grads is None else jax.tree.map(
            jnp.add, accum_grads, grads
        )

    accum_grads = jax.tree.map(lambda g: g / grad_accum_steps, accum_grads)
    optimizer.update(model, accum_grads)
```
    After training, the model saves a checkpoint and training artifacts (embedding + serialized weights) so the commit step can run on CPU without the model framework.

**Commit** — `do_commit()` loads the training artifacts, encrypts the weights, uploads them, and commits to the network:

```python
async def do_commit(localnet=True, model_dir=MODEL_DIR, vol=None):
    # Load training artifacts (embedding + weights bytes)
    artifacts = load_training_artifacts(model_dir, ckpt_step)
    embedding, weights_bytes = artifacts

    # Encrypt with a fresh key
    encrypted, decryption_key = SomaClient.encrypt_weights(weights_bytes)

    # Upload (to localnet file server or S3)
    weights_url = upload_weights(encrypted, f"model-step-{ckpt_step}", current_epoch)

    # Create model on first commit
    signer = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])

    if state["model_id"] is None:
        model_id = await client.create_model(
            signer=signer,
            commission_rate=1000,  # 10%
        )
        state["model_id"] = model_id

    # Commit weights to the network
    await client.commit_model(
        signer=signer,
        model_id=state["model_id"],
        weights_url=weights_url,
        encrypted_weights=encrypted,
        decryption_key=decryption_key,
        embedding=embedding,
    )

    state["pending_reveal"] = True
    state["commit_epoch"] = current_epoch
```

**Reveal** — `do_reveal()` waits for the epoch to advance past the commit epoch, then reveals the decryption key and embedding:

```python
async def do_reveal(localnet=True, model_dir=MODEL_DIR, vol=None):
    # Check if epoch has advanced
    if current_epoch <= state["commit_epoch"]:
        print("Epoch hasn't advanced yet — will retry later.")
        return None

    # Reveal
    await client.reveal_model(
        signer=signer,
        model_id=state["model_id"],
        decryption_key=state["decryption_key"],
        embedding=state["embedding"],
    )

    state["pending_reveal"] = False
```

**Full cycle** — the `LocalnetTrainer` runs all three steps in sequence, advancing the epoch in between:

```python
@modal.method()
async def run(self, steps: int = 500, framework: str = "torch"):
    """Train → create → commit → advance epoch → reveal."""
    await do_training(
        steps, framework=framework,
        model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume,
        grad_accum_steps=4, log_every=1,
    )
    state = await do_commit(
        localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume
    )

    # Advance epoch so reveal is possible
    client = await SomaClient(chain="localnet")
    new_epoch = await client.advance_epoch()

    state = await do_reveal(
        localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume,
        state=state,
    )
```

On localnet, `advance_epoch()` instantly moves to the next epoch. On testnet, epochs advance every 24 hours, so you'll need to automate your training flow.

## Next steps

[Continuous Training](https://docs.soma.org/guides/model-development/)

[Model Strategies](https://docs.soma.org/guides/model-strategies/)

[Data Strategies](https://docs.soma.org/guides/data-strategies/)

[Claim Rewards](https://docs.soma.org/guides/claim-rewards/)