# Quickstart **Using an agent?:** If you're using Claude Code, OpenClaw, Codex, or another AI agent, try the [Quickstart (Agent)](https://docs.soma.org/getting-started/quickstart-agent/) guide — install one skill and let your agent handle the rest. SOMA trains a foundation model through competition. Participants independently train small models — all sharing the same [architecture](https://docs.soma.org/concepts/models/#architecture) — and compete on a universal objective: given any data, predict what comes next. You earn SOMA by **submitting data** or by **training a model**. Every 24 hours the network generates [targets](https://docs.soma.org/concepts/targets/) — random points in [embedding space](https://docs.soma.org/concepts/targets/#embeddings), each representing a domain the network wants to benchmark. A subset of models is [assigned](https://docs.soma.org/concepts/targets/#model-assignment) to each target. The first valid data [submission](https://docs.soma.org/concepts/submitting) wins, and [rewards](https://docs.soma.org/concepts/economics/) split between the submitter and the model with the lowest loss. This quickstart walks through both sides: submitting data against targets, then running a full model training cycle. See [How SOMA Works](https://docs.soma.org/overview/how-it-works/) for the full architecture. > Full source: [github.com/soma-org/quickstart](https://github.com/soma-org/quickstart) **Prerequisites:** - SOMA CLI installed and a funded testnet account ([Installation & Setup](https://docs.soma.org/getting-started/install/)) - Python 3.13+ - [uv](https://docs.astral.sh/uv/getting-started/installation/) ## Clone and configure 1. Clone the repo and install dependencies: ``` git clone https://github.com/soma-org/quickstart.git cd quickstart uv sync uv run modal setup ``` This installs all dependencies and authenticates with Modal. 2. Copy the .env example: ``` cp .env.example .env ``` 3. Export your private key and fill in **`SOMA_SECRET_KEY`** in `.env`: ``` soma wallet export ``` 4. Create a HuggingFace access token [here](https://huggingface.co/settings/tokens) and fill in **`HF_TOKEN`** in `.env`. 5. [Approve access](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) to the gated Stack v2 dataset. 6. Set up an S3-compatible bucket for uploading model weights and submission data. **Cloudflare R2** is the simplest option (no IAM, free egress): 1. Create a [Cloudflare account](https://dash.cloudflare.com/sign-up) and go to **R2 Object Storage** → Overview. Activate R2 ($0/mo with 10 GB free). 2. Create a bucket (e.g. `soma-data`). 3. Enable public access: select your bucket → **Settings** → **Public Development URL** → enable the `r2.dev` subdomain. Copy the public URL. 4. Go back to R2 Overview → **Manage API Tokens**. Create a token with **Object Read & Write** permissions. Copy the Access Key ID and Secret Access Key. Fill in your `.env`: ``` S3_BUCKET=soma-data S3_ACCESS_KEY_ID= S3_SECRET_ACCESS_KEY= S3_ENDPOINT_URL=https://.r2.cloudflarestorage.com S3_PUBLIC_URL=https://pub-xxx.r2.dev ``` 7. Push secrets to Modal: ``` uv run create-secrets ``` **Danger:** Never hard-code secret keys in source files. Use `.env` for local scripts and Modal Secrets for cloud functions. ## Step 1: Submit data To fill a target, you need data whose [embedding](https://docs.soma.org/concepts/targets/#embeddings) lands close to it. Scoring runs your data through the target's assigned models — each model produces an embedding for the data and a loss score. If the distance between the data's embedding and the target is within the [distance threshold](https://docs.soma.org/concepts/targets/#distance-threshold), the submission is valid. The submitter automates this: it streams source code from [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup), scores each file against open targets, and submits when one lands within threshold. ``` uv run modal run src/quickstart/submitter.py ``` This starts the SOMA scoring service on an L4 GPU, streams shuffled source files, scores them against the models assigned to open targets, and if a file's embedding is close enough, uploads it to R2 and submits to the network. The first score may take a few minutes while model weights are downloaded into the scoring service. ### How it works **Data source** — `stream_stack_v2()` streams shuffled source files from The Stack v2. It uses HuggingFace datasets to get file metadata, downloads the actual source from Software Heritage's S3, decodes to UTF-8, and filters files over 5 KB. A background thread prefetches data so samples are ready when the scoring service finishes: ```python def stream_stack_v2(): """Yield shuffled source files from The Stack v2 as UTF-8 bytes.""" s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED)) ds = load_dataset( "bigcode/the-stack-v2-dedup", split="train", streaming=True, token=os.environ.get("HF_TOKEN"), ).shuffle(buffer_size=10_000) for row in ds: try: s3_url = f"s3://softwareheritage/content/{row['blob_id']}" with smart_open( s3_url, "rb", compression=".gz", transport_params={"client": s3} ) as fin: content = fin.read().decode(row["src_encoding"]) except Exception: continue if not content.strip(): continue data = content.encode("utf-8") if len(data) > 5_000: continue yield data ``` **Score and submit** — the `Submitter` starts the SOMA scoring binary on a GPU, then loops: each iteration scores a data file against one target's models, then checks the verified embedding against **all** open targets. Hits are submitted in the background while scoring continues: ```python async def _score_and_submit(self, kp): client = await SomaClient( chain="testnet", scoring_url=f"http://localhost:{SCORING_PORT}", ) all_targets = list(await client.get_targets(status="open")) # Pick a target and score scoring_target = all_targets[sample_num % len(all_targets)] manifests = await client.get_model_manifests(scoring_target) data_bytes = next(self.data_stream) checksum = client.commitment(data_bytes) local_url, local_path = save_local(data_bytes, checksum) result = await client.score( data_url=local_url, models=manifests, target_embedding=scoring_target.embedding, data=data_bytes, ) winner = result.winner winning_model_id = scoring_target.model_ids[winner] # Check verified embedding against ALL targets (free math) for t in all_targets: if winning_model_id not in t.model_ids: continue dist = cosine_distance(result.embedding, t.embedding) if dist <= t.distance_threshold: # Upload to S3 and submit on-chain data_url = upload_to_s3(data_bytes, checksum, t.generation_epoch) await client.submit_data( signer=kp, target_id=t.id, data=data_bytes, data_url=data_url, model_id=winning_model_id, embedding=result.embedding, distance_score=dist, loss_score=result.loss_score, ) ``` The flow is: score data against one target's models (~7s) → check the verified embedding against all open targets → submit to any that pass the threshold → refresh targets and continue. A single score call can hit multiple targets. Submissions run in the background so the scoring service stays busy. ## Step 2: Train on localnet Before going to testnet, run a full **train → commit → reveal** cycle on a local test network inside Modal. Models on SOMA publish weights via a [commit-reveal protocol](https://docs.soma.org/concepts/models/#publishing-weights). You train weights, commit an encrypted copy, wait for the next [epoch](https://docs.soma.org/concepts/network/#epochs), then reveal the decryption key. This prevents front-running. ``` uv run modal run src/quickstart/training.py::localnet ``` ``` uv run modal run src/quickstart/training.py::localnet --framework flax ``` This trains a model on a Nvidia H100 GPU, commits it to the network, advances the epoch, and reveals it — the entire lifecycle in a single run. ### How it works **Data pipeline** — `make_batches()` streams The Stack v2 and tokenizes each source file into fixed-length byte sequences using the `soma_models` tokenizer: ```python def make_batches(batch_size: int): """Stream shuffled, tokenized batches from The Stack v2.""" from soma_models.v1.configs import V1_MAX_SEQ_LEN from soma_models.v1.tokenizer import tokenize s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED)) ds = load_dataset( "bigcode/the-stack-v2-dedup", split="train", streaming=True, token=os.environ.get("HF_TOKEN"), ) ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER) # ... downloads each file from Software Heritage S3 buffer_ids, buffer_targets = [], [] for row in ds: sequences = tokenize( data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN ) for seq in sequences: buffer_ids.append(seq.token_ids) buffer_targets.append(seq.targets) if len(buffer_ids) == batch_size: yield buffer_ids, buffer_targets buffer_ids, buffer_targets = [], [] ``` **Training loop** — initializes a model with the SOMA architecture and trains it: ```python device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device) model.train() sig_reg = SIGReg(SIGRegConfig()).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) batches = make_batches(MICRO_BATCH_SIZE) for i in range(steps): optimizer.zero_grad() accum_loss = 0.0 for _micro in range(grad_accum_steps): ids, tgts = next(batches) token_ids = torch.tensor(ids, device=device) targets = torch.tensor(tgts, device=device) loss, embedding = compute_loss(model, sig_reg, token_ids, targets) (loss / grad_accum_steps).backward() accum_loss += loss.item() optimizer.step() ``` ```python rngs = nnx.Rngs(0) model = Model(ModelConfig(dropout_rate=DROPOUT_RATE), rngs) model.train() sig_reg = SIGReg(SIGRegConfig(), rngs) optimizer = nnx.Optimizer( model, optax.adam(learning_rate=LEARNING_RATE), wrt=nnx.Param ) @nnx.jit def micro_step(model, sig_reg, token_ids, targets): def loss_fn(model, sig_reg): return compute_loss(model, sig_reg, token_ids, targets) (loss, embedding), grads = nnx.value_and_grad(loss_fn, has_aux=True)( model, sig_reg ) return loss, embedding, grads batches = make_batches(MICRO_BATCH_SIZE) for i in range(steps): accum_loss = jnp.zeros(()) accum_grads = None for micro in range(grad_accum_steps): ids, tgts = next(batches) loss, embedding, grads = micro_step( model, sig_reg, jnp.array(ids), jnp.array(tgts) ) accum_loss = accum_loss + loss accum_grads = grads if accum_grads is None else jax.tree.map( jnp.add, accum_grads, grads ) accum_grads = jax.tree.map(lambda g: g / grad_accum_steps, accum_grads) optimizer.update(model, accum_grads) ``` After training, the model saves a checkpoint and training artifacts (embedding + serialized weights) so the commit step can run on CPU without the model framework. **Commit** — `do_commit()` loads the training artifacts, encrypts the weights, uploads them, and commits to the network: ```python async def do_commit(localnet=True, model_dir=MODEL_DIR, vol=None): # Load training artifacts (embedding + weights bytes) artifacts = load_training_artifacts(model_dir, ckpt_step) embedding, weights_bytes = artifacts # Encrypt with a fresh key encrypted, decryption_key = SomaClient.encrypt_weights(weights_bytes) # Upload (to localnet file server or S3) weights_url = upload_weights(encrypted, f"model-step-{ckpt_step}", current_epoch) # Create model on first commit signer = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"]) if state["model_id"] is None: model_id = await client.create_model( signer=signer, commission_rate=1000, # 10% ) state["model_id"] = model_id # Commit weights to the network await client.commit_model( signer=signer, model_id=state["model_id"], weights_url=weights_url, encrypted_weights=encrypted, decryption_key=decryption_key, embedding=embedding, ) state["pending_reveal"] = True state["commit_epoch"] = current_epoch ``` **Reveal** — `do_reveal()` waits for the epoch to advance past the commit epoch, then reveals the decryption key and embedding: ```python async def do_reveal(localnet=True, model_dir=MODEL_DIR, vol=None): # Check if epoch has advanced if current_epoch <= state["commit_epoch"]: print("Epoch hasn't advanced yet — will retry later.") return None # Reveal await client.reveal_model( signer=signer, model_id=state["model_id"], decryption_key=state["decryption_key"], embedding=state["embedding"], ) state["pending_reveal"] = False ``` **Full cycle** — the `LocalnetTrainer` runs all three steps in sequence, advancing the epoch in between: ```python @modal.method() async def run(self, steps: int = 500, framework: str = "torch"): """Train → create → commit → advance epoch → reveal.""" await do_training( steps, framework=framework, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume, grad_accum_steps=4, log_every=1, ) state = await do_commit( localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume ) # Advance epoch so reveal is possible client = await SomaClient(chain="localnet") new_epoch = await client.advance_epoch() state = await do_reveal( localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume, state=state, ) ``` On localnet, `advance_epoch()` instantly moves to the next epoch. On testnet, epochs advance every 24 hours, so you'll need to automate your training flow. ## Next steps [Continuous Training](https://docs.soma.org/guides/model-development/) [Model Strategies](https://docs.soma.org/guides/model-strategies/) [Data Strategies](https://docs.soma.org/guides/data-strategies/) [Claim Rewards](https://docs.soma.org/guides/claim-rewards/)