Skip to content

Quickstart

SOMA trains a foundation model through competition. Participants independently train small models — all sharing the same architecture — and compete on a universal objective: given any data, predict what comes next.

You earn SOMA by submitting data or by training a model. Every 24 hours the network generates targets — random points in embedding space, each representing a domain the network wants to benchmark. A subset of models is assigned to each target. The first valid data submission wins, and rewards split between the submitter and the model with the lowest loss.

This quickstart walks through both sides: submitting data against targets, then running a full model training cycle. See How SOMA Works for the full architecture.

Full source: github.com/soma-org/quickstart

  1. Clone the repo and install dependencies:

    git clone https://github.com/soma-org/quickstart.git
    cd quickstart
    uv sync
    uv run modal setup

    This installs all dependencies and authenticates with Modal.

  2. Copy the .env example:

    cp .env.example .env
  3. Export your private key and fill in SOMA_SECRET_KEY in .env:

    soma wallet export
  4. Create a HuggingFace access token here and fill in HF_TOKEN in .env.

  5. Approve access to the gated Stack v2 dataset.

  6. Set up an S3-compatible bucket for uploading model weights and submission data. Cloudflare R2 is the simplest option (no IAM, free egress):

    1. Create a Cloudflare account and go to R2 Object Storage → Overview. Activate R2 ($0/mo with 10 GB free).
    2. Create a bucket (e.g. soma-data).
    3. Enable public access: select your bucket → SettingsPublic Development URL → enable the r2.dev subdomain. Copy the public URL.
    4. Go back to R2 Overview → Manage API Tokens. Create a token with Object Read & Write permissions. Copy the Access Key ID and Secret Access Key.

    Fill in your .env:

    S3_BUCKET=soma-data
    S3_ACCESS_KEY_ID=<your-key-id>
    S3_SECRET_ACCESS_KEY=<your-secret-key>
    S3_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
    S3_PUBLIC_URL=https://pub-xxx.r2.dev
  7. Push secrets to Modal:

    uv run create-secrets

To fill a target, you need data whose embedding lands close to it. Scoring runs your data through the target’s assigned models — each model produces an embedding for the data and a loss score. If the distance between the data’s embedding and the target is within the distance threshold, the submission is valid.

The submitter automates this: it streams source code from The Stack v2, scores each file against open targets, and submits when one lands within threshold.

uv run modal run src/quickstart/submitter.py

This starts the SOMA scoring service on an L4 GPU, streams shuffled source files, scores them against the models assigned to open targets, and if a file’s embedding is close enough, uploads it to R2 and submits to the network. The first score may take a few minutes while model weights are downloaded into the scoring service.

Data sourcestream_stack_v2() streams shuffled source files from The Stack v2. It uses HuggingFace datasets to get file metadata, downloads the actual source from Software Heritage’s S3, decodes to UTF-8, and filters files over 5 KB. A background thread prefetches data so samples are ready when the scoring service finishes:

def stream_stack_v2():
"""Yield shuffled source files from The Stack v2 as UTF-8 bytes."""
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
ds = load_dataset(
"bigcode/the-stack-v2-dedup",
split="train",
streaming=True,
token=os.environ.get("HF_TOKEN"),
).shuffle(buffer_size=10_000)
for row in ds:
try:
s3_url = f"s3://softwareheritage/content/{row['blob_id']}"
with smart_open(
s3_url, "rb", compression=".gz", transport_params={"client": s3}
) as fin:
content = fin.read().decode(row["src_encoding"])
except Exception:
continue
if not content.strip():
continue
data = content.encode("utf-8")
if len(data) > 5_000:
continue
yield data

Score and submit — the Submitter starts the SOMA scoring binary on a GPU, then loops: each iteration scores a data file against one target’s models, then checks the verified embedding against all open targets. Hits are submitted in the background while scoring continues:

async def _score_and_submit(self, kp):
client = await SomaClient(
chain="testnet",
scoring_url=f"http://localhost:{SCORING_PORT}",
)
all_targets = list(await client.get_targets(status="open"))
# Pick a target and score
scoring_target = all_targets[sample_num % len(all_targets)]
manifests = await client.get_model_manifests(scoring_target)
data_bytes = next(self.data_stream)
checksum = client.commitment(data_bytes)
local_url, local_path = save_local(data_bytes, checksum)
result = await client.score(
data_url=local_url,
models=manifests,
target_embedding=scoring_target.embedding,
data=data_bytes,
)
winner = result.winner
winning_model_id = scoring_target.model_ids[winner]
# Check verified embedding against ALL targets (free math)
for t in all_targets:
if winning_model_id not in t.model_ids:
continue
dist = cosine_distance(result.embedding, t.embedding)
if dist <= t.distance_threshold:
# Upload to S3 and submit on-chain
data_url = upload_to_s3(data_bytes, checksum, t.generation_epoch)
await client.submit_data(
signer=kp,
target_id=t.id,
data=data_bytes,
data_url=data_url,
model_id=winning_model_id,
embedding=result.embedding,
distance_score=dist,
loss_score=result.loss_score,
)

The flow is: score data against one target’s models (~7s) → check the verified embedding against all open targets → submit to any that pass the threshold → refresh targets and continue. A single score call can hit multiple targets. Submissions run in the background so the scoring service stays busy.

Before going to testnet, run a full train → commit → reveal cycle on a local test network inside Modal.

Models on SOMA publish weights via a commit-reveal protocol. You train weights, commit an encrypted copy, wait for the next epoch, then reveal the decryption key. This prevents front-running.

uv run modal run src/quickstart/training.py::localnet

This trains a model on a Nvidia H100 GPU, commits it to the network, advances the epoch, and reveals it — the entire lifecycle in a single run.

Data pipelinemake_batches() streams The Stack v2 and tokenizes each source file into fixed-length byte sequences using the soma_models tokenizer:

def make_batches(batch_size: int):
"""Stream shuffled, tokenized batches from The Stack v2."""
from soma_models.v1.configs import V1_MAX_SEQ_LEN
from soma_models.v1.tokenizer import tokenize
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
ds = load_dataset(
"bigcode/the-stack-v2-dedup",
split="train",
streaming=True,
token=os.environ.get("HF_TOKEN"),
)
ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)
# ... downloads each file from Software Heritage S3
buffer_ids, buffer_targets = [], []
for row in ds:
sequences = tokenize(
data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN
)
for seq in sequences:
buffer_ids.append(seq.token_ids)
buffer_targets.append(seq.targets)
if len(buffer_ids) == batch_size:
yield buffer_ids, buffer_targets
buffer_ids, buffer_targets = [], []

Training loop — initializes a model with the SOMA architecture and trains it:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device)
model.train()
sig_reg = SIGReg(SIGRegConfig()).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
batches = make_batches(MICRO_BATCH_SIZE)
for i in range(steps):
optimizer.zero_grad()
accum_loss = 0.0
for _micro in range(grad_accum_steps):
ids, tgts = next(batches)
token_ids = torch.tensor(ids, device=device)
targets = torch.tensor(tgts, device=device)
loss, embedding = compute_loss(model, sig_reg, token_ids, targets)
(loss / grad_accum_steps).backward()
accum_loss += loss.item()
optimizer.step()

After training, the model saves a checkpoint and training artifacts (embedding + serialized weights) so the commit step can run on CPU without the model framework.

Commitdo_commit() loads the training artifacts, encrypts the weights, uploads them, and commits to the network:

async def do_commit(localnet=True, model_dir=MODEL_DIR, vol=None):
# Load training artifacts (embedding + weights bytes)
artifacts = load_training_artifacts(model_dir, ckpt_step)
embedding, weights_bytes = artifacts
# Encrypt with a fresh key
encrypted, decryption_key = SomaClient.encrypt_weights(weights_bytes)
# Upload (to localnet file server or S3)
weights_url = upload_weights(encrypted, f"model-step-{ckpt_step}", current_epoch)
# Create model on first commit
signer = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])
if state["model_id"] is None:
model_id = await client.create_model(
signer=signer,
commission_rate=1000, # 10%
)
state["model_id"] = model_id
# Commit weights to the network
await client.commit_model(
signer=signer,
model_id=state["model_id"],
weights_url=weights_url,
encrypted_weights=encrypted,
decryption_key=decryption_key,
embedding=embedding,
)
state["pending_reveal"] = True
state["commit_epoch"] = current_epoch

Revealdo_reveal() waits for the epoch to advance past the commit epoch, then reveals the decryption key and embedding:

async def do_reveal(localnet=True, model_dir=MODEL_DIR, vol=None):
# Check if epoch has advanced
if current_epoch <= state["commit_epoch"]:
print("Epoch hasn't advanced yet — will retry later.")
return None
# Reveal
await client.reveal_model(
signer=signer,
model_id=state["model_id"],
decryption_key=state["decryption_key"],
embedding=state["embedding"],
)
state["pending_reveal"] = False

Full cycle — the LocalnetTrainer runs all three steps in sequence, advancing the epoch in between:

@modal.method()
async def run(self, steps: int = 500, framework: str = "torch"):
"""Train → create → commit → advance epoch → reveal."""
await do_training(
steps, framework=framework,
model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume,
grad_accum_steps=4, log_every=1,
)
state = await do_commit(
localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume
)
# Advance epoch so reveal is possible
client = await SomaClient(chain="localnet")
new_epoch = await client.advance_epoch()
state = await do_reveal(
localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume,
state=state,
)

On localnet, advance_epoch() instantly moves to the next epoch. On testnet, epochs advance every 24 hours, so you’ll need to automate your training flow.