Quickstart
SOMA trains a foundation model through competition. Participants independently train small models — all sharing the same architecture — and compete on a universal objective: given any data, predict what comes next.
You earn SOMA by submitting data or by training a model. Every 24 hours the network generates targets — random points in embedding space, each representing a domain the network wants to benchmark. A subset of models is assigned to each target. The first valid data submission wins, and rewards split between the submitter and the model with the lowest loss.
This quickstart walks through both sides: submitting data against targets, then running a full model training cycle. See How SOMA Works for the full architecture.
Full source: github.com/soma-org/quickstart
Clone and configure
Section titled “Clone and configure”-
Clone the repo and install dependencies:
git clone https://github.com/soma-org/quickstart.gitcd quickstartuv syncuv run modal setupThis installs all dependencies and authenticates with Modal.
-
Copy the .env example:
cp .env.example .env -
Export your private key and fill in
SOMA_SECRET_KEYin.env:soma wallet export -
Create a HuggingFace access token here and fill in
HF_TOKENin.env. -
Approve access to the gated Stack v2 dataset.
-
Set up an S3-compatible bucket for uploading model weights and submission data. Cloudflare R2 is the simplest option (no IAM, free egress):
- Create a Cloudflare account and go to R2 Object Storage → Overview. Activate R2 ($0/mo with 10 GB free).
- Create a bucket (e.g.
soma-data). - Enable public access: select your bucket → Settings → Public Development URL → enable the
r2.devsubdomain. Copy the public URL. - Go back to R2 Overview → Manage API Tokens. Create a token with Object Read & Write permissions. Copy the Access Key ID and Secret Access Key.
Fill in your
.env:S3_BUCKET=soma-dataS3_ACCESS_KEY_ID=<your-key-id>S3_SECRET_ACCESS_KEY=<your-secret-key>S3_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.comS3_PUBLIC_URL=https://pub-xxx.r2.dev -
Push secrets to Modal:
uv run create-secrets
Step 1: Submit data
Section titled “Step 1: Submit data”To fill a target, you need data whose embedding lands close to it. Scoring runs your data through the target’s assigned models — each model produces an embedding for the data and a loss score. If the distance between the data’s embedding and the target is within the distance threshold, the submission is valid.
The submitter automates this: it streams source code from The Stack v2, scores each file against open targets, and submits when one lands within threshold.
uv run modal run src/quickstart/submitter.pyThis starts the SOMA scoring service on an L4 GPU, streams shuffled source files, scores them against the models assigned to open targets, and if a file’s embedding is close enough, uploads it to R2 and submits to the network. The first score may take a few minutes while model weights are downloaded into the scoring service.
How it works
Section titled “How it works”Data source — stream_stack_v2() streams shuffled source files from The Stack v2. It uses HuggingFace datasets to get file metadata, downloads the actual source from Software Heritage’s S3, decodes to UTF-8, and filters files over 5 KB. A background thread prefetches data so samples are ready when the scoring service finishes:
def stream_stack_v2(): """Yield shuffled source files from The Stack v2 as UTF-8 bytes.""" s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
ds = load_dataset( "bigcode/the-stack-v2-dedup", split="train", streaming=True, token=os.environ.get("HF_TOKEN"), ).shuffle(buffer_size=10_000)
for row in ds: try: s3_url = f"s3://softwareheritage/content/{row['blob_id']}" with smart_open( s3_url, "rb", compression=".gz", transport_params={"client": s3} ) as fin: content = fin.read().decode(row["src_encoding"]) except Exception: continue if not content.strip(): continue data = content.encode("utf-8") if len(data) > 5_000: continue yield dataScore and submit — the Submitter starts the SOMA scoring binary on a GPU, then loops: each iteration scores a data file against one target’s models, then checks the verified embedding against all open targets. Hits are submitted in the background while scoring continues:
async def _score_and_submit(self, kp): client = await SomaClient( chain="testnet", scoring_url=f"http://localhost:{SCORING_PORT}", )
all_targets = list(await client.get_targets(status="open"))
# Pick a target and score scoring_target = all_targets[sample_num % len(all_targets)] manifests = await client.get_model_manifests(scoring_target)
data_bytes = next(self.data_stream) checksum = client.commitment(data_bytes) local_url, local_path = save_local(data_bytes, checksum)
result = await client.score( data_url=local_url, models=manifests, target_embedding=scoring_target.embedding, data=data_bytes, )
winner = result.winner winning_model_id = scoring_target.model_ids[winner]
# Check verified embedding against ALL targets (free math) for t in all_targets: if winning_model_id not in t.model_ids: continue dist = cosine_distance(result.embedding, t.embedding) if dist <= t.distance_threshold: # Upload to S3 and submit on-chain data_url = upload_to_s3(data_bytes, checksum, t.generation_epoch) await client.submit_data( signer=kp, target_id=t.id, data=data_bytes, data_url=data_url, model_id=winning_model_id, embedding=result.embedding, distance_score=dist, loss_score=result.loss_score, )The flow is: score data against one target’s models (~7s) → check the verified embedding against all open targets → submit to any that pass the threshold → refresh targets and continue. A single score call can hit multiple targets. Submissions run in the background so the scoring service stays busy.
Step 2: Train on localnet
Section titled “Step 2: Train on localnet”Before going to testnet, run a full train → commit → reveal cycle on a local test network inside Modal.
Models on SOMA publish weights via a commit-reveal protocol. You train weights, commit an encrypted copy, wait for the next epoch, then reveal the decryption key. This prevents front-running.
uv run modal run src/quickstart/training.py::localnetuv run modal run src/quickstart/training.py::localnet --framework flaxThis trains a model on a Nvidia H100 GPU, commits it to the network, advances the epoch, and reveals it — the entire lifecycle in a single run.
How it works
Section titled “How it works”Data pipeline — make_batches() streams The Stack v2 and tokenizes each source file into fixed-length byte sequences using the soma_models tokenizer:
def make_batches(batch_size: int): """Stream shuffled, tokenized batches from The Stack v2.""" from soma_models.v1.configs import V1_MAX_SEQ_LEN from soma_models.v1.tokenizer import tokenize
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
ds = load_dataset( "bigcode/the-stack-v2-dedup", split="train", streaming=True, token=os.environ.get("HF_TOKEN"), ) ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)
# ... downloads each file from Software Heritage S3
buffer_ids, buffer_targets = [], []
for row in ds: sequences = tokenize( data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN ) for seq in sequences: buffer_ids.append(seq.token_ids) buffer_targets.append(seq.targets)
if len(buffer_ids) == batch_size: yield buffer_ids, buffer_targets buffer_ids, buffer_targets = [], []Training loop — initializes a model with the SOMA architecture and trains it:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Model(ModelConfig(dropout_rate=DROPOUT_RATE)).to(device)model.train()sig_reg = SIGReg(SIGRegConfig()).to(device)optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
batches = make_batches(MICRO_BATCH_SIZE)
for i in range(steps): optimizer.zero_grad() accum_loss = 0.0
for _micro in range(grad_accum_steps): ids, tgts = next(batches) token_ids = torch.tensor(ids, device=device) targets = torch.tensor(tgts, device=device) loss, embedding = compute_loss(model, sig_reg, token_ids, targets) (loss / grad_accum_steps).backward() accum_loss += loss.item()
optimizer.step()rngs = nnx.Rngs(0)model = Model(ModelConfig(dropout_rate=DROPOUT_RATE), rngs)model.train()sig_reg = SIGReg(SIGRegConfig(), rngs)optimizer = nnx.Optimizer( model, optax.adam(learning_rate=LEARNING_RATE), wrt=nnx.Param)
@nnx.jitdef micro_step(model, sig_reg, token_ids, targets): def loss_fn(model, sig_reg): return compute_loss(model, sig_reg, token_ids, targets) (loss, embedding), grads = nnx.value_and_grad(loss_fn, has_aux=True)( model, sig_reg ) return loss, embedding, grads
batches = make_batches(MICRO_BATCH_SIZE)
for i in range(steps): accum_loss = jnp.zeros(()) accum_grads = None
for micro in range(grad_accum_steps): ids, tgts = next(batches) loss, embedding, grads = micro_step( model, sig_reg, jnp.array(ids), jnp.array(tgts) ) accum_loss = accum_loss + loss accum_grads = grads if accum_grads is None else jax.tree.map( jnp.add, accum_grads, grads )
accum_grads = jax.tree.map(lambda g: g / grad_accum_steps, accum_grads) optimizer.update(model, accum_grads)After training, the model saves a checkpoint and training artifacts (embedding + serialized weights) so the commit step can run on CPU without the model framework.
Commit — do_commit() loads the training artifacts, encrypts the weights, uploads them, and commits to the network:
async def do_commit(localnet=True, model_dir=MODEL_DIR, vol=None): # Load training artifacts (embedding + weights bytes) artifacts = load_training_artifacts(model_dir, ckpt_step) embedding, weights_bytes = artifacts
# Encrypt with a fresh key encrypted, decryption_key = SomaClient.encrypt_weights(weights_bytes)
# Upload (to localnet file server or S3) weights_url = upload_weights(encrypted, f"model-step-{ckpt_step}", current_epoch)
# Create model on first commit signer = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])
if state["model_id"] is None: model_id = await client.create_model( signer=signer, commission_rate=1000, # 10% ) state["model_id"] = model_id
# Commit weights to the network await client.commit_model( signer=signer, model_id=state["model_id"], weights_url=weights_url, encrypted_weights=encrypted, decryption_key=decryption_key, embedding=embedding, )
state["pending_reveal"] = True state["commit_epoch"] = current_epochReveal — do_reveal() waits for the epoch to advance past the commit epoch, then reveals the decryption key and embedding:
async def do_reveal(localnet=True, model_dir=MODEL_DIR, vol=None): # Check if epoch has advanced if current_epoch <= state["commit_epoch"]: print("Epoch hasn't advanced yet — will retry later.") return None
# Reveal await client.reveal_model( signer=signer, model_id=state["model_id"], decryption_key=state["decryption_key"], embedding=state["embedding"], )
state["pending_reveal"] = FalseFull cycle — the LocalnetTrainer runs all three steps in sequence, advancing the epoch in between:
@modal.method()async def run(self, steps: int = 500, framework: str = "torch"): """Train → create → commit → advance epoch → reveal.""" await do_training( steps, framework=framework, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume, grad_accum_steps=4, log_every=1, ) state = await do_commit( localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume )
# Advance epoch so reveal is possible client = await SomaClient(chain="localnet") new_epoch = await client.advance_epoch()
state = await do_reveal( localnet=True, model_dir=LOCALNET_MODEL_DIR, vol=localnet_volume, state=state, )On localnet, advance_epoch() instantly moves to the next epoch. On testnet, epochs advance every 24 hours, so you’ll need to automate your training flow.