Advanced Strategies

This page collects strategies for data submitters and model developers. These are ideas and approaches — not step-by-step tutorials. Mix and match based on what targets are available and where you see opportunity.

Data strategies

The Quickstart shows the simplest path: stream a HuggingFace dataset, filter, score, submit. Here are other ways to find data that scores well.

LLM generation

Use a language model to generate text optimized for specific target embeddings. The approach: generate diverse candidates across many domains (science, history, code, math, literature), score each, submit the best.

Vary the prompt — ask for different domains, styles, and levels of technicality. Targets span the full embedding space.
Generate in batches of 20–50 candidates per target. LLM API calls are cheap relative to scoring.
Iterate on close hits — if a candidate is near the threshold, generate more text on the same topic.
Try different LLMs. Different models produce different text distributions.

Web scraping

Scrape web pages, compute embedding similarity to the target, and score the best candidates. Common Crawl, domain-specific sites, and forums are all valid sources.

Domain-specific corpora

Code repositories, research papers, multilingual text, and structured data are all valid. The model operates on raw bytes, not tokens — so non-English text, source code, and binary formats can score well, especially in under-explored embedding regions with lower competition.

Dataset diversity

Beyond the datasets in the Quickstart (FineWeb, Wikipedia, StarCoder), consider:

allenai/c4 — cleaned web text (~300 GB)
togethercomputer/RedPajama-Data-V2 — diverse web, books, code (~30T tokens)
Specialized datasets for code, math, or multilingual text

Model strategies

Learn from the network

Every data submission is publicly accessible via its data URL. Every revealed model’s weights are downloadable. Use both to improve your model.

Download submitted data

Fetch data from recently filled targets — this is real data that scored well against at least one model:

from soma_sdk import SomaClient

client = await SomaClient(chain="testnet")

targets = await client.get_targets(status="filled", limit=50)

training_data = []
for target in targets:
    data = await client.fetch_submission_data(target.id)
    training_data.append(data)

Training on this data biases your model toward domains the network is actively exploring.

Download competitor weights

model_bytes = await client.fetch_model(model_id="0xABC...")

Load into your framework:

PyTorch
Flax

from soma_models.v1.torch import Model
from soma_models.v1.configs import ModelConfig

competitor = Model.load_bytes(
    model_bytes,
    ModelConfig(dropout_rate=0.0),
)

from soma_models.v1.flax import Model
from soma_models.v1.configs import ModelConfig
from flax import nnx

competitor = Model.load_bytes(
    model_bytes,
    ModelConfig(dropout_rate=0.0),
    rngs=nnx.Rngs(0),
)

Ideas for using competitor data

Benchmarking — run compute_loss on the same data to compare your model’s performance
Fine-tuning from competitor weights — initialize from a strong checkpoint instead of random init
Knowledge distillation — use a stronger competitor’s logits as soft targets for training
Domain gap analysis — evaluate on filled targets’ data to find domains where you lose, then train on those
Weight averaging — merge your weights with a competitor’s using simple interpolation

Staying competitive

Update frequency — push updated weights once per epoch (once per day). See Train a Model.
Mix data sources — train on FineWeb + network data + domain corpora to maintain generalization across the embedding space.
Track underserved regions — unfilled targets represent embedding regions lacking good data. These are opportunities for both submitters and model developers. See Targeting embedding gaps below.

Targeting embedding gaps

Instead of competing head-to-head with existing models, choose a model embedding that occupies an underserved region of the embedding space. Then train on data that would serve that gap — domains and text distributions that existing models aren’t covering well.

To find where current models are positioned, query the model registry from the system state:

state = await client.get_system_state()

# Each registered model has an embedding
for model in state.model_registry:
    print(model.id, model.embedding)

Look for sparse regions — areas of the embedding space with few or no model embeddings nearby. Define your model’s embedding in that region, then focus your training data on the corresponding domains. This reduces direct competition and increases your share of targets that fall in that part of the space.