Skip to content

Advanced Strategies

This page collects strategies for data submitters and model developers. These are ideas and approaches — not step-by-step tutorials. Mix and match based on what targets are available and where you see opportunity.

The Quickstart shows the simplest path: stream a HuggingFace dataset, filter, score, submit. Here are other ways to find data that scores well.

Use a language model to generate text optimized for specific target embeddings. The approach: generate diverse candidates across many domains (science, history, code, math, literature), score each, submit the best.

  • Vary the prompt — ask for different domains, styles, and levels of technicality. Targets span the full embedding space.
  • Generate in batches of 20–50 candidates per target. LLM API calls are cheap relative to scoring.
  • Iterate on close hits — if a candidate is near the threshold, generate more text on the same topic.
  • Try different LLMs. Different models produce different text distributions.

Scrape web pages, compute embedding similarity to the target, and score the best candidates. Common Crawl, domain-specific sites, and forums are all valid sources.

Code repositories, research papers, multilingual text, and structured data are all valid. The model operates on raw bytes, not tokens — so non-English text, source code, and binary formats can score well, especially in under-explored embedding regions with lower competition.

Beyond the datasets in the Quickstart (FineWeb, Wikipedia, StarCoder), consider:

  • allenai/c4 — cleaned web text (~300 GB)
  • togethercomputer/RedPajama-Data-V2 — diverse web, books, code (~30T tokens)
  • Specialized datasets for code, math, or multilingual text

Every data submission is publicly accessible via its data URL. Every revealed model’s weights are downloadable. Use both to improve your model.

Fetch data from recently filled targets — this is real data that scored well against at least one model:

from soma_sdk import SomaClient
client = await SomaClient(chain="testnet")
targets = await client.get_targets(status="filled", limit=50)
training_data = []
for target in targets:
data = await client.fetch_submission_data(target.id)
training_data.append(data)

Training on this data biases your model toward domains the network is actively exploring.

model_bytes = await client.fetch_model(model_id="0xABC...")

Load into your framework:

from soma_models.v1.torch import Model
from soma_models.v1.configs import ModelConfig
competitor = Model.load_bytes(
model_bytes,
ModelConfig(dropout_rate=0.0),
)
  • Benchmarking — run compute_loss on the same data to compare your model’s performance
  • Fine-tuning from competitor weights — initialize from a strong checkpoint instead of random init
  • Knowledge distillation — use a stronger competitor’s logits as soft targets for training
  • Domain gap analysis — evaluate on filled targets’ data to find domains where you lose, then train on those
  • Weight averaging — merge your weights with a competitor’s using simple interpolation
  • Update frequency — push updated weights once per epoch (once per day). See Train a Model.
  • Mix data sources — train on FineWeb + network data + domain corpora to maintain generalization across the embedding space.
  • Track underserved regions — unfilled targets represent embedding regions lacking good data. These are opportunities for both submitters and model developers. See Targeting embedding gaps below.

Instead of competing head-to-head with existing models, choose a model embedding that occupies an underserved region of the embedding space. Then train on data that would serve that gap — domains and text distributions that existing models aren’t covering well.

To find where current models are positioned, query the model registry from the system state:

state = await client.get_system_state()
# Each registered model has an embedding
for model in state.model_registry:
print(model.id, model.embedding)

Look for sparse regions — areas of the embedding space with few or no model embeddings nearby. Define your model’s embedding in that region, then focus your training data on the corresponding domains. This reduces direct competition and increases your share of targets that fall in that part of the space.