Advanced Strategies
This page collects strategies for data submitters and model developers. These are ideas and approaches — not step-by-step tutorials. Mix and match based on what targets are available and where you see opportunity.
Data strategies
Section titled “Data strategies”The Quickstart shows the simplest path: stream a HuggingFace dataset, filter, score, submit. Here are other ways to find data that scores well.
LLM generation
Section titled “LLM generation”Use a language model to generate text optimized for specific target embeddings. The approach: generate diverse candidates across many domains (science, history, code, math, literature), score each, submit the best.
- Vary the prompt — ask for different domains, styles, and levels of technicality. Targets span the full embedding space.
- Generate in batches of 20–50 candidates per target. LLM API calls are cheap relative to scoring.
- Iterate on close hits — if a candidate is near the threshold, generate more text on the same topic.
- Try different LLMs. Different models produce different text distributions.
Web scraping
Section titled “Web scraping”Scrape web pages, compute embedding similarity to the target, and score the best candidates. Common Crawl, domain-specific sites, and forums are all valid sources.
Domain-specific corpora
Section titled “Domain-specific corpora”Code repositories, research papers, multilingual text, and structured data are all valid. The model operates on raw bytes, not tokens — so non-English text, source code, and binary formats can score well, especially in under-explored embedding regions with lower competition.
Dataset diversity
Section titled “Dataset diversity”Beyond the datasets in the Quickstart (FineWeb, Wikipedia, StarCoder), consider:
allenai/c4— cleaned web text (~300 GB)togethercomputer/RedPajama-Data-V2— diverse web, books, code (~30T tokens)- Specialized datasets for code, math, or multilingual text
Model strategies
Section titled “Model strategies”Learn from the network
Section titled “Learn from the network”Every data submission is publicly accessible via its data URL. Every revealed model’s weights are downloadable. Use both to improve your model.
Download submitted data
Section titled “Download submitted data”Fetch data from recently filled targets — this is real data that scored well against at least one model:
from soma_sdk import SomaClient
client = await SomaClient(chain="testnet")
targets = await client.get_targets(status="filled", limit=50)
training_data = []for target in targets: data = await client.fetch_submission_data(target.id) training_data.append(data)Training on this data biases your model toward domains the network is actively exploring.
Download competitor weights
Section titled “Download competitor weights”model_bytes = await client.fetch_model(model_id="0xABC...")Load into your framework:
from soma_models.v1.torch import Modelfrom soma_models.v1.configs import ModelConfig
competitor = Model.load_bytes( model_bytes, ModelConfig(dropout_rate=0.0),)from soma_models.v1.flax import Modelfrom soma_models.v1.configs import ModelConfigfrom flax import nnx
competitor = Model.load_bytes( model_bytes, ModelConfig(dropout_rate=0.0), rngs=nnx.Rngs(0),)Ideas for using competitor data
Section titled “Ideas for using competitor data”- Benchmarking — run
compute_losson the same data to compare your model’s performance - Fine-tuning from competitor weights — initialize from a strong checkpoint instead of random init
- Knowledge distillation — use a stronger competitor’s logits as soft targets for training
- Domain gap analysis — evaluate on filled targets’ data to find domains where you lose, then train on those
- Weight averaging — merge your weights with a competitor’s using simple interpolation
Staying competitive
Section titled “Staying competitive”- Update frequency — push updated weights once per epoch (once per day). See Train a Model.
- Mix data sources — train on FineWeb + network data + domain corpora to maintain generalization across the embedding space.
- Track underserved regions — unfilled targets represent embedding regions lacking good data. These are opportunities for both submitters and model developers. See Targeting embedding gaps below.
Targeting embedding gaps
Section titled “Targeting embedding gaps”Instead of competing head-to-head with existing models, choose a model embedding that occupies an underserved region of the embedding space. Then train on data that would serve that gap — domains and text distributions that existing models aren’t covering well.
To find where current models are positioned, query the model registry from the system state:
state = await client.get_system_state()
# Each registered model has an embeddingfor model in state.model_registry: print(model.id, model.embedding)Look for sparse regions — areas of the embedding space with few or no model embeddings nearby. Define your model’s embedding in that region, then focus your training data on the corresponding domains. This reduces direct competition and increases your share of targets that fall in that part of the space.