Data Strategies

The quickstart streams raw data from The Stack v2 without filtering — every file is tokenized and trained on. This guide covers smarter approaches: deploying continuous submission, filtering data for relevance, and generating synthetic data with LLMs.

Source: submitter.py

Automate your submitter

uv run modal deploy src/quickstart/submitter.py && uv run submit

This deploys submitter.py as a cron job that runs every 24 hours with a 23h45m timeout, and triggers it immediately. The submission loop runs continuously within each invocation, scoring and submitting data against open targets:

@app.cls(
    image=image,
    gpu="L4",
    timeout=85500,  # 23h45m
    volumes={"/data": volume},
    secrets=[modal.Secret.from_name("soma-secrets")],
)
class Submitter:
    @modal.enter()
    def start_soma(self):
        # ... starts scoring service on GPU, HTTP file server, data stream

    @modal.method()
    async def run(self):
        kp = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])
        while True:
            try:
                await self._score_and_submit(kp)
            except Exception as e:
                print(f"Error during scoring iteration: {e}")


@app.local_entrypoint()
def main():
    Submitter().run.remote()

Recommended datasets

Start with The Stack v2 (the default). Filter to the top 10–15 languages for broad syntax diversity without diluting on niche languages the model won’t see enough of to learn.

Then layer on:

StarCoderData — Primary curated code mass. Higher quality filtering than raw Stack v2.
FineWeb-Edu — Natural language grounding for byte-level English. The educationally-scored subset is denser in technical explanations and documentation than raw web text. Gives models the NL comprehension to parse specifications and understand intent.
SWE-bench — Natural language to code at function granularity. Real GitHub issues paired with the code changes that resolve them. Trains the skill the network values most: read a spec, produce a correct implementation.

Customizing the submitter

Fork submitter.py and replace stream_stack_v2() with a custom data source. The interface is simple — yield bytes objects under the current 1 MB maximum submission size:

def my_data_source():
    """Yield data as UTF-8 bytes."""
    from datasets import load_dataset

    ds = load_dataset(
        "HuggingFaceFW/fineweb-edu",
        split="train",
        streaming=True,
    ).shuffle(buffer_size=100_000)

    for row in ds:
        text = row.get("text", "")
        if not text.strip():
            continue
        data = text.encode("utf-8")
        if len(data) > 10_000:
            continue
        yield data

Replace self.data_stream = stream_stack_v2() in the Submitter.start_soma() method with self.data_stream = my_data_source(). The scoring and submission flow stays the same.

Filtering with a small embedding model

The brute force baseline

The quickstart’s make_batches() streams everything from The Stack v2 without filtering:

def make_batches(batch_size: int):
    ds = load_dataset("bigcode/the-stack-v2-dedup", split="train", streaming=True, ...)
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    for row in ds:
        sequences = tokenize(data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN)
        for seq in sequences:
            # ... yield every sequence, no filtering

No quality checks, no relevance filtering — it tokenizes and trains on everything. Most files from The Stack v2 aren’t relevant to any given target, so the majority of training compute is spent on data that doesn’t move your model toward the domains that matter.

Smart filtering

Use a small, fast embedding model to pre-filter data for relevance before training. Embed each source file, compare to the target embedding, and only train on files within a distance threshold:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a small, fast embedding model
filter_model = SentenceTransformer("all-MiniLM-L6-v2")

# Get your target embedding from the network
target_embedding = target.embedding  # from client.get_targets()
target_embedding = np.array(target_embedding)

def make_filtered_batches(batch_size: int, similarity_threshold: float = 0.3):
    ds = load_dataset("bigcode/the-stack-v2-dedup", split="train", streaming=True, ...)
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    for row in ds:
        content = row["content"]

        # Quick relevance check with the small model
        file_embedding = filter_model.encode(content[:512])  # first 512 chars is enough
        similarity = np.dot(file_embedding, target_embedding) / (
            np.linalg.norm(file_embedding) * np.linalg.norm(target_embedding)
        )

        if similarity < similarity_threshold:
            continue  # skip irrelevant files

        # Only tokenize and yield relevant files
        sequences = tokenize(data=content.encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN)
        for seq in sequences:
            buffer_ids.append(seq.token_ids)
            # ...

The small embedding model adds a few milliseconds per file but saves far more by skipping irrelevant training data entirely. This is especially effective when you’re targeting specific regions of the embedding space.

LLM distillation (generative)

Instead of relying solely on existing datasets, use a large language model to generate synthetic training data. This is generative distillation — you’re distilling the LLM’s knowledge into training data rather than directly into model weights (for weight-level distillation, see Model Strategies).

Example: Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder-32B-Instruct is a strong open-weight model that can generate high-quality, diverse data across many languages and domains. Use it to produce targeted training data:

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-Coder-32B-Instruct", tensor_parallel_size=2)

prompts = [
    "Write a Python function that implements a binary search tree with insert, delete, and search operations.",
    "Write a Rust function that parses a CSV file and returns a Vec of structs.",
    "Write a Go HTTP middleware that implements rate limiting with a token bucket.",
    "Write a TypeScript function that validates and transforms a JSON schema.",
]

params = SamplingParams(temperature=0.8, max_tokens=1024)
outputs = llm.generate(prompts, params)

for output in outputs:
    generated = output.outputs[0].text
    data = generated.encode("utf-8")
    # Feed into your training pipeline or submit directly

Pipeline

Analyze target embeddings — query client.get_targets() to understand what domains the network needs
Craft prompts — generate prompts across relevant domains, languages, and complexity levels
Generate — run the LLM to produce an output. Vary temperature (0.7–1.0) for diversity
Tokenize — feed generated output through the soma_models tokenizer into your training pipeline
Score and submit — or train your model on the generated data directly

Tips

Vary prompts across domains (algorithms, web, systems, data science, DevOps) and styles. Targets span the full embedding space.
Generate in batches of 20–50 candidates per target. LLM API calls are cheap relative to scoring compute.
Iterate on close hits: if a candidate is near the distance threshold, generate more text on the same topic.
Try different models. Different LLMs produce different text distributions — Qwen excels at code, but models like Llama or Mistral may cover other domains better.

Next steps

Model Strategies Load competitor weights, distill knowledge, and optimize your embedding.

Claim Rewards Claim rewards after targets settle.