Skip to content

Data Strategies

The quickstart streams raw data from The Stack v2 without filtering — every file is tokenized and trained on. This guide covers smarter approaches: deploying continuous submission, filtering data for relevance, and generating synthetic data with LLMs.

Source: submitter.py

uv run modal deploy src/quickstart/submitter.py && uv run submit

This deploys submitter.py as a cron job that runs every 24 hours with a 23h45m timeout, and triggers it immediately. The submission loop runs continuously within each invocation, scoring and submitting data against open targets:

@app.cls(
image=image,
gpu="L4",
timeout=85500, # 23h45m
volumes={"/data": volume},
secrets=[modal.Secret.from_name("soma-secrets")],
)
class Submitter:
@modal.enter()
def start_soma(self):
# ... starts scoring service on GPU, HTTP file server, data stream
@modal.method()
async def run(self):
kp = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])
while True:
try:
await self._score_and_submit(kp)
except Exception as e:
print(f"Error during scoring iteration: {e}")
@app.local_entrypoint()
def main():
Submitter().run.remote()

Start with The Stack v2 (the default). Filter to the top 10–15 languages for broad syntax diversity without diluting on niche languages the model won’t see enough of to learn.

Then layer on:

  • StarCoderData — Primary curated code mass. Higher quality filtering than raw Stack v2.
  • FineWeb-Edu — Natural language grounding for byte-level English. The educationally-scored subset is denser in technical explanations and documentation than raw web text. Gives models the NL comprehension to parse specifications and understand intent.
  • SWE-bench — Natural language to code at function granularity. Real GitHub issues paired with the code changes that resolve them. Trains the skill the network values most: read a spec, produce a correct implementation.

Fork submitter.py and replace stream_stack_v2() with a custom data source. The interface is simple — yield bytes objects under the current 1 MB maximum submission size:

def my_data_source():
"""Yield data as UTF-8 bytes."""
from datasets import load_dataset
ds = load_dataset(
"HuggingFaceFW/fineweb-edu",
split="train",
streaming=True,
).shuffle(buffer_size=100_000)
for row in ds:
text = row.get("text", "")
if not text.strip():
continue
data = text.encode("utf-8")
if len(data) > 10_000:
continue
yield data

Replace self.data_stream = stream_stack_v2() in the Submitter.start_soma() method with self.data_stream = my_data_source(). The scoring and submission flow stays the same.

The quickstart’s make_batches() streams everything from The Stack v2 without filtering:

def make_batches(batch_size: int):
ds = load_dataset("bigcode/the-stack-v2-dedup", split="train", streaming=True, ...)
ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)
for row in ds:
sequences = tokenize(data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN)
for seq in sequences:
# ... yield every sequence, no filtering

No quality checks, no relevance filtering — it tokenizes and trains on everything. Most files from The Stack v2 aren’t relevant to any given target, so the majority of training compute is spent on data that doesn’t move your model toward the domains that matter.

Use a small, fast embedding model to pre-filter data for relevance before training. Embed each source file, compare to the target embedding, and only train on files within a distance threshold:

from sentence_transformers import SentenceTransformer
import numpy as np
# Load a small, fast embedding model
filter_model = SentenceTransformer("all-MiniLM-L6-v2")
# Get your target embedding from the network
target_embedding = target.embedding # from client.get_targets()
target_embedding = np.array(target_embedding)
def make_filtered_batches(batch_size: int, similarity_threshold: float = 0.3):
ds = load_dataset("bigcode/the-stack-v2-dedup", split="train", streaming=True, ...)
ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)
for row in ds:
content = row["content"]
# Quick relevance check with the small model
file_embedding = filter_model.encode(content[:512]) # first 512 chars is enough
similarity = np.dot(file_embedding, target_embedding) / (
np.linalg.norm(file_embedding) * np.linalg.norm(target_embedding)
)
if similarity < similarity_threshold:
continue # skip irrelevant files
# Only tokenize and yield relevant files
sequences = tokenize(data=content.encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN)
for seq in sequences:
buffer_ids.append(seq.token_ids)
# ...

The small embedding model adds a few milliseconds per file but saves far more by skipping irrelevant training data entirely. This is especially effective when you’re targeting specific regions of the embedding space.

Instead of relying solely on existing datasets, use a large language model to generate synthetic training data. This is generative distillation — you’re distilling the LLM’s knowledge into training data rather than directly into model weights (for weight-level distillation, see Model Strategies).

Qwen2.5-Coder-32B-Instruct is a strong open-weight model that can generate high-quality, diverse data across many languages and domains. Use it to produce targeted training data:

from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-Coder-32B-Instruct", tensor_parallel_size=2)
prompts = [
"Write a Python function that implements a binary search tree with insert, delete, and search operations.",
"Write a Rust function that parses a CSV file and returns a Vec of structs.",
"Write a Go HTTP middleware that implements rate limiting with a token bucket.",
"Write a TypeScript function that validates and transforms a JSON schema.",
]
params = SamplingParams(temperature=0.8, max_tokens=1024)
outputs = llm.generate(prompts, params)
for output in outputs:
generated = output.outputs[0].text
data = generated.encode("utf-8")
# Feed into your training pipeline or submit directly
  1. Analyze target embeddings — query client.get_targets() to understand what domains the network needs
  2. Craft prompts — generate prompts across relevant domains, languages, and complexity levels
  3. Generate — run the LLM to produce an output. Vary temperature (0.7–1.0) for diversity
  4. Tokenize — feed generated output through the soma_models tokenizer into your training pipeline
  5. Score and submit — or train your model on the generated data directly
  • Vary prompts across domains (algorithms, web, systems, data science, DevOps) and styles. Targets span the full embedding space.
  • Generate in batches of 20–50 candidates per target. LLM API calls are cheap relative to scoring compute.
  • Iterate on close hits: if a candidate is near the distance threshold, generate more text on the same topic.
  • Try different models. Different LLMs produce different text distributions — Qwen excels at code, but models like Llama or Mistral may cover other domains better.