# Data Strategies

The quickstart streams raw data from [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) without filtering — every file is tokenized and trained on. This guide covers smarter approaches: deploying continuous submission, filtering data for relevance, and generating synthetic data with LLMs.

> Source: [`submitter.py`](https://github.com/soma-org/quickstart/blob/main/src/quickstart/submitter.py)
**Prerequisites:** - Quickstart repo cloned, dependencies installed, secrets configured ([Quickstart: Clone and configure](https://docs.soma.org/getting-started/quickstart/#clone-and-configure))
- Funded testnet account ([Installation & Setup](https://docs.soma.org/getting-started/install/))
- Submitted data interactively at least once ([Quickstart: Submit data](https://docs.soma.org/getting-started/quickstart/#step-1-submit-data))

## Automate your submitter

```
uv run modal deploy src/quickstart/submitter.py && uv run submit
```

This deploys `submitter.py` as a cron job that runs every 24 hours with a 23h45m timeout, and triggers it immediately. The submission loop runs continuously within each invocation, scoring and submitting data against open targets:

```python
@app.cls(
    image=image,
    gpu="L4",
    timeout=85500,  # 23h45m
    volumes={"/data": volume},
    secrets=[modal.Secret.from_name("soma-secrets")],
)
class Submitter:
    @modal.enter()
    def start_soma(self):
        # ... starts scoring service on GPU, HTTP file server, data stream

    @modal.method()
    async def run(self):
        kp = Keypair.from_secret_key(os.environ["SOMA_SECRET_KEY"])
        while True:
            try:
                await self._score_and_submit(kp)
            except Exception as e:
                print(f"Error during scoring iteration: {e}")

@app.local_entrypoint()
def main():
    Submitter().run.remote()
```
**Tip:** Deploy both `training.py` and `submitter.py` together. Your model earns commission from other submitters' data, and your submitter earns rewards against all models — including your own.

## Recommended datasets

Start with The Stack v2 (the default). Filter to the top 10–15 languages for broad syntax diversity without diluting on niche languages the model won't see enough of to learn.

Then layer on:

- [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) — Primary curated code mass. Higher quality filtering than raw Stack v2.
- [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) — Natural language grounding for byte-level English. The educationally-scored subset is denser in technical explanations and documentation than raw web text. Gives models the NL comprehension to parse specifications and understand intent.
- [SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) — Natural language to code at function granularity. Real GitHub issues paired with the code changes that resolve them. Trains the skill the network values most: read a spec, produce a correct implementation.
**Note:** All datasets are chunked to 1024 bytes by the `soma_models` tokenizer during training. Documents longer than the context window are split into independent sequences — the model doesn't see cross-sequence context. Shorter, self-contained passages (functions, docstrings, paragraphs) are more effective training data than long documents.

### Customizing the submitter

Fork `submitter.py` and replace `stream_stack_v2()` with a custom data source. The interface is simple — yield `bytes` objects under the current **1 MB** maximum submission size:

```python
def my_data_source():
    """Yield data as UTF-8 bytes."""
    from datasets import load_dataset

    ds = load_dataset(
        "HuggingFaceFW/fineweb-edu",
        split="train",
        streaming=True,
    ).shuffle(buffer_size=100_000)

    for row in ds:
        text = row.get("text", "")
        if not text.strip():
            continue
        data = text.encode("utf-8")
        if len(data) > 10_000:
            continue
        yield data
```

Replace `self.data_stream = stream_stack_v2()` in the `Submitter.start_soma()` method with `self.data_stream = my_data_source()`. The scoring and submission flow stays the same.

## Filtering with a small embedding model

### The brute force baseline

The quickstart's `make_batches()` streams everything from The Stack v2 without filtering:

```python
def make_batches(batch_size: int):
    ds = load_dataset("bigcode/the-stack-v2-dedup", split="train", streaming=True, ...)
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    for row in ds:
        sequences = tokenize(data=row["content"].encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN)
        for seq in sequences:
            # ... yield every sequence, no filtering
```

No quality checks, no relevance filtering — it tokenizes and trains on everything. Most files from The Stack v2 aren't relevant to any given target, so the majority of training compute is spent on data that doesn't move your model toward the domains that matter.

### Smart filtering

Use a small, fast embedding model to pre-filter data for relevance before training. Embed each source file, compare to the target embedding, and only train on files within a distance threshold:

```python
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a small, fast embedding model
filter_model = SentenceTransformer("all-MiniLM-L6-v2")

# Get your target embedding from the network
target_embedding = target.embedding  # from client.get_targets()
target_embedding = np.array(target_embedding)

def make_filtered_batches(batch_size: int, similarity_threshold: float = 0.3):
    ds = load_dataset("bigcode/the-stack-v2-dedup", split="train", streaming=True, ...)
    ds = ds.shuffle(buffer_size=SHUFFLE_BUFFER)

    for row in ds:
        content = row["content"]

        # Quick relevance check with the small model
        file_embedding = filter_model.encode(content[:512])  # first 512 chars is enough
        similarity = np.dot(file_embedding, target_embedding) / (
            np.linalg.norm(file_embedding) * np.linalg.norm(target_embedding)
        )

        if similarity < similarity_threshold:
            continue  # skip irrelevant files

        # Only tokenize and yield relevant files
        sequences = tokenize(data=content.encode("utf-8"), max_seq_len=V1_MAX_SEQ_LEN)
        for seq in sequences:
            buffer_ids.append(seq.token_ids)
            # ...
```

The small embedding model adds a few milliseconds per file but saves far more by skipping irrelevant training data entirely. This is especially effective when you're targeting specific regions of the embedding space.
**Tip:** The embedding model used for filtering doesn't need to be large or precise — it just needs to be fast and directionally correct. `all-MiniLM-L6-v2` is ~80MB and runs on CPU. The goal is to reject obviously irrelevant files cheaply.

## LLM distillation (generative)

Instead of relying solely on existing datasets, use a large language model to generate synthetic training data. This is *generative distillation* — you're distilling the LLM's knowledge into training data rather than directly into model weights (for weight-level distillation, see [Model Strategies](https://docs.soma.org/guides/model-strategies/#distilling-from-competitors)).

### Example: Qwen2.5-Coder-32B-Instruct

[Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) is a strong open-weight model that can generate high-quality, diverse data across many languages and domains. Use it to produce targeted training data:

```python
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-Coder-32B-Instruct", tensor_parallel_size=2)

prompts = [
    "Write a Python function that implements a binary search tree with insert, delete, and search operations.",
    "Write a Rust function that parses a CSV file and returns a Vec of structs.",
    "Write a Go HTTP middleware that implements rate limiting with a token bucket.",
    "Write a TypeScript function that validates and transforms a JSON schema.",
]

params = SamplingParams(temperature=0.8, max_tokens=1024)
outputs = llm.generate(prompts, params)

for output in outputs:
    generated = output.outputs[0].text
    data = generated.encode("utf-8")
    # Feed into your training pipeline or submit directly
```

### Pipeline

1. **Analyze target embeddings** — query `client.get_targets()` to understand what domains the network needs
2. **Craft prompts** — generate prompts across relevant domains, languages, and complexity levels
3. **Generate** — run the LLM to produce an output. Vary temperature (0.7–1.0) for diversity
4. **Tokenize** — feed generated output through the `soma_models` tokenizer into your training pipeline
5. **Score and submit** — or train your model on the generated data directly

### Tips

- Vary prompts across domains (algorithms, web, systems, data science, DevOps) and styles. Targets span the full embedding space.
- Generate in batches of 20–50 candidates per target. LLM API calls are cheap relative to scoring compute.
- Iterate on close hits: if a candidate is near the distance threshold, generate more text on the same topic.
- Try different models. Different LLMs produce different text distributions — Qwen excels at code, but models like Llama or Mistral may cover other domains better.

## Next steps

[Model Strategies](https://docs.soma.org/guides/model-strategies/)

[Claim Rewards](https://docs.soma.org/guides/claim-rewards/)