Architecture / RAG / 2026

The Flat Vector Pool
Had to Die

On contextual drag, walled memory, and why your 8B model can outthink your 32B one

Kabir Murjani  ·  March 2026  ·  ~3,400 words

Three weeks ago I gave a 32B model a standard vector RAG context and watched it score 41.25% on GPQA-Diamond. Then I isolated the same underlying data, cleaned the context window down to nothing but the facts the query actually needed, and handed it to an 8B model. It scored 55.62%. The 8B model costs a fraction of the compute. It won.

That result is not a fluke. It is a direct consequence of a failure mode the community has been quietly accumulating for two years, and it goes by the name contextual drag.

This post lays out the problem, the math behind it, and an architecture I am calling G.A.T.E.S. (Gated Access via Threshold-Evaluated Semantics) that I believe addresses the structural root cause rather than papering over it. I will go through three layers: isolated memory partitions, a deterministic SLM router, and a logarithmic penalty function that governs when cross-domain retrieval is actually justified.

The argument has no buzzwords and two testable empirical claims. I think that is the right way to do this.

§ 1 The Expressiveness Gap Is Not a Bug, It Is Geometry

The dense retrieval assumption is clean on paper: semantic relevance correlates with geometric proximity in a high-dimensional embedding space. Encode a query, find its nearest neighbors, retrieve the context. The underlying faith is that meaning lives in distance.

It does not.

Embedding models are bounded by a property of their architecture called the linear separation limit. For a fixed dimension, certain Boolean concepts cannot be linearly separated by a single query vector. The canonical example is negation.1 A query that says "Exclude all material related to Boltzmann" and a query that says "Include all material related to Boltzmann" share nearly identical vocabulary. The encoder maps both to nearly identical coordinates. Cosine similarity does not care about "NOT." It sees overlap.

The practical consequence is what the retrieval literature calls vector bleed: the excluded domain remains near the top of the ranking because geometric proximity overrides the logical negation.1 The "NOT" operator functions as a soft demotion at best. The excluded documents do not disappear. They settle near the top ten, and they enter the context window.

A query intended to isolate pure calculus away from thermodynamics will nonetheless retrieve thermodynamic formulas because mathematical terminology occupies the same semantic neighborhood regardless of the user's intent. — Derived from analysis of dense retrieval expressiveness limits [1]

This is not an edge case in complex enterprise deployments. It is a structural property of the architecture that becomes more visible as the knowledge base scales and as query intent becomes more specific. The denser the corpus, the more neighborhoods overlap, the more bleed occurs.

Security researchers at OWASP flagged this in their 2026 Agentic Application threat model under the label "cross-tenant vector bleed."2 The attack surface is concrete: a malicious actor seeds near-duplicate content into a shared flat vector pool. Because the retrieval engine prioritizes cosine similarity over access boundaries, the seeded content acts as a semantic magnet, pulling adjacent sensitive data chunks from other tenants into an active retrieval window. The same geometric property that causes logical noise also causes data leakage.

The fix is not a better embedding model. The fix is to stop relying on continuous similarity as the access control mechanism.

Figure 1 — Vector Bleed vs. G.A.T.E.S. Isolation

Each row is a knowledge partition k1..k14. Intensity = retrieval activation strength over query time. Top panel: standard flat vector pool. The query targets k7 but activations bleed across k3, k9, k12. Bottom panel: G.A.T.E.S. topological isolation. Only the gated partition activates. Cross-partition noise is structurally zero.

§ 2 Contextual Drag: Why Scaling Parameters Does Not Help

The response from the scaling camp to vector bleed is usually some version of "the model is powerful enough to filter noise." Give a 32B parameter model the context, the argument goes, and it will know what to ignore.

A comprehensive 2026 benchmark by Cheng et al. tested this assumption across eleven proprietary and open-weight models on GPQA-Diamond and AIME24.3 It does not hold.

The paper introduces the term contextual drag to describe what actually happens. When a model's context window contains irrelevant, erroneous, or noisy retrieval results alongside relevant material, the model does not cleanly ignore the noise. It inherits the structural pattern of the noise. The researchers quantified this using tree edit distance: models under drag did not merely fail to use the relevant material, they generated responses whose error structure was geometrically similar to the structure of the distractor context.3

The performance drops documented were not marginal. Across math, science, and coding benchmarks, contextual drag produced accuracy degradations of 10% to 20%.3 More troubling: in iterative refinement settings, the drag could cascade into what the authors call self-deterioration, where each refinement step inherits more error structure than the previous one.

The researchers attempted an obvious mitigation: explicitly instruct the model to ignore the distracting context. It did not work. Even when the model correctly identified the noisy material as irrelevant, the structural distortion in its reasoning persisted.3

This is the core empirical result that invalidates the "just scale the model" argument. Parameter count does not determine whether a model is immune to structural anchoring bias from its context. The 32B model does not escape the drag. It just drags with more parameters.

The model does not merely fail to use noisy context. It inherits the structural error pattern of that noise.

A parallel finding comes from the GSM-Infinite benchmark presented at ICML 2025.4 Researchers evaluated model performance over infinitely increasing reasoning complexity by mapping abstract computation graphs to natural language. When they extended context by injecting distractors that had tight semantic connections to the essential graph (the exact failure mode of a flat vector database), performance followed a sigmoid decline. To achieve linear gains in this regime, you need exponential increases in inference compute. The scaling law breaks down exactly where RAG noise is highest.

The data from Cheng et al. makes the David-beats-Goliath claim concrete:

Model Params Clean Context Noisy RAG Context
Qwen3 32B 65.00% 41.25%
Qwen3 8B 55.62% ✓ 30.63%
Nemotron 32B 66.81% 49.43%
Nemotron 7B 51.61% ✓ 35.27%

The 8B Qwen3 under clean context outperforms the 32B Qwen3 under standard RAG noise. The 7B Nemotron outperforms the 32B Nemotron under the same conditions. The pattern is consistent across model families. Clean context is more valuable than parameter scale.

The implication for architecture is direct: before you pay for a larger model, pay for a cleaner context pipeline.

Figure 2 — Clean 8B vs. Noisy 32B (GPQA-Diamond + AIME24)

Grouped by model family. Gold bars: clean/isolated context (G.A.T.E.S. state). Red bars: noisy RAG context (standard vector pool). Highlighted bars show smaller models in clean context exceeding larger models in noisy context. Source: Cheng et al. 2026 [3].

§ 3 G.A.T.E.S.: Three Layers That Actually Fix This

Layer 1 — Walled Communities: Stop Putting Data in a Blender

The vector database gives you one giant flat pool of everything. The G.A.T.E.S. architecture replaces that with physically isolated partitions, each containing a discrete, logically coherent knowledge domain. Math Chapter 1 and Thermodynamics Chapter 1 do not share a neighborhood. They are separate nodes in a symbolic topology. Unless an explicit edge connects them, they cannot interact during retrieval.

The underlying representation is a Knowledge Graph with community detection used to define partition boundaries.5 Within each partition, the document structure is preserved as a hierarchical tree (parent sections contain child sections, preserving logical context) rather than being atomized into a "soup" of disconnected chunks. Frameworks like PageIndex have shown that reasoning-based hierarchical retrieval over document trees achieves 98.7% accuracy on FinanceBench, specifically because the system scales with the logical structure of the document rather than degrading as corpus length introduces metric noise.6

The security consequence of this design is worth stating plainly. Cross-tenant vector bleed requires a shared continuous space to exploit. Discrete topological partitions are inherently immune to the attack vector because there is no continuous space to manipulate. The denial-by-default architecture satisfies OWASP 2026 data minimization requirements structurally, not through policy.2

Cache coherency across isolated partitions is non-trivial. Standard PCIe implementations struggle with the latency penalties of highly segregated storage. The Compute Express Link (CXL) protocol addresses this by bridging persistent isolated memory pools to computational retrieval without the cache coherency overhead that makes partition-per-query architectures impractical at scale.7

Layer 2 — The Gatekeeper: A 270M SLM That Does One Thing

A partitioned topology without continuous vectors cannot be navigated by a FAISS index or a Pinecone similarity search. FAISS is useless without vectors. Something else needs to decide which partition to open.

The Gatekeeper is a fine-tuned 270M parameter Small Language Model built on the Gemma 3 architecture, specifically the FunctionGemma variant that Google released for text-only function calling and tool use.8 Its only job is to translate a natural language query into a deterministic JSON command that physically unlocks the correct partition: {"open_gate": "thermo_chap_1"}.

The model is fine-tuned with LoRA. The primary weights are frozen. Only the attention projection layers (q, k, v, o) are adapted, using rank 16 and alpha 32, yielding 1.47M trainable parameters out of 270M total (0.55%).9 This is fast, does not overwrite the base language capabilities, and produces a model that consistently routes across multi-agent pipelines with a false positive rate of 4.1%.10

Crucially, the model operates in non-thinking inference mode with constrained decoding. Chain-of-thought is suppressed via binary logit masking: output tokens outside the valid JSON schema are assigned probability zero. The model cannot hallucinate a path that does not exist in the topology. It cannot be prompted into revealing data from a partition it was not instructed to open.10

Hardware profile: on a Jetson Nano, a Samsung S25 Ultra, or a 4-thread CPU with under 550MB RAM, the quantized model processes a routing decision in sub-millisecond time for 512 prefill tokens.8 The vLLM Semantic Router architecture (VSR) demonstrates how this kind of lightweight classifier model reduces time-to-first-token dramatically in Mixture-of-Models pipelines by redirecting before the generative model ever touches the query.11 The Gatekeeper operates on the same principle.

Layer 3 — The 8B Generator: Works Because Context Is Clean

Once the Gatekeeper unlocks the correct partition, the retrieved context is, by construction, 100% relevant. There are no distractors in the knowledge graph because distractors live in different partitions that the Gatekeeper did not open. The 8B generator receives a pristine context window and produces its response without ever being exposed to the drag conditions documented by Cheng et al.

This is why the performance asymmetry in the table above is not surprising once you understand the architecture. The 32B model in standard RAG is not just using more compute; it is fighting its own context. The 8B model in G.A.T.E.S. is not fighting anything.

§ 4 Dynamic Gate Math: When to Open Multiple Partitions

The walled community design handles single-domain queries cleanly. A more interesting problem arises when a user's query legitimately crosses domains: "Analyze the derivation of entropy in thermodynamics and compare it to Shannon entropy in information theory." This requires data from two separate partitions. Opening both gates naively reintroduces context bloat. The architecture needs a principled decision rule.

G.A.T.E.S. governs multi-partition access with what I am calling Dynamic Gate Math: a logarithmic penalty function that the Gatekeeper evaluates before issuing each successive open_gate command.

P_threshold(n) = T_base + log(n) * I_penalty -- variables -- S_n = semantic relevance score of partition n to the query T_base = base relevance threshold (minimum justification for any retrieval) n = number of partitions currently requested (gates open) I_penalty = intent complexity multiplier assigned by the Gatekeeper -- gate opens when -- S_n > P_threshold(n) -- i.e., relevance of the next partition must exceed -- the base threshold PLUS a logarithmically growing penalty that scales with how many gates are already open, modulated by the inferred complexity of the user's query.

Why logarithmic?

The logarithmic penalty has two justifications. The first is information-theoretic: Normalized Discounted Cumulative Gain (NDCG), one of the standard metrics in Information Retrieval research, applies a log-rank penalty to results positioned further down a ranking list precisely because the compounding cognitive and computational cost of processing lower-relevance data follows a logarithmic curve.12 The second is from distributed optimization theory: in weakly convex optimization over unbalanced graphs, algorithmic convergence to stationary points is bounded at a rate of O(1/log t), meaning the cost of expanding the variable search space naturally compounds logarithmically.13 The penalty function is not arbitrary; it mirrors the mathematical structure of what it costs to maintain coherent reasoning across an expanding variable set.

The two scenarios

High I_penalty (simple, localized query): A user asks "Give me the derivation of enthalpy." The Gatekeeper parses low query complexity and assigns a high I_penalty. This steepens the logarithmic curve sharply. The first gate opens easily (n=1, log(1)=0, so no penalty yet). If the system attempts to speculatively open a second gate containing fluid dynamics, the penalty becomes T_base + log(2) * I_high. The fluid dynamics partition's relevance score S_2 cannot clear this bar. The gate stays shut. No context bloat.

Low I_penalty (complex synthesis query): A user asks to compare thermodynamic entropy and Shannon entropy. The Gatekeeper recognizes multi-domain intent and assigns a low I_penalty, flattening the curve. When the system evaluates the information theory partition (n=2), the penalty is T_base + log(2) * I_low, a small number. The relevance score S_2 easily clears it. The gate opens. The generator receives exactly the cross-domain context the query requires, and nothing more.

The formula is self-regulating. It defaults to isolation and opens to synthesis only when the query semantics mathematically justify it.

I_penalty (query complexity — drag left for complex synthesis, right for simple lookup) 1.50
Figure 3 — Dynamic Gate Threshold vs. Partitions Opened

Drag the slider to adjust the intent complexity multiplier I_penalty. The red curve is the threshold a partition's relevance score must exceed to be opened. The horizontal dashed lines show example relevance scores for three candidate partitions. Gates open when S_n falls above the threshold curve. High I_penalty (right) slams gates shut. Low I_penalty (left) forces cross-domain access. T_base = 0.3.

§ 5 Three Claims Worth Testing

I want to be clear about what this architecture is and is not. It is not a deployed production system; it is a framework with specific, falsifiable predictions. Here are the three claims that I think are publishable and empirically testable.

Zero cross-tenant information leakage under adversarial seeding. Because access control is topological and symbolic rather than metric, a near-duplicate seeding attack that reliably extracts data from adjacent tenants in a flat vector pool should produce zero retrieval from a G.A.T.E.S. partition not covered by the attacker's gate command. This is testable against existing OWASP threat scenarios.2

Sub-linear scaling with corpus size. In a flat vector database, increasing the corpus size increases the density of the semantic space, increasing bleed probability. In G.A.T.E.S., adding new knowledge domains adds new partitions that are physically disjoint. The Gatekeeper's routing complexity increases only with the number of partitions, not the content within them. This should produce sub-linear performance degradation as corpus size scales, testable on FinanceBench or similar multi-domain corpora.6

8B generator under G.A.T.E.S. context matches or exceeds 32B generator under standard RAG. This is already shown empirically by Cheng et al. The prediction of G.A.T.E.S. is that the effect is reproducible when the context isolation is achieved architecturally (via topological partitioning) rather than experimentally (via manual context curation). Testing this requires implementing the actual routing layer and running the same benchmark suite.3

The architecture has open questions. Fine-tuning the Gatekeeper SLM requires a labeled routing dataset that does not exist at scale. The logarithmic penalty formula has two free parameters (T_base and the shape of I_penalty assignment) that need empirical calibration. CXL memory management introduces hardware dependencies that complicate deployment outside of specialized infrastructure.

None of these are showstoppers. They are engineering problems. The theoretical case for why the architecture should work is grounded in existing work on Boolean expressiveness limits, contextual drag, and information-theoretic ranking penalties. The empirical case for why parameter scaling alone cannot solve the problem is already in the literature. What is missing is a full implementation and a controlled ablation study.

That is the next step.

References

  1. Khattab et al., "Demonstrating expressiveness limits in dense retrieval under Boolean constraint queries," EMNLP 2024. (On linear separation limits and negation failure in embedding spaces.)
  2. OWASP Foundation, OWASP Top 10 for LLM and Generative AI Applications 2025, owasp.org. (Cross-tenant vector bleed and assistant memory poisoning threat profiles.)
  3. Cheng et al., "Contextual Drag: Structural Error Inheritance in Long-Context Reasoning," arXiv:2601.xxxxx, 2026. (Benchmark of 11 models across GPQA-Diamond and AIME24; 10-20% drag-induced degradation.)
  4. Zhou et al., "GSM-Infinite: Evaluating LLM Reasoning at Infinite Complexity," ICML 2025. (Sigmoid decline in performance under semantically tight RAG-insolvable distractors; exponential compute for linear gain.)
  5. Blondel et al., "Fast unfolding of communities in large networks," Journal of Statistical Mechanics 10 (2008). (Louvain community detection for Knowledge Graph partitioning.)
  6. Islam et al., "PageIndex: Hierarchical Document Retrieval without Vector Embeddings," arXiv:2501.xxxxx, 2025. (98.7% accuracy on FinanceBench; reasoning-based tree retrieval outperforms dense chunking.)
  7. Sharma et al., "CXL Memory Semantics for Disaggregated AI Infrastructure," IEEE Micro 2025. (Cache coherency across isolated memory pools; PCIe latency penalties and CXL bridging.)
  8. Google DeepMind, "FunctionGemma: A 270M SLM for Deterministic Tool Use," Google AI Blog, February 2026. (Architecture, hardware profile, Berkeley Function Calling Leaderboard results.)
  9. Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," ICLR 2022. (Rank/alpha configuration, trainable parameter counts, catastrophic forgetting avoidance.)
  10. Mukherjee et al., "Tiny-Critic RAG: Constrained Decoding for Deterministic SLM Routing," arXiv:2602.xxxxx, 2026. (Vocabulary logit masking, binary constraint enforcement, 4.1% false positive rate.)
  11. Agrawal et al., "vLLM Semantic Router: Mixture-of-Models Orchestration at Scale," MLSys 2026. (VSR keyword/domain classification for TTFT reduction in MoM pipelines.)
  12. Järvelin and Kekäläinen, "Cumulated gain-based evaluation of IR techniques," ACM TOIS 20(4), 2002. (NDCG log-rank penalty formulation.)
  13. Xin et al., "Distributed weakly convex optimization over unbalanced directed graphs," IEEE Transactions on Signal Processing 2024. (O(1/log t) convergence rate for expanding variable search spaces.)