No single model will save us
A continuation of When Deliberation Meets Reality and Building Critical Thinking Infrastructure, which together argued that the missing precondition for collective flourishing is critical thinking infrastructure fast enough to survive contact with reality. This post, Part 3 of the Critical thinking infrastructure essay series, addresses where AI sits in that architecture, why no single model can take its place, and where the architecture goes next.
Abstract
The instinct in 2026 is to reach for one large language model and ask it to do everything. This post argues, from Wolpert and Macready’s no free lunch theorems and from seventy years of overpromise and correction in computational optimisation, that no single model outperforms all others across all problem instances. The sensible response is to invest in a stable substrate with pluggable specialised models rather than to chase a god-model. The post then works through where AI sits in the four-layer architecture: extraction and entity resolution at the spatial layer, graph neural networks and causal models at the domain layer, classical mathematical solvers at the compute layer, and a general-purpose LLM as the facilitator that translates questions into compositions of the above. The platform is the infrastructure; the models are replaceable.
Hypotheses introduced (one per section):
- No single model or technology outperforms all others across all problem instances.
- The value of AI in decision support comes from specialised models operating at specific points in a stable infrastructure, orchestrated by a general-purpose facilitator. The infrastructure is the platform; the models are pluggable.
- The architecture evolves through three stages, each extending the scope of the system while preserving the stable base layers. The third stage is where the system begins to model decision-making itself.
Introduction
The previous post proposed a four-layer architecture (spatial knowledge graph, domain graph, solver integration, micro-tools) with a facilitator role that compresses ttQ and ttA inside the decision window. It closed by flagging the next place where compression has to happen: the facilitator role, today filled by a human who translates decision-makers’ questions into queries against the graph. It also flagged a separate concern that needed its own treatment: why no single model will save us. This post takes up both threads, because they are the same thread. Answering one answers the other. The work is to settle which model belongs at which point in the architecture, and why the architecture itself has to stay stable while the models attached to it change.
No free lunch
Hypothesis: no single model or technology outperforms all others across all problem instances.
The instinct in 2026 is to reach for a single large language model and ask it to do everything. Translate the question, query the data, run the analysis, summarise the result, defend the recommendation. The architecture in the previous post looks elaborate by comparison, with its stable spatial foundation, declarative ontology, formal contracts between layers, and specialised solvers behind disciplined adapters. A reasonable reader, especially one who watched LLMs collapse time-to-answer across a dozen white-collar tasks in two years, might wonder whether all those layers do real work or whether they are over-engineering in a field the model layer is about to absorb.
The field has heard the same question, in different guises, for nearly seventy years. The answer has not changed. A new technology arrives, posts impressive results on a class of problems, and is anointed the universal solution. Newton’s mechanics was going to predict everything until quantum phenomena and chaotic systems showed where its reach ended. Genetic algorithms were going to optimise any function until they ran into Wolpert and Macready’s No Free Lunch theorems in 1997. The theorems proved formally that any algorithm’s elevated performance on one class of problems is exactly offset by degraded performance on another1. Big data and statistical inference were going to render theory obsolete. Then deep learning was going to learn any pattern, until it turned out to be brittle to adversarial perturbations and unable to reason about causal structure.
Large language models were going to scale their way to general intelligence, with the scaling laws read as a straight line from more parameters and more data to AGI. When the curve started to bend, the story shifted to context windows, on the argument that a million tokens of working memory would close whatever gap remained. When that did not deliver either, the definition of AGI itself drifted, redefined from “human-level general intelligence” to whichever economic or benchmark threshold the next model release happened to clear. Each cycle of overpromise and correction takes five to ten years, and each ends with the same realisation: the technology is excellent at what it is excellent at, and unreliable at what it is not.
The relevance to critical thinking infrastructure is direct. A facility location problem with discrete coverage trade-offs is a combinatorial optimisation problem, not a language problem, and asking an LLM to solve it directly produces fluent output that is often wrong in ways the decision-maker cannot detect from the answer alone. A causal question about supply chain propagation is not a pattern-matching problem, and a model trained on text correlations will reproduce the correlations without distinguishing them from causes. A spatial overlay between a regulatory zone and a building dataset is not a reasoning task in the sense LLMs are good at, and a model trying to answer it from parametric memory will invent buildings, boundaries, or both. The first post argued that the binding constraint on decision quality is speed. The second showed how to compress that speed without sacrificing rigour. A god-model strategy collapses both arguments, because it produces fast answers by giving up the grounding that made the speed worth having. Speed without traceable evidence is just gut instinct in a more articulate voice.
The architecture in the previous post takes the opposite position. The stable base layers are technology-agnostic infrastructure. The models attached at each point are chosen for their fit to a specific sub-problem and designed to be swapped when something better arrives. There is a counter-argument that LLMs can write code that solves all the above, and it is a good one. It also fits within the architecture, at the micro-tools layer. This is not a philosophical commitment to eclecticism so much as an engineering response to a mathematical fact: any elevated performance on one class of problems comes at a cost on another, so the sensible move is to invest in the substrate that lets you match the right tool to each sub-problem rather than search for the one tool that handles everything. The rest of this post works through what that looks like layer by layer, and then traces where the architecture goes next.
Where AI lives in this architecture
Hypothesis: the value of AI in decision support comes from specialised models operating at specific points in a stable infrastructure, orchestrated by a general-purpose facilitator. The infrastructure is the platform; the models are pluggable.
The current discourse treats AI as a single capability: you have “an AI” and it does things. The framing inherits the god-model assumption. In practice, the architecture needs different AI capabilities at different layers, each with different characteristics, each serving a different part of the planning problem. LLMs do not have to be applied at every layer. But as a new technology, it is worth working out where they, and older fields of AI, may help.
At the spatial knowledge graph layer, the tasks are construction and maintenance. Large language models can extract entity relationships from unstructured text, parsing shipping manifests, news articles, regulatory filings, and corporate disclosures to populate the graph with supplier-to-manufacturer relationships that no single database contains2. Computer vision models extract building footprints from satellite imagery to fill gaps in crowdsourced mapping data. Spatial entity resolution models, specialised for multilingual address matching and proximity scoring, determine that two records in different databases refer to the same real-world building. These are narrow, well-defined tasks where domain-specific models beat general-purpose ones by a wide margin.
At the domain graph layer, graph neural networks earn their keep. GNNs can perform link prediction on supply chain networks, inferring hidden relationships between firms that are not directly observable in public data3. They can detect anomalous patterns in temporal graphs to flag bullwhip amplification or supply chain contagion before it cascades. Time series models handle demand forecasting. Causal inference models distinguish correlation from genuine propagation mechanisms in disruption scenarios4. Each is a specialised model trained on graph-structured data, operating within the domain ontology rather than outside it.
At the solver integration layer, the models are not neural at all. Facility location uses mixed-integer programming. Coverage and set-cover problems use combinatorial optimisation. Network flow problems use min-cost flow algorithms. Resource allocation under hard constraints uses constraint programming. These are mature, well-understood methods that outperform learned approaches on their specific problem classes because the problem structure is known and exploitable. Replacing a mixed-integer solver with a neural network would be a god-model move: it trades decades of problem-specific engineering for a general-purpose approximator that is slower and less reliable on this class of problem5. Reinforcement learning has a role in adaptive heuristic selection, learning which solver configuration works best for which sub-problem. The solvers themselves remain classical.
At the facilitator layer, the general-purpose LLM earns its place. The facilitator’s job is translation: converting a decision-maker’s natural language question into a composition of graph queries, solver invocations, and micro-tool calls. This needs broad language understanding, domain awareness loaded from the ontology, and the ability to decompose a vague question into precise sub-queries. A general-purpose LLM orchestrating specialised tools is the right architecture at this layer, for the same reason a human facilitator with broad knowledge and access to specialised tools tends to outperform either a specialist without breadth or a generalist without tools.
The interesting design decision is what sits between the general-purpose facilitator and the specialised models, because this is where small, domain-specific LLMs become useful. A model fine-tuned on the domain ontology, on the patterns of graph queries that answer common planning questions, and on the mapping between natural language and formal graph operations can translate faster and more reliably than a general-purpose model reasoning from first principles each time. It can also be tested, validated, and version-controlled against the ontology it serves in a way a general-purpose model cannot.
The open research question is where the boundary between these two sits. Which translation tasks benefit from domain-specific fine-tuning, and which are better handled by a general-purpose model with the right context? The answer is almost certainly “both, at different points,” which is what the No Free Lunch theorem predicts. The architecture is designed to make that boundary traversable, so a task can start with the general-purpose facilitator and migrate to a specialised model as usage patterns emerge and training data accumulates. This is P8 (materialisation is an economic decision) applied to AI itself6.
The research trajectory: from logistics to collective intelligence
Hypothesis: the architecture evolves through three stages, each extending the scope of the system while preserving the stable base layers. The third stage is where the system begins to model decision-making itself.
v0: Urban logistics (partially tested, 2020 onwards). The current setup focuses on a single city, a single organisation, and a single human facilitator. The spatial knowledge graph covers one geography. The domain graph represents one organisation’s operational problem. The facilitator is a person who knows both the system and the client’s domain. This is where the core concepts get validated: that pre-built spatial foundations compress project setup, that scenario comparison changes how planners think, and that iteration rate determines decision quality. The limitation is that it scales with the facilitator’s availability. Any given facilitator can serve one client at a time.
v1: Supply chains and global logistics (current research target). The next setup will enable multiple cities, multiple organisations, and AI facilitators. The spatial graph extends to global infrastructure: ports, shipping lanes, chokepoints, trade corridors. The domain graph extends upstream from movement (logistics) to production (supply chains), covering bills of materials, procurement relationships, and market signals. The AI facilitator replaces the human facilitator for routine questions and drives ttQ toward zero. Shipping is the obvious proving ground: the Suez blockage, Red Sea disruptions, tariff-driven sourcing shifts, and container imbalance are all graph problems the architecture is designed for7. The research questions here concern data sharing (firms will not share competitive information), multi-scale temporal reasoning (hourly market signals alongside yearly product architectures), and whether the AI facilitator can match the human facilitator’s knack for asking the question behind the question.
v2: Decision simulation and collective intelligence (blue sky). This is where the architecture turns reflexive. If every decision made through the system produces an inspectable trace (as the first post argued), then after enough decisions the system accumulates a dataset of how decision-makers use analytical tools: which questions they ask first, which scenarios they compare, what trade-offs they accept, where they override the model’s recommendation and why. That dataset enables a different kind of model, one trained not on the domain but on the decision process itself.
A small LLM trained on facilitator traces can learn the patterns of expert questioning. It can anticipate what a planner will ask next based on what planners in similar situations have asked before. It can also identify blind spots: questions that should be asked but are not. The prompt might read, “you have not asked about supplier concentration risk in the eastern corridor, but the graph shows 80% of your tier-2 suppliers depend on the same port.” The role of the system shifts from answering questions to helping generate them. The decision-maker stays in charge, but works against a tool that pushes back on omissions rather than only responding to prompts.
The same capability opens the door to adversarial modelling. If you can model how a decision-maker uses the tool, you can model how a competitor or adversary would respond to the same information. That lets you ask questions like, “if we shift sourcing away from supplier X, what is the most likely competitive response from firms Y and Z, and how does that move the price floor over the next two quarters?” This is wargaming conducted on the actual decision infrastructure rather than on a simplified simulation built separately for the purpose. The traces feeding it are real operational decisions, not synthetic scenarios.
At the collective level, v2 opens the possibility of federated intelligence. Multiple organisations run the same architecture on their own data, with shared public base layers (the spatial knowledge graph) and private domain layers (their own supply chain data). No organisation shares raw data. They share aggregated patterns: “demand for container shipping from Southeast Asia to Northern Europe has increased 15% this month.” The spatial knowledge graph becomes public infrastructure in the same sense that weather services or financial market data are public. The domain-specific insights stay private. The collective benefits from shared situational awareness without requiring any single organisation to reveal competitive information.
The step from v1 to v2 is the point at which the system stops being a decision support platform and becomes a research instrument for studying collective decision-making. If the traces are available with appropriate consent and anonymisation, researchers can ask questions that are hard to answer empirically today: how decision-makers under pressure use analytical tools, how iteration speed relates to decision quality in real-world settings, and whether inspectability changes behaviour or whether people mostly ignore the trace. These are empirical questions about human cognition and collective coordination, and the architecture allows them to be studied through actual operational decisions rather than through laboratory experiments on undergraduates solving toy problems.
That is the connection to ARIA’s Collective Flourishing programme. The architecture is infrastructure for studying collective flourishing empirically, using supply chains as the laboratory and decision traces as the data.
The next post in this series turns to the institutional question this trajectory raises but does not answer: who builds and maintains the public substrate the architecture depends on, and why it has to exist as public infrastructure rather than as anyone’s product.
Wolpert, D.H. and Macready, W.G. (1997) ‘No Free Lunch Theorems for Optimization’, IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–82. Available at: https://doi.org/10.1109/4235.585893. Note: For an accessible explanation of the theorem and a discussion of its implications for modern machine learning, see Ho, Y.C. and Pepyne, D.L. (2002) ‘Simple Explanation of the No-Free-Lunch Theorem and Its Implications’, Journal of Optimization Theory and Applications, 115(3), pp. 549–570. Available at: https://doi.org/10.1023/A:1021251113462. ↩︎
AlMahri, S., Xu, L. and Brintrup, A. (2025) ‘Enhancing supply chain visibility with knowledge graphs and large language models’, International Journal of Production Research, pp. 2178–2209. Available at: https://doi.org/10.1080/00207543.2025.2575841. Note: The framework uses zero-shot LLM prompting for named entity recognition and relation extraction over public sources, validated on electric vehicle battery supply chains and demonstrated to extend visibility beyond tier-2 suppliers without requiring direct information sharing between stakeholders. The earlier arXiv preprint is also available at https://arxiv.org/abs/2408.07705. ↩︎
Kosasih, E.E., Margaroli, F., Gelli, S., Aziz, A., Wildgoose, N. and Brintrup, A. (2024) ‘Towards knowledge graph reasoning for supply chain risk management using graph neural networks’, International Journal of Production Research, 62(15), pp. 5596–5612. Available at: https://doi.org/10.1080/00207543.2022.2100841. Note: The same paper is referenced in the previous post on the role of ontologies in supply chain knowledge graphs. The authors demonstrate that graph neural networks trained on partial supply chain data can infer plausible hidden relationships between firms, addressing the structural problem that no single database has full visibility beyond tier 1. ↩︎
Wyrembek, M., Baryannis, G. and Brintrup, A. (2025) ‘Causal machine learning for supply chain risk prediction and intervention planning’, International Journal of Production Research, 63(15), pp. 5629–5648. Available at: https://doi.org/10.1080/00207543.2025.2458121. Note: The paper applies directed acyclic graphs and counterfactual reasoning to supplier intervention planning, demonstrated on a maritime engineering case study. It also makes the broader argument that black-box machine learning models are insufficient for supply chain risk management because managers are held accountable for outcomes and need to understand the causal mechanisms behind a recommendation before acting on it. ↩︎
Da Ros, F., Soprano, M., Di Gaspero, L. and Roitero, K. (2025) ‘Large Language Models for Combinatorial Optimization: A Systematic Review’, ACM Computing Surveys. Available at: https://doi.org/10.1145/3801961. Note: The review synthesises 103 studies and finds that classical solvers continue to outperform learned approaches on well-structured combinatorial problems, while LLMs show their most promise in problem formulation, heuristic generation, and algorithm selection rather than in direct solving. The arXiv preprint is available at https://arxiv.org/abs/2507.03637. ↩︎
P8 (materialisation is an economic decision, not a structural one) is the eighth governing principle from the Core Principles and Architectural Decisions reference document. The principle states that the position of any computation on the spectrum from ephemeral micro-tool to permanent infrastructure should be determined by observed reuse relative to compute cost, not by design-time classification. Applied to AI, it means a translation task should start at the general-purpose facilitator and migrate to a fine-tuned specialised model only when usage data justifies the materialisation. ↩︎
The current shipping disruption landscape provides an unusually rich test environment. Houthi attacks in the Red Sea caused a roughly 75% collapse in container shipments through the corridor, US tariff policy is driving real-time sourcing and inventory decisions, and container spot rates remain volatile. Each of these is a graph problem: supplier exposure analysis, multi-modal capacity reallocation, or temporal scenario comparison under uncertainty. Note: For a recent treatment of how supply chain network structure transmits financial and operational shocks at scale, see Fialkowski, J., Diem, C., Borsos, A. and Thurner, S. (2025) ‘A data-driven econo-financial stress-testing framework to estimate the effect of supply chain networks on financial systemic risk’, arXiv preprint, arXiv:2502.17044. Available at: https://arxiv.org/abs/2502.17044. ↩︎