How A Large Language Model Can Run On A Network

A large language model does not have to live in one building. It can be cut into pieces, and the pieces can run on machines scattered across the world, owned by people who have never spoken to each other. The model still answers. This is not a forecast. It has been working in a published research system since 2022.

The system is called Petals. It was built by a research collaboration that includes engineers from the University of Washington, Hugging Face, Yandex Research, and the Higher School of Economics, and it was presented at the Association for Computational Linguistics conference in 2023. The paper is public on arXiv and the code is open source on GitHub under a permissive license. Anyone can read exactly how it works. Most people have not, because conventional wisdom says large models require concentrated infrastructure, and a working counterexample does not get the attention the conventional wisdom gets.

This entry walks through how Petals runs a model across strangers' machines. Not the marketing version. The mechanism. Because once the mechanism is clear, the claim that frontier-scale inference requires a hyperscale data center stops being a law of physics and starts being a business decision.

The Model Is A Stack, And A Stack Can Be Cut

The first thing to understand is what a language model is shaped like on the inside. A modern transformer model is not a single dense block of computation. It is a tall stack of nearly identical layers, called transformer blocks, run one after another. Input goes in at the bottom, passes up through every block in order, and an answer comes out at the top. A model with 70 billion parameters might have 80 of these blocks. A model with 176 billion might have 70 larger ones. The exact count varies. The structure does not. It is a stack, and the blocks run in sequence.

A stack that runs in sequence can be cut into segments. This is the entire idea. If one machine holds blocks 1 through 10, a second machine holds blocks 11 through 25, a third holds 26 through 40, and so on until the stack is covered, then a request can travel up through all of them in order and produce exactly the same answer it would have produced on one giant machine. The computation is identical. It has simply been spread across a chain.

Petals calls the machines holding blocks servers. According to the project documentation, a server holds a subset of the model's layers on its local hardware and answers requests for those layers. How many blocks a server holds depends on how much memory it has. A machine with a small consumer graphics card holds a few blocks. A machine with more memory holds more. Nobody has to hold the whole model. That is the point. The whole model never has to fit anywhere.

The person who wants an answer runs what Petals calls a client. The client is light. According to the research paper, to run inference on a 176-billion-parameter model the client needs about 12 gigabytes of RAM, most of which is spent holding the model's input and output layers, and a network connection of around 25 megabits per second in each direction. That is an ordinary laptop on ordinary home internet. The heavy compute, the 176 billion parameters of transformer blocks, lives out in the swarm on other people's machines.

Here is the sequence when someone asks the model a question. The client turns the input text into numbers using the input layer it holds locally. It then finds a chain of servers that together cover every block of the model from first to last. It sends its numbers to the first server in the chain. That server runs its blocks and passes the result to the next server. That one runs its blocks and passes it onward. The signal climbs the chain, server by server, until it reaches the end, and the final result comes back. Crucially, what moves between machines is never the model's weights and never the original text in a readable form. What moves is the intermediate state, the activations, the partly-processed numbers in flight between one layer and the next.

The Hard Part Is Not The Cutting, It Is The Chaos

Cutting a stack into segments is simple. The hard part is that the segments sit on machines you do not control, and those machines join, leave, and fail without warning. A research system that ignored this would collapse the first time someone closed a laptop. Petals does not ignore it, and the way it handles the chaos is the part worth studying.

The coordination problem is solved with a distributed hash table. This is the same class of technology that lets a file-sharing swarm find who has which piece of a file, with no central index. In Petals, each server periodically announces which blocks it is currently serving to this shared table. There is no central server keeping the list. The list is the network itself. When a client needs to build a chain, it reads the table to discover which machines are holding which blocks right now.

The routing is not naive. A later version of the system, described in the project's release notes, has the client build a full graph of the measured latencies between itself and the servers, and between the servers themselves, along with how fast each server runs. It then chooses the chain that will be fastest overall, not just the chain that happens to be available. It even accounts for how much memory each server has left for holding the running context of a request, so it does not route to a fast machine that is about to run out of room. The newer version also lets servers pass results directly to the next server, which removes a round trip.

Failure is treated as normal, not exceptional. The system uses fault-tolerant protocols so that inference continues even as nodes drop mid-request. If a server in the chain disappears while a request is in flight, the chain re-forms around the gap and the request continues. The user is not supposed to notice. This is the same resilience model that lets a file-sharing swarm survive any individual peer leaving. The network is not the machines. The network is the protocol that keeps finding a working path through whatever machines happen to be present.

There is also the matter of weight. Moving activations between machines over home internet is only practical if the activations are small. Petals compresses them. The system uses dynamic quantization to shrink the numbers passed between servers, and stores the model's weights on each server in compressed form. The project integrated a 4-bit compression method drawn from published research, which lets a server hold the same blocks using roughly 40 percent less memory and run generation about twice as fast as the previous approach, with what the researchers describe as relatively small loss in quality. Compression is what makes the consumer-internet link viable.

Where The Honesty Lives

A research system is worth studying for what it has not solved as much as for what it has. Petals is candid about its limits, and the limits are instructive.

Speed is the first one. Distributed inference across a swarm is not as fast as the same model on a tightly coupled cluster wired with specialized interconnects. The project's own documentation reports single-batch generation on the order of several tokens per second for large models, which it describes as enough for a chatbot or an interactive application. That is genuinely usable. It is not the raw throughput of a data center built for the job. For the workload most people actually run, interactive inference, the gap is narrow. For frontier-scale training, the gap is still wide, and the researchers do not pretend otherwise.

Privacy is the second, and it is sharper. On a public swarm, the machines running the middle of the model can, in principle, see the activations passing through them. The project states plainly that the public swarm should not be used for sensitive data, because a participant serving a layer could potentially reconstruct input or output. The researchers point to private swarms, where every machine is operated by a party the user trusts, as the answer for sensitive work. Stronger cryptographic answers exist in theory. Running a neural network fully encrypted, so the host computes on data it cannot read, currently carries a slowdown of ten to a hundred times. That is too slow to ship. Closing this gap properly is unfinished research.

The third limit is the most important one for anyone thinking about why distributed inference has not already taken over. Petals has no incentive layer. Servers join the public swarm because researchers and enthusiasts choose to donate their hardware. The project has openly discussed adding explicit rewards for contribution and has not shipped them. There is also no built-in way to verify that a server actually ran the computation correctly rather than returning a plausible fake. The honest summary is that the runtime works and the economics do not exist yet. The machine that splits a model across strangers has been built and proven. The system that makes those strangers want to participate, and that lets a client trust the result, has not.

That is the real state of distributed inference in 2026. The technical core is solved and published and open. A 176-billion-parameter model running across volunteer machines on home internet, fast enough for a chatbot, is not a thesis. It is a citation. What remains unsolved is not the computer science of cutting the stack. It is the coordination problem of who runs the pieces and why, and the trust problem of proving the pieces ran honestly.

The conventional wisdom says intelligence at scale must be concentrated. The conventional wisdom is describing a business arrangement and calling it a law of nature. The stack can be cut. The chain holds. The chaos is survivable. The computers do not have to have met. Everything left is incentive and trust, and incentive and trust are problems with histories of being solved.