Mastering Tokenomics in AI: Efficient Coding Strategies

I have been doing AI-assisted coding for over a year now. Maybe longer, depending on how you count the early months when I was mostly fighting with GPT-4 about indentation. I have used it for the kind of work that used to take me three days and now takes me three hours if I am honest with myself about what “done” means. And somewhere in that year I noticed something that nobody in the vibe coding community was talking about directly, even though everyone was dancing around it in their Slack channels and Discord servers and in the slightly embarrassed way people mention their monthly API bills.

Some people burn through their tokens much faster than others doing the exact same work.

That is not a trivial observation my friend. It sounds like it should be obvious, like pointing out that some people spend more on groceries than others even when they cook the same meals. But in practice it is a deeply uncomfortable thing to say out loud because it implies that there is a right way and a wrong way to work with AI, and most of us have been pretending there is no difference.

Yes, we have been treating tokens like oxygen. Like it is an unlimited, and invisible source, and always there.

Then I started working at a place with a token cap.

Yes, a cap. Typical Dutch behavior, I thought, frugal by design, a ceiling on how much AI you can use in a period, because compute costs money and somebody has to pay for it and that somebody is apparently watching the dashboard. And that cap turned out to be one of the most instructive things that ever happened to my thinking about AI-assisted work, because suddenly it was possible to compare things. Like people who got the same tasks, with the same tool and on the same codebase. And the thing that struct me was that some people hit the ceiling by Thursday and others still had room on Friday afternoon. The work was similar and so where the outcomes, but the token footprint was completely different.

So I went looking for why.

What a token actually is, and why it is not free

Before I get into the ways that we, vibe coders, incinerate tokens without realizing it, let me slow down and explain what the heck we are actually talking about. Because the word “token” is one of those terms that people use constantly without anyone agreeing on what it means in practice.

A token is a chunk of text. It is not necessarily a word, mind you, nor a letter, it is actually something in between. Take the word “cooking”, that is one token, and the phrase “I am cooking” is three tokens. A long sentence might be fifteen tokens, and a detailed prompt with context and instructions might be three hundred tokens and a big context window stuffed with code files, documentation, error logs, and conversation history might be fifty thousand tokens.

And every time you send something to an AI, you pay for the tokens going in. And when the AI responds, you pay for the tokens coming out.

The input costs money, and so does the output, and when you are working in a long session with a lot of back and forth between input and output, with files attached and context accumulated and corrections flying about, those costs stack up faster than you think.

The thing is that the AI has no memory between sessions and every time you start a fresh conversation, it starts from zero. Yes, that is the sad little truth, so if you want it to know what it knew yesterday, you have to tell it again. You have to stuff all that context back in, and that costs tokens every-single-time.

This is the architecture unfortunately, but it means that how you structure your work and how you write your prompts or how much context you carry from one moment to the next, determines if you are using ten times more tokens than necessary or staying lean.

And most people, including me for a long time, were not thinking about this at all.

The many ways to burn tokens badly

I will now describe the most common patterns that I have seen, including patterns I used to have myself before I started paying attention.

The first one is the wall of context where someone attaches an entire codebase to every prompt because they are not sure which part the AI will need. Yes, that happens, with every file and with all the documentation. The README – the test files – the configuration. And then they ask a question about one function in one file and of course, the AI reads all of it, and processes it all and uses maybe three percent of what was provided. The rest was noise that cost real money to push through the model.

Ok, the first one is kind of obsious, but wait for the second pattern.

That is the correction spiral, and here someone writes a vague prompt, gets a response that is slightly wrong and then corrects it with another vague prompt, then gets a response that is different but still wrong, corrects again, and continues this for six or seven rounds until the output is, um, uhhh, acceptable? Each round costs tokens of course, and a single well-written prompt at the start would have produced the right answer in one or two steps. The correction spiral is not a prompting problem, but a thinking problem. The person did not know what they wanted precisely enough to ask for it precisely, so they outsourced the clarification to the AI at full token rate. And this is one of the traps I fall into frequently. I even think this goes for all of us. Actually I think we should use cheap models to get our prompts right before we feed it to the costly stuff. I kinda work this way already, and that’s why I keep paying for my ChatGPT subscription. I mainly use it to gather my thoughts and have it create a prompt for me which either Claude Code or Manus turns into code or slide decks and what have you.

The third pattern is re-explanation. Someone explains the same background information in every prompt because they have not saved their context anywhere and so, every time they start a new message they explain who they are, what the project is, what the constraints are, what has happened so far. This is what I do every morning when I call a colleague while driving and spend twenty minutes catching them up on things they already knew yesterday. It is expensive and it is avoidable and it turn you into a jerk. Noted.

Then the fourth pattern is over-requesting. Someone asks the AI to generate something much larger than what they need because they are not sure exactly what they want. Generate ten options when two would do. Write a full implementation when a sketch would be enough. Produce a detailed analysis when a summary would answer the question. Do I need to go on?

The output tokens pile up, and most of the output gets discarded, and the cycle repeats. And I am guilty of that one too.

The fifth pattern is the agentic spiral, and this one deserves special attention because it is the most expensive and the least visible.

When the agent runs on its own and your wallet runs with it

Agentic AI is the version of AI that does not answer questions but takes actions instead of you, ya lazy bastard. It browses the web for you and it writes code and runs it and it reads files and writes files, breaks down a task into subtasks and executes each one.

It loops.

And then it checks its own output and tries again when something fails.

This is really powerful but it is also friggin’ expensive if you have not thought about what happens at each step.

An agent that is working on a task does sends dozens of prompt and receive ten times more in return and for each subtask that it plans, it generates a prompt and yes, for each action it takes, it logs the result and feeds that result back into its context and for each check it performs on its own work, it processes everything it has done so far plus the new information. The context grows with every step and because the cost of processing a prompt scales with the size of the context, each step costs more than the one before it.

An agent that takes twenty steps on a moderate task may just well burn tokens at a rate that scales not linearly but something closer to quadratically, since the context is growing the whole time. By step twenty it is carrying the full history of steps one through nineteen in its working memory, and it is processing all of that with every new output.

This is when the “agentic cost management” problem becomes tangible. The people who hit their token ceiling on Thursday are often the people who launched agents without thinking about what the agent would do. They set it loose on a task, it worked, they got an output, and they did not notice that the process consumed three times the tokens it needed to because nobody had thought about context compression, or about breaking the task into smaller bounded subtasks, or about when to clear the agent’s memory and start a cleaner sub-process.

Context compression is the practice of summarizing earlier parts of a conversation before they get too long, so the agent carries a compact version of its history rather than the full transcript. Model routing is the practice of using a smaller, cheaper model for simple subtasks and reserving the large expensive model for the steps that actually need it. Task decomposition is the practice of breaking a big agentic job into smaller jobs with clean handoffs, so no single process accumulates an enormous context. These are not exotic techniques. They are the difference between an agentic workflow that costs five euros and one that costs fifty.

The tools for managing this do already exist, but they aren’t really used since most people don’t even know they’re having this problem. But when you’re managing an AI factory at scale, you start to notice these things when your controller slides the monthly OPEX report across the table with the trembling hands of a man who has seen things.

The efficient way to work

If the inefficient patterns are about carelessness and size, the efficient patterns are about precision and structure.

Efficient prompting starts with knowing what you want before you ask. This sounds obvious, but it is really not. A significant fraction of the token waste I described earlier comes from people using the AI to figure out what they want, rather than figuring that out first and then using the AI to produce it. There is a place for exploratory conversation with AI but that place should be a cheap model on a small context, and certainly not a full-scale session with everything attached.

Efficient context management is knowing which files and information are actually relevant to the task you’re working on and attaching only those. It means maintaining a project brief (a compact document that describes the project), its constraints, the decisions and its current state, and using that as a context anchor instead of the full codebase. You also should update that brief as the project evolves instead of reconstructing it from scratch each time.

Then there’s efficient session design. Thinking of each AI session as having a defined input and a defined output, like a function. What goes in, what comes out, and what gets thrown away. Sessions that are open-ended and exploratory are fine for learning but sessions that are doing production work should be bounded.

Efficient correction is about frontloading precision. In prompt writing, this means putting your most important instructions, and context at the very beginning of the prompt rather than at the end. That matters because LLMs don’t read linearly the way humans do. Attention is uneven across a long context window and what sits early gets weighted more heavily during generation and what you bury at the bottom of a 4000-token prompt is statistically more likely to be underweighted or ignored entirely.

Here’s an example, say you want JSON output with no preamble, then you state that in sentence one. If you say it in sentence forty after explaining your entire use case, you will get preamble. Voila. There are two common failure modes that occur here. People write prompts the way they think (like, context first and instruction last), and that is exactly backwards for reliability. Then there’s people who outright frontload the wrong thing. They’re spending the prime attention real estate on background the model doesn’t need and saving the actual constraint for the end where it gets diluted.

Personally I use the rule that if the model violating a particular instruction would make the output useless, that instruction goes in the first three lines.

And when you’re running a factory you have to design for this because prompting at scale means the costs will run wild. When designing your agents you have to treat the agent like the process you have designed rather than a helper you have let loose on your tasks. And like with a process, you specify the subtask boundaries and set context limits and use compression at regular intervals and route simple steps to lighter models. And in the end, our process-orchestrator – not a tool like Camunda, but a human role – monitors the cost as the agent runs rather than discovering it afterward.

Connecting this to the research on process automation

I have been building and studying AI-driven process automation for several years now, across a wide range of industries and deployment environments. The framework that came out of that work is called PASF, the Process Automation Scoring Framework, and its companion PADE, the Process Automation Deployment Engine. If you want the full technical treatment, it lives on the research section of Eigenvector (see the comments or this blog post “The real story behind enterprise scale process agentification”)

The core finding from that research is that most processes can be automated to some ceiling, and that ceiling is not determined by how smart the AI is but by the structure of the task itself, and so, repetitive, well-defined tasks with clear inputs and outputs hit that ceiling fast, but ambiguous tasks with shifting goals and fuzzy success criteria hit a lower ceiling, and the effort to push past it rises steeply.

What I did not have in the original framework was a systematic treatment of cost.

The PASF model was about what could be automated and how well, but it wasn’t about what it would cost per token, per session and ultimately, per quarter. That gap became more visible as agentic deployments multiplied in my factory and as the token cap reality set in. The question quickly became if the agent could do a particular task, but at a cost that makes sense.”

This is where a recent piece of research, the TOKENOMICS framework, peaks its little head around the corner.

TOKENOMICS, or how to make agents pay their own way

TOKENOMICS is the name I have given to a structured approach for managing token costs in agentic AI systems, and I buil it specifically for the production context where agents are running continuously, but also where token budgets are real constraints, and where the difference between an efficient deployment and an expensive one determines if the whole project makes business sense.

The framework rests on a few core ideas that are worth explaining carefully.

The first idea is the cost stack. An agentic system does not have one cost. It has five distinct layers of cost, each of which can be managed separately. There is orchestration cost, and that is the overhead of coordinating tasks and managing the agent’s behavior. There is perception cost, and that’s about what the agent spends reading and processing inputs. Then there is reasoning cost, which is the expensive part where the model actually thinks. There is memory cost, or what it costs to carry and retrieve context. And there is output cost, which is what the model spends generating its response.

Most people think of AI cost as a single number.

But with TOKENOMICS you treat it as a five-layer stack where each layer has different levers for efficiency. Reducing orchestration cost requires better task decomposition and reducing perception cost requires smarter context selection, bringing down reasoning cost means routing simpler tasks to cheaper models and lowering memory cost is about compression and summary strategies. And if you want to lower the output cost you need to be more specific about what kind of output you actually need.

The second idea is about dynamic budgeting. In a static tokenomics system, you allocate a fixed budget to each task and hope it is enough but in a dynamic (agentic) system, the agent can negotiate that budget in real time. If a task is proving simpler than expected, it returns unused tokens to a shared pool but if that task is proving more complex, it requests a reallocation. This may sound like it’s a small optimization, but in practice, across dozens of concurrent agentic workflows, it makes a substantial difference to both cost and performance.

The third idea is the SDpD benchmark.

SDpD stands for Semantic Density per Dollar.

Yeah, I know, it sounds like something a finance dude who married an AI researcher cooked up after staring too long at a token invoice together, but in all honesty, the idea is actually sharp, if I say so myself.

It measures how much useful meaning and task success you squeeze out of every dollar spent on tokens, and that, my smart friend, is a question the industry has been heroically avoiding since GPT-3 made it socially acceptable to burn compute on vibes.

To put it into something us humans can understand, it is basically asking if this AI actually did something valuable or simply torched money while sounding intelligent. It combines how successful the task was, how complex that task actually is, and how many tokens you burned getting there. An agent hitting eighty percent success on complex tasks at ten thousand tokens is a different animal from one hitting the same eighty percent at fifty thousand tokens on identical work. So instead of worshipping raw model output you are forced to look at efficiency through-a-cold-economic-lens. You are measuring the same success rate, but you’re optimizing against wildly different economics and SDpD makes that gap visible and comparable across workflows, models, and prompting strategies. And that, my friend, is precisely the kind of uncomfortable visibility that most vendor dashboards are architecturally designed to avoid giving you because it’s actually the metric that kills the illusion that more tokens equals more intelligence.

And I know, it is probably bad news for roughly half your experiments, but you’ve got to start sometime.

Then the last idea is the Technical Debt-Aware Prompting Framework. And that also sounds like a mouthful because it’s something I cooked up two years ago, and my acronym creation skills haven’t really improved since. But even though it sounds complicated, in reality it is just the systematic application of good prompting practices across eleven domains of vibe coding and AI-assisted development.

Here’s a picture of these 11 domains.

If you want to read up on it, go to this blog “I may have found a solution to Vibe Coding’s technical debt problem”.

And from this vibe coding paper arose an MVP tool called MASSQ – ”mask” or “mass questions” because that’s what it really is all about – that helps you with creating these prompts.

The insight here is that bad prompting is not an inefficiency in the moment, but that vibe coding creates a technical debt if you don’t do it correctly across these 11 domains and when you write vague prompts and accept approximate outputs, you are taking on a liability that you will pay later in correction spirals, in rework, in agents that misunderstand their task three steps in because the original framing was unclear. The TOKENOMICS framework and the MASSQ tool gives you a set of questions to work through before you start a session, so the upfront cost in thinking pays off in reduced correction cost downstream.

This matters beyond the token cap

I want to end with something that might seem obvious but that I think is genuinely underappreciated. The token cap at my company is a constraint, but it is also a signal that compute cost is real, and that AI-assisted work has an economic structure, and the people and organizations who understand that structure will get more out of their AI investment than the people who do not.

The vibe coding community has spent a lot of time celebrating speed and magic and the joy of watching an AI build something in twenty minutes that would have taken a week but I’m telling you that this celebration is warranted, but the next conversation is about ROI, sustainability and efficiency and building workflows that scale but without the cost scaling with them and treating AI as a production resource instead of an infinite tap.

The PASF framework tells you what you can automate and how well, the VIBE CODING PAPER tells you how to prompt, and the TOKENOMICS framework tells you what it will cost and how to reduce it, and when you put them together, you end up a view of agentic AI deployment that is technically sound and economically honest.

If you want the numbers, the benchmarks, the deployment data from real production environments, and the full methodology behind both frameworks, all of that is available at Eigenvector in the research section. The PASF and PADE unified paper is there. The TOKENOMICS paper is there. They are written for people who want to understand this at depth rather than settle for a summary.

But the short version is that tokens are not a free resource, and agents that run unsupervised are not free either, and vague prompts will limit the success of your agentification factory, and the organizations that figure out the economics of AI-assisted work before their competitors do will have an advantage that is much harder to copy than any individual AI tool or model.

The ones who do not figure it out will just keep hitting the ceiling on Thursday and wondering why.

Eigenvector builds Agentification factories at scale, for production environments that actually have to pay-off, and Eigenvector Research occasionally publishes papers about why this is harder than the demos suggest.

Signing off,

Marco

👉 Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn, Google and the AI engines appreciates your likes by making my articles available to more readers.

I spent a year burning money on AI and finally decided to do something about it