Don’t Vibe Code; Delegate | AI Coding & Responsible DevelopmentChee Web Development

In the ancient times of 2023, we wrote about the new LLM-driven tools that were rolling out and referred to them as “interactive documentation”, in an effort to understand how to slot these new tools into our workflows. At the time, they were not much more than a single chat window. That definition still works well but they are more capable and, more importantly, directly integrated into our tools by default. Just about every application now has a chat interface built into the application itself and, if you’re a developer, AI-infused code editors are the de-facto standard.

There’s also an industry push to try and paint these tools as entities that can collaborate or use judgment. To complicate matters: we’re hard-wired to trust things that “speak” to us; it’s a part of human psychology and the phenomenon of projecting human traits onto machines has been happening since 1966. It even has a name: the ELIZA effect.

In an effort to cut through the marketing (and the hype), we’ve been asking ourselves: Do the labels they are using influence how much we trust these tools? Do they encourage active discipline and healthy skepticism? Do they legitimately describe what these models can do and how they help?

Because these tools can generalize so broadly, and because they seem extremely sensitive to how you prompt them, we’re all having our own unique experiences that influence how we think of and relate to them. Naturally, we’re contemplating questions that a lot of other people are as well: Are they “intelligent” or are they just “autocomplete”? Are they “Copilots” or “Junior Developers”? Do they “understand” or are they “stochastic parrots“?

Labels matter. They set expectations and they especially matter when they influence your level of trust. Just like politicians use catchy marketing slogans to try and influence your view of them and get your vote, we think big tech is pushing a narrative that AI tools and LLMs are more capable and trustworthy than they are.

The Problem with Current Labels

Before we dig into the labeling issue, there needs to be a foundational understanding of what these tools are.

Even though they are incredibly complex systems, you don’t need to be an expert in machine learning to get a layman’s understanding of the concepts and mechanisms under the hood. Once there’s a decent understanding of how a model takes an input and produces an output, the current labeling being used for these tools doesn’t make as much sense.

Before going further, we think it prudent to begin by dismissing the two most hyperbolic statements from the get-go:

No, they are not only “autocomplete” or a “search engine”
No, they are not digital versions of our brains

The truth is much more nuanced. If one takes the time to really dig into the mechanics of these models, it’s not possible to walk away thinking that either of these statements are the full story.

Thinking or Calculating?

I am aware that much of this next section is re-treading old ground around how these models work, but stay with me, as I build on this fundamental knowledge. First, a technical definition:

Large Language Models are complex mathematical engines, rooted in linear algebra and probability, that generate outputs by predicting the statistical likelihood of the next token, and are refined using calculus to iteratively minimize the error between their predictions, and vast libraries of unlabeled data.

In other words, LLMs do not “read” text, but rather calculate numbers:

Still from the phenomenal 3blue1brown LLM explanation video

When you send a prompt to an LLM, this is the path it will use to return a response to you:

It performs mathematical lookups on the input and maps the words and sub-words into numerical lists (tokens).
The Transformer introduced a process that involves “self-attention mechanisms”. In short: they calculate the mathematical relationships between every token to derive the context.
Once it has the context of the words, it then multiplies them against dozens if not hundreds of layers, each layer containing incomprehensibly large matrices of integers (weights). This multiplication is a process of running the numbers (the tokens’ vectors) through the model’s numbers (the weights) to see which patterns or ‘neurons’ light up. Arguably, this is the main reason of why these latest models have generalized so well: they can create correlations that lead to letters -> words -> phrases -> concepts.
The model then calculates a probability distribution and selects the next token, sometimes picking the most likely option, sometimes picking a slightly less probable one (e.g. if the model is configured to be more “creative” vs. more “professional”).

If you’re a visual learner, here is a quick video that recaps this process.

A basic visual for the process of an LLM translating input to output.

We need to be careful about conflating this process with “thinking” or “reasoning” (which sound more like carefully selected marketing terms…more on that later).

The “thinking” an LLM does is not contemplation; it is statistical projection between words in a high-dimensional mathematical embedding space. It follows the vector path to the next most likely token, regardless of whether that path leads to a true statement or a falsehood. The model, as Cal Newport said, “is trying to finish the story that you’ve given it”*.

*A common retort we hear to this idea is that you can prime the model to be critical or even contradictory, to mitigate the “sycophantic” behaviors. The rub is: Even if you ask an LLM to be contradictory and critical, you’re leading it to be contradictory and critical, and it becomes very difficult to know if the responses you are getting are truthful and rooted in facts, or if the LLM is just completing the story and being influenced by your instructions.

While some might argue that this definition of “thinking” is semantic in nature, I posit that the distinction is essential, because it changes the level of trust you should have when using these models. When an individual trusts an LLM the same as a human, it has resulted in misinformation, poor or harmful suggestions, and even numerous examples of AI psychosis when the user has lost the ability to scrutinize the responses.

With the introduction of “reasoning tokens” that OpenAI unveiled with their o1 model in late 2024, it provided additional layers of “inference time.” This has diminished some of the most egregious fabrications that models were known for prior to this change, yet some studies show that reasoning models can ironically hallucinate more.

Of course, humans aren’t factually correct 100% of the time either. The difference is that humans have the potential for self-awareness; we can catch ourselves in a mistake and pivot at any moment. This potential does not exist in an ‘AI’ system. Once an LLM is on a track, it treats its own previous output as the absolute foundation for what comes next. It does not ‘fact check’ itself; it only ‘consistency checks’ itself. If it makes an error, it will often compound that error to maintain the flow, spiraling into a loop of fabrications. There is a human equivalent to this, and we call it delusion.

In other words, to an LLM: the process of answering correctly and the process of spiraling into delusion are the exact same process.

The “Copilot” or “Collaborator”

If the labels of “thinking” and “reasoning” are potentially flawed, then it calls into question the other labels we’ve been using, which anthropomorphize them in ways that influence how much we integrate and trust them.

The most common way that companies have positioned AI tooling is that of “Copilot” or “Collaborator.” Besides the fact that these are marketing terms and not technical definitions, these systems lack the qualities that would allow them to be used in these ways, so these labels create false narratives and expectations around their capabilities.

To be an effective copilot, it needs to be able to pilot, as well. Which means it needs to be capable of decision making, deliberating, and weighing the cost of action vs. inaction.
To be an effective collaborator, it needs to be driven by the vision of the project and more importantly, curiosity, which involve having defined opinions and knowledge that come from past experiences.

The “Junior” or “Senior” Developer

In the development world specifically, it’s common to refer to them as overly confident and eager-to-please “junior” developers or interns, but these terms are misleading and incomplete, as well:

These tools have direct access to a seemingly infinite repository of code examples, documentation, and best practices, something no junior developer possesses.
And yet, they lack all the real-world experiences, empirical knowledge, and wisdom that mold someone into what is considered a “senior.”

They are neither of these, of course, and these labels anthropomorphize them in unhelpful ways by creating false expectations.

For example, a model will never provide an unprompted “reality check”, even though that is exactly what you might need at the time. This type of interaction is all too common (and I’m sure many of you have had similar ones):

Me: “Review this implementation. Is there a better way I can approach X?”
LLM: “Yes, you can improve X, by doing Y“.
Me: “If we do it like Y, it wouldn’t meet the same requirements.”
LLM: “You’re absolutely right! The solution here is to do it like Z!”
ME: “………..but Z is X, it’s what I started with.”
LLM: “You’re absolutely right! Your current approach is the best way to implement this type of functionality, and needs no changes!”

Clearly, it could have said that from the start if that was the truth, but there’s no way to really know that based on its responses, since it’s always “finishing the story”. AI tools are representative of a system that can present the pattern of reasoning, but does not actually possess reasoning.

This distinction is important because it transforms your approach with the work you try to delegate to them, with the understanding that diligent skepticism is essential, with everything you do, even if it seems correct.

The Developer Delegation Divide

There’s a pretty big divide in the developer community around these tools. There’s one side that finds them to be fundamentally life changing, even claiming they are erasing programming as a learned skill, and another side that finds AI coding to be utterly useless; “fancy autocomplete” that produces “slop.”

If this technology is so powerful, this divide is surprising. This isn’t like the DotCom era or internet, where the infrastructure wasn’t built yet, websites were in their infancy, and many of the products were still theoretical. These tools are ubiquitous, robust, and available for anybody to utilize and deploy.

A major reason for this divide over AI’s usefulness comes down to a developer’s relationship to delegation.

Developers, in general, aren’t the best delegators. Why bother explaining it to someone when you can just do it yourself, exactly the way you want it done? Techies, in general, tend to be a very (ahem) passionate and opinionated bunch, and the thought of losing control over their codebase is offensive to developers who see their work as a craft, as much as a paycheck.

In a recent study released by Anthropic titled How AI is transforming work at Anthropic, one of the engineers was quoted as saying:

“The more excited I am to do the task, the more likely I am to not use Claude.”

Effective and productive delegating is quite hard, especially when you’re excited to dig into the code yourself, and delegating to an LLM is even harder, since you’re delegating to a function.

It doesn’t care about your project, and it won’t push back or critique unless you specifically tell it to (which can distort the feedback).
It has no concern for the long or short term viability of the project.
It doesn’t know about context that isn’t explicitly provided.
It is very sensitive to the input and context a user provides (two developers approaching the same problem could have wildly different results).

There are apparently two groups that are getting the most out of these tools, and they are on the opposite ends of the spectrum:

Individuals with minimal coding experience (“vibe coders”)
Experienced senior developers who’ve been in the industry for a decade+

This seems counterintuitive that these groups would be the overlap, but the underlying reason is: both groups delegate often.

If you’re a senior developer, you already understand architecture, tradeoffs, design patterns, naming conventions, debugging, etc.., and that foundation lets you guide the AI to get the best result (even if it needs some cleanup).

Junior developers and hobbyists, on the other hand, have a tendency to just delegate everything, since they are using the LLMs for filling in their knowledge gaps. Unsurprisingly, this leads to a lack of understanding, a major breakdown, or a hard ceiling when the LLM eventually can’t hold the project’s scope any longer and this scenario becomes the blind leading the blind.

The Factory Model

Instead of looking at the models as Copilots that don’t have the skills to pilot, or Collaborators that lack the qualities to collaborate, we can work them in a way that meets them where their capabilities are: as systems we can delegate very defined tasks of varying complexity to.

The LLM model pipeline can be thought of as a robotics factory floor, manufacturing each item (requests) through a series of steps. The factory takes the blueprints (prompt) and the raw materials (tokens) and constructs a new product (output) based on the specifications and requirements. Once completed, the product still needs robust testing, and verification that it meets the requirements and safety standards.

The factory robots have no awareness if they’re building a functioning product, or something that is just “statistically likely” to resemble the product. Once the manufacturing task is set in motion, they will complete it with reckless abandon, no matter the outcome. There might be some self-healing mechanisms in place, but they’re still just inanimate machines prone to failure for any number of reasons.

Viewing these systems as pipelines that process requests, rather than cooperative partners that are prioritizing your best interest, allows for an objective approach where scrutiny is normal and trust is replaced by “due diligence.”

Confidence without Comprehension

“Vibe coding” was a term coined in early 2025 by one of the original engineers at OpenAI, which describes a style of development that essentially offloads all technical knowledge and implementation details to the models:

Despite Karpathy clearly defining this as a “throwaway” workflow, the term still took off and spread like wildfire across social media platforms. Countless people and businesses have jumped onto this term as the new “no code” trend (not to be confused with the “no code” trend that happened over five years ago) and the declaration that programming is dead.

Interestingly enough, around 8 months after Karpathy coined the term he distanced himself from the concept and there’s been indicators that the idea is a passing trend. CEO Michael Truell of Cursor, arguably the most popular LLM-driven coding tool, was recently quoted:

If you were vibe coding, you would close your eyes and just ask for a house to be built. You wouldn’t examine the foundations, you wouldn’t look under the floorboards, and you wouldn’t look at the wiring. […] If you close your eyes and you don’t look at the code and you have AIs build things with shaky foundations as you add another floor, and another floor, and another floor, and another floor, things start to kind of crumble,”

Vibe coding operates under the presumption that technical understanding is something that can be delegated, and we already have a plethora of evidence that it can lead to catastrophic situations.

Diligent Skepticism

While the industry is pushing relentlessly for handing over control to “agents,” I propose a more measured approach, and recommend that the default mode when working with LLMs should always be scrutiny and skepticism. The trust needs to be earned, not granted, especially when you’re not versed enough in the material to be able to verify immediately.

When working in areas where the training data is robust and plentiful and the requests are clearly architected with proper context, they have a fairly high accuracy rate. Nevertheless, the real work happens in the nuance and the details, and they are renowned for introducing application-breaking issues through seemingly innocuous additions or changes. Every response should have a “trust, but verify” approach.

That same study from Anthropic supports this methodology:

https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic#trust-but-verify

Engineers are split on whether to use Claude within or outside their expertise. Some use it for “peripheral” domains to save implementation time; others prefer familiar territory where they can verify outputs (“I use Claude in such a way where I still have full understanding of what it’s doing”). A security engineer highlighted the importance of experience when Claude proposed a solution that was “really smart in the dangerous way, the kind of thing a very talented junior engineer might propose.” That is, it was something that could only be recognized as problematic by users with judgment and experience.

The Dangers of Cognitive Offloading

Regardless of whether you’re a hobbyist, a junior, or a senior engineer, you can over-delegate your workload to the point where you lose control of a project, or create a situation where you could suffer from skill atrophy. Cognitive debt is a well-researched and documented phenomenon that is happening with those that use these tools with a high frequency by “offloading” their work and cognition to them often. “Use it, or lose it,” as the saying goes.

Choosing when to delegate is an ongoing practice that will ebb and flow throughout the course of a task or a project, and whatever you decide to delegate you have to be ready to scrutinize heavily. Delegating requires constant vigilance as these models are not operating in the realm of truth or falsehoods.

The Anthropic team’s engineers echo this sentiment:

One reason that the atrophy of coding skills is concerning is the “paradox of supervision”—as mentioned above, effectively using Claude requires supervision, and supervising Claude requires the very coding skills that may atrophy from AI overuse. One person said:
Honestly, I worry much more about the oversight and supervision problem than I do about my skill set specifically… having my skills atrophy or fail to develop is primarily gonna be problematic with respect to my ability to safely use AI for the tasks that I care about versus my ability to independently do those tasks.
To combat this, some engineers deliberately practice without AI: “Every once in a while, even if I know that Claude can nail a problem, I will not ask it to. It helps me keep myself sharp.”

To avoid this atrophy, and to develop responsibly, we should stop pretending these tools are colleagues and instead approach them as “delegation utilities”, which better aligns with their core mechanisms.

How To Delegate Effectively

We’ve likely all had a boss or superior tell us to “I need you to do ____________”, but didn’t provide nearly enough information, background and context as to what they wanted, how they wanted it done, and when they wanted it done by.

Delegating is not just telling someone to “do something”. The practice and art of delegation has some key components that need to be included if you are going to get quality results and outcomes.

Delegating is responsibly assigning a task to someone else, furnished with the same materials and information that you, yourself, would also need, in order to execute it successfully.

Whether you are delegating to a human or an LLM, there’s a checklist that can raise the chance for success:

Type – what type of task is this? E.g. writing, coding, brainstorming
Request – what is the request? E.g. the actual deliverable you’re expecting
Context – What background is needed to accomplish this? E.g. examples, documents, images
Constraints – What should be avoided? What limitations? E.g. “avoid using _______”
Examples – What does a good outcome look like? E.g. code snippet, writing style
Format – What should the output be formatted as? E.g. JSON object, transpiled to Python, SCSS rules

Not every task that is delegated needs all of these included for it to be a successful. The more basic the task, the less context and support is needed. As the complexity increases, you would incorporate more and more of these.

For example, if I have a simple refactoring function task, the function itself with the phrase “refactor this grouping of functions into a class with declared methods” will suffice. Inversely, if the task is going to be a multi-step process that covers numerous disciplines, then you need to put in the work up-front to ensure its processed effectively and to your requirements.

Macro or Micro-Delegation

Macro-delegation is a process of delegating high-level, multi-step tasks and agent orchestration, and requires a lot of preparation of requirements, specifications, and delimited execution steps. It can be incredible if successful, and a complete catastrophe if it goes off the rails. Some of it is the maturity of the tools, and some of it is the nature of delegating to an unthinking algorithm, which can result in fun outcomes like erasing your hard drive because it was tasked with deleting the cache (which technically, I guess it did do).

Micro-delegation is precision-level, and comes with much lower risk. In this workflow, I tend to view my requests as dynamically composable functions, that execute other functions of the LLM.

Anthropic just released a study regarding How AI is transforming work at Anthropic which draws the same distinction:

How much work can be fully delegated to Claude?
Although engineers use Claude frequently, more than half said they can “fully delegate” only between 0-20% of their work to Claude. (It’s worth noting that there is variation in how respondents might interpret “fully delegate”—from tasks needing no verification at all to those that are reliable enough to require only light oversight.) When explaining why, engineers described working actively and iteratively with Claude, and validating its outputs—particularly for complex tasks or high-stakes areas where code quality standards are critical. This suggests that engineers tend to collaborate closely with Claude and check its work rather than handing off tasks without verification, and that they set a high bar for what counts as “fully delegated.”

Personally, I mostly prefer micro-delegation. I delegate hundreds of little tasks, but I rarely delegate large tasks and comprehensive work because I find that it leads to cognitive debt, too much code review, and usually a lot of refactoring at some point down the line.

That isn’t to say that I haven’t or won’t engage with macro-delegation or even “Spec Driven Development” depending on the project, but I find the slope gets the slipperiest for both the health of your project, and the health of your mind in the long term, the broader you go with your delegation.

How We Delegate – The Four Rs

In the Anthropic study, the engineers and developers listed the tasks they most often delegate:

Outside the user’s context and low complexity
Easily verifiable
Well-defined or self-contained
Code quality isn’t critical
Repetitive or boring
Faster to prompt than execute

We’ve had a similar list that covers these types of tasks, and can be organized into four main categories that we have found are the most common, most effective ways to leverage LLMS (in order of the most basic to the most complex):

Rote
Refactor
Research
Reinforcement

Let’s explore each a bit more in-depth and with examples.

Rote

This is probably the most common use case that most developers lean on these tools for. Whether you’re enthusiastic about AI or a detractor, you can’t help but see the value in offloading what are commonly arduous tasks over to an algorithm that will churn them out in minutes, or having a debugging assistant that can drill-down and find the needle in the haystack.

Examples:

Data processing/entry
- Content population
- Converting images and snapshots to JSON
Debugging assistance and error log analysis
Boilerplate and common functions (that offer no value for repetitive writing)
- Boilerplate (Classes, Components, API routes)
- For Loops/Map/Recursion/Iteration
- Formulaic CSS styles (typography, color variables, print)

Print Styles can be extremely tedious. If I never wrote them again, it would be too soon…

Refactor

These models are incredibly capable pattern matchers (even if that’s not all they are) and leveraging those functions has been instrumental in getting out from under some of the drudgery of the work. There’s always value and lessons to be learned in performing refactors, but all refactors aren’t equal, and sometimes they just need to be done. For example, one practice we deploy often is writing in a verbose manner knowing full well we can save the keystrokes since the LLM can refactor later according to our requirements.

Examples:

Transpiling
- React to Vue, PHP to Python, CSS unit conversions etc.)
DRY’ing out verbose implementations

Breaking a huge single file into smaller functional components is a near daily request

Research

This one hearkens back to our idea that these models have the ability to behave like interactive documentation, where you’re essentially chatting with the codex of what humanity has put on the internet.

Researching with LLMs is a bit like panning for gold. The models are able to produce reams of material, and then it becomes a process of sifting and finding the nuggets. Their sycophantic design doesn’t help things, and it can be hard to tease apart what is objectively true vs. the pattern the model is following. They are designed for engagement, and that influences their accuracy. For example, I was recently working on an analogy of how LLMs work in general.

I can see how some people start to believe they have discovered something completely groundbreaking.

For the record, I don’t think my analogy was the “most accurate intuitive analogy for how vector mathematics works”. 😅

Another great researching use case that we use all the time is for generating high level examples to provide contextual learning. Picking up new skills, languages, and frameworks can be accelerated significantly when you have the ability to generate tutorials for yourself in the exact contexts that you know would help you understand something.

Examples:

High level and meandering explorations of a topic
Deep documentation dives with specific examples
Dynamic tutorial generation

Reinforcement

In this realm, the delegation takes the form of high level strategy, and you’re, ideally, using these tools reinforce and support work that often requires multiple steps and integrations. This is type of delegation that requires the most scrutiny. If you’re working in a domain where you’re very comfortable, it can drastically reduce the time it takes to get to that 80% point, but is quite risky when you aren’t. In addition, this is the type of delegation where skills atrophy the quickest if you’re constantly abdicating responsibility of decisions to an external source.

Examples:

Analyzing an existing codebase and asking questions about functionality/architecture
Agentic coordination and Spec-Driven Development
Comparing, contrasting, and/or validating methodologies and approaches

The Future of Programming

Some say “natural language” is essentially a new abstraction layer, the same as the introduction of the Compiler. This is a tempting analogy, except “natural language” in and of itself is more of an intermediary between compiled languages, and we’ve yet to see if it can truly function as its own layer. I could foresee a future where we’ll have something in-between the two; an LLM-specific type of syntax or pseudo-code that can bridge the ambiguity gap between natural language prompts and the resulting code.

Programming itself, what it means to guide computers to execute our instructions as per our requirements, along with fundamental knowledge of what it takes to extend and support the resulting software, is not changing. In other words: if you program in natural language, you’re still programming.

It seems the capabilities and complexities of the software we write, scale to the tools we have to build with. This means more abstraction layers, more integrations, more variables, more moving parts; more complexity. Mix all that up together and what do you get? Usually, more problems. 😅 The 7 trillion dollar gamble is that AI (LLMs) will scale to absorb all of those issues that it might help create. Considering their introduction has only complicated matters by introducing a verbose and non-deterministic element into the process, I remain unconvinced.

As Andrew Ng recently said:

Some senior business leaders were recently advising others to not learn to code on the grounds that AI will automate coding. We’ll look back on that as some of the worst career advice ever given. Because as coding becomes easier, as it has for decades, as technology has improved, more people should code, not fewer.”
The limitations we face with LLMs and AI tooling is not something solved by bigger models or more inference time; they go far deeper and are intrinsically linked to what it means to think, problem solve, create, and envision. The act of coding and development are as general as it gets and encompasses many of the higher order functions that no LLMs can come close to.

Nonetheless, as the tooling gets better, leaning more into macro-delegation and agent orchestration is something that is likely to increase, but the risks of cognitive offloading and reducing insight to the endless nuanced complexities in the codebase, will likely remain a persistent risk factor.

By shifting our mental model of what it means to work and delegate to these systems rather than assign them agency, we can leverage the strengths of their wide range of capabilities, while mitigating risk to our codebase(s), and our mental acuity.

Don’t Vibe Code; Delegate. Cultivating Diligent Skepticism and Responsible Development in the Age of LLMs

The Problem with Current Labels

Thinking or Calculating?

The “Copilot” or “Collaborator”

The “Junior” or “Senior” Developer

The Developer Delegation Divide

The Factory Model

Confidence without Comprehension

Diligent Skepticism

The Dangers of Cognitive Offloading

How To Delegate Effectively

Macro or Micro-Delegation

How We Delegate – The Four Rs

Rote

Refactor

Research

Reinforcement

The Future of Programming

Share this Article:

Would you like to read more from us?

Don’t Vibe Code; Delegate.
Cultivating Diligent Skepticism and Responsible Development in the Age of LLMs