My thoughts on building software using AI agents

I use LLMs a lot lately. I even got an unlimited access to a popular commercial model recently, which increased my usage even more. It’s all still so new to me. I keep poking at the limits and discovering what it means for software and three things come up for me on most days:

Human focus is the scarcest resource

We can ask the model to do anything. It’s so very tempting to ask it to do everything. I think that’s a trap.

Toyota used work-in-progress limits to ship faster. The idea is: having more than n projects in progress at any given time means increases the total time required to ship them. I believe Toyota’s wisdom holds for software developers operating LLM agents.

I’m very fatigued by switching between multiple agents and re-prompting them. I lose focus, and I can’t meaningfully drive anything to completion. This tells me I need to stop doing that and find a more productive way of working. That could mean driving just one or, at most, two agents at the same time.

I’m trying fully autonomous loops now. The idea is: I can start multiple loops today, and then pick up the results on by one on my schedule and stay with them until the entire project is shipped. Technically, I’m defining the expected outcome once and run the model in a while(true) bash loop. I’ve noticed models often report success prematurely, so the loop is instructed to continue working on the task even when it thinks it’s finished.

Understanding of the software is eroding

Human developers use time to turn code + requirements into understanding of the system¹. The code is just a reflection of those mental models. Time is necessary because humans learn through spaced repetition. Solid mental models are created through hours of thinking, hacking, talking, and attacking the problem from multiple angles.

LLMs use electricity to turn code + requirements into large volumes of text. Time is secondary. A faster model will produce more tokens in less time.

Once LLMs are involved, humans can’t and don’t meaningfully engage with new code. It comes too fast, it lacks soul, and it requires too much effort to go through. The more new code produced per day, the faster maintainers give up on even reading it.

I’m not aware of any LLM-backed way of keeping up with the volume. No amount of summarization, diagrams, or automated reviews can compensate for the loss of understanding.

Dennis Snell said projects must be primarily driven by either humans or LLMs, but not by both. I agree. I’m very happy to start new repositories where I instruct the LLM what to do at a high level and merge most of the things it proposes. I forget the details quickly. In fact, I’m not sure how most of these projects work. That’s scary, and I’d rather not adopt that modus operandi in WordPress core.

Quality is more difficult than before

Previously reliable software now crumbles around me every day. GitHub throws errors and changes the content of people’s repositories, major companies have daily outages, and random buttons stop working in the apps I use.

Everyone incentivizes speed over quality these days. LLM-produced code is reviewed by LLM reviewers, then passes LLM-generated checks, gets tested by LLM testers, and only meets its first human being once it’s released to the customer.

Is that a viable business model? Who knows! If the LLM economy doesn’t change, tokens may increase in price 100x in the next few years. In that scenario, we’ll need a lot of trained developers to pick up the LLM slack. However, if the economics do change and token prices decrease, heavy LLM adoption might be the winning strategy.

In any case, the LLMs desperately need to get better with quality.

My agents spend most of their time running tests and fixing bugs. Even with less frequent test runs, the tests must run eventually. LLM-generated tests tend to cover surface-level happy paths in many of my long-running loops, even when I’m very specific about the desired depth of coverage. And even then, most test runs fail, leading the agent to change the code and re-run the tests. Which means more waiting.

I spend most of my time finding defects in the agent’s work and getting them to fix them. Sometimes I’m lucky and they don’t introduce another defect while fixing the first one. I haven’t found any way of getting an LLM to one-shot a reliable, production-ready system. I’ve tried goals, loops, rigorous harnesses, specific acceptance criteria, and they still find ways of messing up.

Claude, especially, is very prone to ignoring my directions and solving a different problem than the one I wanted solved. It often chooses a simpler problem, a workaround, or it gives up and refuses to proceed. I think they overbooked their hardware capacity and nerfed their product to make up for it. In any case, Claude Code has been almost useless to me for a few days now, and I might cancel my Claude Max subscription until they figure this out.

Programming as theory building ↩︎

Adam's Perspective

Human focus is the scarcest resource

Understanding of the software is eroding

Quality is more difficult than before

Like this:

Leave a ReplyCancel reply