Can AI Un-Slop Itself?

Everyone knows that LLMs can, at least, sometimes create slop.

The interesting question isn’t whether they can create slop. It’s: can they un-slop themselves?

Problem

I dreamed of a programming language for 10 years. After Gemini 3.1-pro, I figured LLMs were good enough that I should at least finally see what this AI "vibe-coding" craze was all about.

I set on a 6-month journey to build a programming language.

The problem is: this only barely worked, and no one wants a barely working programming language!

What didn’t work

WRT the language:

WRT the runtime:

What happened next?

I told the LLMs to fix their slop!

In about ~35 minutes, they screamed:

“Done, you have a complete MIR pass in place! Everything is memory safe.”

They proceeded to feed me this same line of crap for ~1000 commits over 2 months, and me constantly asking:

“If the MIR pass is done, then how come X is still there and Y is still segfaulting?”

LLMs:

“Oh, yes, there’s just that one small part left. Okay, now it’s done!”

What happened next?

The definition of insanity is trying the same thing and expecting different results!

I gave up on LLMs' ability to tell me the truth about the state of a non-trivial codebase.

I got a wild idea...

I had LLMs build me a set of tools that don’t exist in Ruby (it's turtles all the way down).

  1. I built the compiler in Ruby, because, initially this started off as a project I was building by hand.
  2. The initial scope was merely to play around with Syntax and see if I could develop something I liked.
  3. Never did I ever dream that I would actually get something this far.
  4. Everyone knows that a good compiler eventually self-hosts, and the goal was - like Crystal - my language would be quite close to Ruby (so LLMs should be able to migrate it easily).

Anyway, Ruby has decent tooling around “code health” like SimpleCov / Flay / Flog / Reek / Debride, etc.

The problem is, if you’re building a compiler with LLMs these aren’t the core metrics you need.

The list goes on.

What did they build?

I’m used to developing... like myself. LLMs are different. It’s sort of like pair-programming with a genius who is occasionally extremely dumb and sometimes downright adversarial.

How do you learn to work with this crazy partner?

I rarely need to think to implement code, and I type quite fast... So LLMs are not that much faster than me at implementation. They are faster at design, but definitely not better.

Regardless, the enormous benefit they have is that they scale, whereas I do not.

If I have to review all of their code line-by-line, I’d rather just write it myself.

The goal wasn’t to try LLMs and find a reason not to use them... The goal was to find a way TO use them effectively.

If I have to review EVERYTHING LLMs write line-by-line, the bottle neck is me. The entire goal of LLM development is to get the bottleneck NOT to be me - i.e. to scale.

You can’t scale yourself... but maybe you can scale with LLMs.

So, rather than say:

I’ll just insert myself as a bottleneck and review everything line-by-line with caution…

I thought:

What if I put a system in place that makes it very hard for them to poke holes in?

What does the system look like?

Tooling & Reporting

Tooling & Reporting in general is a sweet spot for LLMs. They are very good at getting things nearly working.

The problem is: most things need to actually work.

In some cases, you are far better off with nothing than something that has very bad failure methods.

If you’re using a report to give out a billion dollars, you probably want a report that actually works.

But if you want a report that just gives you a probability of how much AI slop there is in a commit, and how dangerous that AI slop might be... something is far better than nothing.

The buzz phrase going around is:

LLMs increase implementation throughput far faster than they increase verification throughput.

What does that mean?

Invest in verification throughput!

I will continue to invest in layers of defense to protect myself from their common pitfalls and accept that 1) almost no software is perfect, even compilers, and 2) the system I have in place with LLMs is doing at least 10x better than I could do on my own.

Conclusion

Interestingly, I'm building a language that syntically makes the vast majority of bugs simply impossible to represent, and - when possible - the compiler autogenerates all the failure methods that might unfold, so you can see exactly where a bug might be.

I wanted to do this before working with LLMs. Now I want it more than ever before!

The question is: can LLMs ever get this language to work?

Time will tell... but - what I can tell you is - there’s not a shot in hell I’d ever get this done myself.

It at least seems possible with LLMs.

Source: docs/retrospective/can-ai-unslop-itself.md