I finally turned off GitHub Copilot yesterday.

David Clarke :tinoflag:

@david_chisnall @carbontwelve this is what has been gnawing at the back of my brain. The purveyors of LLM's have been talking up the latest improvements in reasoning. A calculator that isn't 100% accurate at returning correct answers to inputs is 100% useless. We're being asked to conflate the utility of LLM's with the same kind of utility as a calculator. Would we choose to drive over a bridge designed using AI? How will we know?

David Chisnall (*Now with 50% more sarcasm!*)

@zebratale @carbontwelve Calculators do make mistakes. Most pocket calculators do arithmetic in binary and so propagate errors converting decimal to binary floating point, for example not being able to represent 0.1 accurately. They use floating point to approximate rationals, so collect rounding errors for things like 1/3.

The difference is that you can create a mental model of how they fail and make sure that the inaccuracies are acceptable within your problem domain. You cannot do this with LLMs. They will fail in exciting and surprising ways. And those failure modes will change significantly across minor revisions.

Stephen J. Anderson

@simon @kitten_tech @carbontwelve @david_chisnall How would you avoid or deal with the issues that David encountered? Specifically, subtle bugs that the process of debugging make the whole process less efficient than writing it yourself. Is there one of your notes that deals with that already?

Glitzersachen

@david_chisnall @zebratale @carbontwelve

"do make mistakes" I wouldn't call that a mistake. The calculator does what it should do according to the spec how to approximate real numbers with a finite number of bits.

It's (as you explain) a rounding error. A "mistake" is what Pentiums with the famous Pentium bug made.

But maybe it's my understanding of English (as a second language) that is at fault here.

Pendell

@glitzersachen @david_chisnall @zebratale @carbontwelve the calculator /is/ doing exactly what it's been programmed to... and it is programmed to make specific and defined "mistakes" or errors in predictable and clear cut ways in order to make the pocket calculator run on as little power as possible.

An LLM, likewise, is also doing exactly what it was programmed to do... and that is to spew regurgitated nonsense it read off the internet.

Simon Willison

@utterfiction @kitten_tech @carbontwelve @david_chisnall you have to assume that the LLM will make weird mistakes all the time, so your job is all about code review and meticulous testing

I still find that a whole lot faster then writing all the code myself

Here's just one of many examples where I missed something important: https://simonwillison.net/2023/Apr/12/code-interpreter/#something-i-missed

Simon Willison

@utterfiction @kitten_tech @carbontwelve @david_chisnall but honestly, the disappointing answer is that most of this comes down to practice and building intuition for tasks the models are likely to do well vs mess up

Manipulating some elements in the HTML DOM with JavaScript? They'll nail that every time

Implementing something involving MDIO registers? My guess is there are FAR less examples relating to that in the (undocumented, unlicensed) training data so much more likely to make mistakes

Martijn Faassen

@kitten_tech

@carbontwelve @david_chisnall .

Note that how @simon reports using this to generate little projects is an entirely different mode of working with them. I have used copilot for a few years now and like it myself, which is mostly context sensitive autocomplete.

A Q&A session to create code for a CLI tool or web app is a very different way of working I started exploring more recently. It's surprisingly capable for little projects and requires a different approach.

Simon Willison

@faassen @kitten_tech @carbontwelve @david_chisnall Steve Yegge calls it CHOP, for Chat Oriented Programming https://simonwillison.net/2024/Jul/12/the-death-of-the-junior-developer/

{Insert Pasta Pun}

@pendell @glitzersachen @david_chisnall @zebratale @carbontwelve floating point finance calculations is a common mistake...

{Insert Pasta Pun}

@pendell @glitzersachen @david_chisnall @zebratale @carbontwelve programmers and CPU designers are just a tad sensitive and insecure when someone points out the calculator makes a mistake and isn't mathematically perfect