If you’re generating code, and you’re *not* doing it with an LLM, is it reasonable to use metrics like F1 and recall to measure how well the tools you use are doing? This is bothering me because it feels a bit weird to apply metrics like this to static...

kaoudis

If you’re generating code, and you’re *not* doing it with an LLM, is it reasonable to use metrics like F1 and recall to measure how well the tools you use are doing? This is bothering me because it feels a bit weird to apply metrics like this to static analyses, build tooling frameworks, or things that just plain don’t have any recall to begin with.

Ryan Castellucci :nonbinary_flag:

@kaoudis generating code, like, with build time scripts?

kaoudis

@ryanc yeah, if you want a bunch of semi-reasonable test cases for a compiler or something and you generate a bunch of build variants, is the case I was thinking about

kaoudis

@ryanc the thing in question is a talk I’m watching about using LLMs to figure out if code variants are equivalent, and as their baseline they seem to have used precision, recall, and F1 to measure how well methods that leverage non-ML things do at determining when code variants are equivalent

Ryan Castellucci :nonbinary_flag:

@kaoudis I have a number of personal projects that use build time code generation, some of it parameterized. Not sure if it would be useful for you to look at. The most complicated is a cryptographic hash library that generates HMAC, PBKDF2, and HKDF functions. I validate via test vectors.