We had our first talk in the seminar series ‘The Cognitive Science of Generative AI’

Ulrike Hahn

But he also described their more recent work of using genAI systems *as cognitive models*, using (fine-tuned) models to predict human behaviour and compare their performance to standard computational models in those domains. This work, with their system Centaur, has now been scaled to cover over 160 behavioural experiments, see

https://arxiv.org/abs/2410.20268

3/7

Ulrike Hahn

Eric’s lab were the first to use behavioural tasks from psychology to probe LLMs, and in the talk he discusses not just what they found but why that is useful (and how it relates to the benchmarking of LLMs and genAI systems)

https://www.pnas.org/doi/full/10.1073/pnas.2218523120

2/7

Louis Chartrand

@UlrikeHahn A bit lazy on my part to ask, but… are these tasks that have been studied? GenAI seems to do amazing when you can find the answer in their (humongus) training data, but not very well once you have novel tasks. Even presenting a problem in a bizarre way would throw them off.

Ulrike Hahn

@locha 2/2 e.g. “participants preferred the less risky option 75% of the time”.

So any straightforward application of “next token prediction” isn’t going to get you (the system) the ‘right’ answer when you are faced with the stimulus items themselves (here are two options [..] and [..] which…?) and have to go through and actually make your own choices.

Ulrike Hahn

@locha yes, the problem of ‘data leakage’. These tasks have been ‘studied extensively’ but their descriptions don’t contain the tasks themselves is my point:

the task is, say, “here are two options [..] and […]” which of these do you prefer. And you, as a participant get a whole bunch of these choices to make. And your preferences across all of those items are the data that is recorded. That data is then only ever given a high-level summary description in a paper 1/2

Mathieu Perona

@UlrikeHahn @locha I expect detailed descriptions of these tasks abound in textbooks and lecture notes put online (mine included), which very likely are in the training data.

Ulrike Hahn

@locha or to put it differently, these behavioural tasks are very different from something like false beliefs tasks where there is a (often single) text based question and a ‘right answer’, and both the question and the right answer will invariably be described in the paper.

Ulrike Hahn

@MathieuP @locha of course there are “descriptions of these tasks” - the point is how get from those descriptions to actual choices. The descriptions will be something like: “to examine risk aversion, we gave participants 60 gambles involving two options that varied in risk”…”we found that participants chose the risk averse option 80% of the time”. By contrast, the task the system gets is 60 iterations of “here are two options [..] and [..], which of these do you prefer?”

Mathieu Perona

@UlrikeHahn @locha I beg to differ. In my class, I fully described examples of options, and which one people preferred. Looks like to me the kind of structure a LLM could pick up.

Ulrike Hahn

@MathieuP @locha how, in your view, do you get from the summary description to *picking individual options* in a way that matches human preferences. The options *aren’t labelled*. The word “risk aversion” literally features nowhere, and even if it did, to pick the corresponding options you would need to know what ‘risk aversion’ means such that you are able to identify the right option of the two, *and* then do that the right proportion of times…

Mathieu Perona

@UlrikeHahn @locha In my (statistician) opinion, you would get quite correct results if the figures used in the problem the model faced are close enough to these used in Masters' textbooks and exercises. The LLM would consider figures as word, and be able to predict that the options with these figures are associated with a positive weight.

Ulrike Hahn

@MathieuP @locha we crossed posts… my points already apply.

the fine tuned models, of course, are fine tuned on actual data, which is why they do better. But they are tested on new items, which requires transfer, and the fact that relevant structure is extracted is the point.

Mathieu Perona

@UlrikeHahn @locha Strictly speaking, a genAI does not "identify" the right option in any meaningful way. It just predicts that a chain of the order "choose this" is more often associated with some formulation of option A than with some formulation of option B. Boilerplate lecture slides with "people chose this over that", with the two options, provide the required training data.

Ulrike Hahn

@MathieuP @locha a) I’m not sure to what extent we are actually disagreeing….

to make the right generalisation, you have to have extracted relevant structure. That extraction is not trivial (by design)

Mathieu Perona

@UlrikeHahn @locha I discuss the very idea that a generalization was made. My first assumption is that something close enough to the "new" items is there on the training dataset, and that the model, as expected, just reproduces what is in the training set.

Ulrike Hahn

@MathieuP @locha I suspect you might not have read point 2 in my thread

Ulrike Hahn

@MathieuP @locha

extending a response to a “close enough” *is* a generalisation.

Mathieu Perona

@UlrikeHahn @locha I did, but I fail to see how it answers my concern.

Mathieu Perona

@UlrikeHahn @locha I object to this wording, since it assumes an abstract reasoning capacity, which LLMs lack. It may *look like* a generalization, but is not considering the output could essentially be training data + some noise (the noise parameter being responsible for the "close enough" part).

Ulrike Hahn

@MathieuP @locha your actual concern was addressed by pointing out that extending the responses to non-identical gambles is generalisation, by definition. Throwing in the word “close enough” is, at best, a question-begging redescription of the phenomenon, not an explanation of how it was obtained.