We had our first talk in the seminar series ‘The Cognitive Science of Generative AI’
-
But he also described their more recent work of using genAI systems *as cognitive models*, using (fine-tuned) models to predict human behaviour and compare their performance to standard computational models in those domains. This work, with their system Centaur, has now been scaled to cover over 160 behavioural experiments, see
https://arxiv.org/abs/2410.202683/7
-
Eric’s lab were the first to use behavioural tasks from psychology to probe LLMs, and in the talk he discusses not just what they found but why that is useful (and how it relates to the benchmarking of LLMs and genAI systems)
https://www.pnas.org/doi/full/10.1073/pnas.2218523120
2/7
-
@UlrikeHahn A bit lazy on my part to ask, but… are these tasks that have been studied? GenAI seems to do amazing when you can find the answer in their (humongus) training data, but not very well once you have novel tasks. Even presenting a problem in a bizarre way would throw them off.
-
@locha 2/2 e.g. “participants preferred the less risky option 75% of the time”.
So any straightforward application of “next token prediction” isn’t going to get you (the system) the ‘right’ answer when you are faced with the stimulus items themselves (here are two options [..] and [..] which…?) and have to go through and actually make your own choices.
-
@locha yes, the problem of ‘data leakage’. These tasks have been ‘studied extensively’ but their descriptions don’t contain the tasks themselves is my point:
the task is, say, “here are two options [..] and […]” which of these do you prefer. And you, as a participant get a whole bunch of these choices to make. And your preferences across all of those items are the data that is recorded. That data is then only ever given a high-level summary description in a paper 1/2
-
@UlrikeHahn @locha I expect detailed descriptions of these tasks abound in textbooks and lecture notes put online (mine included), which very likely are in the training data.
-
@locha or to put it differently, these behavioural tasks are very different from something like false beliefs tasks where there is a (often single) text based question and a ‘right answer’, and both the question and the right answer will invariably be described in the paper.
-
@MathieuP @locha of course there are “descriptions of these tasks” - the point is how get from those descriptions to actual choices. The descriptions will be something like: “to examine risk aversion, we gave participants 60 gambles involving two options that varied in risk”…”we found that participants chose the risk averse option 80% of the time”. By contrast, the task the system gets is 60 iterations of “here are two options [..] and [..], which of these do you prefer?”
-
@UlrikeHahn @locha I beg to differ. In my class, I fully described examples of options, and which one people preferred. Looks like to me the kind of structure a LLM could pick up.
-
@MathieuP @locha how, in your view, do you get from the summary description to *picking individual options* in a way that matches human preferences. The options *aren’t labelled*. The word “risk aversion” literally features nowhere, and even if it did, to pick the corresponding options you would need to know what ‘risk aversion’ means such that you are able to identify the right option of the two, *and* then do that the right proportion of times…
-
@UlrikeHahn @locha In my (statistician) opinion, you would get quite correct results if the figures used in the problem the model faced are close enough to these used in Masters' textbooks and exercises. The LLM would consider figures as word, and be able to predict that the options with these figures are associated with a positive weight.
-
-
@UlrikeHahn @locha Strictly speaking, a genAI does not "identify" the right option in any meaningful way. It just predicts that a chain of the order "choose this" is more often associated with some formulation of option A than with some formulation of option B. Boilerplate lecture slides with "people chose this over that", with the two options, provide the required training data.
-
-
@UlrikeHahn @locha I discuss the very idea that a generalization was made. My first assumption is that something close enough to the "new" items is there on the training dataset, and that the model, as expected, just reproduces what is in the training set.
-
-
-
@UlrikeHahn @locha I did, but I fail to see how it answers my concern.
-
@UlrikeHahn @locha I object to this wording, since it assumes an abstract reasoning capacity, which LLMs lack. It may *look like* a generalization, but is not considering the output could essentially be training data + some noise (the noise parameter being responsible for the "close enough" part).
-