We had our first talk in the seminar series ‘The Cognitive Science of Generative AI’
-
the subsequent discussion (not recorded) was lively - it was an added bonus there were some Mastodon friends there…
next week, Nov 21st, our speaker will be Cameron Buckner
join us, registration here: https://psyc.bbk.ac.uk/cccm/cccm-seminar-series/
-
..that seems potentially transformative if that were to hold on a wider scale..
All in all, it was a thought provoking opener in terms of what the seminar series is specifically meant to be about: a focus on the implications of genAI for cognitive science and vice versa
7/7
-
2. my preferred interpretation of my own work with connectionist/ML models of cognition has always been deflationary, taking the real value of these models to be that they identify structure available in the data (i.e., identify what information is available for executing the cognitive task)
Where genAI models manage to predict human behaviour better than extant cognitive models, they are providing information on how much there is to explain, i.e., how well our theories could ever do
6/7
-
(as the majority of experimental tasks in psych are).
That relates back to the question of how human-like these systems are in as much as the corresponding human response choices were never themselves in the training data.
(the only thing that likely was there was papers describing those results in aggregate, which is a detail-poor meta-statement, not a choice in the experimental task itself). So what is driving that correspondence?
5/7
-
And most recently, they have been extending this work to try and get genAI systems to also provide explanations by letting them infer underlying heuristics from human date (e.g., ‘take the best’)
Several thoughts struck me hearing the talk:
1, The ability of these models to predict human behaviour so well is surprising: asking a model to, say, make choices in a two-armed bandit problem is *not a linguistic task*, merely one communicated in language
4/7
-
But he also described their more recent work of using genAI systems *as cognitive models*, using (fine-tuned) models to predict human behaviour and compare their performance to standard computational models in those domains. This work, with their system Centaur, has now been scaled to cover over 160 behavioural experiments, see
https://arxiv.org/abs/2410.202683/7
-
Eric’s lab were the first to use behavioural tasks from psychology to probe LLMs, and in the talk he discusses not just what they found but why that is useful (and how it relates to the benchmarking of LLMs and genAI systems)
https://www.pnas.org/doi/full/10.1073/pnas.2218523120
2/7
-
@UlrikeHahn A bit lazy on my part to ask, but… are these tasks that have been studied? GenAI seems to do amazing when you can find the answer in their (humongus) training data, but not very well once you have novel tasks. Even presenting a problem in a bizarre way would throw them off.
-
@locha 2/2 e.g. “participants preferred the less risky option 75% of the time”.
So any straightforward application of “next token prediction” isn’t going to get you (the system) the ‘right’ answer when you are faced with the stimulus items themselves (here are two options [..] and [..] which…?) and have to go through and actually make your own choices.
-
@locha yes, the problem of ‘data leakage’. These tasks have been ‘studied extensively’ but their descriptions don’t contain the tasks themselves is my point:
the task is, say, “here are two options [..] and […]” which of these do you prefer. And you, as a participant get a whole bunch of these choices to make. And your preferences across all of those items are the data that is recorded. That data is then only ever given a high-level summary description in a paper 1/2
-
@UlrikeHahn @locha I expect detailed descriptions of these tasks abound in textbooks and lecture notes put online (mine included), which very likely are in the training data.
-
@locha or to put it differently, these behavioural tasks are very different from something like false beliefs tasks where there is a (often single) text based question and a ‘right answer’, and both the question and the right answer will invariably be described in the paper.
-
@MathieuP @locha of course there are “descriptions of these tasks” - the point is how get from those descriptions to actual choices. The descriptions will be something like: “to examine risk aversion, we gave participants 60 gambles involving two options that varied in risk”…”we found that participants chose the risk averse option 80% of the time”. By contrast, the task the system gets is 60 iterations of “here are two options [..] and [..], which of these do you prefer?”
-
@UlrikeHahn @locha I beg to differ. In my class, I fully described examples of options, and which one people preferred. Looks like to me the kind of structure a LLM could pick up.
-
@MathieuP @locha how, in your view, do you get from the summary description to *picking individual options* in a way that matches human preferences. The options *aren’t labelled*. The word “risk aversion” literally features nowhere, and even if it did, to pick the corresponding options you would need to know what ‘risk aversion’ means such that you are able to identify the right option of the two, *and* then do that the right proportion of times…
-
@UlrikeHahn @locha In my (statistician) opinion, you would get quite correct results if the figures used in the problem the model faced are close enough to these used in Masters' textbooks and exercises. The LLM would consider figures as word, and be able to predict that the options with these figures are associated with a positive weight.
-
-
@UlrikeHahn @locha Strictly speaking, a genAI does not "identify" the right option in any meaningful way. It just predicts that a chain of the order "choose this" is more often associated with some formulation of option A than with some formulation of option B. Boilerplate lecture slides with "people chose this over that", with the two options, provide the required training data.
-
-
@UlrikeHahn @locha I discuss the very idea that a generalization was made. My first assumption is that something close enough to the "new" items is there on the training dataset, and that the model, as expected, just reproduces what is in the training set.