Animal personality, operationalism and ontological ambiguity

Animal personality is one of the fastest growing areas of animal behaviour science, commonly defined as the study of relatively consistent among-individual behavioural variation through time and across contexts. It has, however, also grappled with a considerable degree of criticism for being overly anthropomorphic, concerned with solely increasing media accessibility to studies of animal behaviour, and fundamentally providing no new insights into animal behaviour (Beekman & Jordan, 2017). Some have gone further, calling personality "an empty placeholder feigning understanding" (Jungwirth et al., 2017), which has "emerged like a virus within behavioral ecology" (Crews, 2013). The consequences of these criticisms have been relatively minor, but has meant that, every now and then, a new article appears trying to clarify, settle and 'level the playing' field about what animal personality actually means and how it should be studied, creating a large number of 'data free' papers on personality which 'fail to be impactful' (see DiRenzo & Montiglio, 2015).

My PhD was on animal personality, specifically focused on dog behaviour, and I followed these debates relatively closely. And this topic continues to interest me almost five years on from leaving academia because it encapsulates a fundamental issue in many of the behavioural sciences, and likely others, one that isn't discussed quite as often as it should be but that is at the core of the reproducibility crisis, scientific method, and statistical inference.

Good science requires strong theories, as only then can we design and accurately interpret the results of scientific studies. Strong theories are detailed blueprints about how the world works, blueprints that produce predictions and provide evidence for the processes we believe cause natural phenomenon. Empirical evidence is often not enough, due to the probabilistic nature of how we sample observations of the world around us and the many ways that statistical results can confuse and mislead us. The same empirical results can support competing hypotheses or theories of how the world works, for instance, and problems with measurement and observational error cause ambiguity in using empirical results as evidence for any hypothesis or theory. We need strong theories to guide us through these murky empirical waters to avoid being misguided. Importantly, many of the open science bulwarks against questionable research practices such as p-hacking or HARKing cannot protect against the ambiguity resulting from a lack of strong theory.

It is not controversial to say that many areas of behavioural science do not build strong theories of how the world works before engaging in empirical research programs. The incentives for building strong theory, opposed to quickly running experiments, collecting data and publishing, don't exist. Most studies start from a vague hypothesis, such as 'General intelligence is influenced by socio-economic status' or 'Animal behaviour is moderately consistent across contexts', define a number of measurements of the constructs of interest, collect some data and churn out a series of numbers (correlation coefficients, p-values, posterior distributions etc.) which tell us how variables are associated with each other and the degree to which the data match the statistical model's, but infrequently the theoretical model's, assumptions. If there are many similar such studies, and sets of results, some consensus might be reached, but often there are plenty of exceptions. What we are left with is samples of empirical results that, as we have learned in recent years, have a less than desirable chance of being reproduced, and are hard to interpret as evidence for or against the particular hypothesis because there was scarce theory in the first place to tell us what to expect the results should even be.

It's worth transporting ourselves back one-hundred years to the early 20th century where the dominant philosophy of science and agenda of scientific measurements was quite radical empiricism, which generally attempted to expunge reference to the metaphysical, holding that we can learn all there is about the world from our direct observations and experiences with it. In 1927, Percy Bridgman introduced the term operationalism in his book The Logic of Modern Physics, a philosophy of science that said we can define scientific constructs purely by the operations by which we measure them. For instance, here's a well-known quote from his book about the concept of length:

To find the length of an object we have to perform certain physical operations. The concept of length is therefore fixed when the operations by which length is measured are fixed: that is, the concept of length involves as much as and nothing more than the set of operations by which length is measured.

At the time, this was highly influential, and particularly appealing to the logical positivist school of philosophy which was also trying to measure scientific constructs, such as those in psychology at a time of growing interest in psychological testing, by avoiding reference to anything concrete in the real world. Operationalism meant that a construct like 'intelligence' is as much as and nothing more than scores from an intelligence test. Stevens (1966) wrote that operationalism "ensures us against hazy, ambiguous and contradictory notions and provides the rigor of definition which silences useless controversy".

The death knell of operationalism as a valid philosophy of science was that constructs cannot, in fact, be exhausted by how we measure them. There are multiple methods used to measure length, for instance, but we would all agree that all calibrated methods are measuring the same thing, and believing that each measurement measures a different concept is highly confusing. If you believe that there are multiple measurements of intelligence, and that intelligence is something that exists independently of those measurements, something that has a biological basis, you'd be in opposition to operationalism. You'd also be in opposition to operationalism if you believe there are better and worse methods of measuring constructs like intelligence, or if you buy into the concepts of reliability and validity of measurements, because such things are illogical under operationalism that says all measurements measure a different concept, and so one cannot be better or worse than another. Operationalism synonymises meaning with measurement, and it does so to avoid any reference to external 'things' in the universe that at the time were difficult to prove the existence of. This induces a great deal of ontological ambiguity of what constructs are. If you want to learn more, I recommend starting with Green's (2001) article Of immortal mythological beasts: operationism in psychology.

Logical positivism is largely dead, but although operationalism has been criticised and rejected by philosophers of science since its inception, with even Bridgman and the logical positivists abandoning operationalism in later years, operationally defining scientific constructs is a core part of the behavioural scientist's toolkit. The use of operational definitions per se is not in itself the problem, only when those operational definitions become sufficient, and not simply necessary, to define behavioural constructs. Intelligence should causally affect scores on intelligence tests, but is not reducible to a score on an intelligence test. It is this operational stance that I think has confused, and continues to confuse, the field of animal personality.

In one of the most influential articles on personality, Reale et al. (2007) defined animal 'temperament', now regarded as largely synonymous with animal personality, as "the phenomenon that individual behavioural differences are consistent over time/across situations", and defined five initial traits that most behaviours reflect (aggression, sociality, boldness, exploration and activity). The authors wrote that "each trait should be operationally defined and its ecological validity tested" to guard against interpreting these traits as "'unobservables' or qualities that are difficult to measure (e.g. dispositions)", lest researchers would refrain from taking animal personality seriously. The purely operational definition of animal personality has been reiterated in numerous articles since. For instance, Carter et al. (2013) say that the term 'trait' "is used in behavioural ecology to mean a measurable aspect of an individual's behaviour that is, usually, repeatable while in psychology the use is more abstract and describes a construct", and Bell (2017) wrote that behavioural ecologists are "agnostic about the sources of variation" underlying animal personality. Kaiser & Müller (2021) more recently have written in their article titled "What is animal personality?", published in the Biology & Philosophy:

Since personality traits are empirically accessible only through the behaviours that manifest them under specific conditions they should be operationally defined also in terms of these manifestations (i.e., behaviours) and manifestation conditions (i.e., experimental setups).

The key measurement of personality is behavioural repeatability ($R$) defined as the ratio of among-individual variance in behaviour to the total behavioural or phenotypic variance, which has direct relationship to quotients of interest in quantitative genetics. Repeatability is most commonly calculated from hierarchical statistical models, where repeated measurements of behaviour can be summarised by the ratio of among-to-within individual variance components, leading to the 'variance partition coefficient' or 'intra-class correlation coefficient':

$$ R = \frac{\sigma_{b}^2}{\sigma_b^2 + \sigma_w^2} $$

Therefore, if a researcher wants to study personality, collecting repeated measurements of some behaviour believed linked to a personality trait, fitting a hierarchical statistical model, extracting the variance components, and reporting this quotient is often enough. The random intercepts in the model thus represent the 'personality' values of each individual. Repeatability is frequently between 0.3 and 0.5, but apart from this general range, there is little theory to predict what is a high versus low value, or at which point repeatability is too high or too low to not constitute as evidence for personality. Personality researchers often use repeatability to dissociate themselves from human psychologists, arguing that its calculation and links to quantitative genetics makes animal personality research more objective than the messier work of personality psychologists.

But, this is wrong, as anyone familiar with psychometrics will recognise the quotient above as the reliability coefficient used extensively in classical test theory in psychology, defined as the ratio of 'true score' variance to the total observed score variance. Reliability is a population-level measure of measurement precision, often calculated from repeated observations as a test-retest correlation, just as repeatability is necessarily a population-level estimate of the stability of among-individual behavioural differences. Classical test theory is still one of the dominant paradigms in psychometrics, and is where popular concepts of reliability and validity originate. The 'true score' is the expectation of observed scores across a long-running series of independent experiments, and is quite similar to the idea in animal personality that random intercepts represent individual personality values. As Borsboom (2003) explains, true scores cannot be taken to be 'construct scores', scores of the underlying attribute being researched, precisely because classical test theory has no cohesive measurement model linking construct scores to true scores, which follows from its operationalist philosophy. Borsboom writes:

[Classical test] theory does not contain the identity of true scores and construct scores - either by definition, by assumption, or by hypothesis. Moreover, it is obvious from the definition of the true score that classical test theory does not assume that there is a construct underlying the measurements at all. In fact, from the point of view of classical test theory, literally every test has a true score associated with it. For example, suppose we constructed a test consisting of the items "I would like to be a military leader", ".10$\sqrt{-05 + .05}$ = .. ", and "I am over six feet tall". After arbitrary — but consistent — scoring of a person's item responses and adding them up, we multiply the resulting number by the number of letters in the person's name, which gives the test score. This test score has an expectation over a hypothetical long run of independent observations, and so the person has a true score on the test. The test will probably even be highly reliable in the general population, because the variation in true scores will be large relative to the variation in random error (see also Mellenbergh, 1996). The true score on this test, however, presumably does not reflect an attribute of interest. The argument shows that it is very easy to construct true scores that have no substantial meaning in terms of scientific theories, and are therefore invalid upon any reasonable account of validity.

By the same token, it is easy to construct operational definitions and measurements of animal behaviour that result in moderately high repeatability coefficients that have no substantial meaning. For instance, the rate at which dogs blink will probably be at least moderately repeatable, but we'd struggle to link this to any particular trait.

Behavioural ecologists will rebut this by noting that those behaviours, like the rate of blinking, have little ecological validity, as there is little ecological or evolutionary reason why blinking might be related to a trait of significant interest. This argument also rests on shaky ground, however. Animal behaviourists have not provided a detailed explanation of what validity means beyond introductory textbook definitions used in psychometrics and classical test theory. I realise this might sound pretentious, but it is, I believe, completely fair. For instance, Carter et al. (2013) try to answer the question of "What are behavioural ecologists measuring?" in reference to animal personality, but do not go beyond basic descriptions of reliability and validity, which are methods of validation but do not elucidate what is being measured. That can only be answered by reviewing the ontological stances in animal personality, as methods of validation are about epistemology and not ontology. The notion of validity established in animal behaviour derives from classical test theory, which has over-complicated the meaning of validity to be about 'degrees of validity' and interpretations of evidence, and methods that rely on eye-balling patterns of correlation matrices. This is a natural consequence of the operationalist philosophy of scientific measurement that does its best to skirt the presence of unobservable constructs. Borsboom et al. (2004) is one of the clearest references on this topic, so I'm going to quote the authors liberally here:

When claiming that a test is valid, one is taking the ontological position that the attribute being measured exists and affects the outcome of the measurement procedure.
A valid test can convey the effect of variation in the attribute one intends to measure. This means that the relation between test scores and attributes is not correlational but causal. A test is valid for measuring an attribute if variation in the attribute causes variation in the test scores. In this case, we say that it is true that the test measures the attribute in question.
However, as is often the case in psychology, we have beautiful models but too little theory to go with them. We can formulate the hypothesis that Extraversion is the common cause of the scores on a number of different items of subtests, but there is no good theory available to specify how different levels of Extraversion lead to different item responses. Thus, there is, at present, no detailed hypothesis of how the causal effect of Extraversion on the test scores is being conveyed. The causal model may be set up, but the arrows in it are devoid of interpretation. This does not show that personality tests are invalid or that no such thing as Extraversion exists. However, it does preclude any firm treatment of the problem of validity. The reason for this is that researchers expect to get an answer to the question of what the test measures, without having a hypothesis on how the test works. If one attempts to sidestep the most important part of test behavior, which is what happens between item administration and item response, then one will find no clarity in tables of correlation coefficients. No amount of empirical data can fill a theoretical gap.

Animal personality researchers are not rampant operationalists in the same way that Bridgman was but the birth of animal personality, which makes repeatability sufficient evidence for personality and is largely agnostic to the mechanisms of behavioural variation, shares many similarities to early 20th century psychology that was sympathetic to the operationalist paradigm. It also parallels the rationales of classical test theory in psychology. And with this philosophical quicksand as its bedrock, it is completely unsurprising to me that personality research has induced so much criticism and confusion. If someone says that a particular behaviour is a measure of 'boldness', but fails to define what boldness means in the biology of the animal, fails to have some theoretical model of behaviour that one can point to 'boldness' as being a part of, and some blueprint as to how boldness will result in a population-level repeatability coefficient, this will necessarily result in confusion about boldness itself, or any other trait under question.

The solution, I believe, is three-fold:

Committing to a realist interpretation of personality traits means recognising that while we cannot 'see' what traits are, we do believe they are concrete, biological systems that exist in animals. Those biological systems are likely very complex and have multiple causes, but there are established ways of thinking about those complex systems which do not entail anthropomorphic interpretations of behaviour. The latter would also mean that repeatability is no longer sufficient evidence of personality, as evidence would need to be accrued that a particular behaviour is a cause of the aforementioned biological systems. Simply, the behaviour is a valid measurement of a personality trait when that trait causes variation in that behaviour (Borsboom et al., 2004). Finally, both the previous points signals a need for formal models delineating how personality traits result in relatively consistent among-individual variation in behaviours, ones that attempt to predict repeatability coefficients and can be used to build strong theories.

There is much more one could write on this topic, including other paradigms that are used within animal personality, such as latent variable modelling, and how that relates to operationalism and behavioural theory. I wrote part of this argument up in a paper a few years ago, which can be read as a preprint, but it was rejected from a journal after reviewers considered my reasoning to be ignorant of the particular trajectory of animal personality as a science and equating human personality and animal personality. With respect to the reviewers, I think my points were misunderstood, and they believed I was ignorant of how animal personality originated from a growing interest in the adaptive significance of behavioural variation. Perhaps I should have written it up more clearly.