Do we, can we, have a science of intelligence?
Two current approaches to the computer science of artificial intelligence are, in my opinion, barriers to the development of artificial general intelligence. The first is the push for formal mathematical expression and algorithms. The second is the focus on leaderboards—the list of projects and their scores on defined problems.
The fixation on formalism is reminiscent of the logical positivists who tried to rid science of intellectual sloppiness. They argued that all scientific statements could/should be reduced to observation statements: “This is what I saw,” and logical deductions from those statements. Logical positivism failed for a number of reasons (See Kuhn’s 1970 The Structure of Scientific Revolutions, Lakatos & Musgrave, 1965 Criticism and the Growth of Knowledge). Although mathematical formalisms have a role to play, they also reflect the position that formal algorithms are enough to generate artificial general intelligence.
The focus on leaderboard success is a recognition that computers do not have to solve problems in the same way that humans do. For example, tree traversal algorithms are competitive with the best chess players, the computer does not have to know human psychology. The focus is on being at the top of a leaderboard, sometimes by only a fraction of a percentage point, But on the other hand, it reflects the position that the behavior of models is all there is, or at least all that matters.
The current approaches to artificial intelligence are strongly reminiscent of the behaviorism of the 1950s and 1960s. The key assumptions of behaviorism, translated into the computer science of intelligence would include:
1- Artificial intelligence science is the science of the (primarily verbal) behavior of artificial systems.
2- Behavior can be described and explained without reference to anything other than the behavior of systems, preferably described in formal, mathematical terms. To be intelligent means precisely to act intelligent.
3- Mental terms, particularly human mental terms, are not needed. These terms should be avoided or translated into behavioral terms (formal algorithms and benchmark performance).
For example, artificial general intelligence is usually defined in terms of the number of problems it can solve. For example Aguera Y Arcas and Norvig recently defined artificial general intelligence as a system that can deal with many topics, perform many tasks in multiple modalities and multiple languages, and can be instructed. These are all behavioral properties. A generally intelligent system is one that can perform these tasks. But, as it turns out, large language models solve only one task: guess the next word, given the context. It is the cleverness of the users that applies that capacity along with well-designed prompts to accomplish what the users call multiple behaviors. If the task is consistent with the learned language probability patterns, then the machine can accomplish these nominal tasks.
In my opinion, a science of artificial intelligence based on discredited epistemological foundations (logical positivism and behaviorism) is doomed to fail, just as the attempt to build other science on them has failed. Scientific progress depends on better understanding of the phenomena, not just producing slight improvements in some well-defined task. The current large language models are far superior to their predecessors at predicting the next word, given a context. They are great at doing tasks that can take advantage of that predictability, but they fail at others that are not compatible.
A focus on leaderboard position, which can be affected by chance and other factors, tends to show diminishing incremental value over time. Early projects on a certain task tend to show large improvements from project to project, later projects show diminishing improvement.
One source of this improvement is the availability of increased compute resources, but these, too, bring diminishing returns. GPT 4 is said to have 1.76 trillion parameters, ten times more than GPT 3.5 (175 billion), but the bigger model is arguably only slightly better at some tasks.
The focus on leaderboards to measure progress is almost antithetical to scientific progress. Better performance can come from better techniques, but also from chance, or from brute force using an expanding capacity of computational resources. Scientific progress, on the other hand, depends on better understanding of the phenomena.
A larger, related, problem for artificial intelligence research is a lack of critical thinking. As long as performance improves, there is, too often, a failure to consider alternative hypotheses. For any given result, could another, perhaps simpler, model do the job as well? Could there be another explanation for the observation? This lack of critical thinking is most apparent in the claims that GenAI models, such as the GPT series or Google’s PaLM, or Meta’s Llama have “discovered” some emergent property, such as reasoning, cognition, a world model, or even sentience.
An observation consistent with learned language patterns does not require the supposition of deeper cognitive processes. The proponents of these cognitive claims, on the other hand, observe behavior that is consistent with a model having one or more of these properties and then jump to conclusion that the model must therefore have those properties. This pattern is a logical fallacy, called affirming the consequent.
Following this illogic, we would have to conclude that artificial general intelligence was achieved in 1995. In 1979, Douglas Hofstadter speculated that in order for a chess-playing program to be successful, it will have to be a program of general intelligence.
Hypothesis: In order to beat a chess master, a computer program will need to be capable of general intelligence.
Observation: In 1997, IBM’s Deep Blue defeated Garry Kasparov, the reigning world chess champion.
Conclusion: IBM’s deep blue was capable of general intelligence.
Here is another, even more absurd, example of that pattern.
Hypothesis: Abraham Lincoln was killed by robots.
Observation: Lincoln is dead.
Conclusion: Therefore, Lincoln was killed by robots.
A historian promoting this obviously bogus conclusion would be laughed at, but an artificial intelligence researcher would be taken seriously.
A recent paper claims evidence that large language models learn world models:
The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a coherent model of the data generating process—a world model. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks).
It may be desirable for an artificial intelligence to learn a world model, but their observation is no more supportive of this conclusion than are the Kasparov or Lincoln observations. The language model itself, that is its statistical model, is enough to explain their observed results. The World Wide Web is replete with pages that describe the distance between arbitrary locations, for example. Google reports that there are 4,920,000 pages retrieved for the query “Kalamazoo Omaha distance.” Surely, not all 4.9 million of them actually contain a specific statement of the distance, but many of them do. Here is an example:
The total driving distance from Omaha, NE to Kalamazoo, MI is 596 miles or 959 kilometers.
If the answer is in the language model, no other models are needed and so this result provides no support for the hypothesis that a model trained to guess the next word could somehow learn something else entirely. A simple bag-of-words model (where word-document associations are learned independently of word order) might be sufficient.
It would be extraordinary if advanced cognitive processes were to emerge from a word-guessing model. Extraordinary claims require support by extraordinary evidence, that evidence has not yet been forthcoming.
When distinguished artificial intelligence experts like Peter Norvig and Geoffrey Hinton claim that large language models are approximating or beginning to demonstrate artificial general intelligence, these claims are based on some kind of at least implicit theory. Under what theory could a model that is built to predict the next word be considered an example of general intelligence?
The key concepts in this implicit theory include:
Language is enough. Intelligence is the same thing as language ability. Everything that can be known can be written down in an encyclopedia or, more specifically, on the World Wide Web. All “thought” is expressed in language and there are no other representations (such as a world model) needed. Thinking is the behavior of speaking English for English speaker, Chinese for Chinese speakers, and so on. The only representation that can exist is the conditional probability of a word, given the context. There is no need for a “Language of thought” that is more abstract than this, or any other kind of mental representation or processes. Playing a doctor on TV, then, is equivalent to being a doctor.
Existing language pattern statistics are enough. The current GenAI systems have been trained explicitly on a broad range of language and those patterns are sufficient to account for all future intelligent language, and therefore, all future intelligence.
Reward is enough. Intelligence is “the maximization of a certain quantity, by a system interacting with a dynamic environment” (Pennachin and Goertzel, 2007). Gradient descent or other method of optimizing parameters is enough to generate intelligence. The human brain has structure, which may be critical to human intelligence, but reward is enough to produce any comparable structure needed for machine intelligence.
Scale is enough. Once we have a computing system in which the number of parameters approximates the number of synapses in the human brain we will achieve human-like intelligence. The human brain is thought to have about 100 trillion synapses (connections between neurons). If each of these synapses were represented by a single parameter, then a comparable sized computer system would have 100 trillion parameters. GPT-4 is said to have 1.76 trillion parameters, so there is a way to go.
To summarize this implicit theory, currently existing systems and are assumed to be sufficient to create artificial general intelligence if only we had computational systems and data sources of sufficient scale. If we just had enough, intelligence would spontaneously emerge.
The most implausible assumption is the scale one. It relies on a “miracle” of some sort. According to this theory, a model reaching a certain size will spontaneously generate general intelligence. Spontaneous generation has not been a useful theory in biology. It did not work to produce living vermicelli; and it is not a sound basis for producing artificial general intelligence. Artificial general intelligence will require a great deal more than this theory, and the models associated with it, can provide.
Why is a theory of artificial general intelligence needed? If the possibility of intelligence is a scientific question, then progress will be limited or even blocked without such a theory. Science is driven by theory and the observations motivated by the theory drive theory development. Much of medicine would be impossible without the germ theory of disease. Before germ theory, miasmas were thought to be the cause of disease which led to inadequate and even dangerous treatments, including bloodletting.
Researchers who claim that large language models manifest cognitive processes (such as having a world model) apparently recognize the desirability of such non-behaviorist representations, but they fail to employ adequate methodology to support their existence. The model may behave as if it has a world model (when the demonstration is congruent with its language model), but that does not mean that it has this capability, nor does this observation explain how such a model would be achieved, given that we know how the actual model was constructed as a word guesser.
A misguided theory of intelligence will guide research, but in a faulty direction that ultimately wastes effort. Poor theory also leads to inappropriate and useless regulation. The existing theory of language models is based on the assumption that the preceding words are useful predictors of the following word. That theory is sufficient to predict the behavior of language models (without implicating other cognitive processes), but it is not a theory of artificial general intelligence, even with the additional assumptions described above. It is a theory of language production. Until we have a such a theory we are doing little more than collecting observations and explaining them through the computational equivalent of miasma. Hope is not a theory.
Secondarily, this implicit theory refers to systems that adjust the parameters of a specified structure. These models are dependent on a human to provide data, provide a structure, and provide a reward (objective) function. They depend on learning from human-generated data, which they learn to emulate. In short, they are great at solving equations that are provided to them, but have no mechanisms for generating new equations. The only “thoughts” that they can entertain are the “thoughts” that can be expressed in their structured set of parameters, which, in the case of language models, are the word probability patterns. As long as these models are dependent on humans for the data and for the structure, they can never achieve comparable levels of general intelligence.
I can do little more than sketch out what a theory of artificial general intelligence would look like. Although there are many definitions of intelligence, Robert Sternberg suggests
Successful intelligence is defined as one’s ability to set and accomplish personally meaningful goals in one’s life, given one’s cultural context.
In the context of computational intelligence, I think that the key part of this definition is that intelligence includes the ability not only to accomplish goals, but to set them. Anyone with high school algebra can solve Einstein’s famous equation (E=MC2), but it took special intelligence to identify the need for, and then create, that equation.
Current computer models are accomplished at solving equations, but so far, it takes a human to have the insight to create significant new ones. A theory of general intelligence must include a way to identify the need for a problem to be solved, determine the structure for solving it, and implement the means to solve it. So far, the first two steps can be accomplished only by humans. A more complete theory of artificial general intelligence would, at a minimum, include processes that can create these structures.
Representation of data and the problem solving approach would be, I think, critical to a theory of intelligence. The representation determines what the computer can “think” about and how it thinks about it.
For some problems, often called “insight” problems, the key to solving the problem is to come up with the right representation. For example, what numbers follow in this sequence: 85491 …? If you are having a problem guessing the numbers that follow, I will describe how to solve the problem later. Here’s a clue. Solving this problem may require a change of representation. So far, we do not have an idea of how a computer could do that, but we also have no reason to expect that it cannot be done.
A forward-looking theory of intelligence would also recognize that intelligence requires deeper representations than words. Current models are limited to “distributional” semantics, meaning that words’ meanings are specified by the context in which they are used. Words with similar meaning occur in similar context. On this view, synonyms are related to one another because they occur in similar contexts. Rather, I think that it is more appropriate to say that similar contexts are indicative of similar meanings. There are many words that can communicate that same meaning (synonymy) and many meanings that can be represented by the same words (polysemy). Synonyms have similar meaning because they represent similar concepts. They appear in similar contexts because they have similar meanings, not the other way around. Even the early proponents (e.g., Wittgenstein) of distributional semantics recognized that there is more to meaning than just the words with which it co-occurs.
The truly important innovations in representation that have advanced artificial intelligence have so far come exclusively from humans. For example, ideas like using backpropagation to adjust weights of neural networks, or the idea that many heterogeneous layers could be combined into a deep learning network, and so on, are all human inventions. Once the structure has been invented, the machine can use that structure/representation to solve problems, but a general intelligence would also need a means to invent that structure. Given a space, determined by the parameters, and a problem, extant machine learning can find a path through that space (a set of parameter values) that will solve the problem if it can be done. But so far, it cannot construct that space.
General intelligence requires capabilities that are absent from current models, but being beyond the capacity of current models does not imply that they could not be within the purview of future models. The present models are not enough, but some future model could be if we invest in the theory-driven development of the science of artificial intelligence.
The answer to the digit sequence requires you to sort the digits into English alphabetical order.
Eight, five, four, nine, one, seven, six, three, two, zero.
I always find amusing the statements about how current approaches are “doomed to fail”, when people advocating your position have nothing to show for it, while current approaches are making a lot of progress.
Yes, current approaches are very limited. But they are good at solving concrete problems, and the range of problems it can solve is getting larger.
Any time people tried to do clever things they failed. Only pragmatic focus on incremental improvements worked.
We will move beyond current algorithms, but incrementally, as we uncover more patterns in the world and as we see more systematic problems with current methods.
I do agree that “Representation of data and the problem solving approach would be, I think, critical to a theory of intelligence. “. The problem is that people doing cognitive research have been doing this for decades, and the only thing they accomplished was narrow-purpose rigid models that can’t grow or express things beyond their hand-crafted representation.
That is why recently people have moved towards very large neural nets, where representations arise implicitly. Such representations are shallow, and do not separate the data from the concepts, but are more flexible.
I think learning how to build representations reliably is very hard. A solution where we guide neural nets towards building them based on data is likely going to work better than us doing that.
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Do we, can we, have a science of intelligence?