Abstract
It has long been recognized that humans (and possibly other animals) usually break problems down into smaller and more manageable problems using subgoals. Despite a general consensus that subgoaling helps problem solving, it is still unclear what the mechanisms guiding online subgoal selection are during the solution of novel problems for which predefined solutions are not available. Under which conditions does subgoaling lead to optimal behaviour? When is subgoaling better than solving a problem from start to finish? Which is the best number and sequence of subgoals to solve a given problem? How are these subgoals selected during online inference? Here, we present a computational account of subgoaling in problem solving. Following Occam's razor, we propose that good subgoals are those that permit planning solutions and controlling behaviour using less information resources, thus yielding parsimony in inference and control. We implement this principle using approximate probabilistic inference: subgoals are selected using a sampling method that considers the descriptive complexity of the resulting subproblems. We validate the proposed method using a standard reinforcement learning benchmark (fourrooms scenario) and show that the proposed method requires less inferential steps and permits selecting more compact control programs compared to an equivalent procedure without subgoaling. Furthermore, we show that the proposed method offers a mechanistic explanation of the neuronal dynamics found in the prefrontal cortex of monkeys that solve planning problems. Our computational framework provides a novel integrative perspective on subgoaling and its adaptive advantages for planning, control and learning, such as for example lowering cognitive effort and working memory load.
1. Introduction
The human problemsolving literature shows that an oftenused strategy to solve complex problems is subgoaling: splitting the original problem into smaller and more manageable tasks whose achievement corresponds to ‘milestones’ or ‘subgoals’ of the original problem [1]. This strategy is ubiquitous while solving problems of different kinds. For example, navigational planning problems can be decomposed by using landmarks (e.g. known locations) as subgoals and puzzles such as the Tower of Hanoi can be decomposed by using subgoals such as ‘free up third rod’.
In cognitive neuroscience, there is a general consensus that subgoals are key to complex planning and problem solving. Most theories of executive function assume that goals and subgoals are maintained in prefrontal hierarchies and permit exerting cognitive control during demanding activities [2–5]. It has been reported that the monkey prefrontal cortex encodes a sequence of activation of subgoals and goals during a delay period prior to action [6,7]. Failure to find good task decomposition or to follow the right sequence of subgoals can produce poor planning performance, as frequently observed in prefrontal patients solving the Tower of Hanoi [8].
Computational studies also provide many demonstrations of the usefulness of subgoals. In the hierarchical reinforcement learning (HRL) literature, it has been consistently shown that, given the right decomposition, problems can be learned and solved more efficiently, that is, in less time and with less resources [9,10]. The most popular HRL framework, called the ‘Options’ framework, explicitly uses subgoals for building temporal abstractions that allow faster learning and planning [11]. An Option can be conceptualized as a sort of macroaction that includes a list of starting conditions, a policy and a termination condition (subgoal). It is well assessed in the HRL literature that Options and subgoals bring several advantages in terms of faster learning and a reuse of strategies across problems (or skill transfer) [12]. However, this is only true if the right subgoals are selected: the selection of the wrong subgoals can be deleterious and slow down learning and inference [13].
Subgoals are not only useful for learning but they might also facilitate cognitive control. A recent series of studies used informationtheoretic notions to measure the amount of information in working memory that (sub)goals can ‘save’ while at the same time allowing successful task completion. In this formulation, the most useful subgoals are those that allow controlling behaviour with less information in memory; from a cognitive perspective, these advantages might be linked to lower working memory requirements during monitoring and control [14].
Finally, subgoaling is essential for planning and problem solving. Most popular architectures for symbolic problem solving [1,15] and planning [16,17] include a subgoaling component; and subgoaling is also used in a few connectionist systems [18,19]. It is often assumed (especially in the human problem solving/AI tradition [1,15]) that subgoaling proceeds backward from the final goal states and serves to resolve ‘impasses’: if a goalachieving action cannot be executed because of a missing precondition, achieving the precondition becomes the next system subgoal (and so on). Less investigated is the case of the feedforward (i.e. from the current to the goal state) selection of subgoals during planning, which is recognized as an important determinant of human problem solving [20,21] and whose neuronal underpinnings have been studied in monkeys [6,7] and humans [22].
We offer a model of subgoal selection that starts from a normative principle: we propose that during planning subgoals should be selected that minimize the computational complexity [19,23] of the problem at hand. In our proposed approach, subgoaling proceeds in a feedforward way with the objective of maximizing the probability of the resulting subproblems (or programs, see later) by selecting those ones with minimum algorithmic complexity and description length. Below we describe the computational approach and successively we validate it in two computational experiments, which focus on two issues: (i) under which conditions using subgoals leads to optimal behaviour and (ii) whether the specific subgoaling mechanisms envisaged here can be linked to neuronal computations in primates.
2. Material and methods
Our approach to subgoaling is framed within the planning as probabilistic inference framework, where goaldirected planning and problem solving are cast in terms of probabilistic inference [24–26]. In this framework, planning is usually implemented by imposing (‘clamping’) future goal states as observations, and running a probabilistic inference until a solution is found that goes from the current to the goal state [27]. In other words, planning consists of selecting actions that bias the probability of future events towards desirable states. A variant of this method consists of clamping future rewards rather than goal states [28,29]. Here, we use a related method that stems from Active Inference theory [30]: we set goals as (high) Bayesian priors and not as future observations, with the advantage that one has more flexibility on when (i.e. after how many steps) the goal will be eventually achieved.
In the remainder of this section, we introduce the three key components of the proposed method: (i) the probabilistic model used for the inference, which describes the conditional dependencies between the relevant stochastic variables (e.g. states, actions and subgoals); (ii) the method we adopted to assign ‘prior’ values to the subgoal variable; and (iii) the probabilistic inference method.
2.1. Probabilistic model
The probabilistic model we adopt is the Dynamic Bayesian Network (DBN [31]) of figure 1. In this representation, the nodes are stochastic variables expressing the elements that influence the planning process. They are

— S describes the agent's states; for example, its position in a maze. Here, we assume a discrete set of states, represented as integer values in the range {0, … , n}, with n being the total number of states.

— SG represents the subgoals (or subgoal states) used for planning. Potentially, every state S can be a subgoal used to achieve the final goal; this implies that SG can assume the same values as S: {0, … , n}. We assume that the final goal state is a particular subgoal having the highest a priori probability (see below).

— A represents the actions that an agent can execute to move from a state to another state. In the twodimensional mazes we use for our simulations, the agent has five actions: {u, d, l, r, ɛ}. The first four actions represent moving up, down, left and right, respectively. ɛ is a ‘rest’ action, or a transition from a state to the same state.

— Π represents the policies (or mappings between states and actions) available to the agent [32]. Intuitively, policies permit specification of sequences of transitions from (say) a start and a goal state. The number m of policies available to the agent vary depending on the number of states and actions in the environment. We include also a rest policy π_{ɛ} associated with each state the action ɛ.

— F describes the agent's progress to the goal state, i.e. whether it has finished or not. The allowed values of F are {0, 1, 2}. The value is 2 if the agent has reached the final goal state (and is an absorbing state preventing further inference); 1 if the agent has just reached a subgoal state; 0 otherwise. The main role of this variable is monitoring subgoal achievement and guiding the transitions between subgoals [33].
Overall, the model corresponds to a twolayered DBN (where the nodes are arranged on two layers corresponding to two consecutive slices of time indicated with subscripts, e.g. S_{t} and S_{t}_{−1}). Firstorder and stationary Markov properties hold: the conditional probabilities do not change as a function of t and every variable depends exclusively on other variables expressed at the same or the immediately preceding time step. In our simulations, the DBN structure and parameters are assumed to be known. Note that the transition P(πSG, S) captures the concept of an Option in HRL, but is expressed in a probabilistic way; furthermore, here the Options are not ‘cached’ but formed onthefly.
2.2. How to generate algorithmic priors for subgoals
A key departure from current models using the planningasinference framework is that we address the problem of subgoal selection. Here, subgoals are states that are inferred during the inference and guide the selection of compact and effective sequences of transitions from the initial state to the subgoal(s) and ultimately to the final goal. By assuming that the state space is discrete, such transitions can be considered as series of programs necessary to execute a series of moves from the agent's initial state s_{0} to the first subgoal state sg_{1}, to the second sg_{2}, and so on (following the instructions contained in some policy π_{j}) until the agent's final goal state s_{goal} is achieved.
It is important to note that every program can formally be transformed into a code, a binary string of length  that can be processed by a computing machine. Thus, following principles of algorithmic probability theory [23,34] (see appendix A for a discussion), one can associate a probability to every binary program , depending on the program length. By taking into account every program returning the subgoal sg, built with every possible combination between input states and policies, the algorithmic probability for a subgoal sg can be computed, up to a normalization factor, by 2.1
The aboveequation does not concern just subgoal states; it can be applied to every state belonging to S, in order to generate an a priori ‘algorithmic probability’ distribution (hereafter also denoted as algorithmic priors) that assigns a prior to each state, with highest priors given to the most strongly informative ones, i.e. the subgoals, important for decomposing any multistep planning task. This approach is conceptually similar to the method described in [35] that considers for each state ‘the amount of Shannon information that the agent needs to maintain about the current goal at a given state to select the appropriate action’.
Algorithmic priors for subgoal states depend on how many programs halt in those states and how long they are, and are thus specific for a given environment. The computations required to compute algorithmic priors for subgoals in a specific environment are very costly, but there are ways of alleviating this problem. First, algorithmic priors can be computed with a significantly reduced computational cost (see appendix B) in an offline way (in batch), once and for all, and can be successively reused for every inference in the same environment. Second, there are effective approximations to the exact calculations. One can uniformly sample a set of policies, to be used in the algorithmic probability formula, with a cardinality of some order less than π and (given a sufficient number of samples) this procedure approximates well the prior distributions calculated by enumerating every potential policy. The fidelity of the priors depend on the computational resources used to calculate them and can be improved trial by trial making new observations, as evident in the ‘cumulative’ nature of equation (2.1) and of equation (A 1) in appendix A.
Finding subgoals that might be ‘a priori’ good is only the first step of our method. A key distinction in the proposed approach is between potentially useful subgoals and selected subgoals, the former being a priori and independent of the agent's current goal and the latter dependent on it. To understand this difference, consider the 18states ‘fourrooms’ planning problem shown in figure 2, which is often used as a testbed in HRL scenarios [11]. Most studies have assessed that specific states such as bottlenecks (in this case, the ‘doors’ S2, S3, S16 and S17) are often useful subgoals and permit splitting the search space down into more manageable subproblems [12,13,35]. Our proposed method using equation (2.1) yields similar results, as represented by the colour gradient of figure 2 (the darker the colour, the higher the probability). In this perspective, although all the states can potentially be subgoals, some of them (e.g. bottlenecks or doors) are more likely candidates and have higher probability.
However, knowing the potentially useful subgoals is not sufficient to solve a specific problem, because the best subgoals for each specific planning problem depend on the start and goal states and not only on a priori information on the environment. Consider, for example, an agent solving the problem of planning how to go from S15 to S18 in the ‘fourrooms’ scenario of figure 2. Potentially, all ‘doors’ are useful subgoals but given the specific start and goal states the best choice would be passing through S16 and S17, not S2 or S3. The inferential procedure should thus preferentially find a plan that traverses these selected subgoals and not the others. We describe next the inference permitting to do so.
Algorithm 1: Planning inference (s_{0}, s_{goal}, p(SG), T_{max})
Require: Agent's initial state s_{0}, goal state s_{goal}, subgoal algorithmic priors p(SG), maximum number of forward inferences T_{max}.
Ensure: State sequence [s_{0, … ,}s_{goal}], subgoal sequence Seq.
t = 0
set S_{0} to the agent's initial state s_{0}
sample a subgoal sg_{0} from the prior distribution p(SG) attained by using equation (2.1) on each state
select a policy π_{0} maximizing equation (2.2) sampled through a Monte Carlo method
determine the action a_{0} depending on π_{0} and s_{0}
evaluate the termination condition state F_{0} according to p(F_{0}sg_{0}, s_{0})
while (F_{t} ≠ 2 and t ≤ T_{max}) do
t = t + 1
determine the state s_{t} by means of p(S_{t}a_{(t−1)}, s_{(t−1)})
select the subgoal sg_{t} maximizing equation (2.3) sampled through a Monte Carlo method
select a policy π_{t} maximizing equation (2.2) sampled through a Monte Carlo method
determine the action a_{t} depending on π_{t} and s_{t}
evaluate the termination condition variable F_{t} according to p(F_{t}sg_{t}, s_{t})
end while
2.3. Inferential procedure
The aim of the proposed planning as probabilistic inference scheme is inferring a plan from the agent's initial state s_{0} to its final goal state s_{goal}, which are assumed to be known (say, from S1 to S18 in the ‘fourrooms’ scenario). Normally, in HRL this would correspond to finding a suitable policy from S1 to S18. Rather, here we consider the possibility that the plan actually consists of a series of smaller plans, say, one from S1 to S16 and one from S16 to S18. In this case, the inference would result in a first subgoal (S16) along with a suitable policy permitting the transition from S1 to S16; and a second subgoal (S18), which actually corresponds to the final goal, along with a suitable policy permitting the transition from S16 to S18. Importantly, here we use considerations of algorithmic probability to select subgoals and policies; as a consequence, the plan results in series of transitions across states with highest algorithmic probability.
The inference uses the probabilistic model of figure 1 to iteratively sample (i.e. extract probabilistically) candidate subgoals from the previously described a priori subgoal distribution (see §2.2); the candidate subgoals are then retained or discarded by considering the computational complexity of the resulting subproblems or programs. When a new subgoal is selected, it becomes the starting point for a new iteration, until the final goal is eventually found. The iterative sampling procedure runs, for a maximum number of steps T_{max}, through all the problem space and returns a sequence of subgoals and associated programs that essentially solve the original problem by finding sequences of smaller subproblems (but in principle, a sample might also generate a solution without subgoals, from the start to the goal state).
The pseudocode of Algorithm 1 describes the procedure stepbystep. Essentially, using the model of figure 1, the inference uses the agent's initial state s_{0} as a clamped (i.e. observed) state on the node S at the time t = 0, and the agent's goal state s_{goal} as the state with highest prior on the subgoal node SG.
Initially, at the time t = 0, a subgoal sg_{0} is drawn over the a priori algorithmic probability distribution described in §2.2. Then, for each time t, the inference proceeds by finding out a policy π_{t} so that a program permitting a transition from s_{t} to sg_{t} can be built. Formally, this corresponds to drawing a policy from the probability distribution p(Π_{t}s_{t}, sg_{t}). Using Bayesian calculus and assuming that there is no prior information about policies (see appendix A, equation (A 2)), p(Π_{t}s_{t}, sg_{t}) can be approximated by the following expression: 2.2
According to the above equation, the probability of selecting a specific policy π_{j} depends (in addition to its prior probability) on the length of the program generated with π_{j} that from the current state s_{t} takes us to the currently examined subgoal sg_{t}, weighted by the s_{t} prior computed just over π_{j}. Drawing from the distribution defined in equation (2.2) would require to reckon every distribution value and this would be very costly and often infeasible for even moderately large state spaces. To overcome this problem, we use a Monte Carlo sampling method [36] to collect a set of policies accepted as realizations of equation (2.2) by using the uniform distribution p(Π_{t}) as a proposal distribution (note that the sampling method can be conducted using a parallel procedure). From this set we select the policy π* = argmax_{j} p(π_{j}s_{t}, sg_{t}). Note, however, that alternative methods such as heuristic techniques or tree search might be adopted to this purpose [37].
To sum up, the inference described so far starts from an initial (clamped) state and returns a subgoal and a policy to reach it. In other words, it builds an Optionlike plan onthefly that would (ideally) correspond to the shortest possible program from the initial to the subgoal state. Once this program is established, one can get the transition to the time t + 1 off the ground. The action a_{t} is directly determined for passing from the state s_{t} to the next one s_{t}_{+1}.
Simultaneously, the node F_{t} monitors (sub)goal achievement and guides the transitions SG_{t} → SG_{t}_{+1} between subgoals [33]. If s_{t} ≠ sg_{t}, the value of F_{t} is zero and the subgoal sg_{t}_{+1} is forced to be the same as of the time t. When the current state s_{t} is equal to the selected subgoal sg_{t}, the rest policy π_{ɛ} (i.e. a specific policy associating with every state a ‘rest’ action ɛ) is selected determining s_{t}_{+1} = s_{t} = sg_{t}.
Then, a new subgoal (SG_{t}_{+1}) has to be chosen. To guide subgoaling towards the final goal state (rather than, say, away from it), not only the inference ‘clamps’ the justachieved subgoal as the current state, but it also assumes that f_{t}_{+1} = 2, that is, it fictively assumes that the goal state is (has to be) observed. The procedure then continues as explained earlier, until the final goal state s_{goal} is reached (and f_{t} = 2).
All these operations are summarized by the following distribution: 2.3where δ_{ab} is 1 when a = b while is 0 otherwise, and p(SG_{t}_{+1}f_{t}_{+1} = 2, sg_{t}) is estimated by a Monte Carlo process [36] applied to the posterior probability distribution of SG_{t}_{+1}: 2.4considering that p(sg_{k}sg_{t}) estimates the probability that sg_{k} succeeds sg_{t} as subgoal and that the evaluation of the likelihood p(f_{t}_{+1} = 2sg_{k}) corresponds to computing the conditional probability p(gsg_{k}) of the goal state with respect to the subgoal sg_{t}_{+1} (for a more detailed description, see appendix A, especially equation (A 3), and appendix B). Here, the value sg* = argmax_{k} p(sg_{k}f_{t}_{+1} = 2, sg_{t}), sampled by adopting the algorithmic priors of equation (2.1) as proposal distribution, is set as subgoal at time t + 1. Note that using this procedure the probability of selecting an optimal policy increases when the state S_{t} becomes closer to the subgoal.
3. Results
3.1. Experiment 1: fourrooms scenario
A first question we ask is whether the proposed subgoalingbased method is more efficient compared to an equivalent procedure without subgoaling that searches for a solution to the problem from start to finish, and under which conditions it leads to optimal behaviour (e.g. the selection of shorter paths). To answer this question, we compare the planningasinference method with and without subgoals in the 18states ‘fourrooms’ scenario shown in figure 2. Note that even in this apparently simple scenario the number of potential policies is around seven million, making exact inference impracticable.
In the first (withoutsubgoals) strategy, the probability of choosing a subgoal different from the goal state is zero: p(SG ≠ s_{goal}) = 0. In the second (withsubgoals) strategy, the abovedefined algorithmic priors of the subgoals are used: p(SG ≠ s_{goal}) > 0. To study the generality of the method, we tested three different ensembles (of 50, 100 and 1000 of independent inferential instances) and made 10 simulation runs for each. The parameters used in the simulations are reported in table 1. In the experiment, we assume S1 as starting state and S18 as goal. We set the prior probability S18 to be the highest among all the values of p(SG) (table 1). Specifically, we set P(S18) as the highest value of p(SG) plus a small value calculated as the difference between the first and second highest values of p(SG); then we normalize all the other probabilities in p(SG) to sum up to 1.
To compare the inference with and without subgoals, we consider various factors: the number of successfully reached goals, the optimality of behaviour, expressed here as the number of inferential instances achieving the optimal planning strategy (in this scenario, a sequence of eight states, including both start and goal states), the complexity of the inference (i.e. the percentage of instances failing to arrive at the goal) and the complexity of the control (i.e. the average length of the programs necessary to achieve the task).
The differences between the percentage of successful strategies and the percentage of achieved optimal paths carried out in the withoutsubgoals and withsubgoals modalities are shown in table 2. The latter strategy achieves a better performance and the largest number of optimal paths by using subgoals to split the search space. In keeping, the programs produced by the strategy withsubgoals (dotted black line) are on average shorter than those produced by the strategy withoutsubgoals (grey line), especially in the first inferential steps (figure 3a). Here the results are relative to the programs used to plan (up to the next subgoal) at each inferential step, for N = 100 instances and averaged on 10 runs. Results are similar with 50 and 1000 instances (not shown).
The average percentage of instances that fail to find a successful planning strategy (for N = 100) is shown in figure 3b, revealing again an advantage of the strategy withsubgoals (dotted black line) over the strategy withoutsubgoals (grey line), especially in the first inferential steps. Furthermore, the fact that the percentage of failures is stable in all the steps suggests that the strategy withsubgoals is robust.
Overall, these results highlight that the proposed method using subgoals is more efficient compared to a standard planningasinference procedure that does not use subgoals and makes a more parsimonious use of resources (e.g. working memory).
We next analysed the solutions found by the proposed method using subgoals. The distribution of the number of subgoals found by the strategy withsubgoals (averaged on the different runs) is shown in figure 3c. The results show that the strategies including two subgoals before the actual goal (with a total of three subplans) are on average the most successful. This result reflects the specificity of the fourrooms scenario; for example, successful cases of navigation between S1 (start) and S18 (goal) include the subgoal sequences [S2, S3, S18] and [S16, S17, S18].
As a final remark, we note that the probability distributions of S and SG—as estimated by considering the average of 100 instances during the inference—change at every step, as shown in figures 4 and 5, respectively. In the next experiment, we ask whether these dynamic representations might have equivalents in the brain of primates executing planning tasks.
3.2. Experiment 2: neuronal correlates of planning in monkey lateral prefrontal cortex
Few neurophysiological studies have directly probed the role of goals and subgoals during planning and control at the single cell level (see [4] for a review). Perhaps the most direct test of the neuronal underpinnings of planning in prefrontal circuits comes from two monkey studies performed by Tanji and collaborators [6,7]. In these studies, monkeys performed a multistep path planning task in a maze (shown in figure 6a) and were required to prepare multiple movements of a cursor stepwise from a starting position (at the centre of the maze) to a preinstructed goal position which varied in the experimental conditions (see labels from G1 to G8 in figure 6a). These experiments revealed that during the preparatory period monkey lateral prefrontal cortex (lPFC) neurons encoded sequential representations of the path plans [7] and that these prefrontal representations are specific for goals and not motor actions (i.e. the position to be reached and not the cursor movement to be executed), in contrast with representations in the primary motor cortex that encoded arm movements [6].
Most importantly for our study, the experiment reported in [7] identified two neuronal populations in the lPFC that are crucial for preparation of path planning: one in which neurons specifically coded the position within the maze that represented the final path goal independent of the steps required to achieve it (called final goalselective neural activity) and one in which neurons specifically coded the position within the maze to which the animal intended to move the cursor next, independent of the movement required or the specific maze configuration (called immediate goalselective neural activity). The aim of our modelling study is identifying whether the dynamics of two key variables used in our model, P(SG) and P(S), might correspond to the activity of final and immediate goalselective neurons in the monkey study, respectively, thus providing a computationally motivated explanation of what these neuronal populations might encode during the planning task.
We mapped the maze used in the monkey experiment of [7] (figure 6a) into a discrete domain that includes 21 states and four actions (giving approx. 10^{9} possible policies), and we calculated the algorithmic prior of each state using the aforementioned method; the results are shown in figure 6b.
We next conducted a simulated experiment on a sample trial (equivalent to one of those studied in [7]), which consists of running an inference from a starting position (S11) to a known goal location (S3) (figure 7). The prior probability S3 is set to be the highest value (table 1) and all the other probabilities in p(SG) are normalized to sum up to 1. Results are for 10 execution runs using 100 instances with the parameters reported in table 1. Note that this example is chosen for illustrative purposes but the results reported below generalize to all the startgoal pairs of the maze used in the original monkey experiment.
The former panel of figure 7 shows the prior SG once the goal (S3) is known; the other panels show how the distribution p(SG_{t}) changes during the probabilistic pathplanning inference. It is possible to appreciate that during the inference the final goal position S3 is tonically maintained in a way that closely mimics the final goalselective neural activity reported by Saito et al. [7]. Importantly, while in the first panel the final goal S3 is clamped, in the successive steps it is not, but it results from sampling over the p(SG_{t}) distribution (e.g. p(SG_{t}_{=2}s_{start} = S11, s_{goal} = S3) in the second inferential step (see §2.3 for details). The fact that the probabilistic inference of SG_{t} in our proposed algorithm predicts the final goalselective neural population found in monkeys lPFC [7] suggests that these neurons might be a signature of an ongoing probabilistic inference [38,39] rather than only playing a role in working memory, as usually assumed. This result provides a novel perspective on how prefrontal goal representations might support inference and control, which is in line with other evidence that goalcoding neurons peak before an action is made but are also maintained tonically after action onset [4,40].
The probability distribution of the states S_{t}_{=1} − S_{t}_{=4} in the same path planning example as before (i.e. from S11 to S3) is shown in figure 8. It is possible to appreciate that these probability distributions closely mimic the immediate goalselective neural activity reported in [7], with a preferential coding of the next spatial position in the maze required to reach the goal. This suggests that the activity of immediate goalselective neurons codes (predicted) state representations; these prospective representations might complement the activity of final goalselective neurons in a coherent probabilistic pathplanning inference [41]. Importantly, these state representations maintained by lPFC are not just ‘reachable’ future states but compose optimal (i.e. shorter) plans, in keeping with previous theoretical proposals [42]. Supporting this idea is the fact that the probability distributions found by our method preferentially select optimal plans. For example, in the second step (figure 8b) the probability of S6 is higher than S12. This fact can be easily justified by making simple probabilistic considerations: the probability P(S3S7, S6) that S3 can be reached through the path [S11, S6, S7, S3] is greater than the probability P(S3S7, S12). Indeed, there are two optimal paths passing through S6 while there is only one through S12, which would imply the path [S11, S12, S7, S3] (and thus visiting S6) increases the probability of finding an optimal path to the desired goal.
Overall, this second study shows that the two distributions p(SG) and p(S) used in our probabilistic inference method can be fruitfully linked to two distinctive (final and immediate) goalselective neural populations found in monkeys lPFC, and provide a novel interpretation of what these populations might code during the task (i.e. subgoal distributions and states that compose optimal plans) and how these neuronal responses might be produced (i.e. through probabilistic inference). Note that the inference used to compute p(SG) and p(S) is not conditionally dependent on action, which implies that these populations are goal and not movementrelated, as reported in the target monkey study [7].
4. Discussion
The benefits of subgoaling have long been recognized by many disciplines that include psychology, neuroscience and computational modelling, but its computational and neuronal principles are incompletely known. From a normative perspective, we proposed that subgoaling permits planning solutions and control behaviour using less information by selecting more compact programs. To implement this idea, we devised a probabilistic inference method that combines planning as inference with informationtheoretic measures based on Solomonoff's Algorithmic Probability and Kolmogorov Complexity (KC), the latter being widely used in AI [19,23] and closely related to variational probabilistic methods that minimize free energy [30,43]. Essentially, our approach formalizes the ‘Occam's razor’ principle: a priori, among the strings (here, programs) that represent the procedures returning the same output, ‘simpler’ strings (i.e. strings with low descriptive complexity) are more probable and should more likely be selected by the probabilistic inference.
There is increased attention in theoretical neuroscience to the idea that brain computations, and in particular state inference and planning, can be understood in terms of probabilistic inference; still, the required computations are computationally expensive and it is unclear how the brain might solve or approximate them [27,30,33,38]. Within this framework, we have demonstrated that parsimony principles derived from normative (informationtheoretical) considerations can guide efficient subgoaling and significantly lower the descriptive complexity of inferential procedures underlying planning and problem solving (Experiment 1). We have also linked aspects of the proposed method to planning dynamics found in monkey lPFC (Experiment 2). The proposed computational framework thus offers a mechanistic explanation of subgoaling in planning and problem solving and suggests parallels with neurobiological findings, which remain to be tested in future studies.
The subgoaling approach proposed here shares some similarities with ‘task partitioning’ strategies that have been reported in insect societies [44–46]. In insect societies, a task is split into smaller tasks that are carried out in a distributed manner by several individuals (e.g. insects), which can collectively produce optimal solutions. The similarities between the collective decisionmaking strategies of insect societies and of neurons in the brain have often been noticed [47,48]; understanding the benefits of distributed schemes for subgoaling is an open interrogative for future research.
The proposed method represents a novel approach to planning and subgoaling that can also inform technical fields such as AI/HRL. In HRL, temporal abstraction methods such as Options are widely used but they are normally modelled as structures that, once learned, are retrieved in an allornothing way. On the contrary, our method permits building Optionslike constructs on the fly using probabilistic inference, which yields increased flexibility. Note that the two approaches are not mutually exclusive but can be combined. The method presented here can be easily extended to model the acquisition of stable or ‘cached’ skills or Options; it is sufficient to allow the system to store the results of previous inferences (e.g. in the form of priors on policies) and reuse them in similar future inferences. Running inferences based on such prior information would generate a wellknown dilemma between faster but less flexible (habitual) and slower but more flexible (goaldirected) selection mechanisms [26,49–53].
Although we have focused on the importance of subgoaling for planning, the same computational principles might operate at the three different timescales of planning, control and learning. These three problems are traditionally studied in isolation but are closely related in probabilistic schemes and might use common optimization principles, including the proposed subgoalbased, Divide et Impera (divide and conquer) strategy. During planning, subgoals (or Options in HRL) reduce the search space by permitting planning at a higher level of temporal abstraction (i.e. at the level of macroactions and not only basic actions). During control, subgoals permit maintaining the smallest possible information in working memory that is sufficient for successful task achievement [35]. During learning, subgoals (and associated Options) permit more efficient learning, reducing the search space, which also helps in the transfer of knowledge and skills to novel domains [12].
An open research question is whether goals and subgoals might support efficient coding and information compression in cortical hierarchies. The rationale of this idea is that, in the same way perceptual coding might be based on minimum description length and efficient coding principles [54], information compression principles might explain how the brain efficiently codes control programs [55]. In this perspective, goals and subgoals might ensure that information is compressed in a way that is more useful to solve problems efficiently and thus bias which control programs (e.g. Options) should be learned. It remains to be tested in future studies whether plans that are initially formed using probabilistic inference and subgoaling might be successively stored in prefrontal hierarchies, yielding an efficient coding of controlrelated information.
Funding statement
Research funded by the EU's FP7 under grant agreement no FP7ICT270108 and the HFSP under grant agreement RGY0088/2014. The GEFORCE Titan used for this research was donated by the NVIDIA Corporation.
Appendix A. Algorithmic probability, Bayesian inference and subgoaling
As usually in computer science, transition sequences can be considered as series of instructions that let a computing machine return an output for some input, provided that the execution halts. In the spirit of Algorithmic Probability Theory introduced by Solomonoff [23,34], we assume that (in a discrete state space) it is possible to determine a set of executable instructions by considering a starting state s_{i}, an arrival state s_{i}_{′} and a policy π_{j}. The policy is represented as a stateaction table encoding a list of actions that, by starting from s_{i}, will get to s_{i}_{′}. In algorithmic terms, the triple (s_{i}, π_{j}, s_{i}_{′}) defines a program for the state s_{i}_{′}. This correspondence is not biunique because the same program can be determined by different policies.
The next step consists of encoding the program in a computable format. A simple and efficient way to do so is using the bitwise encoding. By mapping the program (s_{i}, π_{j}, s_{i}_{′}) into a binary string with length , we can algorithmically assign to it an a priori probability equal to . This probability goes to 0 in the limit of (e.g. if a program does not halt in s_{i}_{′} or it does not halt at all) and is equal to 1 when (i.e. when initial and arrival states coincide). In our representation, this last condition can happen if and only if there exists a policy working as an ‘identity function’, namely having the transition s_{i} → s_{i}, ∀i (that corresponds to the policy π_{ɛ}, in our approach).
By considering the set of all the programs returning the state s_{i}_{′}, the a priori algorithmic probability of s_{i}_{′} can be defined as A 1when the aboveequation is applied on every state, it generates an a priori probability distribution, depending on the specific domain, in which the subgoal states have the highest probability values.
This algorithmic method to define probabilities can be used to determine the conditional probability distribution commonly involved in Bayesian inference. For example, let us consider equation (2.2) of §2.3: we can obtain the expressions for p(sg_{t}s_{i}, π_{j)} by setting s_{t} and π_{j} in equation (A 1); and the expression for p(s_{t}π_{j}) by setting s_{t} in equation (A 1). Therefore, equation (2.2) can be rewritten in an algorithmic form as A 2with Z as a normalization constant. According to the above equation, the probability of a specific policy π_{j}, given s_{t} and sg_{t}, is proportional to its aptitude to generate programs of minimum length which, starting from every possible state s_{i}, permit movement to the currently examined subgoal sg_{t} by passing through the current state s_{t}.
Analogously, one can determine the algorithmic expression of p(SG_{t}_{+1} = sg_{k}f_{t}_{+1} = 2, sg_{t}) in equation (2.4) by constraining all the programs returning the goal to have sg_{k} as an intermediate outcome. Equation (2.4) thus becomes A 3where Z′ is a normalization constant.
It is worth noting that equation (A 3) is equivalent, up to the normalization factor, to p(s_{goal}sg_{t}) when sg_{k} = s_{goal} or sg_{k} = sg_{t}. Besides, it holds 1 when s_{goal} _{=} sg_{t}. This is a direct consequence of the fact that, when initial and final states coincide, the only policy able to determine a halting program of length zero is the ‘rest policy’ π_{ɛ} with probability equal to 1. Rather, when the rest policy is fixed, the probability of a program is different from zero if and only if the starting and final states are the same.
Evaluating equation (A 1) or (A 3) is computationally burdening as it depends on the number m of policies, which is usually prohibitive also for a space with few states. However, it is possible to introduce various approximations and heuristics, such as that presented in appendix B, to render their evaluation computationally treatable.
Appendix B. Heuristic procedure for estimating algorithmic conditional probabilities
Equations (A 1) and (A 3) have a considerable computational cost, because they are evaluated over the whole set of the policies. However, it is possible to use heuristics to decrease the cost of calculating the conditional probabilities used in the two equations.
The state space of the problem can be arranged as a graph having the states s_{i} as the n vertices and the transitions e between states as the edges. The graph can be represented as an adjacency list: a collection of unordered lists, one for each state, composed of the neighbours of the related vertex. Given two states, s_{i} and s_{i′}, we exploit a standard depthfirst search algorithm for figuring out all the paths between them. Every path c corresponds intuitively to a distinct program , but each program can be attained by different policies.
As a consequence, various policies give the same contribution to the algorithmic probability computation. Their number can be estimated by considering every possible combination of the neighbours of the states not involved in the specified path c (and, thus, not present in the program ). We call this value multiplicity of the program and we indicate it as μ().
Therefore, the probability p(s_{i′}s_{i}) can be written in the following form: B 1which makes the conditional probability estimation affordable with a computational complexity of O(n + e), equal to the depthfirst search complexity. An analogous procedure can be used to calculate p(sg_{k}sg_{t}) and p(gsg_{k}).
 Received December 5, 2014.
 Accepted January 13, 2015.
 © 2015 The Author(s) Published by the Royal Society. All rights reserved.