## Abstract

Feather pecking in laying hens is a major welfare and production problem for commercial egg producers, resulting in mortality, loss of production as well as welfare issues for the damaged birds. Damaging outbreaks of feather pecking are currently impossible to control, despite a number of proposed interventions. However, the ability to predict feather damage in advance would be a valuable research tool for identifying which management or environmental factors could be the most effective interventions at different ages. This paper proposes a framework for forecasting the damage caused by injurious pecking based on automated image processing and statistical analysis. By frame-by-frame analysis of video recordings of laying hen flocks, optical flow measures are calculated as indicators of the movement of the birds. From the optical flow datasets, measures of disturbance are extracted using hidden Markov models. Based on these disturbance measures and age-related variables, the levels of feather damage in flocks in future weeks is predicted. Applying the proposed method to real-world datasets, it is shown that the disturbance measures offer improved predictive values for feather damage thus enabling an identification of flocks with probable prevalence of damage and injury later in lay.

## 1. Introduction

A major welfare problem in commercial egg production is that of injurious feather pecking, where the hens peck each other's feathers and skin, leading to feather damage, injury and even cannibalism (Rodenburg *et al.* 2004; Dixon 2008). The problem is worse in barn and free-range systems than in cages (Blokhuis *et al.* 2007; Fossum *et al.* 2009) and so is likely to become even more widespread with the European Union ban on barren cages in 2012. Feather pecking increases feed consumption owing to heat loss (Tauson & Svensson 1980), reduces egg production (El-Lethey *et al.* 2000), causes pain and suffering of the injured birds (Gentle & Hunter 1991) and increases bird mortality, including cannibalism (Huber-Eicher & Sebö 2001). Despite a great deal of research it is currently not possible to prevent feather pecking altogether (Rodenburg *et al.* 2004; Dixon 2008). The main method of controlling it is to blunt the birds' beaks to reduce the damage they can do to each other. This blunting is usually done when they are day-olds by use of hot-blade or infra-red beak treatment (Dennis *et al.* 2009), but it raises welfare problems in its own right (Gentle *et al.* 1990). A major research priority is therefore to eliminate feather pecking without resorting to beak trimming.

An important step forward would be an ability to predict outbreaks of feather pecking in advance and to identify those flocks that were at particular risk of damage (Drake *et al.* in press). A large number of interventions such as changing the litter, changing the lighting, changing the methods of providing food and water, changing the times of feeding have been proposed, although none are currently completely effective (Rodenburg *et al.* 2004; Dixon 2008). Some of these would cause considerable expense and so it is important to know those or which combinations are most effective and to target the most at-risk flocks. Being able to analyse the effects of such interventions, or other management practices, at an early stage will help to identify key factors in this problem.

This paper proposes a novel method for predicting feather damage based on visual processing and statistical analysis of flock behaviour taken from video recordings. We show that a disturbance profile can be generated for a flock based on optical flow measures of the movements of the birds (Dawkins *et al.* 2009) and the use of hidden Markov models (HMMs; Rabiner 1989). We then show that this disturbance profile, together with other measures such as the age of the birds, has predictive power as to which flocks will develop severe feather damage in the future.

## 2. Data collection

The procedure is outlined in figure 1. We used video recordings taken inside commercial free-range laying houses for our basic data.

### 2.1. Video recordings and laying flocks

The recordings were made as part of a larger study into the causes of feather pecking in laying hens (Drake *et al.* in press), in which approximately 335 000 commercially reared laying hens were followed from 12 different rearing farms onto 19 laying farms (44 different houses) between February 2006 and August 2008. Laying flock size varied between 800 and 4000 birds. All flocks were visited on at least four occasions: at end of rear (less than 17 weeks, soon after transfer to the laying house (18–22 weeks), peak lay (23–30 weeks) and towards clearance (approx. 50 weeks). A variety of management factors (e.g. feeder and drinker type, stocking density, litter type, range size, etc) and environmental factors (e.g. levels of light, sound, CO_{2}, ammonia, litter pH, etc) were recorded, as were key production variables (mortality, per cent flock in lay and food consumption (g/h/d)) and feather scores (see below). Details of methods and results are given in Drake *et al.* (in press). The study included both organic free-range and free-range houses and five different breeds (Hyline, Lohman (Brown and Traditional), Shaver, Goldline and Columbian Black Tail). All but two of the houses contained birds that had been beak-treated as day-olds. Video recordings were made at bird level in each house to record behaviour, but these were found not to be suitable for optical flow analysis, so during the second half of the study, covering 13 separate farms and 18 individual houses, additional overhead video recordings were made.

Digital video camcorders (Canon MV940) were affixed inside houses and used to record 1 h video recordings of hens' behaviour within the houses. The recordings were taken of 18 flocks at various ages on separate days, giving a total of 32 useable video files, with each flock recorded at least once and in most cases twice. The files were also subsequently converted to 8-bit greyscale, with four frames-per-second (FPS) and a frame size of 320 × 240 pixels. Between 20 and 100 birds were visible in most frames.

### 2.2. Optical flows

Most automated visual processing techniques keep track of individual animals, which is very difficult for a large number of animals without marking individuals and even then computationally prohibitive. Instead, we extract motion from the hens as a whole group using the optical flow algorithm (Beauchemin & Barron 1995). An optical flow is an approximation to apparent velocities of image motion and can be used to describe the mass movement of a group of objects.

Consider a video file that consists of *T* + 1 image frames each of 320 × 240 pixels. Each image is divided into 1200 (=40 × 30) 8-by-8 pixel blocks. The optical flow algorithm estimates, for each block, a local velocity vector derived by analysis of the frame-by-frame changes between two consecutive image frames at time *t* and *t* + 1. The velocity vector contains two elements, horizontal and vertical, i.e. [*H*_{t,ℓ}, *V*_{t,ℓ}], for frames at time *t* = 1, … , *T*, and blocks ℓ = 1, … , 1200. From the velocity vectors, the amount of movement for each block was obtained as the magnitude of the velocity, . Then, as a snapshot of global movement at time *t*, the spatial mean, , was sequentially calculated for each frame in the video file, resulting in a time-series of length *T*.

### 2.3. Expert labelling of disturbances

As HMMs are used for identifying disturbance periods in the optical flow datasets there is a need of validation against expert opinions. To evalue the efficacy of the HMMs to give results that correspond to what a human observer might notice, the visual records used for the optical flow analysis were independently viewed by an expert human observer (K.A.D.) who was familiar with the hen behaviour. While this was not an essential part of validating the method (the method might be more sensitive than any human), correlations between the two methods would serve to link the statistical analysis to an understandable measure in the real world and to give an indication of what the model was detecting. The human observer viewed the tapes and then scored a variety of different types of disturbances, such as running, birds pecking each other or several birds suddenly turning in on the same spot. She viewed the tapes without being given any information about age or identity of the flocks. She identified 19 disturbances in 32 video sequences and these same sequences were then presented to the model.

### 2.4. Scoring feather damage

Levels of feather damage were scored for the flocks using the method described by Bright *et al.* (2006), which is summarized in table 1. An observer entered the laying house and visually scored 100 birds inside the laying house and 100 birds on the range outside, using a random number grid to select the birds. The feather scores for the five different body regions (neck, back, rump, tail and wing) were then averaged over all 100 birds to give a single mean feather score for that flock for that day. Further details are given in Drake *et al.* (in press). A total of 198 feather scores were recorded, but where houses were divided into colonies (physical barriers between different parts of the same house), the colonies within a house were not considered to be independent, so that the scores for all the colonies within a house were combined, giving a total of 18 (houses) × 4 (visits) = 72 feather scores.

## 3. Extraction of disturbance measures

As we are primarily trying to detect periods of disturbance, which are manifested as state changes in an optical flow time-series, we use a HMM to identify disturbance periods within the dataset. From the state transition identification by the HMM, a set of disturbance measures are obtained for each flock.

### 3.1. Hidden Markov models

HMMs (Rabiner 1989) have been widely used for inferring sequential structures of data in various areas such as speech recognition and bioinformatics. Consider a sequence of observations, **Y** = { *y*_{1}, *y*_{2}, … , *y*_{T}}, where . It is assumed that the probability densities of the observations are determined by the corresponding discrete *hidden* states, **S** = {*s*_{1}, … , *s*_{T}}, where *s*_{t} ∈ {1, … , *J*}. Only the observations **Y** are available to us, but not the hidden states **S**. Analogously, in the application of disturbance identification for laying hens, we can observe optical flows, but not the underlying (that is, hidden) states of hens' behaviour, e.g. normal or disturbed. HMMs are hence used to infer a sequence of likely hidden states based on the observed optical flows.

Figure 2 depicts the dependency between the variables as a graphical model. First of all, a hidden state at time *t* = 1, *s*_{1}, is determined by an initial state vector, *π* = [*π*_{1}, … ,*π*_{J}]^{T}, where *π*_{j} = *p*(*s*_{1} = *j*). At time *t*, given a hidden state at the previous time step *s*_{t−1}, the current hidden state *s*_{t} depends only on *s*_{t−1} and not on those at time *t* − 2 and before, i.e. the state transitions follow a *Markov* process: *p*(*s*_{t}|*s*_{t−1}, … , *s*_{1}) = *p*(*s*_{t}|*s*_{t−1}). By this (first-order) Markov property, the sequential dependency in the data is modelled. The state transition is determined by a transition matrix, **A**, where *A*_{ij} = *p*(*s*_{t} = *j*|*s*_{t−1} = *i*). In this paper, the probability density of an observation, given a hidden state, is taken to be a finite mixture of Gaussians (MoG) as this offers the possibility of modelling arbitrarily complex multi-modal distributions (see Bishop 2006 for example). Taking an MoG emission model introduces another set of hidden variables, **Z** = {*z*_{1}, … , *z*_{T}}, where *z*_{t} ∈ {1, … , *K*}, an indicator determining which mixture emission component is most responsible for observed *y*_{t}. Hence, the emission model has the following probability density for component *k*:
3.1
where *μ*_{j} is a mean value for the *j*th state and *λ*_{jk} is a precision (or inverse variance) value for the *j*th state and the *k*th component. Periods of distinctive behaviour of the laying hens can hence be identified using the HMM, as we interpret the variable *s*_{t} as dictating which state the observation *y*_{t} is generated from, namely normal or disturbance. Moreover, an MoG emission enables us to model *heteroscedasticity*, i.e. to identify observation sequences with differing variance structures, resulting in a model which is more robust against outliers, thus avoiding spurious state transitions.

A HMM with an MoG emission has the following parameters: an initial state vector *π*, a state transition matrix **A**, a mixture component matrix **C** = [*p*(*z*_{t} = *k*|*s*_{t} = *j*)], a set of mean values * μ* = {

*μ*_{j}} and a set of precision values

*= {*

*Λ**λ*

_{jk}}. Finding a solution for an HMM involves estimation of the parameters,

*= {*

*θ**,*

*π***A**,

**C**,

*,*

*μ**}, as well as inference of the posterior distributions of the hidden variables,*

*Λ**p*(

**S**,

**Z**|

**Y**). A two-stage algorithm is usually adopted, known as the Baum–Welch algorithm (Baum

*et al.*1970), which itself is a special case of the expectation–maximization (EM) algorithm (Dempster

*et al.*1977). However, the conventional EM algorithm has several limitations, most notably its inability to incorporate prior knowledge, its tendency to overfit and its difficulty in model selection.

In order to overcome these limitations, the variational Bayesian (VB) approach (Jordan *et al.* 1999; Beal & Ghahramani 2003) is adopted in this paper. Given a model order ℳ = (*J*, *K*), the log marginal likelihood, log *p*(**Y**|ℳ), is approximated by maximizing a lower bound, which is referred to as the *variational free energy* after its original usage in the statistical physics literature^{1}. This quantity is maximized by the VB-EM algorithm where the VB-E and VB-M steps are alternately iterated until convergence. The maximization is equivalent to approximating the joint posterior of the parameters and the hidden variables. In Bayesian model selection, one selects the model order with the highest model posterior, *p*(ℳ|**Y**) ∝ *p*(**Y**|ℳ)*p*(ℳ). Assuming a flat prior over the model structure, i.e. *p*(ℳ) = const., leads to selecting the model order with the highest (negative) free energy.

Periods of distinctive behaviour can be identified by decoding the most probable path, {*s**_{1}, … , *s**_{T}}, resulting in a partition *G*_{j} = {*t*|*s**_{t} = *j*}. The decoding is essentially equivalent to the Viterbi algorithm (Rabiner 1989) with the only difference being that the ‘submarginalized’ parameter and likelihood are used. The VB learning algorithm for HMMs is presented in appendix 6. For more details, see Ji *et al.* (2006) and Roberts & Penny (2002).

### 3.2. State identification details

The optical flow datasets were log-transformed and a moving median filter was applied over the previous five samples to remove very short-term transients, lasting for one or two time steps. The success of the final predictive model is not sensitive to inclusion of this filter; its inclusion, however, is observed to improve the rate of convergence of the HMM. The data were normalized to have a zero mean and a unit variance, before the application of the HMM. Each frame was classified as normal or disturbance depending on which state it belonged to. In particular, since we know that disturbances are characterized by periods of higher activity, each frame was classified as normal if it was associated with the lowest mean activity hidden state and as a disturbance otherwise.

It is possible, of course, to apply supervised learning, e.g. binary classification, to the disturbance identification. However, this requires a significant amount of labelled data from both ‘classes’, normal and disturbance. For typical datasets, only a small fraction of data belong to the disturbance class, resulting in severe class imbalance. Moreover, labelled data are, in practice, often difficult, expensive and time-consuming to obtain, as they require the intensive efforts of human experts. We, therefore, focus on the development of unsupervised methodologies in this paper.

Figure 3 shows a typical example of an identification result on a section of data from a 27-week-old flock. The optical flows are close to zero (the global mean over all data) on the whole, but there is a clear increase, around *t* = 600. Bayesian model selection led to a model with two hidden states, *J* = 2, which are interpreted as normal and disturbance, and three components for the mixture model, *K* = 3. The HMM identified the disturbance period at the same location as the expert labelling. Although there are other very visible spikes in the optical flow, e.g. at *t* ≈ 530 and *t* ≈ 900, the HMM raised no false alarms, as transients in the optical flow were assigned to a high-variance non-disturbance state.

The identification results on 19 such sections of data comprising 19 000 normal and disturbance events were compared with the expert labelling. Table 2 presents the resultant contingency table, where both the true positive and true negative rates were around 95 per cent. A *χ*^{2}-test was carried out, using bootstrap undersampling so as to alleviate the severe class imbalance, with the null hypothesis that the cells in the table are evenly distributed. It was resoundingly rejected with a *p*-value of almost zero. We can conclude therefore that disturbance periods can be accurately identified by the methodology we present in this paper. As full state probabilities are returned by the model it is also possible to adjust the trade-offs between false positives and false negatives if there are unequal costs associated with them. In this paper, however, we take the cost associated with all misclassifications to be the same.

### 3.3. Disturbance measures

The HMM algorithm detailed in previous subsections was applied to each full-length recording with the model order (*J*, *K*) = (3, 3), which was determined by Bayesian model selection using the variational free energy over all datasets. Using the same model structure for all the datasets allows for easier collation and comparison of the results and avoids the computation required to evaluate a full set of models for each and every dataset. An example result is illustrated in figure 4. State 1 is designated as normal, while the other two as disturbances. For the *j*th state, for example, the mean value and its total duration are denoted by *μ*_{j} and *τ*_{j}, respectively. We derived three simple measures representing the disturbance profile as shown in table 3.

## 4. Prediction of future feather damage

Our goal is to predict which flocks will develop severe feather damage. As described in §3, we derived a set of disturbance measures, which we will denote by the three-dimensional vector **d**. Another relevant and readily available piece of information is the age of each flock, *a*, when the video was taken. We define the interval, *Δ*, as being the time interval between the dates of video recording and future feather scoring. All variables, **d**, *a*, *Δ*) are used for predicting future feather scores. In other words, given the disturbance measures and the ‘video age’, a feather score will be predicted *Δ* days ahead. This section describes the prediction models employed and then presents the prediction results.

### 4.1. Prediction models

Consider a dataset of input–output instances, (**X**, **f**) = {(**x**_{n}, *f*_{n})}, *n* = 1, … , *N*. An input vector consists of the disturbance measure **d**, the video age *a*, and the time interval *Δ*, i.e. . An output value corresponds to a future feather score. The number of instances available is 198, i.e. *N* = 198. We here consider two prediction models, one linear and the other nonlinear.

Linear regression (e.g. Johnson & Wichern 1998) is one of the most popular regression models. Given an input vector **x***, the prediction is given by
4.1
where *b*_{0} and **b** = [*b*_{1}, … , *b*_{d}]^{T} are *d* + 1 parameters to be estimated from the training data.

Gaussian processes (GPs; Rasmussen & Williams 2006) perform Bayesian inference by assuming a Gaussian distribution over the infinite numbers of possible outputs of the function to be estimated. A GP is specified by a mean vector, **m**, and a covariance matrix, **K**, over the function outputs:
4.2
where the covariance matrix is defined using a covariance function, i.e. *K*_{nn′} = *k*(**x**_{n}, **x**_{n′}). In this paper, we adopt the squared exponential covariance function with additive independent output noise. The squared exponential covariance function offers a canonical model which is ideally suited to interpolating smooth functions with sparse data. The reader is pointed to, e.g. Rasmussen & Williams (2006) for full discussion regarding the choice of covariance function.
4.3
where the hyperparameters *σ*_{in} and *σ*_{out} are the input and output length-scales, respectively. In the second term, *σ*_{noise}^{2} is the output noise variance and * δ*(

**x**,

**x**′) is a Kronecker delta, which is one for

**x**=

**x**′ and zero otherwise.

When a new (test) input vector, **x***, is given, the joint prior distribution of the training outputs, **f**, and the test output, *f**, is
4.4
where **k***≡ [*k*(**x***, **x**_{1}), … , *k*(**x***, **x**_{N})]^{T}. We can hence easily obtain the predictive distribution as a conditional Gaussian distribution:
4.5
where the mean and variance are expressed as follows:
4.6
and
4.7

The mean value, *m**, is usually taken as the point prediction, i.e. *f̂** = *m**.

Both models contain parameters or hyperparameters to be optimized. Their performance should be evaluated on a sample independent of the training data to avoid overfitting. As only 198 instances were available, leave-one-out validation was adopted to estimate performance on out-of-sample data. Specifically, a single instance from the dataset was assigned as the test data, and the remaining instances as the training data. A model is then trained using the training data and evaluated on the test data. This procedure is repeated for all possible choices of the test data.

In addition, since a feather score is a strictly positive quantity, each output value, *f*_{n}, was transformed to *f̃*_{n} = log(*f*_{n} + 1). A prediction *m̃** was made for *f̃** and then transformed to *m** = exp(*m̃**) − 1. For this reason, the error bars on the figures in the next subsection are asymmetric.

### 4.2. Prediction results

Feather scores depend on the birds' ages when the scores were recorded, since they tend to increase monotonically (Bright 2009; Drake *et al.* in press). Figure 5 shows the GP fitting result based only on the age (=*a* + *Δ*). The relationship is slightly nonlinear although the prediction by linear regression (in a log scale) was not significantly different. Note that uncertainty increases, as expected, as the hens grow older. While this result shows an obvious relationship between ages and feather scores, better performance could be obtained using the disturbance measures in conjunction with age information.

Figure 6 shows boxplots of absolute errors made by linear regression and Gaussian process regressors. For each model, two sets of input variables were compared: with age variables only (*a* and *Δ*) and with both the age variables and the disturbance measures. Linear regression performed slightly better with the disturbance measures, although the improvement was statistically insignificant according to a Brown–Forsythe test (Brown & Forsythe 1974). On the other hand, the GP was considerably better with the disturbance measures included. The variances of the errors were significantly different between the two sets of input variables with a *p*-value of less than 1 per cent (using the Brown–Forsythe test).

Table 4 lists the correlation coefficients between the true and the predicted feather scores with respect to the input variables used. Respectable performance was observed when the age variables were used. However, the best results were obtained using both sets of variables. The correlation coefficient of 0.8710 from the GP was higher than any other values with a *p*-value of less than 1 per cent according to a test based on Fisher's *Z*-transformation (e.g. Meng *et al.* 1992). Linear regression did not perform much better with the disturbance measures included, implying that the relationship between the disturbance measures and the (log) feather scores is nonlinear. Figure 7*a* shows an example plot of GP out-of-sample prediction for a 10-week-old flock as a function of prediction time. Figure 7*b* shows, for comparison, the GP output using age only as a predictive measure for future feather score.

## 5. Discussion

Our results suggest that flocks at greatest risk of later developing severe feather damage can be recognized several weeks earlier using optical flow analysis of flock behaviour. Such prediction would be most valuable if it could be used to trigger the use of measures to prevent or control feather pecking, which is currently very difficult to do, especially if birds are not beak trimmed. However, being able to predict which flocks are most at risk might at least allow control measures such as lowered light levels or additional environmental enrichment to be targeted at these at-risk flocks, rather than used indiscriminately on all flocks—owing to cost or because some, such as straw bales or vegetation, are a biosecurity risk. Furthermore, detecting the early warning signs of severe feather pecking will be a valuable research tool in the search for more ethically acceptable ways of preventing and controlling feather pecking. As a caveat, it should be noted that the proposed method was used for predicting feather scores, not feather pecking itself. Feather pecking and feather scores, though closely related to each other, are not necessarily equivalent.

Optical flow measurements are now increasingly used for a range of applications such as monitoring traffic flow (Bellomo *et al.* 2009), the movement of glaciers (Giles *et al.* 2009), cell and sperm motility (Shi *et al.* 2008), the study of human crowds (Courty & Corpetti 2007) and for studying groups of non-human animals (Bremond *et al.* 2006). Optical flow can be used both to track individuals and to recognize specific behaviour patterns (Johansson & Helbing 2008), but its great advantage is that it can also be used to analyse the mass movements of whole groups such as herds, flocks or schools, where tracking individual animals is difficult or impossible. Whole frames or sections of frames, containing tens or hundreds of individuals can be assessed together and the basic statistical properties of these flow patterns can be derived automatically and continuously (Sonka *et al.* 1999) using algorithms simple enough to deliver results continuously in real time. The flow patterns can of course potentially be analysed in many different ways and the key issue then becomes to find the appropriate statistical analysis that detects patterns of biological significance. In this particular case, we have shown that statistical anomalies detected by using HMM appear to be of significance to the future welfare of laying hens in that they appear to predict future feather damage before it has actually occurred. At risk flocks show signs of behavioural disturbance that this method is able to detect, but the method described here is only one of a range of possible models which have yet to be explored such as an HMM with an autoregressive (AR; Penny & Roberts 1999), or a generalized AR emission model (Roberts & Penny 2002) or even, a HMM with an infinite number of states (Beal *et al.* 2002). We suggest that similar approaches might also be useful for early detection of other signs of poor welfare such as tail-biting in pigs or even subclinical signs of disease. Current research is investigating whether the spatial distribution of optical flow, on the image frames, improves the analysis. It also remains to be seen whether, on short timescales, these measures offer predictive information regarding the occurrence or prevalence of disturbance events.

All the methods considered in this paper were implemented using the Matlab programming language. Code for performing generic modelling using Bayesian dynamical models may be found under http://www.robots.ox.ac.uk/~parg/software.html.

## 6. Conclusions

This paper addresses the problem of predicting feather damage of laying hen flocks based on visual processing and statistical analysis. From video recordings, optical flows were extracted as indicators of the hens' movements. HMM were then applied to the optical flow datasets to identify disturbance periods and to derive the disturbance measures. Using these disturbance measures and age-related variables, feather scores were predicted via linear regression and nonlinear regression using Gaussian processes. It was shown that the Gaussian process model performed well and that the disturbance measures provided predictive information regarding future feather scores.

## Acknowledgements

The work was carried out as part of the BBSRC Animal Welfare Initiative (grant no. BBC5189491) and we would like to thank the BBRSC for supporting H.j.L. The data were collected with support from Defra (contract no. AW1134) and we would like to thank them for supporting K.A.D.

## Appendix A. Variational Bayesian HMM

This section describes the VB learning algorithm for a HMM. First of all, the priors are defined as *p*(*θ*) = *p*(*π*)*p*(**A**)*p*(**C**)*p*(*μ*)*p*(**Λ**), each of which is to be conjugate, that is, Dirichlet for

*π*,

**A**and

**C**, Gaussian for

*μ*, and Gamma for

*as follows: A1 A2 A3 A4 A5 where*

*Λ***u**

^{π},

**u**

^{A},

**u**

^{C}, {

*w*

_{j}}, {

*r*

_{j}}, {

*ν*

_{jk}} and {

*v*

_{jk}} are hyperparameters, and 𝒟(·), 𝒩(·), 𝒢(·) denote the Dirichlet, Gaussian and Gamma distributions, respectively. The density functions and other relevant quantities of the distributions are given in appendix A.5.

A lower bound of a log marginal likelihood can be obtained using Jensen's inequality: A6

The last term, *F*(*q*(**S**, **Z**, *θ*)), is called the variational free energy. It is easy to show that log *p*(**Y**) = *F*(*q*(**S**, **Z**, *θ*)) + KL(*q*(**S**, **Z**, *θ*)‖ *p*(**S**, **Z**, *θ*|**Y**)), where the KL(·‖·) is the Kullback–Leibler (KL) divergence between two distributions. Hence, maximizing *F*(*q*(**S**, **Z**, *θ*)) is equivalent to approximating *p*(**S**, **Z**, *θ* |**Y**) by *q*(**S**, **Z**, *θ*). For brevity, the model order, ℳ = (*J,K*), is omitted from all the conditional terms and assumed implicit.

Since approximation of the joint posterior is analytically intractable, the *mean-field* approximation is adopted, i.e. the approximate posterior can be factorized as
A7

Consequently, the free energy is decomposed as A8

The approximate posteriors are found using the VB-EM algorithm. The hidden variables part, *F*(*q*(**S**, **Z**)), is maximized in the VB-E step while the parameters part, *F*(*q*(*θ*)), is maximized in the VB-M step.

In the VB-E step, we have
A9
where *q*★ (**S**, **Z**) ∝ exp[〈log *p*(**Y**, **S**, **Z**|*θ*)〉_{q(θ)}]. The relation means that the equality holds up to a constant, and 〈·〉_{q(·)} indicates an expectation with respect to the specified distribution. From equation (A 9), maximizing *F*(*q*(**S**, **Z**)) is equivalent to approximating *q*(**S**, **Z**) to *q**(**S**, **Z**). Likewise, in the VB-M step, notating *q*(*θ*) = ∏_{ℓ=1}^{L} *q*(*θ*_{ℓ}), we have
A10

The posterior of a parameter is approximated as
A11
where *θ*_{−ℓ} denotes all the parameters excluding *θ*_{ℓ}. Since the conjugate priors are defined, *q**(*θ*_{ℓ}) can be specified without actually computing the normalization terms. These two steps are iterated alternately until some terminating conditions. After finding the optimal model order, the most probable path of the hidden states is decoded. Algorithm 1 outlines the VB learning algorithm. The following subsections provide more detailed descriptions of the algorithm.

#### A.1. VB-E step

Suppose that the posteriors of the parameters are available from the previous iteration, that is, equations (A 19)–(A 23). Then, the ‘submarginalized’ likelihood for each observation, *y*_{t}, given a hidden state and a mixture component variables, is evaluated as
A12
where 〈log *λ*_{jk}〉 is given in equation (B 7). A superscript of ‘(old)’ means that the posterior or the hyperparameter has been obtained in the previous iteration. A tilde (̃.) indicates an updated posterior hyperparameter, probability, or parameter obtained by marginalizing nuisance variables.⇑

Consider the following relation:
A13
where *p̃*(*z*_{t}|*s*_{t}) corresponds to C̃ = [*C̃*_{jk}] ∝ [exp(〈log *C*_{jk}〉)], given in equation (B 2). The forward and backward recursions are, respectively, as follows:
A14
and
A15
where *α*_{1} ∝ *p̃*(*y*_{t}|*s*_{t})*p̃*(*s*_{1}) and *β*_{T} = [1, … , 1]^{T}. The initial state probability, *p̃*(*s*_{1}), and the transition probability, *p̃*(*s*_{t}|*s*_{t−1}), correspond to *π̃* = [*π̃*_{j}] ∝ [exp(〈log *π*_{j}〉)] and **Ã** = [Ã_{ij}] ∝ [exp(〈log*A*_{ij}〉)], respectively. Then, the posteriors of the hidden variables are now written as
A16
A17
A18

#### A.2. VB-M step

The Dirichlet posteriors for the initial state, the transition, and the mixture component parameters are A19 A20 A21

The Gaussian posteriors for the mean values are A22 and

The Gamma posteriors for the precisions are A23 and

#### A.3. Variational free energy

The variational free energy is evaluated after each round of the two steps. From equation (A 8), the variational free energy can be expanded as A24 where the first term can be evaluated as A25

The second term, the entropy of the hidden variables, is A26

The remaining terms associated to the KL divergences can be evaluated as given in equations (B 3), (B 5) and (B 8).

#### A.4. Model order selection

In Bayesian model selection, one selects a model order with the highest model posterior,
A27
where *p*(**Y**|*J, K*), *p*(*J, K*) and *p*(**Y**) are the marginal likelihood, the model prior and the total evidence, respectively. Since the variational free energy is an approximation to the log-marginal likelihood, if we assume a flat model prior, that is, *p*(*J, K*) = const., ∀*J, K*, the model posterior is approximated as
A28

In essence, we select a model order that leads to the highest free energy.

#### A.5. Decoding of the most probable path

The decoding of the most probable path of hidden states is essentially equivalent to the Viterbi algorithm described in Rabiner (1989). The only difference is that we use the submarginalized parameter, **Ã**, and likelihood *p̃*(*y*_{t}|*s*_{t}) = exp[〈log *p*(*y*_{t}|*s*_{t}, *z*_{t})〉 _{q*} _{(zt},_{μ,Λ)}].

## Appendix B. Relevant distributions

This section presents the density functions, KL divergences and some relevant statistics for the Dirichlet, Gaussian and Gamma distributions.

#### B.1. Dirichlet distribution

For a variable , the Dirichlet density function with a parameter is B1

The expectation of the logarithm of the Dirichlet variable is
B2
where *ψ* (·) is the digamma function. For two Dirichlet distributions, *q*(*π*) = 𝒟(*π*; **ũ**) and *p*(*π*) = *D*(*π*; **u**), the KL divergence is
B3

#### B.2. Gaussian distribution

The Gaussian density function of is specified by a mean *μ* and a precision *λ*:
B4

For two Gaussian distributions, *q*(*y*) = 𝒩(*y*;̃ *μ*, *λ̃*^{−1}) and *p*(*y*) = 𝒩(*y*; *μ*,*λ*^{−1}), the KL divergence is
B5

#### B.3. Gamma distribution

For a variable , the Gamma density function with a shape parameter and a scale parameter is B6

The expectation of the logarithm of the Gamma variable is written as B7

For two Gamma distributions, *q*(*λ*) = 𝒢(*λ*; *ν̃*, *ṽ*) and *p*(*λ*) = 𝒢(*λ*; *ν*, *v*), the KL divergence is
B8

## Footnotes

↵1 The variational free energy provides a bound to the true (negative) log evidence of the data under the model. The quantity refers to the statistical modelling properties of the model and is not in any way a measure of physical movement or energy in this context.

- Received May 20, 2010.
- Accepted June 30, 2010.

- © 2010 The Royal Society