## Abstract

The sequential sampling of populations with unequal probabilities and with replacement in a closed population is a recurrent problem in ecology and evolution. Examples range from biodiversity sampling, epidemiology to the estimation of signal repertoire in animal communication. Many of these questions can be reformulated as urn problems, often as special cases of the coupon collector problem, most simply expressed as the number of coupons that must be collected to have a complete set. We aimed to apply the coupon collector model in a comprehensive manner to one example—hosts (balls) being searched (draws) and parasitized (ball colour change) by parasitic wasps—to evaluate the influence of differences in sampling probabilities between items on collection speed. Based on the model of a complete multinomial process over time, we define the distribution, distribution function, expectation and variance of the number of hosts parasitized after a given time, as well as the inverse problem, estimating the sampling effort. We develop the relationship between the risk distribution on the set of hosts and the speed of parasitization and propose a more elegant proof of the weak stochastic dominance among speeds of parasitization, using the concept of Schur convexity and the ‘Robin Hood transfer’ numerical operation. Numerical examples are provided and a conjecture about strong dominance—an ordering characteristic of random variables—is proposed. The speed at which new items are discovered is a function of the entire shape of the sampling probability distribution. The sole comparison of values of variances is not sufficient to compare speeds associated with different distributions, as generally assumed in ecological studies.

## 1. Introduction

The description of sequential sampling of a population of individuals for which the probability of being selected does not vary until a specific event, such as the collection of all or some types of individuals or a specific subgroup of the population, occurs is a common problem in ecology and evolution studies. In probability theory, such problems are often treated as urn problems, generally as the ‘coupon collector problem’ (CCP). The CCP is a mathematical model that belongs to the family of urn problems that can be formulated as follows: a company issues coupons of different types, each with a particular probability of being issued. The object of interest is the number of coupons that must be collected to obtain a full collection. This problem has been widely studied. The first findings concerned the classical problem in which all coupons are equally likely to be obtained [1]. Rapid advances have been made in this field [2–4], but they have gone largely unnoted by most scientists working in ecology and evolution. This is partly due to difficulties in making the correct analogies, partly due to a lack of worked examples and partly because each field devises its own vocabulary, procedures and formalism. In ecological sciences, for example, a vibrant field of theoretical and applied ecological statistics developed in the 1950s from the repeated sampling of populations to estimate biodiversity richness [5,6]. This field could greatly benefit from the latest advances in the CCP [7,8] as we show through the working of a biological example. Related problems deal with the relative abundance of species from a community containing many species [9], or the sampling effort required to achieve a particular level of coverage [10]. Increases in the number of new hosts being infected or superinfected are a topic of great importance in population dynamics and epidemiology [11–14]. Several of the questions posed in capture–recapture studies relate to the coupon collector problem. Occupancy problems and related capture–recapture techniques are, indeed, defined as problems in which the probability of a given species occupying a given state at a given time must be determined (see the review [15] and the paper [16]). In ethological sciences, the estimation of a repertoire of signals in animal communication is considered as a form of the CCP, because vocal repertoire size is a key behavioural indicator of the complexity of the vocal communication system in birds and mammals [17]. In genetics and evolution, the CCP has been recognized as such only occasionally, despite these fields having generated some of the most elegant theorems and uses of other urn processes [18–20]. Indeed, the CCP has been used in the context of exhaustive haplotype sampling in phylogeography [21], determining the number of beneficial mutations as a function of sequence lines [22] and estimations of the size of the library required to target a particular percentage of the non-essential genome displaying a given property [23], for example.

Urn models have been much more widely used for modelling host–parasitoid systems than in other topics of ecology. We therefore used the biological context and formalism of parasitism by parasitic wasps, as the results obtained with this system can easily be extended to other ecological and evolutionary contexts. Parasitic wasps search for insects hosts, such as caterpillars, in which they lay a single, or multiple eggs. In solitary wasp species, only a single wasp develops fully in a given host. Parasitism can thus be formalized as a probabilistic dynamic process with hosts as ‘balls’ and parasitoids changing their ‘colour’ by parasitizing them. In work beginning more than a century ago [24,25], the pioneering population dynamicists assumed that hosts were found and attacked on successive occasions governed by exponential laws in continuous time. The number of draws was thus considered to be random and the number of eggs for a given host was assumed to follow a Poisson law [26]. If we assume that the number of draws is fixed, then the distribution of the number of eggs for a given host is binomial, but closely approximates a Poisson distribution in large host populations. The proportion of the population without eggs (the zero class) is of particular interest, because these hosts survive parasitism and produce offspring for the next generation. In field studies however, observed distributions are generally more aggregated than would be expected under the assumed Poisson distribution [27]. Aggregation is interpreted as the result of heterogeneity in the risk of being found, owing to differences in location, accessibility, appearance, colour, developmental stage or any other trait [28,29]. The risk distribution greatly influences the stability of the host–parasitoid system and has been widely studied [30–32].

All these works make strong assumptions about parasitoid searching and attacking behaviour and hence the egg distribution over the population of hosts, after a given time or a given number of draws (figure 1). However, the use of this distribution greatly decreases the amount of information available, as it collapses individual host histories. Parasitism is a multinomial process (figure 1), in which time corresponds to host draws. Its dynamics determines, for example, the percentage of hosts parasitized at the end of the season, the opportunity and time at which alternative pest control methods need to be deployed in supplement in biological control with parasitoids, and the time required to achieve a given degree of control by parasitic wasps. In this paper, we aimed to model parasitism as a multinomial urn process over time and we study the speed of parasitization (figure 1). We consider host encounters followed by oviposition without discrimination. The parasitism process described above can be considered as a CCP. In this case, there is a finite population of hosts differing in appearance, location, developmental stage or other factors. This heterogeneity results in different probabilities of hosts being found by parasitoids. These probabilities, *p*_{h} for host *h*, do not change over time. Our work therefore entails describing in depth the CCP, highlighting unnoticed analogies among previous works within the probability literature, and comparing the influence of the degree of heterogeneity among hosts on the speed of infection. We give a compact and hopefully more elegant proof than previously known of the following fact: the more the distribution *p* on the set of hosts is heterogeneous, the more the (random) number *Y* of parasitized hosts after a given number of draws is small; in other terms, there is a monotonic relationship between the majorization relation on the set of probability distributions *p* with the stochastic dominance on the set of random numbers *Y*.

This paper is structured as follows. In §2, we define a succession *S*_{n}, of *N*-dimensional random variables describing the state of the host population over time, in which time, *n*, is given by the number of attacks on the set of hosts. Each marginal distribution of *S*_{n} provides us information about a subset of hosts, including, in particular, the *h*th component representing the number of times that host *h* has been attacked by a parasitoid between times 1 and *n*. In §3, we define the random variables *Y*_{n}, representing the number of parasitized hosts after *n* draws. We also compute the distribution, the distribution function, the expectation and the variance of *Y*_{n}. We found no examples of calculations of this value in previous studies and therefore believe this aspect to be novel. We obtain the expected number of draws required for all hosts in a given subset to be parasitized and provide upper and lower bounds for this value in §4. In §5, we apply the results developed in previous sections to two particular risk distributions on the set of hosts. We first use the uniform distribution, and then a distribution corresponding to a host population with two different kinds of hosts. We calculate the most relevant values for each of these cases. In §6, we develop the relationship between the speed of parasitization and the risk distribution in the set of hosts. A narrower risk distribution is associated with faster parasitization. Thus, parasitization is fastest when the risk distribution is uniform. A conjecture on strong stochastic dominance is proposed in §7. In §8, we highlight the relationship between the speed of parasitization and the risk distribution with numerical examples and in §9 we apply our approach to a real host–parasitoid system. We have included the most mathematical aspects of the paper as electronic supplementary material.

## 2. Modelling parasitism as an urn process

We assume a finite population of hosts, constant for the entire duration of the experiment. The size of the parasitoid population is irrelevant, but we assume that the number of eggs that can be laid in the host population is not limiting. The situation is developed in successive stages or draws. At each stage, a parasitoid attacks a host, in which it lays an egg. The model is based on the fundamental assumption that successive draws are mutually independent. The hosts differ in appearance owing to intrinsic qualities, and these differences modify their probability of being attacked by a parasitoid. If the hosts are named 1, 2, 3, … , *N*, then host *h* has a probability *p*_{h} ≥ 0 () of being attacked by a parasitoid in a draw. This probability does not change during the process. We will say that *p*_{1}, *p*_{2}, … , *p*_{N} or (*p*_{1}, *p*_{2}, … , *p*_{N}) is the risk distribution for the set of hosts *H* = {1, 2, … , *N*}.

The underlying probability space of our model is , where the elements of *Ω* are all the possible histories of parasitism, that is , equipped with its product *σ*-algebra and the probability *P* given by Kolmogorov's theorem: if we therefore fix in *H*, the probability of the event is . When necessary, the vector (*p*_{1}, *p*_{2}, … , *p*_{N}) will be denoted by a single letter *p*, and the probability *P* will be denoted by *P*_{p}.

We can describe this situation by defining a succession of random variables,
2.1 where *S*_{nj} denotes the number of eggs in host *j* after *n* draws.

Variable *S*_{n} represents the state of the host population after *n* draws, that is, the distribution of eggs over the total population of hosts. If host *i* was visited *r*_{i} times between stages 1 and *n*, for *i* = 1, 2, … , *N*, then *S*_{n} takes the value (*r*_{1}, *r*_{2}, … , *r*_{N}). This variable has a multinomial distribution with parameters *n*, *p*_{1}, *p*_{2}, … , *p*_{N}, that is, for every integers *r*_{1}, *r*_{2}, … , *r*_{N}, , ,
2.2

The marginal distribution of *S*_{n}, for distinct elements of is given by
2.3 with

This is the probability that, after *n* draws host *i*_{1} has been visited *r*_{i1} times by the parasitoids, host *i*_{2} *r*_{i2} times and host *i*_{h} *r*_{ih} times, without considering the rest of the hosts.

In particular, the component *S*_{nh} of *S*_{n} has a binomial distribution with parameters *n*, *p*_{h},
2.4where
This variable represents the state of host *h* after *n* draws. Thus, is the probability that host *h* has been attacked *r* times during the *n* draws.

The expected value and variance of this random variable are

Let denote the canonical base of the space . We emphasize that the process is the random walk on with independent increments obeying the following law: with probability *p*_{k}. The statistical behaviour of this process is also very well known.

Note that, in this model, the sequence of random subsets of *H*, describing the set of parasitized hosts over time, is a Markov chain, and it is not difficult to give a precise description of its probability transitions. However, it is not straightforward to study this Markov chain directly.

## 3. Number of parasitized hosts after *n* draws

Let *Y*_{n} be the random variable representing the number of parasitized hosts after *n* draws, that is if there are exactly *k* parasitized hosts after *n* draws. In this section, we give expressions for the probability mass function (3.3), distribution function (3.4), expectation (3.5) and variance (3.6) of this random variable. We show in annex A in electronic supplementary material how these expressions can be obtained.

From now on, for any integer *h* > 0 and real *x*, we write
3.1

The distribution and the distribution function of *Y*_{n} have been obtained in previous studies, see [4]. Denoting *p*_{j1}_{,j2}_{,…,jk} = *p*_{j1} + *p*_{j2} + … + *p*_{jk}, the probability mass function is given by
3.2

Using the notation for any , this can be written in a more compact form
3.3where |*A*| denotes the number of elements of the set *A*.

For the distribution function of *Y*_{n}, we obtain
3.4

A similar expression can be seen in [4].

It is well known that
3.5but we have been unable to find any expression for and the variance of *Y*_{n} in previous studies. For these values, we obtain
and
3.6as can be seen in annex A in electronic supplementary material.

## 4. The number of draws required to reach a given level of parasitism

The expected number of draws required for the parasitization of *k* unparasitized hosts may be of considerable interest. For example, we might want to know the expected number of draws required for *k* of the hosts occupying a determined region, or with probabilities of parasitization greater (or less) than a given value, etc., to be parasitized. We define below a random variable representing the number of draws required for the event of interest to happen and we obtain its expectation. We also describe the relationship between the random variables defined here and the variables *Y*_{n} defined in §3.

Let us consider that, at a given stage of the process, there is a set of unparasitized hosts. This is our set of interest, and the remaining hosts *H*–*K* are or are not parasitized. Let us use *X* to denote the number of hosts in the set *H*–*K* attacked by the parasitoids before one of the hosts in *K* is attacked.

As this process involves the repeating of independent trials, the random variable *X* follows a geometric distribution with parameter (or a degenerate distribution if *K* = *H*). Thus,
4.1

Now, let *k* and be integers . Let *H*_{1} be a subset of the set of hosts, *H*, and , . We can assume that without loss of generality.

If we consider the hosts of set *H*_{1} to be unparasitized, then we can define *T*_{k,N1} as the random number of draws required to ensure that *k* hosts of set *H*_{1} have been parasitized. Its expectation is the expected number of draws required for *k* hosts of set *H*_{1} be parasitized. In this section, we include only the main ideas used to obtain this expectation, leaving the complete development for annex B in electronic supplementary material. The case , and therefore *N*_{1} = *N*, has been studied before and different expressions for have been obtained. We include these at the end of this section. In [2], an expression is proposed for the particular case in which *k* = *N*_{1} = *N*.

Let *i*_{1}, *i*_{2}, … , *i*_{k} be distinct elements of *H*_{1}. Let *D*_{i1, i2, … , ik} be the event defined by the fact that the first *k* hosts of set *H*_{1} parasitized (i.e. attacked by a parasitoid for first time) are hosts *i*_{1}, *i*_{2}, … , *i*_{k}, and are parasitized in the precise order *i*_{1}, *i*_{2}, … , *i*_{k}. In other words, some of the hosts of set *H*_{2} may be attacked first, followed by host *i*_{1}. Next, some hosts of may be attacked, followed by host *i*_{2}, etc. Let , .

As proven in annex B in electronic supplementary material, we can write the equality
4.2where *Π*_{k} is the set of all *k*-permutations of 1, 2, … , *N*_{1}.

The probability of event *D*_{i1i2…ik} and the conditional expectation are given by
4.3and
4.4

Bearing in mind (4.2)–(4.4), we can state the following.

### Proposition 4.1.

*The expected value of* *T*_{k,N1} *is*
4.5*where* *Π*_{k} *is the set of all* *k*-*permutations of set* {1, 2, … , *N*_{1}}, *i.e. the arrangements of length* *k* *of different elements of* {1, 2, … , *N*_{1}}.

Thus, *E*(*T*_{k,N1}) given by (4.5) is the expected number of draws required for *k* hosts of a set of unparasitized hosts with cardinality *N*_{1}, to be parasitized. This value is generally difficult to obtain, because the number of terms required for its computation is the number of *k*-permutations of 1, 2, … , *N*_{1}, that is . This value is huge when *N*_{1} and *k* are large. It is therefore important to obtain upper and lower bounds for this value.

### Proposition 4.2.

*Let* *k* *be given and* *p*_{1}, *p*_{2}, … , *p*_{N1} *be real numbers satisfying* . *Then, the maximum of* *defined by (4.4) over all possible choices of the* *k*-*subsets* *of* *H*_{1} *is*
4.6*and the minimum is*
4.7

### Proof.

From hypothesis , it follows directly that 4.8then and the proof is complete. ▪

### Proposition 4.3.

*Let* *p*_{1}, *p*_{2}, … , *p*_{N1} *be real numbers satisfying* , *for* *and* . *It is then true that*
In other words, and are upper and lower bounds, respectively, for the expected number of draws required for *k* hosts of the set *H*_{1} to be parasitized.

Furthermore, the mode of the distribution on the events , is , i.e. the order of parasitism of *k* hosts mostly likely to occur is

### Proof.

The first part of this proposition is a straightforward consequence of the proposition 4.2.

The second part comes directly from the fact that which follows from (4.3) and (4.8). ▪

Propositions 4.2 and 4.3 prove that, if , then the most likely order of parasitization of *k* hosts in *H*_{1} is the preferential order 1, 2, … , *k*. Moreover, the quickest scenario (in terms of expectation) for the parasitization of *k* hosts of *H*_{1} is the sequence extending from the least likely host, *N*_{1}, to the most likely host, , in the correct order. The slowest scenario (in terms of expectation) for the parasitization of *k* hosts of *H*_{1} extends from the most likely, 1, to the least likely host, *k*, in the correct order.

These results can be intuitively explained as follows. Let us suppose that host 1 is parasitized in the first place. The probability of a new host of the set *H*_{1} − {1} being parasitized is then . This value is less than any other value with . It is therefore more difficult for a host of the set *H*_{1} − {1} to be parasitized than for a host of the set , to be parasitized. The repeated application of this reasoning explains the first inequality of the proposition. The second inequality can be explained in a similar manner.

For simplicity, we denote *T*_{k,N} by *T*_{k} in the particular case in which *N*_{1} = *N*. Now, recalling the definitions of these random variables and the random variables *Y*_{n}, and bearing in mind the equivalence between ‘The number of parasitized hosts after *n* draws is less than or equal to *k* − 1 in the history of the sequence of parasitism’, and ‘To parasitize *k* hosts, more than *n* draws are necessary’, we can write the equality
and from this
and
From (3.3), (3.4) and above equalities, we see that
4.9and
is the expected number of draws required for *k* hosts to be parasitized. Different expressions have been described for this expectation [2,3]. From (4.9), it follows immediately that

This expression was obtained in [3]. In [2], the following expression was obtained:
where is the coefficient of *x*^{r} in the power series development of *f*(*x*).

If *k* = *N*_{1} = *N*, then is the expected number of draws required to obtain complete parasitism. From (4.5)
4.10where *Π*_{N} is the group of permutations of This expression for is proposed in [2]. The authors provide no proof for this formula, and we have found no proof elsewhere.

## 5. Applications to various risk distributions

In this section, we consider two different risk distributions on the set of hosts and compute the most relevant values for every distribution.

### 5.1. The uniform distribution

The situation in which the risk is distributed uniformly, i.e. all the hosts have the same probability of being parasitized, with
5.1has been widely studied. In this case, the expectation and variance of the random variable *Y*_{n} representing the number of parasitized hosts after *n* draws are
and the expected number of draws for *k* new hosts to be parasitized (4.5) is
which, in the case in which *k* = *N*, can be written as the following well-known formula:
It is clear that in this case the upper and lower bounds for *E*(*T*_{k,N1}) obtained in proposition 4.3 are both equal to *E*(*T*_{k,N1}), and all the probabilities are equal to .

### 5.2. Two kinds of hosts

The two types of host situations are an idealization of the following cases. Hosts which are dead, either because they were previously parasitized or because they produced artefacts such as mines and galls, remain in the ecosystem for much longer than the existence of the host. They can make up to 90% of the host population. They can be still attractive to parasitoids long after the host death. Parasitoids will not lay eggs in them, but they will be checked carefully, implying a waste of time of up to 20% [33,34]. In such cases, it is possible to envision two categories, living and dead hosts, while being interested in the rate of parasitism of the living ones only.

Let us now consider the situation in which there are two kinds of hosts and, therefore, two different probabilities of being detected by a parasitoid.

In a population of *N* hosts, each of the hosts has a probability *α* of being parasitized, and each host has a probability *β* of being parasitized, such that
5.2

The probability of host 1 being visited *r*_{1} times, host 2 *r*_{2} times, etc., for , given by (2.2) is in this case
The probability that, after *n* draws host *i*_{1} had been chosen *r*_{i1} times by the parasitoids, host *i*_{2} *r*_{i2} times and host *i*_{h} *r*_{ih} times, without taking the other hosts into account, is given by (2.3). It is equal to

We now calculate the expected number of parasitized hosts after *n* draws with this risk distribution, using the results obtained in §2.

Let *Y*_{n} be the random variable representing the number of parasitized hosts after *n* draws. From (3.3), it follows that
and the expected value of *Y*_{n}, (3.5), is equal to

To compute the expected number of draws for *k* hosts of a set of unparasitized hosts to be parasitized, we name the hosts of the set *H*_{1}, hosts . Without any loss of generality, we can assume and . Let *Π*_{k} be the set of all *k*-permutations of the integers . For every , let be the set defined by if . It is clear that the probability given by (4.3) is, in this case,
where
5.3Then, if for and , it follows directly that
We can therefore define an equivalence relation on *Π*_{k} as follows: *I* is related to *I*′ if . We denote by I̅ the equivalence class of *I*, and by the set whose elements are the equivalence classes of the elements of *Π*_{k}, that is
There are as many equivalence classes as subsets of with cardinalities greater than or equal to , where , and less than or equal to , and the cardinalities of these equivalence classes are
Given the above considerations, it is clear that *E*(*T*_{k,N1}) can be written in this case as
5.4where *γ*_{j} is defined by (5.3).

Let us suppose that

To obtain an upper bound for *E*(*T*_{k,N1}), we distinguish two cases, and . If then
If , this upper bound is
Similarly, to obtain a lower bound for *E*(*T*_{k,N1}) we distinguish the cases and *k* > *n*_{1}. If this lower bound is
and if *k* > *n*_{1}, a lower bound for *E*(*T*_{k,N1}) is
The maximum of the values is

## 6. Relationship between the risk distribution and the speed of parasitization

In the preceding sections, we studied the process of parasitization for a given risk distribution in the set of hosts. In this section, we compare this process for different risk distributions. We show how parasitization speed depends on the risk distribution, and its scatter in particular. We use the concept of ‘majorization’ to formalize the idea that risk distributions have different degrees of spread. This notion dates from the start of the twentieth century. A comprehensive review of the theory can be found in [35].

Less spread distributions are associated with faster parasitization. In other words, the more spread out the risk distribution, the larger the number of draws required for a given number of hosts to be parasitized. Thus, the distribution function for the first-time parasitization of a given number of hosts, viewed as a function of the vector *p*, is Schur convex (see the definition at the end of this section). The mathematical community studying the CCP seems to be largely unaware of it, but this result is not new and can be found in [36]. This result constitutes the first part of theorem 6.3. We give a proof which is more concise and clearer than a previous proposal. It is contained in annex C in supplementary electronic material.

Moreover, our method provides a precise result for strict Schur convexity. This refinement constitutes the second part of theorem 6.3. We make use in our proof of the relationship between the concept of majorization and the numerical operation known as ‘Robin Hood transfer’, described below.

In this section, we work with different risk distributions, requiring further notation and definitions. Given a risk distribution , we denote by *P*_{p} the probability distribution induced by *p* on the *σ*-field over the space of the all the possible incidences of parasitization.

Given (*p*_{1}, *p*_{2}, … , *p*_{N}) in , we denote by the *N*-uple obtained by permutation of *p*_{i} such that .

The following definitions are given in [35].

### Definition 6.1.

Let *p*_{1}, *p*_{2}, … , *p*_{N}, , be real numbers. We say that is majorized by , and we write *p* ≺ *q*, if
and

It is clear that when we apply this definition to the comparison of two risk distributions, the last equality is trivially satisfied.

Let . If , then we can transfer an amount Δ, from to to obtain the following new risk distribution , , … , , where , and for . Then, is less spread out than the initial distribution, that is, . Such operations involving the shifting of some ‘income’ from one individual to a poorer individual are described, somewhat poetically, as Robin Hood transfers [37]. If we define then we can write and .

### Proposition 6.2.

*The following conditions are equivalent:*

(a)

*p*≺*q**and*(b)

*p can be derived from q by successive applications of a finite number of Robin Hood transfers*.

It is not difficult to prove this equivalence. It was proved for the first time, to the best of our knowledge, in [38] for vectors of non-negative integer components.

Let denote a probability distribution *p* over the set *H*. Suppose that *p* is not uniform. We can assume without loss of generality. Let , . We then define a new risk distribution *p*′ by applying a Robin Hood transfer as follows:
6.1We indeed have .

### Theorem 6.3.

*Let* *p* *be a non-uniform probability distribution over* *H*. *Without loss of generality, we can assume that* . *Let* *p*′ *be defined by* (6.1). *Then, for all* *k* *between 1 and* ,
6.2*which is equivalent to*
6.3*Moreover, if at least* *k* − 1 *of the values* *are non-zero, then*
6.4*which is equivalent to*
6.5*where* *p*′ *is defined by* (6.1).

### Proof.

Can be found in annex C in supplementary electronic material. ▪

We can state the following corollaries.

### Corollary 6.4.

*Let* *and* *be risk distributions on* *H* = {1, 2, … , *N*}. *If* *then, for every* *and every*
6.6*is satisfied and*

*Furthermore, if the distributions* *p* *and* *q* *are actually different, meaning that they do not differ only by a permutation, then the preceding inequalities are strict, except in trivial cases. More precisely, denoting by* *j* *the number of non zero* *p*_{i} *values (and remarking that the number of non-zero* *q*_{i} *values is at most* *j*), *we have*

—

*if k*≥*n or k*≥*j, then*—

*if n*≥ 2,*k*<*n and k*<*j, then*

### Proof.

As it is possible to go from vector *q* to vector *p* by a finite sequence of Robin Hood transfers, the corollary follows directly from theorem 6.3, which proves that each transfer decreases the quantity . We just have to consider the cases in which this quantity is strictly decreased. ▪

### Remark 6.5.

We can interpret the results obtained above in terms of the theory of Schur-convex functions. A real-valued function *ϕ* defined on a set is said to be Schur-convex on if, for every *x* and *y* pair of elements in such that , the inequality holds. The first part of corollary 6.4 states that the map is Schur-convex. This was already proved in [36], and was stated as a conjecture in [4].

### Corollary 6.6.

*Let* *be the uniform distribution on* *H* = {1, 2, … , *N*} and *any other risk distribution on* *H*. *Then*

### Proof.

It can be clearly seen that is majorized by any other distribution on *H* and the corollary follows. ▪

### Remark 6.7.

The results obtained in corollaries 6.4 and 6.6 can be expressed in terms of a comparison of probability distributions as follows. If *p* ≺ *q*, then relation (6.6) proves that the random variable *Y*_{n} defined on the probability space determined by *p* on the space of the random sets of *H* = {1, 2, … , *N*} is weakly stochastically dominated by the random variable *Y*_{n} defined on the probability space determined by *q*. Corollary 6.6 proves that the random variable *Y*_{n} defined on the probability space determined by the uniform distribution is always weakly stochastically dominated by the random variable *Y*_{n} defined on the probability space determined by any other probability distribution on *H*.

### Remark 6.8.

A very recent and similar study in [39] proves inequalities (6.2) and (6.3) of theorem 6.3 through a different procedure. Our contribution offers a more elegant argument, based on use of fundamental formulae (3.2) and (3.3) in different contexts. Furthermore, the quality of the arguments allows us to obtain cases of strict inequalities.

## 7. A conjecture on strong dominance

In §6, we used an order relationship between random variables (or more precisely between their distributions) that can be defined formally as follows.

### Definition 7.1.

Let *X* and *X*′ be two real random variables, defined on probability spaces (*Ω*, *P*) and (*Ω*′, *P*′), respectively. We say that the random variable *X* weakly stochastically dominates the random variable *X*′ if the cumulative distribution function of *X*′ dominates the cumulative distribution function of *X*. That is, for any ,

The main result of §6 is that if *p* ≺ *q*, then the random variable *Y*_{n} defined on the probability space weakly stochastically dominates the random variable *Y*_{n} defined on the probability space

A particular case of weak dominance is one in which inequalities apply not only to the cumulative distribution functions, but also to the distributions themselves. We refer to this situation as strong dominance, and we provide a formal definition of strong dominance below, for the case of discrete random variables. (A similar definition can be given for continuous random variables with densities.) In short, *X* strongly dominates *X*′ if, for any small enough value *d*, , and if for any other possible value *e*, .

### Definition 7.2.

Let *X* and *X*′ be two real random variables, defined on probability spaces (*Ω*, *P*) and (*Ω*′, *P*′), respectively, and taking values in a denumerable set *D*. We say that the random variable*X* strongly stochastically dominates the random variable *X*′ if there is a critical value such that, for any

— if

*d*≤*c*, then*P*(*X*=*d*) ≤*P*′(*X*′ =*d*)*and*— if

*d*>*c*, then*P*(*X*=*d*) ≥*P*′(*X*′ =*d*).

It is easy to show that strong dominance implies weak dominance, but that the converse is not true. Coming back to our CCP model, we propose the following.

### Conjecture.

If *p* ≺ *q*, then the random variable *Y*_{n} defined on the probability space (*Ω*, *P*_{p}) strongly stochastically dominates the random variable *Y*_{n} defined on the probability space (*Ω, P*_{q}).

This conjecture has been tested on various examples, but we have been able to prove it formally for only a few values of the pair (*n, N*), namely for *n* = 2 or 3 and any *N*, and for *n* = 4 and *N* ≤ 5.

In applications, strong dominance reinforces weak dominance. It gives more precise statements concerning the relative probabilities that a given number of hosts are parasitized after a given number of eggs laid, for different risk distributions.

## 8. Illustrative examples

In this section, we show graphically the relationships satisfied among the distribution functions of random variables *Y*_{n} as well as the distribution functions of random variables *T*_{k}, when their corresponding risk distributions are able to be compared by majorization.

Figures 2 and 3 represent very different situations in terms of risk distribution over hosts. The risk distributions are majoring each other in figure 2, whereas this is not the case for figure 3.

The distribution functions of five variables *Y*_{n} are represented in figure 2*a*. They correspond to five different risk distributions, *p*_{1}, *p*_{2}, *p*_{3}, *p*_{4} and *p*_{5}, satisfying . These are distributions on the set (so *N* = 12), *p*_{1} is the uniform distribution, for *i* = 2 and 3, and for *i* = 4 and 5. We have also used *n* = 12, and it can be observed that , with . This panel shows that, for example, the probability to observe no more than five parasitized hosts with 12 draws is nearly one for the distribution drawn in red, and nearly zero for the distribution in black.

The distribution functions of 10 variables *T*_{k} are represented in figure 2*b,c*. The number of hosts, *N*, and the risk distributions are the same in both cases; *N* = 10, *p*_{1} is the uniform distribution and , for . For these risk distributions, is satisfied. In figure 2*b*, *k* = 6 and the values of *n* lie between 6 to 50. In figure 2*c*, *k* = 9 and the values of *n* lie from 9 to 100. It can be seen that , for , in both graphics. For example, figure 2*b* shows that the probability to observe at least six parasitized hosts when sampling 10 hosts is nearly 0.8 for the distribution in black, whereas at least 50 hosts have to be sampled in order to obtain the same probability in the situation described by the green curve in the bottom of the panel. Figure 2*c* shows a similar case for nine parasitized hosts. One clearly observes that the ordering of the distributions of random variables *Y*_{n} and *T*_{k} follows the ordering of the risk distributions, described above. This is very different if the risk distributions do not have any simple ordering among themselves, as shown in figure 3. Figure 3 compares distribution functions of random variables *T*_{k} corresponding to two unrelated risk distributions *p* and *q*, i.e. neither *p* ≺ *q* nor *q* ≺ *p*. Thus, these distribution functions act in different ways depending on the value of *k*. We include three different graphics, each bearing two curves. These curves are the distribution functions of two random variables *T*_{k}. The risk distributions associated with these random variables are, in the three graphics, and . In figure 3*a*, *k* = 5, and the probability that at least five hosts have been parasitized after *n* draws is always less for the distribution in red than for the one in black. It does not occur in figure 3*b,c*, where *k* = 8 and 9, respectively. In these cases, the probabilities that at least *k* hosts have been parasitized after *n* draws are lower for the red distribution than for the black one when the number of draws, *n*, is small, but are reversed when *n* is larger.

## 9. Applying our approach to a real host–parasitoid system

The population dynamics of host–parasitoid systems have often been theoretically studied in terms of urn processes, as explained in Introduction. The otherwise quite vast literature on parasitoid behaviour seems however nearly void of real datasets available for applying urn processes. The single reference we identified containing all needed information, namely both the host number identity and the handling behaviour of the wasp over time, is by Simmonds [40], but the datasets presented there are too sparse to enable modelling. Van Lenteren presents in [41] figures with nearly complete information, save for host identity. We therefore use the biological interaction between the butterfly *Melitaea cinxia* (Lepidoptera: Nymphalidae) and its ichneumonid parasitoid *Hyposoter horticola* (Hymenoptera: Ichneumonidae) as an example (see inset in figure 4). The wasp females search for and find the butterfly egg clusters and attack them. We chose this system for several reasons. First, we know more on this system in the wild than on many others [3,42–44]. Second, butterflies lay their eggs in clusters, enabling a clear identification of the number of hosts available. Third, a female leaves a clutch after having parasitized only a third of the eggs. The puzzle of leaving most of a resource unexploited is explained in this case by the avoidance of superparasitism [26]. Indeed, as only one larva survives in a host, superparasitism should be avoided when encountering previously parasitized hosts. In reality, it was observed to occur occasionally, with a probability of some 20%. Our model is silent about the fate of superparasitized hosts. Fourth, the wasp is ‘making haphazard passes across the cluster’ [26, p. 543] and does not show any systematic way of searching, enabling us to assume that the hypothesis of uniform distribution of risk at the cluster level is valid. The authors state that ‘because about 30% of each roughly 200-egg host cluster is parasitized, the wasp must probe, on average, more than 60 eggs per cluster’. In the following, we use the theory developed in this paper to make the above statement more precise and more generic.

Our approach defines the time axis by the number of successive draws. Handling time, a key parameter in most models of host–parasitoid systems, is therefore included in the concept of a draw, i.e. time is not measured in seconds or minutes. Assuming no parasitism at the arrival of the wasp on the 200-eggs cluster, the expected number of probes necessary to have 60 hosts parasitized, *E*(*T*_{60}), is 72, according to equation (4.5). The number of necessary draws to attain a high percentage of parasitism is increasing with that level of parasitism and explodes when approaching 80% parasitism and more: it needs nearly 1200 draws to parasitize 199 hosts. This is the ‘diminishing return’ explanation given by the authors for *Hyposoter*: a foraging female might, due to incomplete knowledge of parasitism status of the hosts, superparasitize some. The number of expected parasitized hosts, *E*(*Y*_{n}), as a function of the number of hosts probed is given in figure 4. Making inverse inference from *Y*_{n} to *T*_{k} is not immediate, because we deal with stochasticity: for a single trajectory of the whole dynamics, the map is a well-defined generalized inverse of the map , but this fact has no direct translation on the link between the maps and . In other words, the strong correspondence between the distribution functions of random variables *Y*_{n} and *T*_{k} has no direct translation on their expectations. A similar problem often encountered by biologists is inverse regression. The behaviour of the variance is interesting: it is very small early on, as each sampled host will most likely be unparasitized (the variance is actually null for first draw), maximal at mid-range and small again at very large numbers of parasitized hosts.

The eggs of a single cluster are building ‘mounts’, and some hosts might be more accessible than others. While Montovan *et al*. [26] have checked the uniformity of parasitism according to depth in the mount and found no gradient, risk of parasitism is known to be varying with accessibility in other host–parasitoid systems (see references in [26]). Let us assume that a single host egg has a probability *α* of being parasitized and the others have probability *β* of being parasitized. This is the extreme case of the situation studied in §5.2. From (5.4), we obtain the following expressions for *E*(*T*_{k,N1}):

(1) If the host with probability α of being parasitized does not belong to the set

*H*_{1}, then(2) If the host with probability α of being parasitized belongs to the set

*H*_{1}, then

9.1

Now, let us assume, for the purpose of illustration, that a single host egg, out of the 200, has a risk of being parasitized 10 times higher than all the others. For this very specific case, we obtain by applying (9.1) the value of *E*(*T*_{60}) of 73 to have 60 hosts parasitized, an insignificant increase over the situation of uniform risk distribution of all hosts. If this host has now a risk of parasitism 100 times higher than all the others, *E*(*T*_{60}) increases markedly, to 105. Two observations can be made on the basis of these computations. First, it seems difficult to inversely infer the risk distribution among hosts on the basis of some characteristics of the random variables *T*_{k} or *Y*_{n}. It is in fact nearly impossible to do so if we do not know *N*, *k*, and the fact that there is only one host different from the others. Theoretical developments would be valuable to enable such inverse inference, of high importance in practical terms. Second, the increased number of draws with increased risk heterogeneity reflects the shadowing effect mentioned earlier, in which this single host with higher risk of parasitism ‘protects’ the others, by being sampled more often.

## 10. Conclusion

Urn sampling processes are well understood when items have a uniform probability to be picked up [45]. This simplest model and its implications, such as the Poisson distribution of draws per sampled unit for large populations, have been frequently used in ecology and evolution. The theoretical developments in this area are however not complete, as some fundamental statistics of practical importance, such as the variance of *T*_{k}, have not been worked out in the probability theory literature yet. This area thus represents a worthwhile probabilistic development. Dealing with non-uniform distributions of risk is more difficult, both because the mathematics are indeed more elaborate and because the outcome is highly sensitive to variations of the risk distributions. Distributions of risk observed in real, natural systems are however highly non-uniformly distributed and often markedly skewed [28,29], hence the need to expand the research field into this direction. The relationship between risk distribution and population dynamics remains complex to understand, constrained by strong assumptions about timing of events and depending on the entire distribution of risk rather than the variance only (see [30] and the following flurry of publications). As a consequence, the assessment of the influence of different distributions of risk of parasitism on the population dynamics of host–parasitoid systems relied so far only on extensive simulations [32]. The next challenge is thus to integrate biased urn sampling theory with their population dynamics models. How the stability properties of the interaction are influenced by the dominance properties of various distributions of risk of parasitism can then be assessed in generic terms.

## Authors' contributions

J.C. framed the question, N.Z., E.L. and M.J.F.S. carried out the bulk of the mathematical reasoning, E.L. proposed the new conjecture, P.Z. computed the illustrative examples and J.C. and N.Z. worked out the biological example. The first draft was written by N.Z., with amendments by J.C. and E.L.

## Funding

N.Z. and M.J.F.S. acknowledge the financial support of the Fundación Seneca of the Comunidad Autónoma de la Región de Murcia, project no.19320/IP/14.

## Acknowledgments

We thank Dr van Nouhuys and Chicago University Press for the use of the photograph of parasitoid from [26]. N.Z. is also grateful to the University François-Rabelais of Tours, for its support and hospitality.

## Footnotes

Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3677185.

- Received August 12, 2016.
- Accepted January 11, 2017.

- © 2017 The Author(s)

Published by the Royal Society. All rights reserved.