





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The underestimation of species diversity and introduces the HCDT entropy and Hill numbers for better interpretation. It also covers the Chao1 estimator for estimating the number of species and its improvements. The document emphasizes the importance of the underlying distribution of species for successful diversity estimation.
Typology: Summaries
1 / 29
This page cannot be seen from the preview
Don't miss anything!
Hal 01212435 v
Abstract Measuring biodiversity requires empirical techniques to effectively estimate it from real data. The well-known underestimation of the number of species applies to low orders of diversity in general. I test nine estimators including three new ones on geometric and lognormal distributions that represent realistic, hyper-diverse communities. The best two estimators allow a good estimation of diversity of orders over 0.5, even when the sampling effort is low. I provide criteria to choose the estimator and the necessary code in the R package entropart. Keywords Biodiversity, HCDT entropy, Phylodiversity (^1) AgroParisTech, UMR EcoFoG, CNRS, Cirad, INRA, Universite´ des Antilles, Universite´ de Guyane, Campus agronomique, BP 316, F- Kourou, French Guiana.
Introduction 1 1 Methods 2 1.1 Sample coverage......................... 2 1.2 Estimators of entropy...................... 3 1.3 Confidence intervals....................... 4 1.4 From entropy to diversity................... 4 1.5 Typical distributions....................... 4 1.6 Evaluation of the performance of estimators....... 5 2 Results 5 2.1 Sample coverage......................... 5 2.2 Entropy and diversity...................... 5 3 Discussion 7 3.1 The sample coverage is not always the good indicator of the quality of estimation.................... 7 3.2 Comparing the diversity of real communities with different distributions remains untractable.............. 7 3.3 Estimating the number of species is the critical step.. 8 3.4 Better, but probably not much better, estimators may be derived............................... 8 4 Application to real data 8 5 Conclusion 9
Measuring biodiversity requires both a robust theoretical framework (Patil and Taillie, 1982) and empirical tech- niques to effectively estimate the theoretical variables with real data (Beck and Schwanghart, 2010). In this paper I focus on species-neutral measures of diversity based on HCDT entropy (Havrda and Charv´at, 1967;
Dar´oczy, 1970; Tsallis, 1988) that fulfill the first require- ment. Entropy measures the average surprise brought by observing individuals of a community. Surprise is a decreasing function of probability dropping to 0 when probability is 1. HCDT entropy uses a parameterized surprise function that is the deformed logarithm of order q of the reciprocal of probability(Marcon et al., 2014a). Traditional measures of diversity, namely the number of species as well as Shannon’s and Simpson’s indices, are special cases of the HCDT entropy for values of q equal to 0, 1 and 2. HCDT entropy should be transformed into Hill numbers (Hill, 1973) for better interpretation of the value of diversity as an effective number of species (Jost, 2006). Hill numbers are simply the deformed exponential of HCDT entropy (Marcon et al., 2014a). Rather than focusing on a single value of q, a profile of diversity, i.e. a plot of diversity against q, can be built (Tothmeresz, 1995). Low values of q (starting from 0) give much im- portance to rare species, whilst higher values (usually up to 2) focus on abundant species. Negative values of q are not used because of poor mathematical properties of their entropy (Beck, 2009), and values over 2 generally bring little more information. Ordering communities in terms of diversity requires that their profile do not cross (Tothmeresz, 1995); else, declaring a community more diverse than another only holds for a range of values of q reflecting the importance given to rare or frequent species (Lande et al., 2000).
To plot those profiles, diversity must be estimated from the data. Estimation bias (I follow the terminology of Dauby and Hardy, 2012) is a well-known issue (Marcon et al., 2014a). Real data are almost always samples of larger communities, so some species may have been missed. The induced bias on the Simpson entropy is
smaller than on the Shannon entropy because the former assigns lower weights to rare species, i.e. the sampling bias is even more important when q decreases. Another estimation bias has been widely studied by physicists who generally consider that all species of a given community are known and their probabilities quantified. Their main issue is not at all missing species but the non-linearity of entropy measures (see Bonachela et al., 2008, for a short review). Estimating probabilities at power q > 0 by the power of their estimator is an important source of underestimation of entropy. The need for corrections has generated a considerable literature in ecological statistics and statistical physics. In this paper, I test the performance of the state- of-the-art estimators when applied to the kind of data ecologists have to deal with. I start with simulated distributions that have the advantage of being easily ma- nipulated to generate various sampling intensities and evaluate the bias and root mean square error (RMSE) of the estimators. I address the classical models of the literature, namely the lognormal and the geometric dis- tributions. The lognormal distribution describes, at least roughly, many hyperdiverse ecosystems even though the link between its statistical success and the underlying ecological mechanisms is poorly documented (Tokeshi, 1993). The geometric distribution is a far more difficult case because it is very uneven: the frequency of rare species is several orders of magnitude smaller than that of the frequent ones, making it impossible to observe with reasonable sampling effort (Haegeman et al., 2013). I apply the best-known and best-performing estimators, including three new ones, to those distributions and two actual forest data sets. My purpose is to provide rec- ommendations about the estimation technique to chose when facing different types of data and draw general conclusions about the possible accuracy of diversity esti- mation. Phyloentropy is the sum of HCDT entropy along an ultrametric tree (Marcon and H´erault, 2015a) so estimat- ing it reduces to estimating HCDT entropy. Phylodi- versity is the deformed exponential of phyloentropy. In short, estimating phylodiversity relies on the methods presented here so I will focus on species-neutral diversity for clarity. I used the package entropart (Marcon and H´erault, 2015b) for R (R Development Core Team, 2015) for all tests. The R code necessary to reproduce all results is in the electronic appendix.
Consider a community of species indexed by s = 1 , 2 ,... , S. ns is the number of individuals of species s sampled in the community, n = (^) ∑s ns the total number of sampled individuals. The (unknown) probability ps for an indi- vidual to belong to species s is estimated by pˆs = ns/n.
The number of species represented by ν individuals in the sample of size n is sn ν , so sn 0 if the (unknown) number of unobserved species considering the sampling effort. sn ν is considered as a realization of the random variable Sn ν so it is used to estimate its expectation E(Sn ν ). πν is the sum of the probabilities ps of species repre- sented by ν individuals. The deformed logarithm formalism (Tsallis, 1994) is very convenient to manipulate entropies. The deformed logarithm of order q is defined as:
lnq x =
x^1 −q^ − 1 1 − q
It converges to the natural logarithm when q → 1. The inverse function of lnq x is the deformed exponen- tial:
exq = [ 1 + ( 1 − q)x]
1 1 −q (^) (2)
1.1 Sample coverage The sample coverage (Good, 1953) is the probability for an individual in the community to belong to a species observed in the sample. It equals the sums of the prob- abilities of the observed species. It is an essential tool for diversity estimation because it is included in some estimators (e.g. Chao and Shen, 2003) and it allows the evaluation of the completeness of sampling (Chao and Jost, 2012). Its estimator given by Good is:
C^ ˆ = 1 − s
n 1 n
It is biased (Zhang and Huang, 2007), because:
C = 1 −
E(Sn 1 ) − π 1 n
Good’s estimator neglects the term π 1 , the sum of the probabilities of singletons. It was built from Turing’s fre- quency formula relating the average probability of species observed ν times to the number of species observed ν + 1 and ν times. This formula has been improved by Chao et al. (Chao and Shen, 2010; Chiu et al., 2014) to esti- mate π 1. Estimating the number of species by the Chao estimator (Chao, 1984), Chao and Shen (2010) obtained an improved estimator of the sample coverage:
C^ ˆ = 1 − s
n 1 n
(n − 1 )sn 1 (n − 1 )sn 1 + 2 sn 2
This estimator has been further used by Chao and Jost (2015) to derive an estimator of entropy (see below). An almost unbiased estimator has been derived using the information provided by the whole distribution (Chao et al., 1988; Zhang and Huang, 2007):
n ∑ ν= 1
(− 1 )ν+^1
n ν
sn ν (6)
Zhang (2013) shows that the bias due to ignoring the remaining terms is asymptotically normal and decays exponentially fast. I’ll call the Zhang and Grabchak (2014) estimator the one based on h˜q:
q (^) H˜ = 1 −^ h˜q q − 1
Some attempts have been made to estimate the re- maining bias (Zhang and Grabchak, 2013). The most achieved one is that of Chao and Jost (2015), completing Chao et al. (2013). It relies on two assumptions: the total number of species is estimated by the Chao1 estimator and the actual probabilities of unobserved species can be estimated all equal. A consequence is that the estimator of the average probability of species sampled once also equals the probability estimator of unobserved species. Its value is noted A. It is 2 sn 2 /[(n − 1 )sn 1 + 2 sn 2 ] if single- tons and doubletons are present or 2 /[(n − 1 ) (sn 1 − 1 ) + 2 ] if doubletons are missing. The Chao-Wang-Jost estima- tor of HCDT entropy is:
q (^) H˜ = 1 q − 1
[ 1 − h˜q
sn 1 n
( 1 − A)^1 −n
Aq−^1 −
n− 1 ∑ r= 0
q − 1 r
(A − 1 )r
In absence of singletons and doubletons, A is set to 1 and the estimator is identical to that of Zhang and Grabchak.
1.3 Confidence intervals Two methods allow the evaluation of confidence inter- vals: asymptotic, closed forms are available for some estimators, or bootstrapping is required in the general case. Esty (1983), completed by Zhang and Huang (2007), showed that the estimator of sample coverage (eq. 6) is asymptotically normal with the following confidence interval:
C = Cˆ ± tn 1 −α/ 2
sn 1
sn 1 n
n
Where t 1 n−α/ 2 is the quantile of a Student distribution with n degrees of freedom at the risk threshold α, here 1.96 for all sample sizes and α = 5%. The Zhang-Grabchak estimator is also asymptotically normal and comes with an asymptotic confidence interval (Zhang and Grabchak, 2014) implemented in the package EntropyEstimation (Cao and Grabchak, 2014). The theoretical distribution of other estimators is un- known. They must be built by bootstrap techniques: the
observed community is re-sampled, say 1000 times, and entropy is calculated each time. The α/ 2 and 1 − α/ 2 quantiles of the distribution of entropy are the bounds of the confidence interval. The issue of re-sampling a com- munity is the same as that of sampling it: rare species are often eliminated, so the entropy is underestimated. Starting from the whole community, a first estimation bias is caused by sampling it. The estimators presented here aim at correcting it. When this observed community is re-sampled, a second estimation bias appears. Esti- mating the entropy of re-sampled communities with bias correction yields, on average, the entropy of the observed community estimated by the plug-in estimator (Marcon et al., 2012): if the estimator works well, it eliminates the second estimation bias but it cannot address the first one. The solution to this problem is simply to recenter the en- tropy distribution of re-sampled communities around the value of the entropy of the observed community (Marcon et al., 2012; Chao and Jost, 2015). The re-sampling technique may just consist of draw- ing individuals in the observed community with replace- ment, or, equivalently, drawing a community in a multi- nomial distribution respecting the size and probability distribution of the observed community (Marcon et al., 2014a). A more sophisticated technique has been pro- posed by Chao and Jost (2015). Given the sample size, the probability distribution of observed species can be es- timated more accurately than by the estimator p˜s = Cˆ pˆs which underestimates the probability of rare species (Chao et al., 2015). A better estimate of the proba- bilities is used (actually, a simplified version of that of the unveiled estimators above) and completed by an es- timation of the number of unobserved species, whose probabilities are assumed identical. Despite these ex- tra efforts, the distribution of the entropy of re-sampled community still has to be recentered.
1.4 From entropy to diversity All entropy estimations are finally transformed into di- versity values to be interpretable (Jost, 2006). It is not correct to recenter the confidence interval of diversity estimations because of the non-linearity of the transfor- mation of entropy into diversity (Marcon et al., 2012). The correct process consists of evaluating entropy with its confidence interval and make the final exponential transformation of all values into diversity.
1.5 Typical distributions Comparing the performance of estimators requires simu- lations of realistic communities. I chose to focus on two opposed models making sense in ecology. The lognormal distribution (Preston, 1948) fits well species-rich commu- nities for several reasons, including populations dynamics (Engen and Lande, 1996), niche apportionment (Bulmer, 1974), or even statistical physics arguments (Pueyo et al., 2007; Dewar and Port´e, 2008). It is often well fitted
Rank
Probablity
1 51 101 151 201 251 301
1e−
1e−
1e−
Figure 1. Rank-Abundance curves of 300 species following a lognormal (top curve) or a geometric distribution (straight line). The red lines are the fitted models.
empirically (Tokeshi, 1990) even though it has been ques- tioned theoretically (Williamson and Gaston, 2005). The local community distribution according to the neutral theory (Volkov et al., 2003) is not lognormal but departs from it very moderately. The logarithm of the species probabilities follows a Gaussian distribution. The geometric series model (Motomura, 1932; Whit- taker, 1972) generates far more uneven species distri- butions. In this model, the first species is represented by a part p of the total resources. The second one has the same part p of the remaining resources, and so on. Finally, probabilities are normalized to be proportional to the resources taken. I generated four artificial communities following those distributions. Figure 1 presents a lognormal one, with log-standard-deviation equal to 2 (typical of the distri- bution of tree species in a rainforest) and a geometric distribution with parameter p = 0. 1. Both contain 300 species. The other two distributions have identical pa- rameters except for the number of species augmented to
1.6 Evaluation of the performance of estimators The performance of each estimator was calculated as its average relative bias on all values of q (i.e. the average difference between the mean simulated entropy and its true value) and it Root Mean Square Error (RMSE, i.e. the square root of the sum of the squared bias and the variance, divided by the true value). The true entropy of each reference distribution was calculated with the known values of ps. For each reference distribution, 1000 random samples of the chosen size were drawn in a multinomial distribution respecting the reference
probabilities ps. Entropy was calculated for q between 0 and 2. The average entropy and its first and last 2.5% quantiles were retained to build the profile and its confidence envelope (which is quite different from that of the estimation of real communities). Finally, entropy was transformed into diversity to be plotted.
I drew multinomial samples of various sizes in the cho- sen species distributions, simulating a real, independent sampling of individuals. Sample sizes are between 200 and 5000 individuals to cover a range from obvious un- dersampling to a high-effort inventory: 5000 individuals correspond to 9 to 10 ha of forest.
2.1 Sample coverage I first evaluated the performance of the estimator of sample coverage. 2 communities of each size between 200 and 5000 individuals were sampled in each typical distribution. The real and estimated sample coverages are compared on figure 2. The estimation of sample coverage is very efficient. A model II linear regression (Legendre, 2014) validated the accuracy of the estimation. Conditionally to the sample size, the relation vanishes but the average estimation is very close to the average actual value: as predicted by the theory, the estimation bias is very small (Figures 7 and 8).
2.2 Entropy and diversity I estimated the entropy profiles of the lognormal and geometric distributions of 300 and 600 species, sampled at 4 different intensities (200, 500, 1000 and 5000 in- dividuals), by 9 estimators: Chao-Shen, Grassberger, Chao-Wang-Jost, Zhang-Grabchak, Generalized Cover- age, the three unveiled and the Plug-in. The diversity profiles are plotted in the electronic appendices. The root mean square error of the estimators is shown on figure 3 for the lognormal and the geometric distribu- tions with 300 species when 1000 individuals are sampled, a typical tropical forest inventory of trees. Unsurprisingly, the plug-in estimator is severely bi- ased and has the poorest results in the tests. The Chao-Wang-Jost estimator systematically outperforms the Zhang-Grabchak estimator (which actually performs little better than the plugin-estimator here) by construc- tion. Its complementary estimation is not paid by in- creased variance. The Grassberger estimator is totally inefficient for low values of q as already noticed by Mar- con et al. (2014a). The generalized coverage estimator outperforms Chao-Shen because of its better estimation of conditional probabilities. The Chao-unveiled estimator is almost confused with the Chao-Wang-Jost estimator. Both are outperformed by the iChao-unveiled estima- tor because it improves the estimation of the number of
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
150
250
Order of Diversity
Diversity
Figure 4. Diversity profiles estimated from 1000 random samples of 1000 individuals from a lognormal community of 300 species. The bold, black line represents the real diversity (starting from 0 D = 269 ). The jackknife-unveiled estimator is plotted by the blue, dashed line (^0 Dˆ ≈ 250 ). Its confidence interval (blue dots) is very wide. The Chao-Wang-Jost estimator (red, bold line: 0 Dˆ = 209 ) is more biased downard but its confidence interval (red, solid lines) is much smaller.
species. The jackknife-unveiled estimator is more flex- ible than the previous ones to estimate the number of species. The order of the jackknife estimator it uses changes between simulations, causing an excessive vari- ance for q < 0. 1. It performs best for higher orders of diversity. Results are consistent whatever the model. The gen- eral pattern is a poor estimation of low orders of diversity, and a quite accurate estimation of high orders, as pre- viously shown by Haegeman et al. (2013). The RMSE varies a lot according to the model. I am continuing the analysis with the best two estima- tors: Chao-Wang-Jost and jackknife-unveiled, ignoring the iChao-unveiled estimator which takes place between them but is too similar to the jackknife-unveiled to bring decisive arguments for the discussion. Figure 4 shows their profiles for a 1000-individual sample of a lognor- mal distribution of 300 species, with their confidence intervals.
The underlying distribution of species is the most impor- tant determinant of the success of diversity estimation: the estimation bias of heavy-tailed distributions decays more slowly when the sample size is increased (Zhang and Grabchak, 2013). Estimating the low-order diversity of a sample from a geometric distribution is all but impossible
(Haegeman et al., 2013) but the low-order diversity of lognormal communities can be estimated meaningfully when the sample size is sufficient. Empirically, it is not possible to discriminate a severely-censused geometric distribution and a lognormal one (Tokeshi, 1993): both models fit well since most of the difference is contained by the unobserved tails of the distributions. So, theoreti- cal, ecological arguments about the actual distribution of the community are necessary to decide whether an estimation of diversity is reliable. Diversity of order over 0.5 is pretty well estimated in the context of this paper. Haegeman et al. showed that this remains true for q ≥ 1 , even when geometric communities of millions of species with parameter 0. (the most abundant species takes half the resources, the second one a quarter and so on) are addressed.
3.1 The sample coverage is not always the good indi- cator of the quality of estimation The sample coverage can not be used as a proxy for how much an estimate of diversity can be relied upon. At the same sampling effort, the sample coverage appears to be higher for the geometric distribution (Figures 7 and 8). Far more species are not sampled than in a lognormal distribution, but their total probability is smaller. For example, samples of 200 individuals drawn in 300-species geometric and lognormal communities yield an average estimation of 54 and 149 species by the jackknife-unveiled estimator, but the respective sample coverages are over 95% for the geometric distribution versus around 81% for the lognormal one. The estimation bias is thus much greater for low orders of diversity even though the sample coverage is higher. Chao and Jost (2012) argue in favor of the sample coverage as a better measure of the sampling effort than the sample size. I agree as long as the underlying distri- bution of communities is the same: then, standardizing the sampling effort by the sample coverage is pertinent.
3.2 Comparing the diversity of real communities with different distributions remains untractable When the number of species of the theoretical distribu- tions is doubled, everything else equal, the sampling bias increases (compare figures 14 and 18). With the same sampling effort, the coverage of the lognormal distribu- tion decreases (compare figures 7 and 8, left columns). Doubling the effort brings both the sample coverage and the bias back to their previous level, with a reduced variance (compare figures 18c and 14e). This is a very simple and intuitive behavior, but it is completely different with the geometric distribution: the sample coverage does not change when richness is dou- bled (compare figures 7 and 8, right columns) because the probabilities of the 300 rarest species are negligi- ble. Doubling the sample size does not restore the bias level (compare figures 18d and 14f). An extensive and
rigorous analysis of the influence of the parameters of the theoretical distributions (beyond manipulating the number of species) is not the scope of this paper, but this simple example shows that no general and simple rules are available to compare the low-order diversity of communities of different nature.
3.3 Estimating the number of species is the critical step The lower q, the more difficult the estimation is, but the estimation of the number of species has been long studied and simple rules of decision have been proposed (Burnham and Overton, 1979; Brose et al., 2003) to chose the most appropriate order of the jackknife estimator. Burnham and Overton derived a selection procedure to obtain the order allowing to minimize the RMSE of the estimation of the number of species. It is implemented in the package SPECIES (Wang, 2011) for R. Brose et al. showed (empirically) hat the first-order jackknife is selected when the sample completeness (terminology by Beck and Schwanghart, 2010), i.e. the proportion of observed species (S −sn 0 )/S is over 3/4 (precisely 74% in their paper). When it is less, higher orders have less bias but more variance. It is easy to estimate the number of species of an actual sample this way and compare it to the Chao1 estimator. If both coincide, the Chao-Wang- Jost estimator will perform well for the whole profile: its value at q = 0 is that of Chao1. Else, the jackknife- unveiled estimator will be the best choice since its value at q = 0 is the optimal-order jackknife. If one does not want to rely on the jackknife estimator for some reason, such as its poor theoretical support, the iChao-unveiled estimator is a reasonable compromise as a lower bound estimation.
3.4 Better, but probably not much better, estimators may be derived The most promising ways of research according to the present results are a better estimation of the remaining bias of the Zhang-Grabchak estimator and the improve- ment of the distribution modeling of the unveiled estima- tors. The first approach is that of the Chao-Wang-Jost estimator, which is limited by its estimation of the num- ber of species (the lower bound, Chao1 estimator). The price for releasing this constraint is losing the elegant, closed form of the estimator allowed by appropriate ap- proximations of the infinite sum of the unknown elements of eq. (11) for a numeric approximation. The distribution of species is modeled with two pa- rameters in the unveiled estimators. This can be refined by extending the technique presented by Chao et al. (2015) to higher orders of sample coverage. In both cases, better fitting the data to reduce the bias has its limits because the variance of estimation is likely to increase (Bonachela et al., 2008). So, the estimators presented here may not be far from the optimum trade-off (less
0.0 0.5 1.0 1.5 2.
50
100
150
200
250
Order of Diversity
Diversity
Figure 5. Estimated diversity profile of the tree species of the BCI 50-ha plot. The shaded zone is the 95% confidence interval of the estimation.
bias with the jackknife-unveiled estimator, less variance with Chao-Wang-Jost).
I now estimate the diversity of two real forest plots. The first case is Barro Colorado Island’s 50-ha plot of tropical forest (Hubbell et al., 2005), whose inventory data of trees over 10 cm diameter at breast height are available in the package vegan (Oksanen et al., 2012) for R. 225 species have been sampled, with a quite good fit to a lognormal distribution. The sample size is over 20000 individuals, the sample coverage is over 99.9%. Estimating the number of species with the Chao1 ( species) or the jackknife 1 (244) estimators gives very similar results. This is an unusually large dataset, whose diversity estimation (Figure 5) is quite easy. The best estimator is Chao-Wang-Jost since the Chao estimator is appropriate for the number of species. The 95% confidence interval of the estimation is built by re- sampling according to the technique by Chao and Jost (2015). It is very small due to the abundance of data. The second example takes place at the other extreme of sampling intensity. A 1-ha plot (plot 18) of tropical forest in the experimental forest of Paracou (Gourlet- Fleury et al., 2004), French Guiana, has been inventoried. Data are available in the package entropart for R. Only 481 trees over 10 cm diameter at breast height have been sampled. They belong to 149 species. The sample coverage is 84.6±4.4%. The estimated number of species is 254 according to Chao1, but the appropriate Jackknife estimator (of order 3) returns 309 species. Clearly, the sampling effort is not sufficient for an accurate estimation: the sample coverage is too low and the estimation of the
Chao A, Hsieh TC, Chazdon RL, Colwell RK, Gotelli NJ (2015). “Unveiling the Species-Rank Abundance Distri- bution by Generalizing Good-Turing Sample Coverage Theory.” Ecology, 96 (5), 1189–1201.
Chao A, Jost L (2012). “Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size.” Ecology, 93 (12), 2533–2547.
Chao A, Jost L (2015). “Estimating diversity and entropy profiles via discovery rates of new species.” Methods in Ecology and Evolution, 6 (8), 873–882.
Chao A, Lee SM, Chen TC (1988). “A generalized Good’s nonparametric coverage estimator.” Chinese Journal of Mathematics, 16 , 189–199.
Chao A, Shen TJ (2003). “Nonparametric estimation of Shannon’s index of diversity when there are un- seen species in sample.” Environmental and Ecological Statistics, 10 (4), 429–443.
Chao A, Shen TJ (2010). “Program SPADE: Species Prediction And Diversity Estimation. Program and user’s guide.” URL http://chao.stat.nthu.edu.tw/ softwareCE.html.
Chao A, Wang YT, Jost L (2013). “Entropy and the species accumulation curve: a novel entropy estimator via discovery rates of new species.” Methods in Ecology and Evolution, 4 (11), 1091–1100.
Chiu CH, Wang YT, Walther BA, Chao A (2014). “An Improved Nonparametric Lower Bound of Species Rich- ness via a Modified Good-Turing Frequency Formula.” Biometrics, 70 (3), 671–682.
Cormack RM (1989). “Log-Linear Models for Capture- Recapture.” Biometrics, 45 (2), 395–413.
Dar´oczy Z (1970). “Generalized information functions.” Information and Control, 16 (1), 36–51.
Dauby G, Hardy OJ (2012). “Sampled-based estimation of diversity sensu stricto by transforming Hurlbert diversities into effective number of species.” Ecography, 35 (7), 661–672.
Dewar RC, Port´e A (2008). “Statistical mechanics unifies different ecological patterns.” Journal of theoretical biology, 251 (3), 389–403.
Engen S, Lande R (1996). “Population dynamic models generating the lognormal species abundance distribu- tion.” Mathematical Biosciences, 132 (2), 169–183.
Esty WW (1983). “A Normal Limit Law for a Non- parametric Estimator of the Coverage of a Random Sample.” The Annals of Statistics, 11 (3), 905–912.
Good IJ (1953). “On the Population Frequency of Species and the Estimation of Population Parameters.” Biometrika, 40 (3/4), 237–264.
Gourlet-Fleury S, Guehl JM, Laroussinie O (2004). Ecology & Management of a Neotropical Rainforest. Lessons Drawnfrom Paracou, a Long-Term Experimen- tal Research Site in French Guiana. Elsevier, Paris, France.
Grassberger P (1988). “Finite sample corrections to entropy and dimension estimates.” Physics Letters A, 128 (6-7), 369–373.
Haegeman B, Hamelin J, Moriarty J, Neal P, Dushoff J, Weitz JS (2013). “Robust estimation of microbial diversity in theory and in practice.” The ISME journal, 7 (6), 1092–101.
Havrda J, Charv´at F (1967). “Quantification method of classification processes. Concept of structural a- entropy.” Kybernetika, 3 (1), 30–35.
Hill MO (1973). “Diversity and Evenness: A Unifying Notation and Its Consequences.” Ecology, 54 (2), 427–
Horvitz DG, Thompson DJ (1952). “A generalization of sampling without replacement from a finite uni- verse.” Journal of the American Statistical Association, 47 (260), 663–685.
Hubbell SP, Condit R, Foster RB (2005). “Barro Col- orado Forest Census Plot Data.” URL https://ctfs. arnarb.harvard.edu/webatlas/datasets/bci.
Jost L (2006). “Entropy and diversity.” Oikos, 113 (2), 363–375.
Lande R, DeVries PJ, Walla TR (2000). “When species accumulation curves intersect: implications for ranking diversity using small samples.” Oikos, 89 (3), 601–605.
Legendre P (2014). lmodel2: Model II Regression. R package version 1.7-2, URL http://CRAN.R-project. org/package=lmodel2.
Marcon E, H´erault B (2015a). “Decomposing Phylo- diversity.” Methods in Ecology and Evolution, 6 (3), 333–339.
Marcon E, H´erault B (2015b). “entropart, an R Package to Partition Diversity.” Journal of Statistical Software, 67 (8), 1–26.
Marcon E, H´erault B, Baraloto C, Lang G (2012). “The Decomposition of Shannon’s Entropy and a Confidence Interval for Beta Diversity.” Oikos, 121 (4), 516–522.
Marcon E, Scotti I, H´erault B, Rossi V, Lang G (2014a). “Generalization of the partitioning of Shannon diver- sity.” Plos One, 9 (3), e90289.
Marcon E, Zhang Z, H´erault B (2014b). “The Decom- position of Similarity-Based Diversity and its Bias Correction.” HAL, 00989454 (version 3), 1–10.
Motomura I (1932). “On the statistical treatment of communities.” Zoological Magazine, 44 , 379–383.
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2012). “vegan: Community Ecol- ogy Package.” URL http://cran.r-project.org/ package=vegan.
Patil GP, Taillie C (1982). “Diversity as a concept and its measurement.” Journal of the American Statistical Association, 77 (379), 548–561.
Preston FW (1948). “The Commonness, And Rarity, of Species.” Ecology, 29 (3), 254–283.
Pueyo S, He F, Zillio T (2007). “The maximum entropy formalism and the idiosyncratic theory of biodiversity.” Ecology letters, 10 (11), 1017–28.
R Development Core Team (2015). “R: A Language and Environment for Statistical Computing.” URL http://www.r-project.org.
Tokeshi M (1990). “Niche Apportionment or Random Assortment: Species Abundance Patterns Revisited.” Journal of Animal Ecology, 59 (3), 1129–1146.
Tokeshi M (1993). “Species Abundance Patterns and Community Structure.” Advances in Ecological Re- search, 24 , 111–186.
Tothmeresz B (1995). “Comparison of different methods for diversity ordering.” Journal of Vegetation Science, 6 (2), 283–290.
Tsallis C (1988). “Possible generalization of Boltzmann- Gibbs statistics.” Journal of Statistical Physics, 52 (1), 479–487.
Tsallis C (1994). “What are the numbers that experi- ments provide?” Qu´ımica Nova, 17 (6), 468–471.
Volkov I, Banavar JR, Hubbell SP, Maritan A (2003). “Neutral theory and relative species abundance in ecol- ogy.” Nature, 424 (6952), 1035–1037.
Wang JP (2011). “SPECIES: An R Package for Species Richness Estimation.” Journal of Statistical Software, 40 (9), 1–15.
Whittaker RH (1972). “Evolution and Measurement of Species Diversity.” Taxon, 21 (2/3), 213–251.
Williamson M, Gaston KJ (2005). “The lognormal dis- tribution is not an appropriate null hypothesis for the species-abundance distribution.” Journal of Animal Ecology, 74 (2001), 409–422.
Zhang Z (2013). “Asymptotic normality of an entropy estimator with exponentially decaying bias.” IEEE Transactions on Information Theory, 59 (1), 504–508.
Zhang Z, Grabchak M (2013). “Bias adjustment for a nonparametric entropy estimator.” Entropy, 15 (6), 1999–2011.
Zhang Z, Grabchak M (2014). “Entropic Represen- tation and Estimation of Diversity Indices.” arXiv, 1403.3031(v. 2), 1–12.
Zhang Z, Huang H (2007). “Turing’s formula revisited.” Journal of Quantitative Linguistics, 14 (2-3), 222–241.
Zhang Z, Zhou J (2010). “Re-parameterization of multi- nomial distributions and diversity indices.” Journal of Statistical Planning and Inference, 140 (7), 1731–1738.
Figures 7 and 8 compares the estimated and the real sample coverages of 1000 samples of sizes between 200 and 5000 individuals from a lognormal and a geometric distribution of 300 or 600 species The average estimated sample coverage virtually equals the average real coverage even when the sampling size is small.
Figures 9 to 17 show the estimation of diversity profiles of communities of 300 species. Each figure presents an estimator. The diversity of the lognormal and of the geometric community is estimated, for sample sizes from 200 to 5000 individuals. 1000 simulations were performed to produce a 95% confidence envelope of the profiles. Figures 18 and ?? present the same profiles for 600-species communities, limited to the best two estimators.
0.974 0.976 0.978 0.980 0.982 0.
Estimated Sample Coverage
(a) Lognormal, 5000 individuals
0.9960 0.9970 0.9980 0.
Estimated Sample Coverage
(b) Geometric, 5000 individuals
0.89 0.90 0.91 0.
Estimated Sample Coverage
(c) Lognormal, 1000 individuals
0.980 0.985 0.990 0.
Estimated Sample Coverage
(d) Geometric, 1000 individuals
0.80 0.82 0.84 0.
Estimated Sample Coverage
(e) Lognormal, 500 individuals
0.965 0.970 0.975 0.980 0.985 0.
Estimated Sample Coverage
(f) Geometric, 500 individuals
0.64 0.66 0.68 0.70 0.72 0.
Estimated Sample Coverage
(g) Lognormal, 200 individuals
0.92 0.94 0.96 0.
Estimated Sample Coverage
(h) Geometric, 200 individuals
Figure 8. Estimated vs real sample coverage of simulated samples. The dotted lines are the average values. The distributions contain 600 species. The plain line represents equality.
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(a) Lognormal, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(b) Geometric, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(c) Lognormal, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(d) Geometric, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(e) Lognormal, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(f) Geometric, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(g) Lognormal, 300 species, 200 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(h) Geometric, 300 species, 200 individuals
Figure 9. Estimation by the plug-in estimator of the diversity profiles of simulated lognormal (left) and geometric (right) communities. The sample size decreases from 5000 (top) to 200 (bottom) individuals. The 95% confidence envelope of the estimation is shaded. The real diversity is plotted by the bold line.
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(a) Lognormal, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(b) Geometric, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(c) Lognormal, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(d) Geometric, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(e) Lognormal, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(f) Geometric, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(g) Lognormal, 300 species, 200 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(h) Geometric, 300 species, 200 individuals
Figure 11. Estimation by the Grassberger estimator of the diversity profiles of simulated lognormal (left) and geometric (right) communities. The sample size decreases from 5000 (top) to 200 (bottom) individuals. The 95% confidence envelope of the estimation is shaded. The real diversity is plotted by the bold line.
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(a) Lognormal, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(b) Geometric, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(c) Lognormal, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(d) Geometric, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(e) Lognormal, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(f) Geometric, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(g) Lognormal, 300 species, 200 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(h) Geometric, 300 species, 200 individuals
Figure 12. Estimation by the Chao-Shen estimator of the diversity profiles of simulated lognormal (left) and geometric (right) communities. The sample size decreases from 5000 (top) to 200 (bottom) individuals. The 95% confidence envelope of the estimation is shaded. The real diversity is plotted by the bold line.
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(a) Lognormal, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(b) Geometric, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(c) Lognormal, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(d) Geometric, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(e) Lognormal, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(f) Geometric, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(g) Lognormal, 300 species, 200 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(h) Geometric, 300 species, 200 individuals
Figure 14. Estimation by the Chao-Wang-Jost estimator of the diversity profiles of simulated lognormal (left) and geometric (right) communities. The sample size decreases from 5000 (top) to 200 (bottom) individuals. The 95% confidence envelope of the estimation is shaded. The real diversity is plotted by the bold line.
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(a) Lognormal, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(b) Geometric, 300 species, 5000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(c) Lognormal, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(d) Geometric, 300 species, 1000 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(e) Lognormal, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(f) Geometric, 300 species, 500 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(g) Lognormal, 300 species, 200 individuals
0.0 0.1 0.2 0.3 0.4 0.5 0.
0
50
100
200
300
Diversity
(h) Geometric, 300 species, 200 individuals
Figure 15. Estimation by the Chao-unveiled estimator of the diversity profiles of simulated lognormal (left) and geometric (right) communities. The sample size decreases from 5000 (top) to 200 (bottom) individuals. The 95% confidence envelope of the estimation is shaded. The real diversity is plotted by the bold line.