Title: | Analysis of Heavy Tailed Distributions |
---|---|
Description: | An implementation of maximum likelihood estimators for a variety of heavy tailed distributions, including both the discrete and continuous power law distributions. Additionally, a goodness-of-fit based approach is used to estimate the lower cut-off for the scaling region. |
Authors: | Colin Gillespie [aut, cre] |
Maintainer: | Colin Gillespie <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 0.70.6.9001 |
Built: | 2025-01-12 04:35:40 UTC |
Source: | https://github.com/csgillespie/powerlaw |
The poweRlaw package aims to make fitting power laws and other heavy-tailed distributions straightforward. This package contains R functions for fitting, comparing and visualising heavy tailed distributions. Overall, it provides a principled approach to fitting power laws to data.
The code developed in this package has been heavily influenced from the python and R code found at: http://tuvalu.santafe.edu/~aaronc/powerlaws/ . In particular, the R code of Laurent Dubroca and Cosma Shalizi.
A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009).
https://github.com/csgillespie/poweRlaw
To explore the uncertainity in the model fit, this package provides a
bootstrap
function.
The output from running 5000 bootstraps on the full
Moby Dick data set (for a discrete power law)
using the bootstrap
function.
The output from running 5000 bootstraps on the full
Moby Dick data set (for a discrete power law)
using the bootstrap_p
function.
The bootstrap_moby
values correspond to the first row of
table 6.1 in the Clauset et al paper:
bootstrap_moby$gof
the K-S statistic
bootstrap_moby$bootstraps
a data frame for the optimal values from the bootstrapping procedure. Column 1: K-S, Column 2: xmin, Column 3: alpha. So standard deviation of column 2 and 3 is 2.2 and 0.033 (the paper gives 2 and 0.02 respectively).
The bootstrap_p_moby
gives the p-value for the hypothesis
test of whether the data follows a power-law. For this simulation study,
we get a value of 0.43 (the paper gives 0.49).
A list
M. E. J. Newman, "Power laws, Pareto distributions and Zipf's law." Contemporary Physics 46, 323 (2005). See http://tuvalu.santafe.edu/~aaronc/powerlaws/data.htm for further details.
moby
, bootstrap
, bootstrap_p
## Generate the bootstrap_moby data set ## Not run: data(moby) m = displ$new(moby) bs = bootstrap(m, no_of_sims=5000, threads=4, seed=1) ## End(Not run) #' ## Generate the bootstrap_p_moby data set ## Not run: bs_p = bootstrap_p(m, no_of_sims=5000, threads=4, seed=1) ## End(Not run)
## Generate the bootstrap_moby data set ## Not run: data(moby) m = displ$new(moby) bs = bootstrap(m, no_of_sims=5000, threads=4, seed=1) ## End(Not run) #' ## Generate the bootstrap_p_moby data set ## Not run: bs_p = bootstrap_p(m, no_of_sims=5000, threads=4, seed=1) ## End(Not run)
Since it is possible to fit power law models to any data set,
it is recommended that alternative distributions are considered.
A standard technique is to use Vuong's test.
This is a likelihood ratio test for model selection using the
Kullback-Leibler criteria.
The test statistic, R
, is the ratio of the
log-likelihoods of the data between the two competing models.
The sign of R
indicates which model is better.
Since the value of R
is esimated,
we use the method proposed by Vuong, 1989 to select the model.
compare_distributions(d1, d2)
compare_distributions(d1, d2)
d1 |
A distribution object |
d2 |
A distribution object |
This function compares two models. The null hypothesis is that both classes of distributions are equally far from the true distribution. If this is true, the log-likelihood ratio should (asymptotically) have a Normal distribution with mean zero. The test statistic is the sample average of the log-likelihood ratio, standardized by a consistent estimate of its standard deviation. If the null hypothesis is false, and one class of distributions is closer to the "truth", the test statistic goes to +/-infinity with probability 1, indicating the better-fitting class of distributions.
This function returns
test_statistic
The test statistic.
p_one_sided
A one-sided p-value, which is an upper limit
on getting that small a log-likelihood ratio if the
first distribution, d1
, is actually true.
p_two_sided
A two-sided p-value, which is the probability of getting a log-likelihood ratio which deviates that much from zero in either direction, if the two distributions are actually equally good.
ratio
A data frame with two columns. The first column is
the x
value and second column is the difference in
log-likelihoods.
Code initially based on R code developed by Cosma Rohilla Shalizi (http://bactra.org/). Also see Appendix C in Clauset et al, 2009.
Vuong, Quang H. (1989): "Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses", Econometrica 57: 307–333.
######################################################## # Example data # ######################################################## x = rpldis(100, xmin=2, alpha=3) ######################################################## ##Continuous power law # ######################################################## m1 = conpl$new(x) m1$setXmin(estimate_xmin(m1)) ######################################################## ##Exponential ######################################################## m2 = conexp$new(x) m2$setXmin(m1$getXmin()) est2 = estimate_pars(m2) m2$setPars(est2$pars) ######################################################## ##Vuong's test # ######################################################## comp = compare_distributions(m1, m2) plot(comp)
######################################################## # Example data # ######################################################## x = rpldis(100, xmin=2, alpha=3) ######################################################## ##Continuous power law # ######################################################## m1 = conpl$new(x) m1$setXmin(estimate_xmin(m1)) ######################################################## ##Exponential ######################################################## m2 = conexp$new(x) m2$setXmin(m1$getXmin()) est2 = estimate_pars(m2) m2$setPars(est2$pars) ######################################################## ##Vuong's test # ######################################################## comp = compare_distributions(m1, m2) plot(comp)
The poweRlaw package supports a number of distributions:
Discrete power-law
Discrete log-normal
Discrete Poisson
Discrete Exponential
Continuous power-law
Continuous log-normal
Continuous exponential
Each object inherits the discrete_distribution
or the ctn_distribution
class.
... |
The object is typically created by passing
data using the |
a reference object
Each distribution object has four fields. However, the object
is typically created by passing
data, to the dat
field. Each field has standard
setters and getters. See examples below
The data set.
The lower threshold, xmin. Typically set after initialisation. For the continuous power-law, xmin >= 0 for the discrete distributions, xmin >0
A parameter vector. Typically set after initialisation. Note the lognormal distribution has two parameters.
A list. This list differs between objects and shouldn't be altered.
Distribution objects are reference classes. This means that when we copy
objects, we need to use the copy
method, i.e. obj$copy()
.
See the examples below for further details.
############################################################## #Load data and create distribution object # ############################################################## data(moby) m = displ$new(moby) ############################################################## #Xmin is initially the smallest x value # ############################################################## m$getXmin() m$getPars() ############################################################## #Set Xmin and parameter # ############################################################## m$setXmin(2) m$setPars(2) ############################################################## #Plot the data and fitted distribution # ############################################################## plot(m) lines(m) ############################################################## #Copying # ############################################################## ## Shallow copy m_cpy = m m_cpy$setXmin(5) m$getXmin() ## Instead m_cpy = m$copy()
############################################################## #Load data and create distribution object # ############################################################## data(moby) m = displ$new(moby) ############################################################## #Xmin is initially the smallest x value # ############################################################## m$getXmin() m$getPars() ############################################################## #Set Xmin and parameter # ############################################################## m$setXmin(2) m$setPars(2) ############################################################## #Plot the data and fitted distribution # ############################################################## plot(m) lines(m) ############################################################## #Copying # ############################################################## ## Shallow copy m_cpy = m m_cpy$setXmin(5) m$getXmin() ## Instead m_cpy = m$copy()
This is generic function for distribution objects. This function calculates the data or empirical cdf.
The functions dist_data_all_cdf
and dist_all_cdf
are only
available for discrete distributions.
Their main purpose is to optimise the bootstrap procedure, where generating a
vector xmin:xmax
is
very quick. Also, when bootstrapping very large values can be generated.
dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) dist_data_cdf(m, lower_tail = TRUE, xmax = 1e+05) dist_data_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'discrete_distribution' dist_data_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'discrete_distribution' dist_data_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'ctn_distribution' dist_data_cdf(m, lower_tail = TRUE, xmax = 1e+05)
dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) dist_data_cdf(m, lower_tail = TRUE, xmax = 1e+05) dist_data_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'discrete_distribution' dist_data_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'discrete_distribution' dist_data_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'ctn_distribution' dist_data_cdf(m, lower_tail = TRUE, xmax = 1e+05)
m |
a distribution object. |
lower_tail |
logical; if |
xmax |
default |
This method does *not* alter the internal state of the distribution objects.
########################################## #Load data and create distribution object# ########################################## data(moby_sample) m = displ$new(moby_sample) m$setXmin(7);m$setPars(2) ########################################## # The data cdf # ########################################## dist_data_cdf(m)
########################################## #Load data and create distribution object# ########################################## data(moby_sample) m = displ$new(moby_sample) m$setXmin(7);m$setPars(2) ########################################## # The data cdf # ########################################## dist_data_cdf(m)
This is a generic function for calculating
the cumulative distribution function (cdf) of
distribution objects. This is similar to base R's pnorm
for the normal distribution.
The dist_cdf
function calculates the
cumulative probability distribution for the
current parameters and xmin value.
dist_cdf(m, q = NULL, lower_tail = FALSE) ## S4 method for signature 'conlnorm' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conlnorm' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'conexp' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conexp' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'conpl' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conpl' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'conweibull' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conweibull' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'disexp' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'disexp' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'dislnorm' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'dislnorm' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'displ' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'displ' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'dispois' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'dispois' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05)
dist_cdf(m, q = NULL, lower_tail = FALSE) ## S4 method for signature 'conlnorm' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conlnorm' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'conexp' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conexp' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'conpl' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conpl' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'conweibull' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'conweibull' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'disexp' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'disexp' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'dislnorm' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'dislnorm' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'displ' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'displ' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05) ## S4 method for signature 'dispois' dist_cdf(m, q = NULL, lower_tail = TRUE) ## S4 method for signature 'dispois' dist_all_cdf(m, lower_tail = TRUE, xmax = 1e+05)
m |
a distribution object. |
q |
a vector values where the function will be evaluated.
If |
lower_tail |
logical; if |
xmax |
default |
This method does *not* alter the internal state of the distribution objects.
########################################## #Load data and create distribution object# ########################################## data(moby_sample) m = displ$new(moby_sample) m$setXmin(7); m$setPars(2) ########################################## #Calculate the CDF at a particular values# ########################################## dist_cdf(m, 10:15) ########################################## #Calculate the CDF at the data values # ########################################## dist_cdf(m)
########################################## #Load data and create distribution object# ########################################## data(moby_sample) m = displ$new(moby_sample) m$setXmin(7); m$setPars(2) ########################################## #Calculate the CDF at a particular values# ########################################## dist_cdf(m, 10:15) ########################################## #Calculate the CDF at the data values # ########################################## dist_cdf(m)
This is generic function for distribution objects. This function calculates the log-likelihood for the current parameters and xmin value.
dist_ll(m) ## S4 method for signature 'conlnorm' dist_ll(m) ## S4 method for signature 'conexp' dist_ll(m) ## S4 method for signature 'conpl' dist_ll(m) ## S4 method for signature 'conweibull' dist_ll(m) ## S4 method for signature 'disexp' dist_ll(m) ## S4 method for signature 'dislnorm' dist_ll(m) ## S4 method for signature 'displ' dist_ll(m) ## S4 method for signature 'dispois' dist_ll(m)
dist_ll(m) ## S4 method for signature 'conlnorm' dist_ll(m) ## S4 method for signature 'conexp' dist_ll(m) ## S4 method for signature 'conpl' dist_ll(m) ## S4 method for signature 'conweibull' dist_ll(m) ## S4 method for signature 'disexp' dist_ll(m) ## S4 method for signature 'dislnorm' dist_ll(m) ## S4 method for signature 'displ' dist_ll(m) ## S4 method for signature 'dispois' dist_ll(m)
m |
a distribution object. |
The log-likelihood
This method does *not* alter the internal state of the distribution objects.
dist_cdf
, dist_pdf
and dist_rand
########################################## #Load data and create distribution object# ########################################## data(moby_sample) m = displ$new(moby_sample) m$setXmin(7); m$setPars(2) ########################################## #Calculate the log-likelihood # ########################################## dist_ll(m)
########################################## #Load data and create distribution object# ########################################## data(moby_sample) m = displ$new(moby_sample) m$setXmin(7); m$setPars(2) ########################################## #Calculate the log-likelihood # ########################################## dist_ll(m)
This is generic function for distribution objects. This function calculates the probability density function (pdf) for the current parameters and xmin value.
dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conlnorm' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conexp' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conpl' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conweibull' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'disexp' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'dislnorm' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'displ' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'dispois' dist_pdf(m, q = NULL, log = FALSE)
dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conlnorm' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conexp' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conpl' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'conweibull' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'disexp' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'dislnorm' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'displ' dist_pdf(m, q = NULL, log = FALSE) ## S4 method for signature 'dispois' dist_pdf(m, q = NULL, log = FALSE)
m |
a distribution object. |
q |
a vector values where the function will be evaluated.
If |
log |
default |
The probability density (or mass) function
This method does *not* alter the internal state of the distribution objects.
dist_cdf
, dist_ll
and dist_rand
########################################## #Create distribution object # ########################################## m = displ$new() m$setXmin(7); m$setPars(2) ########################################## #Calculate the pdf at particular values # ########################################## dist_pdf(m, 7:10)
########################################## #Create distribution object # ########################################## m = displ$new() m$setXmin(7); m$setPars(2) ########################################## #Calculate the pdf at particular values # ########################################## dist_pdf(m, 7:10)
This is generic function for generating random numbers
from the underlying distribution of the distribution reference objects.
This function generates n
random numbers using the parameters
and xmin values found in the associated reference object.
dist_rand(m, n) ## S4 method for signature 'conlnorm' dist_rand(m, n = "numeric") ## S4 method for signature 'conexp' dist_rand(m, n = "numeric") ## S4 method for signature 'conpl' dist_rand(m, n = "numeric") ## S4 method for signature 'conweibull' dist_rand(m, n = "numeric") ## S4 method for signature 'disexp' dist_rand(m, n = "numeric") ## S4 method for signature 'dislnorm' dist_rand(m, n = "numeric") ## S4 method for signature 'displ' dist_rand(m, n = "numeric") ## S4 method for signature 'dispois' dist_rand(m, n = "numeric")
dist_rand(m, n) ## S4 method for signature 'conlnorm' dist_rand(m, n = "numeric") ## S4 method for signature 'conexp' dist_rand(m, n = "numeric") ## S4 method for signature 'conpl' dist_rand(m, n = "numeric") ## S4 method for signature 'conweibull' dist_rand(m, n = "numeric") ## S4 method for signature 'disexp' dist_rand(m, n = "numeric") ## S4 method for signature 'dislnorm' dist_rand(m, n = "numeric") ## S4 method for signature 'displ' dist_rand(m, n = "numeric") ## S4 method for signature 'dispois' dist_rand(m, n = "numeric")
m |
a distribution object. |
n |
number of observations to be generated. |
n random numbers
This method does *not* alter the internal state of the distribution object.
dist_cdf
, dist_pdf
and dist_ll
########################################## #Create distribution object # ########################################## m = displ$new() m$setXmin(7);m$setPars(2) ########################################## #Generate five random numbers # ########################################## dist_rand(m, 5)
########################################## #Create distribution object # ########################################## m = displ$new() m$setXmin(7);m$setPars(2) ########################################## #Generate five random numbers # ########################################## dist_rand(m, 5)
Density and distribution function of the continuous power-law distribution, with parameters xmin and alpha.
dplcon(x, xmin, alpha, log = FALSE) pplcon(q, xmin, alpha, lower.tail = TRUE) rplcon(n, xmin, alpha)
dplcon(x, xmin, alpha, log = FALSE) pplcon(q, xmin, alpha, lower.tail = TRUE) rplcon(n, xmin, alpha)
x , q
|
vector of quantiles. The discrete power-law distribution is defined for x > xmin |
xmin |
The lower bound of the power-law distribution. For the continuous power-law, xmin >= 0. for the discrete distribution, xmin > 0. |
alpha |
The scaling parameter: alpha > 1. |
log |
logical (default FALSE) if TRUE, log values are returned. |
lower.tail |
logical;
if TRUE (default), probabilities are |
n |
Number of observations. If |
dplcon gives the denisty and pplcon gives the distribution function.
The discrete random number generator is very inefficient
xmin = 1; alpha = 1.5 x = seq(xmin, 10, length.out=1000) plot(x, dplcon(x, xmin, alpha), type="l") plot(x, pplcon(x, xmin, alpha), type="l", main="Distribution function") n = 1000 con_rns = rplcon(n, xmin, alpha) con_rns = sort(con_rns) p = rep(1/n, n) #Zipfs plot plot(con_rns, rev(cumsum(p)), log="xy", type="l")
xmin = 1; alpha = 1.5 x = seq(xmin, 10, length.out=1000) plot(x, dplcon(x, xmin, alpha), type="l") plot(x, pplcon(x, xmin, alpha), type="l", main="Distribution function") n = 1000 con_rns = rplcon(n, xmin, alpha) con_rns = sort(con_rns) p = rep(1/n, n) #Zipfs plot plot(con_rns, rev(cumsum(p)), log="xy", type="l")
Density, distribution function and random number generation for the discrete power law distribution with parameters xmin and alpha.
dpldis(x, xmin, alpha, log = FALSE) ppldis(q, xmin, alpha, lower.tail = TRUE) rpldis(n, xmin, alpha, discrete_max = 10000)
dpldis(x, xmin, alpha, log = FALSE) ppldis(q, xmin, alpha, lower.tail = TRUE) rpldis(n, xmin, alpha, discrete_max = 10000)
x , q
|
vector of quantiles. The discrete power-law distribution is defined for x > xmin. |
xmin |
The lower bound of the power-law distribution. For the continuous power-law, xmin >= 0. for the discrete distribution, xmin > 0. |
alpha |
The scaling parameter: alpha > 1. |
log |
logical (default FALSE) if TRUE, log values are returned. |
lower.tail |
logical;
if TRUE (default), probabilities are |
n |
Number of observations. If |
discrete_max |
The value when we switch from the discrete random numbers to a CTN approximation. |
The Clauset, 2009 paper provides an algorithm for generating discrete random numbers. However, if this algorithm is implemented in R, it gives terrible performance. This is because the algorithm involves "growing vectors". Another problem is when alpha is close to 1, this can result in very large random number being generated (which means we need to calculate the discrete CDF for very large values).
The algorithm provided in this package generates true
discrete random numbers up to 10,000 then switches to
using continuous random numbers. This switching point can altered by
changing the discrete_max
argument.
In order to get a efficient power-law discrete random number generator, the algorithm needs to be implemented in C.
dpldis returns the density, ppldis returns the distribution function and rpldis return random numbers.
The naming of these functions mirrors standard R functions, i.e. dnorm. When alpha is close to one, generating random number can be very slow.
Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. "Power-law distributions in empirical data." SIAM review 51.4 (2009): 661-703.
xmin = 1; alpha = 2 x = xmin:100 plot(x, dpldis(x, xmin, alpha), type="l") plot(x, ppldis(x, xmin, alpha), type="l", main="Distribution function") dpldis(1, xmin, alpha) ############################################### ## Random number generation # ############################################### n = 1e5 x1 = rpldis(n, xmin, alpha) ## Compare with exact (dpldis(1, xmin, alpha)) sum(x1==1)/n ## Using only the approximation x2 = rpldis(n, xmin, alpha, 0) sum(x2==1)/n
xmin = 1; alpha = 2 x = xmin:100 plot(x, dpldis(x, xmin, alpha), type="l") plot(x, ppldis(x, xmin, alpha), type="l", main="Distribution function") dpldis(1, xmin, alpha) ############################################### ## Random number generation # ############################################### n = 1e5 x1 = rpldis(n, xmin, alpha) ## Compare with exact (dpldis(1, xmin, alpha)) sum(x1==1)/n ## Using only the approximation x2 = rpldis(n, xmin, alpha, 0) sum(x2==1)/n
estimate_pars
estimates the distribution's
parameters using their maximum likelihood estimator. This estimate
is conditional on the current xmin value.
estimate_pars(m, pars = NULL)
estimate_pars(m, pars = NULL)
m |
A reference class object that contains the data. |
pars |
default |
returns list.
data(moby_sample) m = displ$new(moby_sample) estimate_xmin(m) m$setXmin(7) estimate_pars(m)
data(moby_sample) m = displ$new(moby_sample) estimate_xmin(m) m$setXmin(7) estimate_pars(m)
When fitting heavy tailed distributions, sometimes it is necessary to estimate the lower threshold, xmin. The lower bound is estimated by minimising the Kolmogorov-Smirnoff statistic (as described in Clauset, Shalizi, Newman (2009)).
get_KS_statistic
Calculates the KS statistic for a particular value of xmin.
estimate_xmin
Estimates the optimal lower cutoff using a
goodness-of-fit based approach. This function may issue warnings
when fitting lognormal, Poisson or Exponential distributions. The
warnings occur for large values of xmin
. Essentially, we are discarding
the bulk of the distribution and cannot calculate the tails to enough
accuracy.
bootstrap
Estimates the unncertainty in the xmin and parameter values via bootstrapping.
bootstrap_p
Performs a bootstrapping hypothesis test to determine
whether a suggested
(typically power law) distribution is plausible. This is only available for distributions that
have dist_rand
methods available.
get_bootstrap_sims(m, no_of_sims, seed, threads = 1) bootstrap( m, xmins = NULL, pars = NULL, xmax = 1e+05, no_of_sims = 100, threads = 1, seed = NULL, distance = "ks" ) get_bootstrap_p_sims(m, no_of_sims, seed, threads = 1) bootstrap_p( m, xmins = NULL, pars = NULL, xmax = 1e+05, no_of_sims = 100, threads = 1, seed = NULL, distance = "ks" ) get_distance_statistic(m, xmax = 1e+05, distance = "ks") estimate_xmin(m, xmins = NULL, pars = NULL, xmax = 1e+05, distance = "ks")
get_bootstrap_sims(m, no_of_sims, seed, threads = 1) bootstrap( m, xmins = NULL, pars = NULL, xmax = 1e+05, no_of_sims = 100, threads = 1, seed = NULL, distance = "ks" ) get_bootstrap_p_sims(m, no_of_sims, seed, threads = 1) bootstrap_p( m, xmins = NULL, pars = NULL, xmax = 1e+05, no_of_sims = 100, threads = 1, seed = NULL, distance = "ks" ) get_distance_statistic(m, xmax = 1e+05, distance = "ks") estimate_xmin(m, xmins = NULL, pars = NULL, xmax = 1e+05, distance = "ks")
m |
A reference class object that contains the data. |
no_of_sims |
number of bootstrap simulations. When |
seed |
default |
threads |
number of concurrent threads used during the bootstrap. |
xmins |
default |
pars |
default |
xmax |
default |
distance |
A string containing the distance measure (or measures) to calculate.
Possible values are |
When estimating xmin
for discrete distributions, the search space when
comparing the data-cdf (empirical cdf)
and the distribution_cdf runs from xmin to max(x)
where x
is the data set. This can often be
computationally brutal. In particular, when bootstrapping
we generate random numbers from the power law distribution,
which has a long tail.
To speed up computations for discrete distributions it is sensible to put an
upper bound, i.e. xmax
and/or explicitly give values of where to search, i.e. xmin
.
Occassionally bootstrapping can generate strange situations. For example,
all values in the simulated data set are less then xmin
. In this case,
the estimated distance measure will be Inf
and the parameter values, NA
.
There are other possible distance measures that can be calculated. The default is the
Kolomogorov Smirnoff statistic (KS
). This is equation 3.9 in the CSN paper. The
other measure currently available is reweight
, which is equation 3.11.
Adapted from Laurent Dubroca's code found at http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.r
################################################### # Load the data set and create distribution object# ################################################### x = 1:10 m = displ$new(x) ################################################### # Estimate xmin and pars # ################################################### est = estimate_xmin(m) m$setXmin(est) ################################################### # Bootstrap examples # ################################################### ## Not run: bootstrap(m, no_of_sims=1, threads=1) bootstrap_p(m, no_of_sims=1, threads=1) ## End(Not run)
################################################### # Load the data set and create distribution object# ################################################### x = 1:10 m = displ$new(x) ################################################### # Estimate xmin and pars # ################################################### est = estimate_xmin(m) m$setXmin(est) ################################################### # Bootstrap examples # ################################################### ## Not run: bootstrap(m, no_of_sims=1, threads=1) bootstrap_p(m, no_of_sims=1, threads=1) ## End(Not run)
This function is now deprecated and may be removed in future versions.
get_KS_statistic(m, xmax = 1e+05, distance = "ks")
get_KS_statistic(m, xmax = 1e+05, distance = "ks")
m |
A reference class object that contains the data. |
xmax |
default |
distance |
A string containing the distance measure (or measures) to calculate.
Possible values are |
get_distance_statistic
Returns the sample size of the data set contained within the distribution object.
get_n(m)
get_n(m)
m |
a distribution object. |
################################################ # Load data and create example object ################################################ data(moby_sample) m = displ$new(moby_sample) ################################################ # get_n and length should return the same value ################################################ get_n(m) length(moby_sample)
################################################ # Load data and create example object ################################################ data(moby_sample) m = displ$new(moby_sample) ################################################ # get_n and length should return the same value ################################################ get_n(m) length(moby_sample)
Returns the number of data points greater than or equal to current value of xmin. In the Clauset et al, paper this is called 'ntail'.
get_ntail(m, prop = FALSE, lower = FALSE)
get_ntail(m, prop = FALSE, lower = FALSE)
m |
a distribution object. |
prop |
default |
lower |
default |
################################################ # Load data and create example object ################################################ data(moby_sample) m = displ$new(moby_sample) m$setXmin(7) ################################################ # Get ntail ################################################ get_ntail(m) sum(moby_sample >= 7)
################################################ # Load data and create example object ################################################ data(moby_sample) m = displ$new(moby_sample) m$setXmin(7) ################################################ # Get ntail ################################################ get_ntail(m) sum(moby_sample >= 7)
These are generic functions for distribution reference objects. Standard plotting functions, i.e. plot, points, and lines work with all distribution objects.
## S4 method for signature 'distribution' lines(x, cut = FALSE, draw = TRUE, length.out = 100, ...) ## S4 method for signature 'distribution' plot(x, cut = FALSE, draw = TRUE, ...) ## S4 method for signature 'distribution' points(x, cut = FALSE, draw = TRUE, length.out = 100, ...)
## S4 method for signature 'distribution' lines(x, cut = FALSE, draw = TRUE, length.out = 100, ...) ## S4 method for signature 'distribution' plot(x, cut = FALSE, draw = TRUE, ...) ## S4 method for signature 'distribution' points(x, cut = FALSE, draw = TRUE, length.out = 100, ...)
x |
a distribution reference object. |
cut |
logical (default |
draw |
logical (default |
length.out |
numeric, default 100. How many points should the distribution be evaulated at. This argument is only for plotting the fitted lines. |
... |
Further arguments passed to the |
This method does *not* alter the internal state of the distribution objects.
The frequency of occurrence of unique words in the novel Moby Dick by Herman Melville.
The data set moby_sample is 2000 values sampled from the moby data set.
A vector
M. E. J. Newman, "Power laws, Pareto distributions and Zipf's law." Contemporary Physics 46, 323 (2005). See http://tuvalu.santafe.edu/~aaronc/powerlaws/data.htm for further details.
These data files contain the observed casualties in the American Indian Wars.
The data sets native_american
and us_american
contain the
casualties on the Native American and US American
sides respectively. Each data set is a data frame, with two columns:
the number of casualties and the conflict date.
Data frame
Friedman, Jeffrey A. "Using Power Laws to Estimate Conflict Size." The Journal of conflict resolution (2014).
A simple wrapper around the plot function to aid with visualising the bootstrap results. The values plotted are returned as an invisible object.
## S3 method for class 'bs_xmin' plot(x, trim = 0.1, ...) ## S3 method for class 'bs_p_xmin' plot(x, trim = 0.1, ...) ## S3 method for class 'compare_distributions' plot(x, ...)
## S3 method for class 'bs_xmin' plot(x, trim = 0.1, ...) ## S3 method for class 'bs_p_xmin' plot(x, trim = 0.1, ...) ## S3 method for class 'compare_distributions' plot(x, ...)
x |
an object of class |
trim |
When plotting the cumulative means and standard deviation,
the first trim percentage of values are not displayed. default |
... |
graphics parameters to be passed to the plotting routines. |
This data set contains the population size of cities and towns in England. For further details on the algorithm used to determine city boundries, see the referenced paper.
vector
Arcaute, Elsa, et al. "City boundaries and the universality of scaling laws." arXiv preprint arXiv:1301.1674 (2013).
The distribution objects have an internal structure that is used for caching purposes.
Using the default show
method gives the illusion of duplicate values.
This show method aims to avoid this confusion.
## S4 method for signature 'distribution' show(object)
## S4 method for signature 'distribution' show(object)
object |
A distribution object. |
This dataset contains all the words extracted from the Swiss-Prot version 9 data (with the resulting frequency for each word). Other datasets for other database versions can be obtained by contacting Michael Bell (http://homepages.cs.ncl.ac.uk/m.j.bell1/annotationQualityPaper.php)
Full details in http://arxiv.org/abs/arXiv:1208.2175v1
data frame
Bell, MJ, Gillespie, CS, Swan, D, Lord, P. An approach to describing and analysing bulk biological annotation quality: A case study using UniProtKB. Bioinformatics 2012, 28, i562-i568.