chaos, complexity, and inference (36-462)cshalizi/462/lectures/15/15.pdfchaos, complexity, and...

Post on 23-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Chaos, Complexity, and Inference (36-462)Lecture 15: Estimating Heavy-Tailed Distributions

Cosma Shalizi

3 March 2009

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Estimating Heavy-Tailed Distributions

Maximum likelihood The good way to get power law parameterestimates

Log-log regression The bad way to get power law parameterestimates

Non-parametric density estimation Do you care if you have apower law?

Further reading: Clauset et al. (2009)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Maximum Likelihood

Start with the Pareto (continuous) caseprobability density:

p(x ;α, xmin) =α− 1xmin

(x

xmin

)−α

Assuming IID samples, log-likelihood is easy

L(α, xmin) = n logα− 1xmin

− α

n∑i=1

logxi

xmin

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Take derivative and set equal to zero at the MLE:

∂αL =

nα− 1

−n∑

i=1

log xi/xmin

α = 1 +n∑n

i=1 log xi/xmin

What about xmin? If we know that it’s really a Pareto, then theMLE for xmin is min xi . Otherwise, see later.

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Zipf or Zeta Distribution

Same story:

p(x ;α, xmin) =x−α

ζ(α, xmin)

L(α, xmin) = −n log ζ(α, xmin)− α

n∑i=1

log xi

ζ ′(α, xmin)

ζ(α, xmin)= −1

n

n∑i=1

log xi

In practice it’s easier to just maximize L numerically than tosolve that equation

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

When xmin > 6 or so,

α ≈ 1 +n∑n

i=1 log xixmin−0.5

Result due to M. E. J. Newman, see Clauset et al. (2009)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Properties of the MLE

1. Consistency: easiest to see for Pareto. By LLN,

1n

n∑i=1

log xi/xmin → E [log X/xmin] =1

α− 1

so α → α; similarly for Zipf2. Standard error

Var [α] =(α− 1)2

n+ O(n−2)

Can plug in α, or do jack-knife or bootstrap

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Nonparametric Bootstrap in One Slide

Wanted: sampling distribution of some estimator G of afunctional G of a distribution F (e.g., a parameter)Given: data x1, x2, . . . xn, all assumed IID from FProcedure: draw n samples, with replacement, from data,giving b1, b2, . . . bnCalculate G(b1, . . . bn) = GbRepeat many timesEmpirical distribution of Gb is about the sampling distribution ofG(some conditions apply)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Properties of the MLE (continued)

3. Asymptotically Gaussian and efficient:

α N (α,(α− 1)2

n)

and this is the fastest rate of convergence4. (Pareto) If xmin is known or fixed, (α− 1)/n has an inversegamma distribution, which gives exact confidence intervals

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Log-Log Regression

Recall that for a power law

F ↑(x) = Pr (X ≥ x) ∝ x−(α−1)

log F ↑(x) ∝ C − (α− 1) log x

Empirical survival function:

F ↑n (x) ≡ 1n

n∑i=1

1[x ,∞)(xi)

As n →∞, F ↑n (x)→ F ↑(x).Estimate α by linearly regressing log F ↑n (x) on log x .

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

History

First real investigation of power law data came with VillfredoPareto’s work on economic inequality in 1890sUsed log-log regressionTaken up by Zipf in 1920s–1940sVery widely used in physics, computer science, etc.

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

1e+04 5e+04 5e+05 5e+06

5e−

045e

−03

5e−

025e

−01

US City Sizes with Log−Log Regression Line

population

surv

ival

func

tion

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

If the data really come from a power law, this is consistent:

αLLR → α

but this doesn’t say how fast it converges, and in fact the errorsare large and persistent (compared to MLE)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

100 500 2000 5000 20000

2.2

2.3

2.4

2.5

2.6

2.7

Exponent estimates compared

500 replicates at each sample sizeSample size

Est

imat

ed a

lpha ●

● ● ● ● ● ● ● ●

Simulated from Pareto(2.5, 1); blue = regression, black = MLE (shifted a bit for clarity);

mean α ± standard deviation

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Why This Is Bad: Improperly Normalized

Notice that F ↑(xmin) = 1, so true log survival function crosses 0at xminBut least-squares line does not do so in general! ⇒ estimatedfunction cannot be a probability distribution!Could do constrained linear regression — but somehow you never see that

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Why This Is Bad: Wrong Error Estimates

Usual formulas for standard errors in regression assumeGaussian noise

Y = β0 + β1Z + ε, ε ∼ N (0, σ2)

so using those formulas here means you’re assuminglog-normal noise for F ↑nand the central limit theorem says F ↑n has Gaussian noisei.e., the usual formulas do not apply hereCan get error estimates (if you must) by bootstrap

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Why This Is Bad: Lack of Power

People often point to a high R2 for the regression as a sign thatit must be rightThis is always foolish when it comes to regressionThis is especially foolish here — distributions like log-normalhave very high R2 even with infinite samplesThe R2 test lacks power and severity against such alternatives

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Example: Log-Log Regressions of Power Laws andLog-Normals

Simulated from Pareto(2.5, 5) andlogN (0.6626308, 0.65393343) — chosen to come close to theformer. (Also, simulated values < 5 discarded.)Did log-log regression for both, plot shows distribution of R2

values from simulations.

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

1 10 100 1000 10000

0.0

0.2

0.4

0.6

0.8

1.0

R^2 values from samples

black=Pareto, blue=lognormal500 replicates at each sample sizeSample size

R^2

●●●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●●●●●●●●

●●●●

●●●●●●●●●

●●●●

●●●●●●

●●●

●●

●●●●●

●●

●●●●●●

●●●

●●

●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●●

●●●●●●

●●●●

●●

●●●●●●●●●●

●●●●

●●●

●●●●●

●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●

●●

●●●●●●●●

●●●●●●●●●●

●●●●

●●

●●●●●●●●

●●

●●●●●●●●

●●

●●●

●●

●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●

●●

●●●●●●●●●●

●●

●●●●●●●

●●●●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●●●

●●●●●●●●

●●●●

●●

●●●●●

●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●●

●●●●●●

●●●●

●●●●●●

●●●●●●

●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●

●●●●●

●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●

●●●

●●●●

●●●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●

●●

●●

●●●●●●●●●

●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●

●●●

●●

●●●●●●

●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●

●●●●●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●

●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●●●●●

●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●

●●●●●●●●●●

●●

●●

●●●

●●●●●

●●●●●

●●●

●●●●●●

●●●●

●●●●●

●●●●●●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●

●●

●●●

●●

●●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●

●●●

●●●●●●●●

●●●●●●●

●●

●●●

●●●●

●●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●●●●●

●●●●●●

●●●●●

●●

●●●

●●●●

●●●

●●●●●●●

●●●

●●●●●●●●

●●●●

●●●

●●●●●

●●●●●

●●●●●●●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●●●

●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●

●●●

●●●●●●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●●

●●●●●●

●●

●●●

●●

●●●●●●●

●●

●●

●●●●●●●●●●●

●●●

●●●●●●●

●●●●●

●●

●●●●●●

●●●●●●●

●●●

●●●●

●●

●●

●●●

●●●●

●●●●

●●

●●●●●●●

●●●●●

●●●●●●●●

●●

●●●●●

●●

●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●

●●

●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●

●●

●●

●●

●●●●●●●●

●●●●●

●●●●●●●●●

●●

●●

●●

●●●●●●●●●●

●●●

●●●●●

●●●

●●●●●

●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●●●●

●●●●

●●●●●●●●●●●

●●●●●●

●●●

●●●

●●●●●

●●●●●●●●●●●●

●●

●●●

●●●

●●●●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●

●●

●●●●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●●●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●●

●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●●●●●●

●●●

●●

●●●

●●●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●●

●●●

●●●●

●●●●●

●●

●●

●●

●●●●●●●

●●

●●●●

●●●

●●●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●●

●●

●●●

●●●●●

●●●●●●●●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●

●●

●●●

●●

●●

●●●

●●●●●●●

●●

●●●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●●●●

●●●●●●

●●

●●●●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●

●●●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●●

●●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●●●●●

●●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Note: R2 stabilizes at over 0.9 for the log-normal!

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Log-Log Regression on the Histogram

Bad as log-log regression of the survival function is, it’s stillbetter than log-log regression of the histogram

1 Loss of information (unlike survival function)2 Results depend on choice of bins for histogram3 Even bigger normalization issues4 Even worse errors — comparatively larger fluctuations,

especially in the tail where they have the most leverage onthe regression

There may be times when log-log regression on the survivalfunction is reasonable (though I can’t think of any); there arenone when log-log regression on the histogram is

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Conclusions about Log-Log Regression

1 Do not use it.2 Do not believe papers using it.

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Estimating xmin

Need to estimate xminSimple for a pure power law: to maximize likelihood,xmin = min xiNot useful when it is only the tail which follows a power law

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Hill Plots

One approach: try various xmin, plot α vs. xmin, look for stableregionCalled “Hill plot” after Hill (1975)Also gives an idea of fragility of results

> hill.estimator <- function(xmin,data){pareto.fit(data,xmin)$exponent}

> hill.plotter <- function(xmin,data){sapply(xmin,hill.estimator,data=data)}

> curve(hill.plotter(x,cities),from=min(cities),to=max(cities),log="x",main="Hill Plot for US City Sizes",xlab=expression(x_min),ylab=expression(alpha))

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

1e+00 1e+02 1e+04 1e+06

12

34

56

7

Hill Plot for US City Sizes

xmin

αα

Note log scale of horizontal axis

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Kolmogorov-Smirnov Distance

Kolmogorov-Smirnov statistic: measure of distance betweenone-dimensional cumulative distribution functions

DKS(F , G) = supx|F (x)−G(x)|

Here look at

DKS(xmin) = supx≥xmin

|F ↑n (x)− P(x ; α, xmin)|

where P(x ; α, xmin) is the Pareto survival function we get byassuming a given xmin and estimating

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Estimate xmin by Minimizing DKS

Pick the xmin where the distance between data and estimateddistribution is smallest

xmin = argminxmin∈xi

DKS(xmin)

Only considering actual data values is faster and seems to notmiss anythingAnother principled approach: BIC (Handcock and Jones, 2004),but we find that works slightly less well than minimum KS(Clauset et al., 2009)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

> ks.test.for.pareto <- function(threshold,data) {model <- pareto.fit(data,threshold)d <- ks.test(data[data>=threshold],ppareto,

threshold=threshold,exponent=model$exponent)return(as.vector(d$statistic)) }

> ks.test.for.pareto.vectorized <- function(threshold,data){ sapply(threshold,ks.test.for.pareto,data=data) }

> curve(ks.test.for.pareto.vectorized(x,cities),from=min(cities),to=max(cities),xlab=expression(x[min]),ylab=expression(d[KS]),main="KS discrepancy vs. xmin for US cities")

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

0e+00 2e+06 4e+06 6e+06 8e+06

0.0

0.2

0.4

0.6

0.8

1.0

KS discrepancy vs. xmin for US cities

xmin

d KS

bxmin is evidently small

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.1

0.2

0.3

0.4

KS discrepancy vs. xmin for US cities

xmin

d KS

bxmin = 5.246× 104

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Properties

1. In simulations, when there is a power law tail, this is good atfinding it2. When there isn’t a distinct tail but there is an asymptoticexponent, choses xmin such that α becomes right3. Error estimates: bootstrap

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Nonparametric Density Estimation as an Alternative

All of this is assuming a power-law tail, i.e., parametric formOften this is neither justified nor important, but estimating thedistribution isCan then use non-parametric density estimation

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Kernel Density Estimation in One Slide

Data x1, x2, . . . xn

pn(x) =1n

n∑i=1

1h

K(

x − xi

hn

)where kernel K has K ≥ 0,

∫K (x)dx = 1,

∫xK (x)dx = 0,

0 <∫

x2K (x)dx <∞and bandwidth hn → 0, nhn →∞ as n →∞common choice of K : standard Gaussian density φIdeally, hn = O(n−1/3)Generally, pick hn by cross-validationbasic R command: density

better: use the np package from CRAN (Hayfield and Racine, 2008)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Ordinary non-parametric estimation works poorly withheavy-tailed data, since it generally produces light tailsSpecial methods exist, e.g.:

transform data so [0,∞) 7→ [0, 1] monotonicallye.g., 2

πarctan x

do ordinary density estimation on transformed databeing careful to keep Pr ([0, 1]) = 1

apply reverse transformation to estimated density

See Markovitch and Krieger (2000) or Markovich (2007) (harderto read)

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Next time: how to tell the difference between power laws andother distributions

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

Clauset, Aaron, Cosma Rohilla Shalizi and M. E. J. Newman(2009). “Power-law distributions in empirical data.” SIAMReview , forthcoming. URLhttp://arxiv.org/abs/0706.1062.

Handcock, Mark S. and James Holland Jones (2004).“Likelihood-based inference for stochastic models of sexualnetwork formation.” Theoretical Population Biology , 65:413–422. URL http://www.csss.washington.edu/Papers/wp29.pdf.

Hayfield, Tristen and Jeffrey S. Racine (2008). “NonparametricEconometrics: The np Package.” Journal of StatisticalSoftware, 27(5): 1–32. URLhttp://www.jstatsoft.org/v27/i05.

Hill, B. M. (1975). “A simple general approach to inferenceabout the tail of a distribution.” Annals of Statistics, 3:

36-462 Lecture 15

Maximum LikelihoodLog-Log Regression

Finding the Scaling RegionNonparametric Density Estimation

References

1163–1174. URLhttp://www.jstor.org/pss/2958370.

Markovich, Natalia (2007). Nonparametric Analysis ofUnivariate Heavy-Tailed Data: Research and Practice. NewYork: John Wiley.

Markovitch, Natalia M. and Udo R. Krieger (2000).“Nonparametric estimation of long-tailed density functionsand its application to the analysis of World Wide Web traffic.”Performance Evaluation, 42: 205–222.doi:10.1016/S0166-5316(00)00031-6.

36-462 Lecture 15

top related