Concept

Model Selection: LRT, AIC, and BIC

Use the likelihood ratio test when one model is nested inside another. Use AIC or BIC when comparing models that are not nested. The three criteria can disagree, and understanding when they disagree is the actuarial point.

Page Contract

Role: Concept
Level: Core
Time: Reference
Freshness: Stable

Search Intent

model selection LRT AIC BIC

Nested vs Non-Nested Models

Two parametric families are nested when one is a special case of the other. Exponential is nested inside gamma (set α = 1). Geometric is nested inside negative binomial (set r = 1). Poisson is nested inside negative binomial as the limit r → ∞ with rβ fixed (this is a boundary case and requires care).

Non-nested families do not contain each other. Lognormal and gamma are not nested; Lognormal and Pareto are not nested; Weibull and gamma are not nested. Information criteria like AIC and BIC apply to non-nested comparisons; the likelihood ratio test does not, except via approximations like Vuong's test that are not on the ASTAM syllabus.

Likelihood Ratio Test For Nested Families

Let L_0 be the maximized likelihood under the restricted (smaller) model and L_1 under the full (larger) model. The LRT statistic is twice the difference of log-likelihoods. Under the null that the restricted model is correct and standard regularity conditions, the LRT statistic asymptotically follows a chi-squared distribution with degrees of freedom equal to the number of restrictions.

This is Wilks' theorem. The degrees of freedom equal the number of parameters that the restricted model fixes — for testing exponential vs gamma, df = 1 because the restriction is α = 1.

Likelihood ratio statistic

\Lambda=-2\ln\!\left(\frac{L_0}{L_1}\right)=2(\ell_1-\ell_0)

Asymptotic null distribution

\Lambda\;\xrightarrow{H_0}\;\chi^{2}_{q},\quad q=\text{number of restrictions}

Information Criteria For Non-Nested Comparison

Both AIC and BIC penalize log-likelihood by a function of the number of estimated parameters k. AIC penalizes by 2k regardless of sample size. BIC penalizes by k ln n, which grows with the sample.

For small n, the two criteria agree closely. For n above e^2 ≈ 7.39, BIC penalizes additional parameters more harshly than AIC. By n = 100, ln n ≈ 4.6, so BIC's per-parameter penalty is about 2.3 times AIC's. This is why BIC tends to select smaller models on large samples while AIC retains more parameters.

AIC and BIC

\mathrm{AIC}=2k-2\ell,\qquad \mathrm{BIC}=k\ln n-2\ell

How To Read AIC And BIC

Lower is better. The absolute value of AIC or BIC has no interpretation on its own; only differences between candidate models matter. A difference of 2 or less is often called a tie; a difference of 4 or more is strong evidence for the lower-AIC (or BIC) model.

The criteria are point estimates; bootstrap or cross-validation can be used to assess the stability of the comparison. ASTAM treats AIC and BIC at the formula level; PA and SRM go further into predictive validation.

When AIC And BIC Disagree

Suppose two candidate severity models for a sample of n = 500 claims have log-likelihoods ℓ_1 = -2,400 (k = 2, lognormal) and ℓ_2 = -2,395 (k = 3, three-parameter family).

AIC favors model 2: AIC_1 = 4,804, AIC_2 = 4,796, so model 2 wins by 8. BIC also favors model 2 but by a narrower margin: BIC_1 = 4,800 + 2 × 6.215 = 4,812.4, BIC_2 = 4,790 + 3 × 6.215 = 4,808.6, so model 2 wins by 3.8. The gap narrows because the BIC penalty per added parameter is ln(500) ≈ 6.2 versus AIC's flat 2. BIC flips to the smaller model when 2(ℓ_2 − ℓ_1) < (k_2 − k_1) · ln n; with the improvement of 5 here, the cutoff is n > e^{10} ≈ 22,000. The disagreement that does appear in practice is at moderate n with small log-likelihood improvements: AIC keeps the extra parameter, BIC drops it.

Worked Example: LRT For Exponential vs Gamma

A sample of n = 50 claim sizes is fit by both exponential (k = 1) and gamma (k = 2). Maximized log-likelihoods are ℓ_0 = -310.4 for exponential and ℓ_1 = -307.1 for gamma. The LRT statistic is 2(ℓ_1 − ℓ_0) = 2(3.3) = 6.6.

Under H_0 (exponential), Λ ~ χ²_1. The 95th percentile of χ²_1 is 3.84 and the 99th percentile is 6.63. So 6.6 is on the edge: reject at α = 0.05, do not reject at α = 0.01. The borderline result is realistic; on ASTAM, partial credit goes to candidates who report both decisions with the relevant cut-offs.

Worked Example: AIC And BIC For Three Non-Nested Severity Models

Same n = 50 claims. Fitted log-likelihoods: lognormal (k = 2) ℓ = -305.8, Weibull (k = 2) ℓ = -307.0, Pareto (k = 2) ℓ = -304.2. All three have k = 2, so AIC ordering matches log-likelihood ordering: Pareto best at AIC = 612.4, lognormal next at 615.6, Weibull last at 618.0.

BIC also ranks them in the same order because k is constant: Pareto 616.2, lognormal 619.4, Weibull 621.8. With identical k, AIC and BIC must agree on ranking. They diverge only when candidate models have different parameter counts.

Selecting The Final Model

A practical actuarial workflow: (1) run goodness-of-fit on each candidate against the data (chi-squared, K-S, A-D); (2) discard any candidate that fails an absolute-fit test; (3) among the survivors, use LRT for nested comparisons and AIC or BIC for non-nested comparisons; (4) consider stability under bootstrap and out-of-sample loss for the final selection.

Both ASTAM and SRM grade on this kind of complete workflow rather than on a single computed criterion. Lose points where it counts: report the test, the criterion, the decision, and the actuarial implication of the chosen model.