Confidence Interval for a ratio

1 Introduction

We surveyed 2 variables (\(X\) and \(Y\) counts) from a population and we are interested in their ratio \(Y/X\). We want to make inference on the average ratio in the population.

We need to make sure that \(X > 0\) to avoid infinities in the ratio. Typically, we survey one variable (\(Y\)) given non-negative values of the other. So, choose the variables accordingly. Specifically, we survey households inhabited by at least one person, and conditioned to that, we count the number of dogs. Not the converse.

2 Data description

Sample distributions of survey data (dog-human ratio, number of humans and number of dogs) by zone.

Figure 2.1: Sample distributions of survey data (dog-human ratio, number of humans and number of dogs) by zone.

3 Method 1: classical CI for a population mean

This is the most standard and classical method, which is based on asymptotic normality of the sample mean of any distribution (Wang 2001).

Let \(R = Y/X\) be the quantity of interest. Consider the sample mean \(\bar R = \sum_{i=1}^n r_i/n\) and variance \(S^2 = \sum_{i=1}^n (r_i - \bar R)^2 / (n-1)\).

The theory says that, when \(n \to \infty\), \((\bar R - \mu_R) / (S/\sqrt{n}) \sim t_{n-1}\). Thus, a confidence interval for the population mean \(\mu_R\) is \[ \bar R \pm t_{\alpha/2}\,S/\sqrt{n} \] where \(1-\alpha\) is the confidence level and \(t_{\alpha/2}\) is the upper tail \(\alpha/2\) percentile of the Student \(t\) distribution with \(n-1\) degrees of freedom.

confint_ratio_normal <- function(x, alpha = 0.05) {
  hatx <- mean(x)
  s2 <- var(x)
  n <- length(x)
  ta2 <- qt(alpha/2, n - 1, lower.tail = FALSE)
  
  return(hatx + ta2*sqrt(s2/n) * c(-1, 1))
}
Table 3.1: Results using the Normal asymptotic approximation.
zone Mean Variance CI95_a CI95_b
RURAL 0.057 0.039 0.027 0.087
URBAIN 0.063 0.041 0.032 0.093

If all is required are rough estimates of the average dog-human ratio in urban and rural areas, this method is good enough.

However, we can not make inference about the difference of the urban and rural ratios. In particular, drawing conclusions from the overlap in the confidence intervals is incorrect. For that we need something else, like bootstrap estimates.

4 Method 2: Bootstrapping

The bootstrap is a non-parametric computational approach based on resampling with replacement the observed data in order to simulate a large number of replications of the experiment.

There are several variations of the method for calculating confidence intervals. I will use the basic version here.

get_estimates <- function(x, i) {
  i_u <- i[x$zone[i] == "URBAIN"]
  i_r <- i[x$zone[i] == "RURAL"]
  m_u <- mean(x$dh_ratio[i_u])
  m_r <- mean(x$dh_ratio[i_r])
  c(RURAL = m_r, URBAIN = m_u, `U-R` = m_u - m_r)
}
clean_data_boot <- boot(clean_data, get_estimates, R = 1e4)
Table 4.1: Results using the basic Bootstrap method.
zone Mean CI95_a CI95_b
RURAL 0.057 0.023 0.082
URBAIN 0.063 0.030 0.090
U-R 0.006 -0.036 0.049

We get similar confidence intervals for the dog-human ratio in rural and urban areas. A little downward shift, specially in rural areas. But more importantly, we get inference on the difference, of which we can say that is of an order of magnitude smaller and non-significant.

Yet, if we need something more than the population mean, like predictive statements such as what is the chance of a household having more dogs than expected, or the chance of having at least a dog, then we need a proper model.

5 Method 3: Statistical modelling

Since we are working with counts, the simplest model is Poisson.

\[\begin{equation} \begin{aligned} \label{eq:model} Y_i & \sim \text{Po}(X_i\cdot\lambda_i) \\ \lambda_i & = \beta_0 + \beta_U \mathbb{I}_U \end{aligned} \end{equation}\] were \(\mathbb{I}_U\) is an indicator variable of urban area.

In this model, the number of dogs \(Y_i\) in a household \(i\) is a Poisson random variable, with mean proportional to the number of human inhabitants \(X_i\) with a proportionality constant \(\lambda_i\), which is the parameter of interest. This represents the average dog-human ratio, and depends on the zone (rural/urban) of the household.

Specifically, \(\lambda_R = \beta_0\) is the average dog-human log-ratio in urban areas, and \(\lambda_U = \beta_0 + \beta_U\) is the average dog-human log-ratio in rural areas.

## 
## Call:
## glm(formula = Ndog ~ zone, family = "poisson", data = clean_data, 
##     offset = clean_data$Nbh)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2637  -0.3500  -0.1945  -0.0716   9.6940  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.7929     0.1543 -56.988   <2e-16 ***
## zoneURBAIN    1.8251     0.2209   8.261   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 640.23  on 333  degrees of freedom
## Residual deviance: 580.52  on 332  degrees of freedom
## AIC: 702.14
## 
## Number of Fisher Scoring iterations: 8
## # A tibble: 3 x 2
##   zone       Mean
##   <chr>     <dbl>
## 1 RURAL  0.000152
## 2 URBAIN 0.000942
## 3 U/R    6.20

This is giving much smaller ratios, and a factor of 6 in favour of urban areas. Note that we no longer talk of differences but of relative factor, as a consequence of the model formulation.

Computing confidence intervals here would involve the use of the Bootstrap again, but using the model estimates instead of empirical averages. However, there is evidence of over-dispersion in the data, so a more appropriate model should be first developed in order to continue the analysis.

6 Conclusions

If simple confidence intervals in urban and rural areas are required for reporting, the Normal approximation will suffice.

If the difference between urban and rural ratios is of interest as well, then a Bootstrap method can be used.

However, if the actual interest is on the number of dogs that can be expected in a household, then a proper statistical model is needed. In particular, modelling reveals that considering the difference of ratios can be misleading since it is determined in part by the distribution of the number of inhabitants in urban and rural areas. It might make more sense to consider the ratios instead.

References

Wang, F. K. 2001. “Confidence Interval for the Mean of Non-Normal Data.” Quality and Reliability Engineering International 17 (4): 257–67. https://doi.org/10.1002/qre.400.