1 Introduction

We have genetic samples of Culicoides from 24 sites spread around the Mediterranean Sea. The figure below display the sample locations.

We measured a genetic distance (CSE, Cavalli-Sforza and Edwards’ chord distance) between samples and recorded the sampling dates.

We assume that populations spread through areas of suitable environmental conditions and that between every pair of sampling locations there is a main/preferred dispersal path (unknown).

These paths can be modelled as least-cost paths of some latent rugosity landscape, determined by environmental factors (in an unknown relationship).

Assuming that the genetic distance between samples is associated with:

the length of the main dispersal path,
the environmental conditions along the path and
the temporal distance between the samples,

the objective of the present study is to estimate the rugosity landscape, the dispersal paths and the effect of the environmental factors on the genetic distance.

The method consists of an iterative procedure (EM-like algorithm, Dempster, Laird, and Rubin (1977), see appendix) that starts by assumming that the geodesic (direct) paths between locations are an approximation of the dispersal paths. Next, we:

compute a series of summaries of the environmental variables along the working paths that serve as covariates together with the length of the route and the difference in sampling times
fit a regression model for the genetic distances using the previous computed covariates
predict the rugosity (i.e. expected increase in genetic distance) at each pixel, given the environmental variables alone (i.e. geographical and temporal distance fixed at 0)
compute the next working paths as the least-cost paths between sampling locations, given the previously computed rugosity map

and iterate until convergence, when the maximum absolute relative difference between successive rugosity maps drops below 5%.

This method is based on Bouyer et al. (2015), with the difference that here we work over a domain with a large water body (the Mediterranean Sea). As a consequence, some of the variables in the statistical model are specific to land areas (e.g. tree-cover, elevation). Moreover, the common variables (e.g. air temperature) can have different effects over land or over sea.

2 Data description

Figure 2.1: Joint and marginal distributions of genetic and temporal distances between the 276 samples.

2.1 Environmental variables

Figure 2.2: Spatial representation of centered and scaled environmental covariates.

3 Results

3.1 Convergence process

The process converged after 5 iterations dropping down the established threshold of 5 % relative difference of maximum-absolute value of rugosity with respect to the previous step (Fig. 3.1).

Figure 3.1: Relative difference of maximum absolute rugosity with respect to previous step.

Below, the convergence of the rugosity map and least-cost dispersal paths are displayed.

## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.

## Tip: rasters can be shown as layers instead of facets by setting tm_facets(as.layers = TRUE).

## The number of facets exceeds the limit of 4. The limit can be extended to 5 with:
## tmap_options(limits = c(facets.view = 5))

Figure 3.2 displays the relationship between the genetic distance and the model covariates in the final iteration.

Figure 3.2: Genetic distance vs covariates.

Some of the zeros that are isolated correspond to variables that exist only over either land or water for routes that are within water or land respectively. These should be more appropriately NA’s to avoid influencing the regression.

Here is the summary of the last-fit model.

## 
## Call:
## lm(formula = y ~ ., data = sumcovars[[i]])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.080481 -0.018420  0.000052  0.019309  0.079451 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.877e-01  1.802e-02  10.419  < 2e-16 ***
## difftime                   2.457e-06  1.765e-06   1.392  0.16499    
## max0_sea_air_tmp_bio1      3.125e-03  2.221e-03   1.407  0.16075    
## max0_land_air_tmp_bio1     6.372e-03  2.457e-03   2.594  0.01004 *  
## max0_land_elev            -1.737e-05  8.134e-06  -2.136  0.03366 *  
## mean0_sea_avg_wind_speed  -1.208e-02  4.554e-03  -2.652  0.00849 ** 
## mean0_sea_avg_wind_dir     7.848e-03  5.443e-03   1.442  0.15055    
## mean0_sea_prec_may        -8.963e-04  4.314e-04  -2.078  0.03872 *  
## mean0_land_avg_wind_speed  1.456e-02  6.199e-03   2.348  0.01962 *  
## mean0_land_avg_wind_dir   -1.491e-03  7.512e-03  -0.199  0.84279    
## mean0_land_lst_bio1       -9.230e-04  1.880e-03  -0.491  0.62381    
## mean0_land_treecover       5.159e-04  3.493e-04   1.477  0.14094    
## mean0_land_prec_may       -8.361e-04  4.285e-04  -1.951  0.05210 .  
## mean0_land_cattledens     -9.072e-03  3.620e-03  -2.506  0.01283 *  
## mean0_land_sheepdens       5.743e-03  3.330e-03   1.725  0.08577 .  
## mean0_land_popdens         4.776e-03  3.245e-03   1.472  0.14225    
## mean0_land_suitability    -1.264e-03  5.795e-04  -2.181  0.03008 *  
## sum0_sea_nbdist            3.617e-04  1.223e-04   2.958  0.00338 ** 
## sum0_land_nbdist           6.005e-04  1.893e-04   3.172  0.00170 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02832 on 257 degrees of freedom
## Multiple R-squared:   0.38,  Adjusted R-squared:  0.3366 
## F-statistic: 8.753 on 18 and 257 DF,  p-value: < 2.2e-16

And finally, the estimated rugosity-map and least-cost dispersal paths.

4 TODO

Variables in model centered-scaled for regression?
NA’s rather than zeros for variables over land/water-specific routes?
Consider transformations (e.g. log) of some of the environmental variables?
Linear model vs. LMM with spatial component?

5 Conclusions

Some more work is needed on the regression model (see TODO items)
The method seems to work (i.e. converge to some place) but I have not been able to put it clearly within any established theoretical framework (see appendix). Thus, I have doubts about the theoretical convergence to the optimal/true/convenient value. The justification used in the previous paper was based on the improvement of AIC, which was quite convincing there, but not so much here (but again, some more modelling effort is needed). I’ve also thought of performing a experiment with simulated data to see whether we can recover the simulation parameters. But if you have other ideas on this, they are welcome.

6 Appendix: Relationship with the Expectation-Maximisation algorithm

This is not exactly an EM-algortigm, but very similar.

Let:

\(y\): the measured genetic distance between all pairs of populations
\(\theta = (\beta', \gamma', \sigma_y, \sigma_z)'\): the vector of regression parameters
\(z\): the hidden rugosity at each pixel

We consider the following augmented model for the joint distribution of genetic distances and rugosity surfaces:

\[ f(y, z \,|\, \theta) = f(y \,|\, z, \theta) \, f(z \,|\, \theta) \] with, \[ \tag{1} y \,|\, z, \theta \sim \mathcal{N}\left(X_e(z)\beta + X_d(z)\gamma,\, \sigma_y\right), \] where \(X_e(z)\) is the design matrix of environmental covariates summarised along the least-cost dispersal paths and \(X_d(d)\) is the design matrix of the time-differences and the lengths of the dispersal paths.

On the other hand, \[ \tag{2} z \,|\, \theta \sim \mathcal{N}\left(X_e \beta,\, \sigma_z\right), \] where \(X_e\) is the design matrix of environmental covariates evaluated at each pixel and \(\beta\) is the same vector of coefficients as for component (1).

EM focuses on computing (in the Expectation step) \[ Q(\theta \,|\, \theta_0) = \mathbb{E}_{Z\,|\,y,\theta_0} \left[ \log f(y, z \,|\, \theta) \,|\, y, \theta_0 \right]. \] This is, the expected log-likelihood of the joint distribution of the observations and the rugosity with respect to the posterior distribution of the rugosity (pluging-in \(\theta_0\)) as a function of \(\theta\).

In the Maximisation step, this quantity is maximised with respect to \(\theta\), which is used in turn in the next iteration as the plug-in value.

In our case, we can’t compute this quantity explicitly since for each possible value of \(z\) we need to re-compute the covariates (\(X_e\) and \(X_d\)) in order to evaluate the likelihood \(f(y, z \,|\, \theta)\). Instead, we compute the expected value of the rugosity surface given the working vector of parameters \(\mathbb{E}_{Z\,|\,\theta_0} = X_e\beta_0\) and plug-it into (1) to iterate and find the next vector of parameters, in a more Gibbs-like approach.

I’m not really sure whether this is guaranteed to converge to something reasonable (as EM does), but it seems to work.

References

Bouyer, Jérémy, Ahmadou H. Dicko, Giuliano Cecchi, Sophie Ravel, Laure Guerrini, Philippe Solano, Marc J. B. Vreysen, Thierry De Meeûs, and Renaud Lancelot. 2015. “Mapping Landscape Friction to Locate Isolated Tsetse Populations That Are Candidates for Elimination.” Proceedings of the National Academy of Sciences 112 (47): 14575–80. https://doi.org/10.1073/pnas.1516778112.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the Em Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.

Dispersion of Culicoides imicola using friction maps

Working document - In Progress