In 2015, the Zika virus (ZIKV) was first introduced and spread across Brazil due to a plethora of Aedes aegypti mosquitoes.1 This species transmits ZIKV to humans, which can cause mild symptoms or even severe health conditions such as Guillain-Barré Syndrome and microcephaly.23 Lo and Park (2018) modeled the spread of ZIKV using linear regression, while comparing results between two models — one with predictors calculated from persistent homology (PH) and one without.4 They show that using the number of 0-dimensional and 1-dimensional homological features and the maximum lifetime (persistence) of any 1-dimensional feature can better predict the number of ZIKV cases in Brazil.
Here, we demonstrate how to use topological features to model disease transmission by A. aegypti, similar to the ZIKV model built by Lo and Park.
To imitate the model from Lo and Park with available data, we would predict the number of ZIKV cases confirmed in each state using linear regression. However, data points of A. aegypti mosquitoes in Brazil are only available for the year 2013, while documented ZIKV cases are available only since 2015. Instead of using the number of ZIKV cases as the response variable in our regression model, we obtained the number of Dengue cases in Brazil for the year 2013,5 since Dengue is also transmitted by A. aegypti mosquitoes. Predictions of number of cases in each state will be made using average temperatures, precipitation levels,6 and human population densities.78 Notice that the source for temperature data provided by Lo and Park is no longer available. The accessible dataset for temperatures and precipitations are averaged throughout multiple years, thus it may not be completely accurate for 2013.
Besides these state-level attributes, we will also utilize spatial information on A. aegypti occurrences to model the disease.9 We include the number of 0-dimensional features (H0) as a predictor since a higher value of H0 indicates that a state harbors more mosquitoes. We also include the number of 1-dimensional features (H1) in our model, as we interpret a larger value of H1 to mean that A. aegypti mosquitoes are more spread out. In addition, extending the analysis of Lo and Park, we include the maximum lifetime (H1ML) and the median lifetime (H1MD) of the 1-dimensinoal features, which we expect give additional insight into the statewide connectivity of A. aegypti distributions. A high H1ML indicates mosquitoes have a wider spread and are less clustered at a small number of specific regions. We are also curious about whether the median of H1 lifetime can improve upon the effect of H1ML demonstrated by Lo and Park.
First, we load the aegypti
dataset containing geographic
locations of A. aegypti mosquitoes. Also, we load the
case_predictors
dataset to obtain human population density,
average temperature, precipitation level, and the number of cases in
each state.
We will use state abbreviations to join elements from these two data
sets and from the PH calculations. The vector stateAbbSort
,
de-duplicated from the A. aegypti dataset, provides a
convenient tool for these tasks.
devtools::load_all()
data(aegypti)
data(case_predictors)
stateAbbSort <- sort(aegypti$state_code[!duplicated(aegypti$state_code)])
We will incrementally add the predictive variables to the data frame
caseModel
.
Before building the model, the following code provides an example of
using Vietoris-Rips filtration on the mosquito occurrence patterns for a
single state. First, we retrieve occurrence coordinates in the state of
Acre (AC) from the A. aegypti dataset. Applying the
vietoris_rips
function to these coordinates, we receive a
data frame that contains the birth and death information of every
0-dimensional (H0) and 1-dimensional (H1) feature. We plot these
persistence data in the following graph, where the horizontal
and vertical axes represent the birth and death of each feature,
respectively.
AC_coord <- aegypti[aegypti$state_code == "AC", c("x", "y"), drop = FALSE]
AC_rips <- vietoris_rips(AC_coord) ##filtration
plot.new()
plot.window(
xlim = c(0, max(AC_rips$death)),
ylim = c(0, max(AC_rips$death)),
asp = 1
)
axis(1L)
axis(2L)
abline(a = 0, b = 1)
points(AC_rips[AC_rips$dimension == 0L, c("birth", "death")], pch = 16L)
points(AC_rips[AC_rips$dimension == 1L, c("birth", "death")], pch = 17L)
For reference, we can also plot the mosquito occurrences based on their latitudes and longitudes in the aegypti dataset. Note, however, that the plot is not exactly faithful to the geography, since the distances between longitude lines vary from location to location. This also means that our PH calculations are not exactly right. The same limitation applies to the analysis of Lo and Park.
plot(x = AC_coord$x, y = AC_coord$y, asp = 1, xlab = "X", ylab = "Y",
main = "Aedes Aegypti Occurrences in AC")
The following function takes in a single state’s persistence data and returns a vector consisting of the values H0, H1, H1ML, and H1MD.
topologicalFeatures <- function(state_rips){
numH0 <- length(which(state_rips$dimension == 0))
numH1 <- length(which(state_rips$dimension == 1))
if (numH1 != 0) {
maxPerst <- max(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
medianPerst <- median(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
} else {
maxPerst <- 0
medianPerst <- 0
}
return(c(numH0, numH1, medianPerst, maxPerst))
}
Iterating over all the states in Brazil, we bind every state’s
topological features to caseModel
.
for(val in stateAbbSort) {
state_coord <- aegypti[aegypti$state_code == val, c("x", "y"), drop = FALSE]
state_rips <- vietoris_rips(state_coord)
state_rips$persistence<-(state_rips$death - state_rips$birth)
features <- topologicalFeatures(state_rips)
caseModel <- rbind(
caseModel,
c(features[1], features[2], features[3], features[4])
)
}
colnames(caseModel) <- c("H0", "H1", "H1MD","H1ML")
rownames(caseModel) <- stateAbbSort
We merge caseModel
with case_predictors
,
matched by state code. We have now loaded all potential predictors to
our model. The following code prints out first few lines of the
caseModel
to demonstrate how states and their variables lie
within this data frame.
caseModel <- merge(caseModel, case_predictors, by = "row.names")[, -1]
rownames(caseModel) <- stateAbbSort
head(caseModel)
## H0 H1 H1MD H1ML POP TEMP PRECIP CASE
## AC 13 2 0.06063392 0.07681639 4.729529 25.66136 169.20068 2568
## AL 101 29 0.03059555 0.12310635 118.607876 24.67370 96.40583 11296
## AM 29 3 1.24451720 1.24898089 2.442278 26.49226 198.98254 17832
## AP 12 1 0.28318879 0.28318879 5.158925 26.22867 224.78617 1708
## BA 412 147 0.04711213 0.31302836 26.638086 23.88865 80.96277 6111
## CE 179 63 0.04965020 0.27650333 58.958386 26.22163 78.03878 30219
The plots below visualize the number of cases within each state. To reproduce the model by Lo and Park, we will log-transform the number of cases to reduce data skew and the influence of outliers.
par(mfrow = c(2L, 1L))
hist(caseModel$CASE, main = "Distribution of Case Counts")
hist(log(caseModel$CASE),
main = "Distribution of Log-Transformed Case Counts")
We use the following plot to check if there is any correlation between predictors. We see that H0 and H1 are linearly related to each other, so H1 may not be a necessary predictor. During the model selection process, there will be a comparison testing inclusion of H1 as a predictor and decisions will be made at that point.
First, we fit a linear model to log-transformed case counts predicted by population, temperature, precipitation, and H0 (the number of aegypti occurrences). We reject the null hypothesis of no predictive value (p = 0.0199).
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + PRECIP + H0, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3891 -1.1671 -0.1294 0.7238 3.3588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.416851 4.061618 0.349 0.73053
## POP 0.007909 0.003406 2.322 0.02988 *
## TEMP 0.238266 0.144471 1.649 0.11331
## PRECIP 0.001867 0.008762 0.213 0.83324
## H0 0.007375 0.002306 3.198 0.00415 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.634 on 22 degrees of freedom
## Multiple R-squared: 0.399, Adjusted R-squared: 0.2898
## F-statistic: 3.652 on 4 and 22 DF, p-value: 0.0199
To check error assumptions for linear regression, we plot residuals versus fitted values and theoretical quantiles versus standardized residuals. We observe that residuals for each fitted values are evenly distributed above and below the mean 0, and the relationship between residuals and fitted values seem to have a linear relation. In the quantile-quantile plot, standardized residuals closely line up with normal values, with a few outliers. Therefore, the normality assumption on the error term is satisfied.
Next, we use the likelihood ratio test (LRT) to assess if we should
drop the precipitation predictor PRECIP
. In the LRT, p = 0.8135, which fails to reject
the nested model. This result suggests that the simpler model without
PRECIP
is better. We therefore replace the original model
with this more parsimonious one. Even though TEMP
is not
statistically significant, we preserve it in the model as temperature is
an effectual predictor in most epidemiology modeling, and one whose
point estimate we may want to be able to report.
## Likelihood ratio test
##
## Model 1: log(CASE) ~ POP + TEMP + H0
## Model 2: log(CASE) ~ POP + TEMP + PRECIP + H0
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -48.834
## 2 6 -48.807 1 0.0557 0.8135
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3757 -1.1723 -0.0940 0.6976 3.3161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.771317 3.627567 0.488 0.62997
## POP 0.007684 0.003170 2.424 0.02362 *
## TEMP 0.235534 0.140883 1.672 0.10811
## H0 0.007161 0.002032 3.524 0.00182 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.6 on 23 degrees of freedom
## Multiple R-squared: 0.3978, Adjusted R-squared: 0.3193
## F-statistic: 5.065 on 3 and 23 DF, p-value: 0.007721
Let’s add H1
, H1ML
, and H1MD
to our current regression model and name it fit.2
. From the
summary output of the second model, F = 3.268 and p = 0.01878. We know that
coefficients of these predictors are not all 0. However, we might
overfit the data as many predictors are not statistically significant.
Thus, we prefer to exclude irrelevant variables.
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD,
## data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.014 -1.151 0.246 0.577 3.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.228176 3.473642 0.929 0.36380
## POP 0.009529 0.003143 3.032 0.00659 **
## TEMP 0.155718 0.138196 1.127 0.27317
## H0 -0.021864 0.023392 -0.935 0.36109
## H1 0.078002 0.063214 1.234 0.23153
## H1ML 3.632102 1.767727 2.055 0.05321 .
## H1MD -1.377716 1.941589 -0.710 0.48615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.508 on 20 degrees of freedom
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.3952
## F-statistic: 3.832 on 6 and 20 DF, p-value: 0.01049
Similarly, the two plots below exhibit that error terms from the regression model are normally distributed and centered at zero. There are a few outliers from the Q-Q plot, but their effects are trivial in our regression model.
Using LRT to see if H1MD
is a useful predictor, we fail
to reject the nested model when H1MD is absent (p = 0.4126). Therefore, we leave out
H1MD
in our model.
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1 + H1ML, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9994 -0.9372 0.1673 0.6046 3.0713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.121003 3.429087 0.910 0.37307
## POP 0.009399 0.003101 3.032 0.00635 **
## TEMP 0.160272 0.136405 1.175 0.25316
## H0 -0.017450 0.022281 -0.783 0.44226
## H1 0.067017 0.060560 1.107 0.28097
## H1ML 2.731668 1.216049 2.246 0.03557 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.49 on 21 degrees of freedom
## Multiple R-squared: 0.5231, Adjusted R-squared: 0.4095
## F-statistic: 4.606 on 5 and 21 DF, p-value: 0.005412
## Likelihood ratio test
##
## Model 1: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML
## Model 2: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -45.687
## 2 8 -45.351 1 0.6713 0.4126
We observe that coefficients of both H0 and H1 are not statistically
significant. Also, it is concerning because the coefficient of H0 in the
nested model above is negative, which means more mosquito occurrences
tend to decrease the number of cases. We decide to test the model
without H1
, as it seems to correlate with H0 and cause
confounding.
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2853 -1.0814 -0.1371 0.6899 3.1475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.708512 3.426125 0.791 0.43765
## POP 0.009607 0.003111 3.089 0.00537 **
## TEMP 0.158410 0.137089 1.156 0.26027
## H0 0.007117 0.001902 3.742 0.00113 **
## H1ML 2.471004 1.199093 2.061 0.05135 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.498 on 22 degrees of freedom
## Multiple R-squared: 0.4952, Adjusted R-squared: 0.4035
## F-statistic: 5.396 on 4 and 22 DF, p-value: 0.003494
## Likelihood ratio test
##
## Model 1: log(CASE) ~ POP + TEMP + H0 + H1ML
## Model 2: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -46.452
## 2 7 -45.687 1 1.5303 0.2161
The likelihood ratio test suggests that we need to choose the simpler
model without H1
(p = 0.2161). In this case, we only
need H1ML as our topological predictor.
We compare regression summaries of our final first and second models
and observe an improvement of the adjusted R-squared when adding
H1ML
as an additional predictor.
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3757 -1.1723 -0.0940 0.6976 3.3161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.771317 3.627567 0.488 0.62997
## POP 0.007684 0.003170 2.424 0.02362 *
## TEMP 0.235534 0.140883 1.672 0.10811
## H0 0.007161 0.002032 3.524 0.00182 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.6 on 23 degrees of freedom
## Multiple R-squared: 0.3978, Adjusted R-squared: 0.3193
## F-statistic: 5.065 on 3 and 23 DF, p-value: 0.007721
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2853 -1.0814 -0.1371 0.6899 3.1475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.708512 3.426125 0.791 0.43765
## POP 0.009607 0.003111 3.089 0.00537 **
## TEMP 0.158410 0.137089 1.156 0.26027
## H0 0.007117 0.001902 3.742 0.00113 **
## H1ML 2.471004 1.199093 2.061 0.05135 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.498 on 22 degrees of freedom
## Multiple R-squared: 0.4952, Adjusted R-squared: 0.4035
## F-statistic: 5.396 on 4 and 22 DF, p-value: 0.003494
H0
and H1ML
are both positively related to
the number of cases in each state. A higher value of H0
indicates more A. aegypti occurrences within the given state,
so it leads to more confirmed cases. Since we include both
H0
and H1ML
in the model, given a fixed number
of A. aegypti occurrences, a more persistent cycle would
predict more cases as disease outbreaks can involve more locations. To
interpret this, we select three states with similar H0
but
different H1ML
and compare the number of cases in each.
We choose states Pernambuco, Ceará, and Goiás for this example and
print out H0
, H1ML
, POP
, and
CASE
for each state. We observe that Goiás has the largest
H1ML
and the most cases, while Pernambuco has the smallest
H1ML
and the least number of cases. To reduce the effect of
H0
on case numbers and H1ML
, H0
is fixed since H0
for three selected states all fall into
the range around 200.
Example_States <- c("PE", "CE", "GO")
Coord_Example <- data.frame()
for(val in Example_States){
targetRow <- caseModel[rownames(caseModel) == val,]
Coord_Example <- rbind(Coord_Example,
c(targetRow$H0, targetRow$H1ML, targetRow$CASE))
}
rownames(Coord_Example) <- Example_States
colnames(Coord_Example) <- c("H0", "H1ML", "CASE");Coord_Example
## H0 H1ML CASE
## PE 184 0.1594328 7985
## CE 179 0.2765033 30219
## GO 245 0.3654354 139357
While controlling H0
, we want to give a visualized
interpretation of how H1ML
positively relates to the number
of cases. To achieve this, we plot A. aegypti occurrences on an
xy-axis. We also limit the range of the x-axis to 10 in all three plots
to fix the geographic dimension, in order to have a fair comparison
between aegypti distributions.
par(mfrow = c(1L, 3L))
PE_coord <- aegypti[aegypti$state_code == "PE", c("x", "y"), drop = FALSE]
PE_rips <- vietoris_rips(PE_coord)
plot(x = PE_coord$x, y = PE_coord$y, asp = 1, col = "green",
xlim = c(-42, -32), xlab = "X", ylab = "Y", main = "Pernambuco")
CE_coord <- aegypti[aegypti$state_code == "CE", c("x", "y"), drop = FALSE]
CE_rips <- vietoris_rips(CE_coord)
plot(x = CE_coord$x, y = CE_coord$y, asp = 1, col = "blue",
xlim = c(-43, -33), xlab = "X", ylab = "Y", main = "Ceará")
GO_coord <- aegypti[aegypti$state_code == "GO", c("x", "y"), drop = FALSE]
GO_rips <- vietoris_rips(GO_coord)
plot(x = GO_coord$x, y = GO_coord$y, asp = 1, col = "red",
xlim = c(-54, - 44), xlab = "X", ylab = "Y", main = "Goiás")
Distributions of aegypti populations serve as a guidance map
for interpreting H1ML
: a bigger H1ML
comes
from sparse and wider range of occurrences where infections can extend
to more locations. In Pernambuco, which has the tiniest
H1ML
, most occurrences appear to form a single cluster.
Hence, the spread of disease in Pernambuco may be easier to prevent and
contain in the population close to this aegypti cluster, such
as by applying pesticides and increasing predators. This may be part of
the reason that Pernambuco, with the tiniest H1ML
, has a
small number of cases. In Ceará, occurrences are not clustered in one
chunk but are separated into two clusters, one to the north and the
other to the south. Based on the aegypti distribution in Goiás,
we observe that multiple aegypti points are sparsely located
instead of tightly attached to one another, which can lead to a higher
value of H1ML
. Comparing the shape of distributions among
state Ceará and Goiás, occurrences in Goiás have a larger spread and are
more uniformly distributed across the state, so the disease in Goiás can
be transmitted across a wider range to reach more distant locations.
More uniformed occurrences of aegypti species can increase the
number of infections as it bridges more neighboring communities.
Overall, predictor H1ML
effectively detects patterns among
clusters of occurrences and explains that the increase in cases may be
due to a wider range of disease spread.
We have shown that using topological features from A.
aegypti occurrences in conjunction with standard epidemiology
variables could better use the spatial information of existing data to
benefit the prediction of ZIKV cases. Compared to using different years’
datasets by Lo and Park, we have kept all our datasets to 2013 so our
model leads to a different result. Also, improvements in our regression
model after using topological features are not extraordinarily
significant. In addition to variables included in our model, it is
possible to further involve interaction terms, quadratic predictors, and
log transformation of topological features when they are contextually
and statistically appropriate. We have tested our model while
incorporating the interaction between H1
and
H1ML
as an additional predictor, in order to follow the
approach by Lo and Park. However, this interaction term is not
significant and does not contribute to the overall fitness of our model.
Even though variable selections and results deviate somewhat from the
regression model built by Lo and Park, our model successfully
demonstrates how to take advantage of topological features generated by
the Vietoris-Rips filtration.
Gatherer D and Kohl A. Zika virus, a previously slow pandemic spreads rapidly through the Americas. Journal of General Virology. 2016 97: 269–273. https://doi.org/10.1099/jgv.0.000381 PMID: 26684466↩︎
Winer J. An update in Guillain-Barre´ Syndrome. Autoimmune Diseases. 2014. https://doi.org/10.1155/ 2014/793024 PMID: 24511391↩︎
Johansson MA, Mier-y-Teran-Romero L, Reefhuis J, Gilboa S and Hills S. Zika and the risk of microcephaly. N Engl J Med. 2016 375: 1–4. https://doi.org/10.1056/NEJMp1605367 PMID: 27222919↩︎
Lo D, Park B (2018) Modeling the spread of the Zika virus using topological data analysis. PLoS ONE 13(2): e0192120. https://doi.org/10.1371/journal.pone.0192120↩︎
Brazil Ministry of Health. 2014. Boletim Epidemiológico. ISSN 2358-9450. https://web.archive.org/web/20210209122713/https://www.gov.br/saude/pt-br/assuntos/boletins-epidemiologicos-1/por-assunto↩︎
Brazilian Institute of Geography and Statistics. 2015. Retrieved from https://ftp.ibge.gov.br/Estimativas_de_Populacao/↩︎
Brazilian Institute of Geography and Statistics. 2020. https://www.ibge.Goiásv.br/geociencias/organizacao-do-territorio/estrutura-territorial/15761-areas-dos-municipios.html?edicao=30133&t=acesso-ao-produto↩︎
Kraemer, Moritz U. G. et al. 2017. Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence, Dryad, Dataset, https://doi.org/10.5061/dryad.47v3c↩︎