A novel spatio-temporal clustering algorithm with applications on COVID-19 data from the United States

A novel spatio-temporal clustering algorithm with applications on COVID-19 data from the United States

Soudeep Deb, Sayar Karmakar

Abstract: The time-series of COVID 19 incidence rate widely varied based on geographical locations. This happened not only because of epidemiological factors, but also due to different government policies and in general human mobility in response to that. In this article, the authors propose a new clustering algorithm for analyzing such spatio-temporal data. In particular, their target datasets are such where they could observe time-series in each geographical location, be it at county, state or country level and then try to perform an unsupervised clustering of these locations based on the similarity of the series observed in them.

The proposed method leverages a weighted combination of a spatial haversine distance matrix and a spectral-density based temporal distance matrix between the locations. Concepts of partition around medoids algorithm and the gap statistic are utilized to develop the algorithm and to determine the optimal number of clusters. Such a non-parametric algorithm is novel as it incorporates both spatial and temporal distances of the units and it can work for time-series of possibly different lengths. Theoretical guarantee of consistency of the proposed method is provided. An elaborate simulation study is also given to demonstrate the efficacy of the algorithm.

Next, the proposed algorithm is implemented to analyze the spatio-temporal dynamics of the time series of coronavirus (COVID-19) incidence rates observed at county-level in the United States of America. They collected time-series of 3190 counties across contiguous USA for a 15-month period from 22 January,2020 to 31 March, 2021. They first applied an existing method of clustering that is based on similarity of auto-correlation function in these counties. The results performed both on all 3190 counties and at a more local level were difficult to interpret since the resulting clusters were barely connected. Next, they applied their methods and demonstrated on datasets of different sizes: the entire country, the Midwest region and the state of California. They emphasized the last two and discussed how the clustering results offer interesting insights into the epidemic progression in these areas. Particularly, they were able to shed light on whether state-mandated restrictions impacted the entire state similarly or if there were interesting local behaviors in terms of the COVID-19 spread. In the supplementary materials, they analyzed another spatiotemporal data of temperature recorded in Southern part of India and saw that their clusters clearly showed effects of terrain, vegetation and proximity of water body.

Read moe