A Spatio-Temporal Attentive Network for Video-Based Crowd Counting
Marco Avvenuti, Marco Bongiovanni, Luca Ciampi, Fabrizio Falchi, Claudio Gennaro, Nicola Messina,

Abstract

Automatic people counting from images has recently drawn attention for urban monitoring in modern Smart Cities due to the ubiquity of surveillance camera networks. Current computer vision techniques rely on deep learning-based algorithms that estimate pedestrian densities in still, individual images. Only a bunch of works take advantage of temporal consistency in video sequences. In this work, we propose a spatio-temporal attentive neural network to estimate the number of pedestrians from surveillance videos. By taking advantage of the temporal correlation between consecutive frames, we lowered state-of-the-art count error by 5% and localization error by 7.5% on the widely-used FDST benchmark.

Papers

Code

The code for reproducing results and the trained models are available in our GitHub Repository.

Cite our work

If you find this work or code useful for your research, please cite the following:

@inproceedings{Avvenuti_2022,
    doi = {10.1109/iscc55528.2022.9913019},
    url = {https://doi.org/10.1109%2Fiscc55528.2022.9913019},
    year = 2022,
    month = {jun},
    publisher = {{IEEE}},
    author = {Marco Avvenuti and Marco Bongiovanni and Luca Ciampi and Fabrizio Falchi and Claudio Gennaro and Nicola Messina},
    title = {A Spatio- Temporal Attentive Network for Video-Based Crowd Counting},
    booktitle = {2022 {IEEE} Symposium on Computers and Communications ({ISCC})}
}
    

This work was partially funded by: Extension (ESA, n. 4000132621/20/NL/AF), and “Intelligenza Artificiale per il Monitoraggio Visuale dei Siti Culturali" (AI4CHSites) CNR4C program, CUP B15J19001040004.