Identification of patterns and predictive modeling of traffic accidents.



To develop predictive models for the prediction of traffic accidents.


Research question

What is the risk of traffic accidents for a specific road segment and period of the day?

Analytical Service Specifications

Analytical Service Specifications

Learn More
This image for Image Layouts addon

Analytical Model Code

Learn More
This image for Image Layouts addon

Analytical Service Dashboard

Learn More

Business understanding

For the development of #5 Emergency use case was considered that the Lisbon City Council firefighters (RSB) consider important for their operations, a better allocation of the emergency resources across the city to decrease the time taken to respond to emergencies.

This aspect gains more relevance in case of traffic accidents and in this sense RSB showed interest in the development of a model that allows to identify the Lisbon areas that are more prone to traffic accidents considering changing some characteristics of roads (e.g., the number of traffic lanes, or the number of lightning poles) to assess the risk of traffic accidents. In this sense for this use case was developed a simulation model that allows through the modification of road characteristics identify the risk of traffic accidents for a specific road segment.

Data understanding

Data regarding traffic accidents provided by the RSB for the period between 2013 to 2020, making a total of 6076 traffic accidents, was provided to the research team of the project.

To notice that only the traffic accidents that required firefighters’ intervention are in this database. The details of the provided information are presented in Table 32.

Table 32. Description of the provided traffic accidents data

The traffic accidents spatial distribution was computed through the KDE (Silverman, 1998), using a 50x50 m cell, considering the expected counts in each cell (Figure 63).

Figure 63. KDE of traffic accidents estimated counts in a square grid of 50x50 m cells using traffic accidents data from 2013 to 2020

Data preparation

The relevant datasets for the development of the #5 Emergency use case are presented in Table 33.

Table 33. Relevant datasets used for the development of the #5 Emergency use case

The data presented in Table 33, was aggregated to the nearest road segment to prepare dataset for the modelling phase. In Figure 64 is presented the number of traffic accidents associated to the nearest road segment.

Figure 64. Number of traffic accidents associated to the nearest road segment from 2013 to 2020


Several studies have been developed for traffic accidents prediction, like ARIMA time series models (Ihueze & Onwurah, 2018), using Poisson’s and binomial negative algorithms (Fancello, Soddu, & Fadda, 2018) using machine learning techniques like regression models (Chang & Chen, 2005), K-Nearest Neighbour (KNN), Bayesian networks (Hossain & Muromachi, 2012), decision trees (Lin, Wang, & Sadek, 2015).

Also deep learning approaches (Chen, Song, Yamada, & Shibasaki, 2016; Ren, Song, Wang, Hu, & Lei, 2018) have been developed to estimate the risk of traffic accidents, but in coarser regular spatial grids, not providing the necessary spatial detail needed for emergency operations.

Besides this aspect the major part of the studies regarding prediction of traffic accidents are made in a non-urban context and much attention has not been provided to prediction of traffic accidents in urban environments (Yu et al., 2021).

The modelling strategy developed for the emergency use case was divided in two stages. In the first stage the probability of occurrence of a traffic accident by road segment for a specific day period, considering meteorological conditions (namely temperature and precipitation) was computed. In Table 34 are presented, the variables required for the development of the first stage of the modelling strategy.

Table 34. Variables used for the computation of traffic accidents probability

The probability of traffic accidents occurrence was computed dividing [sum_occurrences] by [count].

In the second stage of the modelling strategy, all combinations of the features [road_id], [temperature], [precipitation], [period], and [off_day], with a value lower than 100 were discarded, as they were not considered statistically significant.

To estimate a probability for the cases where statistical significance was not met two machine learning algorithms were trained and tested, namely a gradient boosting algorithm - LightGBM (LGBM) (Lv, Lou, Feng, Chen, & Lv, 2021) that uses tree-based learning algorithms, and a deep learning model through Keras framework (Ketkar, 2017). Both frameworks were implemented in two different steps:

  1. in which was used as a classification algorithm to identify (for the situations in which the combination of features was < 100) the observations were the probability was non null; and
  2. from the identified observations in the previous step, LGBM and Keras were used as a regressor to assign a probability of the occurrence of traffic accidents for each observation.

In Table 35 one can see the variables that were used for the application of the classification and regression algorithm.

Table 35. Input data for the prediction of the probability of traffic accidents occurrences in the observations, were the combination of the features [road_id], [temperature], [precipitation], [period], and [off_day] is < 100

To assess the quality of the models was considered the AUC (Huang & Ling, 2005) for the classification models and the MAPE (de Myttenaere, Golden, Le Grand, & Rossi, 2016) for the regression models.

The results of both models considering the classification and regression models are presented in Table 36.

Table 36. Quality metrics of the models developed for the #5 Emergency use case

Considering the results presented in Table 36, both models present a similar quality. However the gradient boosting is much less computing expensive when compared with the neural network.

Taking in consideration this fact, the model used to build the traffic accidents simulator was the gradient boosting algorithm implemented using the LGBM framework.

The simulator will be further presented in the next section.


To validate the results, was elaborated a dashboard with several reports based on a star-schema dimensional model.

Figure 65. Dimensional model for the #5 Emergency use case


Chang, L. Y., & Chen, W. C. (2005). Data mining of tree-based models to analyze freeway accident frequency. Journal of Safety Research, 36(4), 365–375. https://doi.org/10.1016/J.JSR.2005.06.013

Chen, Q., Song, X., Yamada, H., & Shibasaki, R. (2016). Learning deep representation from big and heterogeneous data for traffic accident inference. 30th AAAI Conference on Artificial Intelligence, AAAI 2016, 338–344.

de Myttenaere, A., Golden, B., Le Grand, B., & Rossi, F. (2016). Mean Absolute Percentage Error for regression models. Neurocomputing, 192, 38–48. https://doi.org/10.1016/j.neucom.2015.12.114

Fancello, G., Soddu, S., & Fadda, P. (2018). An accident prediction model for urban road networks. Journal of Transportation Safety and Security, 10(4), 387–405. https://doi.org/10.1080/19439962.2016.1268659

Hossain, M., & Muromachi, Y. (2012). A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways. Accident Analysis & Prevention, 45, 373–381. https://doi.org/10.1016/J.AAP.2011.08.004

Huang, J., & Ling, C. X. (2005). Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 17(3), 299–310.

Ihueze, C. C., & Onwurah, U. O. (2018). Road traffic accidents prediction modelling: An analysis of Anambra State, Nigeria. Accident Analysis and Prevention, 112, 21–29. https://doi.org/10.1016/j.aap.2017.12.016

Ketkar, N. (2017). Introduction to Keras. In Deep Learning with Python: A Hands-on Introduction (pp. 97–111). Berkeley, CA: Apress. https://doi.org/10.1007/978-1-4842-2766-4_7

Lin, L., Wang, Q., & Sadek, A. W. (2015). A novel variable selection method based on frequent pattern tree for real-time traffic accident risk prediction. Transportation Research Part C: Emerging Technologies, 55, 444–459. https://doi.org/10.1016/J.TRC.2015.03.015

Lv, Z., Lou, R., Feng, H., Chen, D., & Lv, H. (2021). Novel Machine Learning for Big Data Analytics in Intelligent Support Information Management Systems. ACM Trans. Manage. Inf. Syst., 13(1). https://doi.org/10.1145/3469890

Ren, H., Song, Y., Wang, J., Hu, Y., & Lei, J. (2018). A Deep Learning Approach to the Citywide Traffic Accident Risk Prediction. IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, 2018-Novem(October), 3346–3351. https://doi.org/10.1109/ITSC.2018.8569437

Silverman, B. W. (1998). Density Estimation for Statistics and Data Analysis (1st Editio). https://doi.org/https://doi.org/10.1201/9781315140919

Yu, L., Du, B., Hu, X., Sun, L., Han, L., & Lv, W. (2021). Deep spatio-temporal graph convolutional network for traffic accident prediction. Neurocomputing, 423, 135–147. https://doi.org/10.1016/J.NEUCOM.2020.09.043