top of page

Machine Learning: US Traffic and weather analysis


In a 2008 crash analysis report, the state of Georgia had an estimate of 342,534 traffic accidents. Out of which, 133,555 individuals were injured, and 1,703 were dead. On average, Georgia faces around 1,000 traffic accidents per day.

One explanation for higher crash rates on Georgia roads is that extreme road conditions due to weather (e.g. rain, snow, ice) create potential safety hazards. Such potential safety hazards include, but are not limited to: driver(s) losing complete control of vehicle(s), improper lane changes, or obstruction of visibility. The United States Department of Transportation Road Weather Management Program reports that annual averages from 2007-2016 show 15% of vehicle crashes occurred due to wet pavements, 10% due to rain, 4% due to snow, and 3% due to ice [1].

Eliminating weather conditions and associated factors is not possible, however, understanding relations between conditions and crash risks could make drivers more aware of dangerous conditions. The following document presents an analysis of US traffic accidents surveyed over the span of several years with the intention of developing a severity assessment model, ie. How do weather conditions impact vehicle crash damage?


  • Feature Extraction and Dimensionality Reduction:

    • normalizing the data with respect to every feature

    • filling in missing categorical data with the mode of that feature

    • filling in missing numerical data with the median of that feature

    • Parsing date/time and replacing True/False with 1/0

    • Applying one-hot encoding on categorical data

  • Principle Component Analysis

    • PCA aims to select principal components in a Z space in order to attain the largest possible variance.

    • Reduces dimensionality of data thereby reducing complexity.

    • Applying PCA with 97% recovered variance resulted in the top 49 principal components.


  • Logistic Regression

    • Logistic regression in its basic form uses a logistic function to model a binary dependent variable. However, this algorithm can also be extended to model several classes of events.​

  • Support Vector Machine (SVM)

    • SVM maps data into a high dimension space so that decision boundaries can distinguish between the different classes.​

    • Implemented using hyperparameters and parameters​​

    • Results: ​​

      • SVM struggled to fit onto the test after performing well on the training set, with 0.479 and 0.9997 accuracy, respectively.​

  • ​​Gradient Boosting/Ensemble Learning using Decision Trees
    • Gradient boosting combines small decision trees (relatively weak estimators) through a gradient descent algorithm rather than creating a single decision tree in order to produce a classification strong model that is robust to overfitting. Sk-learn has been implementing an experimental approach to gradient boosting using histograms to bin data and speeds up calculations. This is the implementation we used.
    • Implemented using suitable hyperparameters
    • Grid Search Results (There were better than single run):



Overall, the project found some promise in its approach, but it is clear that a deeper investigation of the reported features is necessary. Another factor, mentioned during the dataset discussion, was that there was an imbalance in severity class with a heavy skew towards severity scores of 2 and 3. Future work and interest may consider grouping these classes into low and high opposed to the set {1, 2, 3, 4}. This requires a better understanding of how an accident was initially categorized during data collection. Future work lies in the interest to determine if local spatial classifiers are a better representation of vehicle accidents then the global spatial classifiers demonstrated. Since each physical location described by latitude and longitude features may be greater or less suspectible to weather conditions, a classifier across an entire state or even city may eliminate any distinctions.

Tools and Platform Used

  • Jupyter Notebook

  • Python 3: sklearn and numpy

  • MySQL

Try it out!
bottom of page