Impaired Performance for New Connections and Sessions

Incident Report

Executive Summary

Beginning at 14:47 UTC on September 23, 2022, our Engineering team received high-priority alerts for an increase in OCPP message processing times and timeout errors. These were affecting multiple services, including OCPP Core Network and Management API.

We noticed impaired performance for new connections and charging sessions across the network until 18:20 UTC on September 23, 2022, when full network functionality was restored. Most existing connections and charging sessions remained unaffected during this interval.

After the final minor changes to the database at 21:09 UTC, we started monitoring the network and we closed this incident at 22:00 UTC on September 23, 2022.

Due to the process of data migration to the new database, there may be a very small minority of sessions that will need to be closed manually as all data migration checks are completed.

Events Timeline

Closed Sept 23 22:00 UTC
Following additional monitoring by our Engineering teams, no additional error rate spikes were observed for several hours, and we have resolved this incident.

Monitoring Sept 23 21:09 UTC
The engineering team has completed the scaling, and tuning of the new database and at this time has put this incident in monitoring mode.

Update Sept 23 18:20 UTC
Data Synchronizing of the new database is completed. New chargestation connections and charging sessions are now successfully started. The network is fully functional at this time, though the engineering teams continue to tune the database for performance.

Update Sept 23 17:23 UTC
All systems configurations were updated to connect to the new database. Synchronizing the new database with existing data and incoming traffic data continued at this time.

Update Sept 23 15:23 UTC
The engineering team resized the existing database and spun up a new parallel database and started to sync the data. As this change was being made, the old database continued to support 90% of all incoming message requests from the network.

Identified Sept 23 15:07 UTC
We declared an incident following confirmation that there was a severe degradation in the response times from our core database. This was causing an issue with establishing new chargestations connections and charging sessions. We were in contact with the cloud database provider to analyze the root cause and resolve this issue.

Investigating Sept 23 14:47 UTC
Our Engineering team received a high-priority alert for an increase in the OCPP network response times.

Mitigation Actions

In an attempt to reduce the impact of similar incidents in the future, we are taking the following actions:

We continue to engage with our cloud database provider on improving our monitoring and resiliency systems.
The eDRV engineering team will explore a "hot-standby" mode for our application and cloud database allowing fast switchover and higher resiliency.
The engineering team continues to increase automation in our network infrastructure, testing, monitoring, and deployment.