[MONITORING] Degraded performance on project hosting

Some component of our infrastructure is causing very high load on servers, leading one of them to eventually get stuck. At that point we have to terminate it, but then after a few hours the problem presents itself on a different server. The origin is probably a rogue project which triggers a side condition in our services.

When the server gets terminated, the projects on it will become unavailable for a few minutes. It might rarely happen that the cleanup process for one of them doesn’t get carried out correctly, leading to issues like this. In case it happens to you, don’t worry: we take frequent backups :slight_smile: Just signal the issue on this forum and we’ll try to take immediate action to restore your project.

We are currently investigating on a possible fix.

Sorry for the inconvenience!

The issue did not present itself so far. We added metrics and monitors to identify the root of the issue more quickly, and then deploy an effective fix.

The issue just showed up again :frowning: We are going to recreate one of the servers. You might experience trouble connecting to your projects for a few minutes.

Sorry again for the inconvenience!

This issue has happened again. You might have trouble connecting to projects for the next few minutes.