March 31, 2020

The first issue on production 🐞

eleutheromaniac

Yesterday was a rough day for us at Oila Studio. In the morning we started receiving alerts alarming about slow response times for our service. I immediately went to check the graphs assuming the issues were caused by a memory leak or spike in requests. After the quick check, I found nothing - we had no memory leaks and no strange behavior according to all the metrics.
Grafana and Prometheus remained silent... Since the metrics gave me no clue I paired with a teammate hoping to find something that would give us some idea about what's happening. Checked the database, load balancer - nothing.

The logs seemed ok too, but some entries were weird, so we decided to check the histogram for the logs and eureka! - we had around 750000 logs produced this day and still being produced which were coming from calls to our Nginx load balancer, hence our system was DDOSed by our own service internally, so we didn't have much means of protection from DOS attack which comes from localhost )) Finally we fixed the issue and can continue serving our customers at a blazing speed.

  1. 2

    Woah what

  2. 2

    Definitely been there. At one point I generated 1,000,000 log entries by accidentally writing a microservice that calls itself. Thankfully it was in development on my laptop rather than production.

    1. 1

      Lucky you :) on production you have completely different feelings haha

  3. 2

    That's a crazy amount of logs! Glad you figured it out. Fixing prod bugs must feel like playing a round of CS:Go haha.

    1. 1

      Haha, It felt exactly like that. So stressful and time pressuring until finally manage to get almost all headshots (you need to get all headshots or you loose on production)