Yesterday was a rough day for us at Oila Studio. In the morning we started receiving alerts alarming about slow response times for our service. I immediately went to check the graphs assuming the issues were caused by a memory leak or spike in requests. After the quick check, I found nothing - we had no memory leaks and no strange behavior according to all the metrics.
Grafana and Prometheus remained silent... Since the metrics gave me no clue I paired with a teammate hoping to find something that would give us some idea about what's happening. Checked the database, load balancer - nothing.
The logs seemed ok too, but some entries were weird, so we decided to check the histogram for the logs and eureka! - we had around 750000 logs produced this day and still being produced which were coming from calls to our Nginx load balancer, hence our system was DDOSed by our own service internally, so we didn't have much means of protection from DOS attack which comes from localhost )) Finally we fixed the issue and can continue serving our customers at a blazing speed.