Forum slows down to unusability during events
A word of sorry ahead. This is going to get lengthy. If you will, hear me out though as I am clueless as to how to progress.
I run https://schreibnacht.de. We host a monthly event where a renowned author joins the community for a Q&A for an hour. During this hour the forum slows down to being unusable with loading a page taking long than the timeout, some requests being completely cancelled and returning a 503 and some request going lightning fast, returning in less than a second.
Which requests return fast is random and does not have a pattern. 95% of all requests load within tens of seconds or run timeout/503.
The nodebb instance is a single one running on 6 vCores with 16 GB RAM with Redis as DB on Ubuntu 16.04. The system load is at 0.4/0.32/0.27 very constantly. RAM is 8.9 GB / 16 used, CPU usage is minimal outside of Redis' bg-save actions where one Core is used and the others keep idling. 336 MB of 1GB in swap. We use Apache as a proxy between the clients and nodebb.
Stopping the forum via ./nodebb stop, stopping redis (shutdown save) and then restarting redis then the forum does not change the situation which is one of the weirdest parts about it.
During our events we have around 50-150 concurrent users, depending on how renowned the special guest is. (So let's face it, it SHOULDN'T have any problems at all).
I desperately need some help identifying bottlenecks, hints for analysis where I can start looking.
I will go so far as to ping @julian directly, maybe you have some insight althought I know you are very busy.
I could bet my pants and shoes that it's some misconfiguration or some error happing somewhere between browser, Apache, Nodebb and Redis but I just don't know my way around them enough to find the hairpin in the haystack.
Thank you for all your help and time guys.
How many nodebb instances are running? Have you checked https://docs.nodebb.org/configuring/scaling/
Sounds to me like some type of limit is being reached. Maybe check that your ulimits are high enough, and try changing some apache settings https://httpd.apache.org/docs/2.4/misc/perf-tuning.html
Yeah, definitely could be resource exhaustion. Check
ss -sduring peak times to see
We prefer scaling out horizontally for this very reason.
Oh my...! Thank you for all the replies. And especially julian taking the time! Amazing, I check all of that right now and ss -s next time we have an event.
Will report back asap!
@yariplus What would a good ulimit be for nodebb? (or even for the described system above?)
We use 500000, set in
fs.file-max = 2097152in
No guarantees that those are the right numbers, of course