Forum slows down to unusability during events

  • A word of sorry ahead. This is going to get lengthy. If you will, hear me out though as I am clueless as to how to progress.

    I run We host a monthly event where a renowned author joins the community for a Q&A for an hour. During this hour the forum slows down to being unusable with loading a page taking long than the timeout, some requests being completely cancelled and returning a 503 and some request going lightning fast, returning in less than a second.
    Which requests return fast is random and does not have a pattern. 95% of all requests load within tens of seconds or run timeout/503.

    The nodebb instance is a single one running on 6 vCores with 16 GB RAM with Redis as DB on Ubuntu 16.04. The system load is at 0.4/0.32/0.27 very constantly. RAM is 8.9 GB / 16 used, CPU usage is minimal outside of Redis' bg-save actions where one Core is used and the others keep idling. 336 MB of 1GB in swap. We use Apache as a proxy between the clients and nodebb.

    Stopping the forum via ./nodebb stop, stopping redis (shutdown save) and then restarting redis then the forum does not change the situation which is one of the weirdest parts about it.

    During our events we have around 50-150 concurrent users, depending on how renowned the special guest is. (So let's face it, it SHOULDN'T have any problems at all).

    I desperately need some help identifying bottlenecks, hints for analysis where I can start looking.

    I will go so far as to ping @julian directly, maybe you have some insight althought I know you are very busy.

    I could bet my pants and shoes that it's some misconfiguration or some error happing somewhere between browser, Apache, Nodebb and Redis but I just don't know my way around them enough to find the hairpin in the haystack.

    Thank you for all your help and time guys.

  • Admin

    How many nodebb instances are running? Have you checked

  • Community Rep

    Sounds to me like some type of limit is being reached. Maybe check that your ulimits are high enough, and try changing some apache settings

  • Admin

    Yeah, definitely could be resource exhaustion. Check ss -s during peak times to see TIMEWAIT?

    We prefer scaling out horizontally for this very reason.

  • Oh my...! Thank you for all the replies. And especially julian taking the time! Amazing, I check all of that right now and ss -s next time we have an event.

    Will report back asap!

  • @yariplus What would a good ulimit be for nodebb? (or even for the described system above?)

  • Admin

    We use 500000, set in /etc/security/limits.conf and fs.file-max = 2097152 in /etc/sysctl.conf

    No guarantees that those are the right numbers, of course 😛


| |