Forum slows down to unusability during events

worp1900

A word of sorry ahead. This is going to get lengthy. If you will, hear me out though as I am clueless as to how to progress.

I run https://schreibnacht.de. We host a monthly event where a renowned author joins the community for a Q&A for an hour. During this hour the forum slows down to being unusable with loading a page taking long than the timeout, some requests being completely cancelled and returning a 503 and some request going lightning fast, returning in less than a second.
Which requests return fast is random and does not have a pattern. 95% of all requests load within tens of seconds or run timeout/503.

The nodebb instance is a single one running on 6 vCores with 16 GB RAM with Redis as DB on Ubuntu 16.04. The system load is at 0.4/0.32/0.27 very constantly. RAM is 8.9 GB / 16 used, CPU usage is minimal outside of Redis' bg-save actions where one Core is used and the others keep idling. 336 MB of 1GB in swap. We use Apache as a proxy between the clients and nodebb.

Stopping the forum via ./nodebb stop, stopping redis (shutdown save) and then restarting redis then the forum does not change the situation which is one of the weirdest parts about it.

During our events we have around 50-150 concurrent users, depending on how renowned the special guest is. (So let's face it, it SHOULDN'T have any problems at all).

I desperately need some help identifying bottlenecks, hints for analysis where I can start looking.

I will go so far as to ping @julian directly, maybe you have some insight althought I know you are very busy.

I could bet my pants and shoes that it's some misconfiguration or some error happing somewhere between browser, Apache, Nodebb and Redis but I just don't know my way around them enough to find the hairpin in the haystack.

Thank you for all your help and time guys.

<baris>

How many nodebb instances are running? Have you checked https://docs.nodebb.org/configuring/scaling/

yariplus

Sounds to me like some type of limit is being reached. Maybe check that your ulimits are high enough, and try changing some apache settings https://httpd.apache.org/docs/2.4/misc/perf-tuning.html

julian

Yeah, definitely could be resource exhaustion. Check ss -s during peak times to see TIMEWAIT?

We prefer scaling out horizontally for this very reason.

worp1900

Oh my...! Thank you for all the replies. And especially julian taking the time! Amazing, I check all of that right now and ss -s next time we have an event.

Will report back asap!

worp1900

@yariplus What would a good ulimit be for nodebb? (or even for the described system above?)

julian

We use 500000, set in /etc/security/limits.conf and fs.file-max = 2097152 in /etc/sysctl.conf

No guarantees that those are the right numbers, of course

worp1900

So...this took a long time to work on. I finally have some odd numbers and findings that i'd like to share.

Maybe you have some insight or interpretations for me:

I found the server's process landscape basically always like this:
0_1540733663870_screenshot htop and ss.png

What jumped my eye is:

One node process with high cpu load and 3 others (further down) with 0% usage.
redis' bgsave command bombards the CPU hard
During redis bgsaves the timewait of ss jumps from 18/19 or 24k to 31/35 or even 40k, they remain around 24k while no bgsave is going.

Notices
The file limit is:

sudo lsof | wc -l
56353

While the maximum seems to be:

cat /proc/sys/fs/file-max
524288

So ulimit shouldn't be the problem. Right?

I was also able to see, using a load testing tool (Locust) that I finally came around setting up, that simulating 200 users (behaviour see #1 below) did not produce any exceptions UNTIL redis' bgsave started. While bgsave was running I was getting a 60-90% request-error percentage with the following failures:

# fails	Method	Name	Type
8850	GET	/topic/11705/testthema2	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
2870	GET	/user/someUser/settings	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /user/someUser/settings (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
1	GET	/topic/11705/testthema2	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known\',))",),)'
2899	GET	/user/janerow	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /user/janerow (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
2	GET	/topic/11705/testthema2	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 60] Operation timed out\',))",),)'
1	GET	/user/janerow	u"ConnectionError(ProtocolError('Connection aborted.', error(54, 'Connection reset by peer')),)"
2949	GET	/	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: / (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
1	GET	/topic/11705/testthema2	u"ConnectionError(ProtocolError('Connection aborted.', error(54, 'Connection reset by peer')),)"
2	GET	/topic/11705/testthema2	u'ConnectionError(ProtocolError(\'Connection aborted.\', BadStatusLine("\'\'",)),)'
1	GET	/user/janerow	u'ConnectionError(ProtocolError(\'Connection aborted.\', BadStatusLine("\'\'",)),)'

Deductions I made myself (pls correct if wrong)

My Apache proxy is not configured correctly. I have a cluster of node servers running on 4 ports but apache is only routing request to one of those ports so even though I have a cluster, I am running with one instance. (I have found some Apache configs that don't seem to establish the connection to a balancer group correctly. I will keep looking. #2)
Redis bgsave basically disables communication with the forum, but I have not figured out why.

Further actions

The next obvious thing is to get the Apache cluster working with correct load balancing between the 4 processes.
Next I need to understand why redis bgsave is killing the forum. It might just be that both run on the same core and redis uses this core 100%, so nothing else can use it in the meantime? Any other suggestions very welcome.

Additions
#1 - The simulated users are logging in, then heavily refresh the index page, a good number of forum posts, their profile and their settings page, then log out

#2 - Apache configuration I was playing with:

<Proxy balancer://nodebbcluster>
        # node process 4567
        BalancerMember http://x.x.x.x:4567

        # node process 4568
        BalancerMember http://x.x.x.x:4568

        # node process 4569
        BalancerMember http://x.x.x.x:4569
</Proxy>

And then :

   ProxyPass / balancer://nodebbcluster/

With nodebb configuration:

{
    "url": "https://www.schreibnacht.de",
                "port": ["4567", "4568", "4569"],
    "secret": "...",
    "database": "redis",
    "redis": {
        "host": "127.0.0.1",
        "port": "6379",
        "password": "...",
        "database": "0"
    },
                "secure": true
}

Dravere

May I ask why you're using Apache as a load balancer and reverse proxy? It is a terrible idea to use Apache HTTPd for something like this. Apache is a general web server. Nginx was created as a reverse proxy and load balancer and is much faster and much more efficient at it. If you're running into performance issue with Apache, I would suggest to change over to Nginx.

In regards to the redis process: What are your specification for HDD/SSD? Could it be that the drive is too slow for the writes by Redis?

worp1900

@Dravere thx for the reply. Sadly it couldn't be more true.

The reason why I chose this setup (for now, certainly won't stick around as I hate it) is because I have a Plesk installed to manage the standard websites on the server. Plesk comes with Apache and (at least in my plesk version) can put an nginx in front as a reverse proxy to all its features. Sadly the version that it brings with itself is an nginx < 1.3.13 as would be required for nodebb to be usable as a reverse proxy.

I have tried manually updating nginx but, as Plesk brings its own custom repo to pull its own images from, this is non-trivial and has rendered my setup unusable until it was reset. So currently I am stuck with Apache until I can get that figured out

The only solution I could think of right now is to put a small machine in place that works as a sole loadbalancer and reverse proxy to which all domains point in the first place. That one can then distribute requests to the real server. That would enable me to run an additional nginx (not the plesk one) on a "non-80" port, allowing me to use any nginx. But as this is a sideproject that doesnt generate any income yet, putting a machine in front is not a viable option

The Server is running on full SSDs. I would need to ask them which ones specifically are being used if it's of relevance. I figured "as long as it's SSD, it should be fine" up until now. Let me know if I should inquire and I will.

Thanks again for your time. Cheers!