So...this took a long time to work on. I finally have some odd numbers and findings that i'd like to share.
Maybe you have some insight or interpretations for me:
I found the server's process landscape basically always like this:
What jumped my eye is:
- One node process with high cpu load and 3 others (further down) with 0% usage.
- redis' bgsave command bombards the CPU hard
- During redis bgsaves the timewait of ss jumps from 18/19 or 24k to 31/35 or even 40k, they remain around 24k while no bgsave is going.
Notices
The file limit is:
sudo lsof | wc -l
56353
While the maximum seems to be:
cat /proc/sys/fs/file-max
524288
So ulimit shouldn't be the problem. Right?
I was also able to see, using a load testing tool (Locust) that I finally came around setting up, that simulating 200 users (behaviour see #1 below) did not produce any exceptions UNTIL redis' bgsave started. While bgsave was running I was getting a 60-90% request-error percentage with the following failures:
# fails Method Name Type
8850 GET /topic/11705/testthema2 u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
2870 GET /user/someUser/settings u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /user/someUser/settings (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
1 GET /topic/11705/testthema2 u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known\',))",),)'
2899 GET /user/janerow u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /user/janerow (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
2 GET /topic/11705/testthema2 u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 60] Operation timed out\',))",),)'
1 GET /user/janerow u"ConnectionError(ProtocolError('Connection aborted.', error(54, 'Connection reset by peer')),)"
2949 GET / u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: / (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
1 GET /topic/11705/testthema2 u"ConnectionError(ProtocolError('Connection aborted.', error(54, 'Connection reset by peer')),)"
2 GET /topic/11705/testthema2 u'ConnectionError(ProtocolError(\'Connection aborted.\', BadStatusLine("\'\'",)),)'
1 GET /user/janerow u'ConnectionError(ProtocolError(\'Connection aborted.\', BadStatusLine("\'\'",)),)'
Deductions I made myself (pls correct if wrong)
- My Apache proxy is not configured correctly. I have a cluster of node servers running on 4 ports but apache is only routing request to one of those ports so even though I have a cluster, I am running with one instance. (I have found some Apache configs that don't seem to establish the connection to a balancer group correctly. I will keep looking. #2)
- Redis bgsave basically disables communication with the forum, but I have not figured out why.
Further actions
The next obvious thing is to get the Apache cluster working with correct load balancing between the 4 processes.
Next I need to understand why redis bgsave is killing the forum. It might just be that both run on the same core and redis uses this core 100%, so nothing else can use it in the meantime? Any other suggestions very welcome.
Additions
#1 - The simulated users are logging in, then heavily refresh the index page, a good number of forum posts, their profile and their settings page, then log out
#2 - Apache configuration I was playing with:
<Proxy balancer://nodebbcluster>
# node process 4567
BalancerMember http://x.x.x.x:4567
# node process 4568
BalancerMember http://x.x.x.x:4568
# node process 4569
BalancerMember http://x.x.x.x:4569
</Proxy>
And then :
ProxyPass / balancer://nodebbcluster/
With nodebb configuration:
{
"url": "https://www.schreibnacht.de",
"port": ["4567", "4568", "4569"],
"secret": "...",
"database": "redis",
"redis": {
"host": "127.0.0.1",
"port": "6379",
"password": "...",
"database": "0"
},
"secure": true
}