Posts made by worp1900

worp1900

@Dravere thx for the reply. Sadly it couldn't be more true.

The reason why I chose this setup (for now, certainly won't stick around as I hate it) is because I have a Plesk installed to manage the standard websites on the server. Plesk comes with Apache and (at least in my plesk version) can put an nginx in front as a reverse proxy to all its features. Sadly the version that it brings with itself is an nginx < 1.3.13 as would be required for nodebb to be usable as a reverse proxy.

I have tried manually updating nginx but, as Plesk brings its own custom repo to pull its own images from, this is non-trivial and has rendered my setup unusable until it was reset. So currently I am stuck with Apache until I can get that figured out

The only solution I could think of right now is to put a small machine in place that works as a sole loadbalancer and reverse proxy to which all domains point in the first place. That one can then distribute requests to the real server. That would enable me to run an additional nginx (not the plesk one) on a "non-80" port, allowing me to use any nginx. But as this is a sideproject that doesnt generate any income yet, putting a machine in front is not a viable option

The Server is running on full SSDs. I would need to ask them which ones specifically are being used if it's of relevance. I figured "as long as it's SSD, it should be fine" up until now. Let me know if I should inquire and I will.

Thanks again for your time. Cheers!

worp1900

So...this took a long time to work on. I finally have some odd numbers and findings that i'd like to share.

Maybe you have some insight or interpretations for me:

I found the server's process landscape basically always like this:
0_1540733663870_screenshot htop and ss.png

What jumped my eye is:

One node process with high cpu load and 3 others (further down) with 0% usage.
redis' bgsave command bombards the CPU hard
During redis bgsaves the timewait of ss jumps from 18/19 or 24k to 31/35 or even 40k, they remain around 24k while no bgsave is going.

Notices
The file limit is:

sudo lsof | wc -l
56353

While the maximum seems to be:

cat /proc/sys/fs/file-max
524288

So ulimit shouldn't be the problem. Right?

I was also able to see, using a load testing tool (Locust) that I finally came around setting up, that simulating 200 users (behaviour see #1 below) did not produce any exceptions UNTIL redis' bgsave started. While bgsave was running I was getting a 60-90% request-error percentage with the following failures:

# fails	Method	Name	Type
8850	GET	/topic/11705/testthema2	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
2870	GET	/user/someUser/settings	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /user/someUser/settings (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
1	GET	/topic/11705/testthema2	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known\',))",),)'
2899	GET	/user/janerow	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /user/janerow (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
2	GET	/topic/11705/testthema2	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: /topic/11705/testthema2 (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 60] Operation timed out\',))",),)'
1	GET	/user/janerow	u"ConnectionError(ProtocolError('Connection aborted.', error(54, 'Connection reset by peer')),)"
2949	GET	/	u'ConnectionError(MaxRetryError("HTTPSConnectionPool(host=\'schreibnacht.de\', port=443): Max retries exceeded with url: / (Caused by NewConnectionError(\'<urllib3.connection.VerifiedHTTPSConnection object at 0x....>: Failed to establish a new connection: [Errno 61] Connection refused\',))",),)'
1	GET	/topic/11705/testthema2	u"ConnectionError(ProtocolError('Connection aborted.', error(54, 'Connection reset by peer')),)"
2	GET	/topic/11705/testthema2	u'ConnectionError(ProtocolError(\'Connection aborted.\', BadStatusLine("\'\'",)),)'
1	GET	/user/janerow	u'ConnectionError(ProtocolError(\'Connection aborted.\', BadStatusLine("\'\'",)),)'

Deductions I made myself (pls correct if wrong)

My Apache proxy is not configured correctly. I have a cluster of node servers running on 4 ports but apache is only routing request to one of those ports so even though I have a cluster, I am running with one instance. (I have found some Apache configs that don't seem to establish the connection to a balancer group correctly. I will keep looking. #2)
Redis bgsave basically disables communication with the forum, but I have not figured out why.

Further actions

The next obvious thing is to get the Apache cluster working with correct load balancing between the 4 processes.
Next I need to understand why redis bgsave is killing the forum. It might just be that both run on the same core and redis uses this core 100%, so nothing else can use it in the meantime? Any other suggestions very welcome.

Additions
#1 - The simulated users are logging in, then heavily refresh the index page, a good number of forum posts, their profile and their settings page, then log out

#2 - Apache configuration I was playing with:

<Proxy balancer://nodebbcluster>
        # node process 4567
        BalancerMember http://x.x.x.x:4567

        # node process 4568
        BalancerMember http://x.x.x.x:4568

        # node process 4569
        BalancerMember http://x.x.x.x:4569
</Proxy>

And then :

   ProxyPass / balancer://nodebbcluster/

With nodebb configuration:

{
    "url": "https://www.schreibnacht.de",
                "port": ["4567", "4568", "4569"],
    "secret": "...",
    "database": "redis",
    "redis": {
        "host": "127.0.0.1",
        "port": "6379",
        "password": "...",
        "database": "0"
    },
                "secure": true
}

worp1900

@yariplus What would a good ulimit be for nodebb? (or even for the described system above?)

worp1900

Oh my...! Thank you for all the replies. And especially julian taking the time! Amazing, I check all of that right now and ss -s next time we have an event.

Will report back asap!

worp1900

A word of sorry ahead. This is going to get lengthy. If you will, hear me out though as I am clueless as to how to progress.

I run https://schreibnacht.de. We host a monthly event where a renowned author joins the community for a Q&A for an hour. During this hour the forum slows down to being unusable with loading a page taking long than the timeout, some requests being completely cancelled and returning a 503 and some request going lightning fast, returning in less than a second.
Which requests return fast is random and does not have a pattern. 95% of all requests load within tens of seconds or run timeout/503.

The nodebb instance is a single one running on 6 vCores with 16 GB RAM with Redis as DB on Ubuntu 16.04. The system load is at 0.4/0.32/0.27 very constantly. RAM is 8.9 GB / 16 used, CPU usage is minimal outside of Redis' bg-save actions where one Core is used and the others keep idling. 336 MB of 1GB in swap. We use Apache as a proxy between the clients and nodebb.

Stopping the forum via ./nodebb stop, stopping redis (shutdown save) and then restarting redis then the forum does not change the situation which is one of the weirdest parts about it.

During our events we have around 50-150 concurrent users, depending on how renowned the special guest is. (So let's face it, it SHOULDN'T have any problems at all).

I desperately need some help identifying bottlenecks, hints for analysis where I can start looking.

I will go so far as to ping @julian directly, maybe you have some insight althought I know you are very busy.

I could bet my pants and shoes that it's some misconfiguration or some error happing somewhere between browser, Apache, Nodebb and Redis but I just don't know my way around them enough to find the hairpin in the haystack.

Thank you for all your help and time guys.

worp1900

@juan-g hey hey, thanks for the reply. I appreciate it. Guess I gotta take a wild guess at which tool to use and try my best. I knew about the load testing tools, was just hoping to get a hint as to what tool or configuration is used by the nodebb team.

Nonetheless: Thank you for caring I really do appreciate it.

worp1900

I am still very much in need of this. Any help possible?

bump

worp1900

I think this question does not belong in this forum. Try a general google first.

worp1900

In the meantime I found:

Testing
Run NodeBB in development mode:

./nodebb dev
This will expose the plugin debug logs, allowing you to see if your plugin is loaded, and its hooks registered. Activate your plugin from the administration panel, and test it out.
From https://docs.nodebb.org/development/plugins/

This might also be helpful.

worp1900

@pitaj thank you so much for your reply. The response noting

ProxyPass / http://backendserver:8080/ retry=0

Might be worth investigating. I was not aware there was a delay before determining a proxy-action as failed. Although it makes obvious sense something like that would exist.

Do you have any recommendation how I might load-test the nodebb instance? Any tool that you have used or have seen used effectively?

I know about various tools with which one can test a web application, but I might not be aware of any obvious pitfalls that some tools might have with nodeBB or Proxy architectures in general.

If I could create a testing scenario to reproduce the error, I might be able to verify if that fix helps. Otherwise I'll have to wait until our next event and potentially realize it hasn't helped at all

This is my Apache config btw, if it's informative in any way:

Protocols h2 http/1.1
ServerName www.schreibnacht.de
ServerAlias schreibnacht.de

SSLEngine on

# Basic security headers
Header always set X-Content-Type-Options "nosniff"
Header always set X-Xss-Protection "1; mode=block"

# NodeBB header
RequestHeader set X-Forwarded-Proto "https"

# Static file cache
<FilesMatch "\.(ico|jpg|jpeg|png|gif|js|css)$">
    <IfModule mod_expires.c>
        ExpiresActive on
        ExpiresDefault "access plus 14 days"
        Header set Cache-Control "public"
    </IfModule>
</FilesMatch>

ProxyRequests off
<Proxy *>
    Order deny,allow
    Allow from all
</Proxy>

# Custom Error Document when NodeBB is offline
ProxyPass /error-documents !
ErrorDocument 503 /error-documents/503.html
Alias /error-documents /path/to/public

# Websocket passthrough
RewriteEngine On
RewriteCond %{REQUEST_URI}  ^/socket.io            [NC]
RewriteCond %{QUERY_STRING} transport=websocket    [NC]
RewriteRule /(.*)           ws://localhost:4567/$1 [P,L]

ProxyPass / http://localhost:4567/ retry=0
ProxyPassReverse / http://localhost:4567/ retry=0

# Log stuff
ErrorLog ${APACHE_LOG_DIR}/www-schreibnacht-error.log
CustomLog ${APACHE_LOG_DIR}/www-schreibnacht-access.log combined

worp1900

Hey everyone, hopefully especially the devs who do this much more often.

I am trying to get to the bottom of a strange issue we keep having. On our NodeBB instance, we have a regular (once a month) events, where the highest number of users (~40-80, so not very many) are concurrently online.

With such low numbers, I figure NodeBB would not have any problems ever. Sadly we are experiencing extremely high response times (< 1 minute and eventual 503s being thrown) almost every time in the last 3 times we had the event.

The problem is not accompanied by unusual resource usage (CPU < 10%, RAM at 12/16 GB, Swap 660 MB/1GB). While redis is saving to disk we have 1 of 6 availables vCores at 100% usage, but the others are basically idle. Also, the slow response time and timeouts do not necessarily occur while redis is saving to disk. They appear much more often than redis writes its dump.

It is usually a fairly short period of time that we experience this problem (10-30 minutes). Then the forum goes back to normal speed.

One of the most curious findings: Stopping nodebb, restarting redis, waiting for it to load the dump and then starting nodebb again does NOT resolve the problem. Not even for a short time. As soon as the forum is reachable again, we are back to high response times and 503s. Then, eventually, without something observable changing, the forum goes back to normal.

The logs of NodeBB and redis do not seem to spill out anything of note. (redis just reports when it's dumping to DB (every 5 minutes as configured) and the success of the operation, NodeBB reports that it was started).

Apache has some conspicuous entries:

7804 [Fri Jul 06 21:18:22.308468 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost)

7805 [Fri Jul 06 21:18:22.308604 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu     blic/503.html, referer: https://www.schreibnacht.de/...

7806 [Fri Jul 06 21:18:22.350118 2018] [proxy:error] [pid 21390] AH00940: HTTP: disabled connection for (localhost)

7807 [Fri Jul 06 21:18:22.350170 2018] [authz_core:error] [pid 21390] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub     lic/503.html, referer: https://www.schreibnacht.de/...

7808 [Fri Jul 06 21:18:22.395862 2018] [proxy:error] [pid 22316] AH00940: HTTP: disabled connection for (localhost)

7809 [Fri Jul 06 21:18:22.395923 2018] [authz_core:error] [pid 22316] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu     blic/503.html, referer: https://www.schreibnacht.de/...

7810 [Fri Jul 06 21:18:22.431708 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost)

7811 [Fri Jul 06 21:18:22.431781 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub     lic/503.html, referer: https://www.schreibnacht.de/...

7812 [Fri Jul 06 21:18:22.464752 2018] [proxy:error] [pid 22334] AH00940: HTTP: disabled connection for (localhost)

7813 [Fri Jul 06 21:18:22.464817 2018] [authz_core:error] [pid 22334] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu     blic/503.html, referer: https://schreibnacht.de/...

7814 [Fri Jul 06 21:18:22.557643 2018] [proxy:error] [pid 22318] AH00940: HTTP: disabled connection for (localhost)

7815 [Fri Jul 06 21:18:22.557711 2018] [authz_core:error] [pid 22318] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/public     /503.html, referer: https://www.schreibnacht.de/...

7816 [Fri Jul 06 21:18:22.570026 2018] [proxy:error] [pid 22329] AH00940: HTTP: disabled connection for (localhost)

7817 [Fri Jul 06 21:18:22.570079 2018] [authz_core:error] [pid 22329] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/public     /503.html, referer: https://www.schreibnacht.de/...

7818 [Fri Jul 06 21:18:22.738724 2018] [proxy:error] [pid 21529] AH00940: HTTP: disabled connection for (localhost)

7819 [Fri Jul 06 21:18:22.738807 2018] [authz_core:error] [pid 21529] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu     blic/503.html, referer: https://www.schreibnacht.de/...

7820 [Fri Jul 06 21:18:22.898662 2018] [proxy:error] [pid 21390] AH00940: HTTP: disabled connection for (localhost)

7821 [Fri Jul 06 21:18:22.898724 2018] [authz_core:error] [pid 21390] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub     lic/503.html, referer: https://schreibnacht.de/...

7822 [Fri Jul 06 21:18:22.957591 2018] [proxy:error] [pid 22324] AH00940: HTTP: disabled connection for (localhost)

7823 [Fri Jul 06 21:18:22.957675 2018] [authz_core:error] [pid 22324] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ     ic/503.html, referer: https://www.schreibnacht.de/...

7824 [Fri Jul 06 21:18:23.019271 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost)

7825 [Fri Jul 06 21:18:23.019347 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ     ic/503.html, referer: https://schreibnacht.de/...

7826 [Fri Jul 06 21:18:23.033226 2018] [proxy:error] [pid 20448] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:4567 (localhost) failed

7827 [Fri Jul 06 21:18:23.033249 2018] [proxy:error] [pid 20448] AH00959: ap_proxy_connect_backend disabling worker for (localhost) for 60s

7828 [Fri Jul 06 21:18:23.033257 2018] [proxy_http:error] [pid 20448] [client x.x.x.x:xxxx] AH01114: HTTP: failed to make connection to backend: localhost

7829 [Fri Jul 06 21:18:23.033310 2018] [authz_core:error] [pid 20448] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ     ic/503.html

7830 [Fri Jul 06 21:18:23.090340 2018] [proxy:error] [pid 20860] AH00940: HTTP: disabled connection for (localhost)

7831 [Fri Jul 06 21:18:23.090435 2018] [authz_core:error] [pid 20860] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub     lic/503.html, referer: https://www.schreibnacht.de/...

7832 [Fri Jul 06 21:18:23.334201 2018] [proxy:error] [pid 22323] AH00940: HTTP: disabled connection for (localhost)

7833 [Fri Jul 06 21:18:23.334289 2018] [authz_core:error] [pid 22323] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ     ic/503.html, referer: https://www.schreibnacht.de/...

7834 [Fri Jul 06 21:18:23.346097 2018] [proxy:error] [pid 21669] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:4567 (localhost) failed

7835 [Fri Jul 06 21:18:23.346146 2018] [proxy:error] [pid 21669] AH00959: ap_proxy_connect_backend disabling worker for (localhost) for 60s

Sadly we only have this even once every month which limits our testing period extremely and every time we can experience and try to monitor the problem, we anger our users a lot at the same time.

So I am asking a very wide, very thin spread question:
How might you go about debugging this issue?
What might my apache log hint at?
Which tools are you guys using (if any) to generate artificial load for testing on your servers?
What are regular troubleshooting techniques that you deploy?
What other advice might you have, what other information can I provide to help my cause?

Many thanks for your time and sorry for the lengthy post.

Best regards,
Worp

worp1900

We found the solution after quite a war of versions

The problem was that the server has Plesk installed. Plesk has runs Apache as the primary webserver but can put an Nginx-Proxy in between, which I wanted to use as a proxy to my nodebb setup as well.

Turns out though that Plesk installs an Nginx 1.11.

In the requirements on Github it is stated that NodeBB requires Nginx 1.3.13 to run (Github - Nodebb - requirements), so 1.11 seemed feasible.

Turns out, it wasn't. After fiddling around with it for over 15 hours, I decided to kick Plesks webserver management, dumped Apache, compiled Nginx from source to get 1.13.7 and retried everything. It worked without any problems.

Thus I am somewhat assuming that NodeBB really requires 1.13.3 to run and not 1.3.13.

The right Nginx version fixed the problem
I am assuming problems with the proxy-functionality with websockets.

Can anyone confirm my suspicion about the version missmatch in the docs?

worp1900

I have a successful nodebb setup working with my IP at http://5.35.243.92:4567.

When I try to access the forum with the generic hostname (http://lvps5-35-243-92.dedicated.hosteurope.de:4567) though, I get

Looks like your connection to NodeBB was lost, please wait while we try to reconnect."

Very similar to this problem:
https://community.nodebb.org/topic/10396/connection-lost

When going to the login page and trying to log in, upon hitting the "Submit" button, I get:

Login failed.
Your login was not successful. Your session might have expried. Please try again.

(freely translated from German)

So I recon my problem is happening somewhere in the nodebb config.

My working config is:

{
    "url": "http://5.35.243.92:4567",
    "secret": "d68e...",
    "database": "redis",
    "port": [
        "4567",
        "4568",
        "4569"
    ],
    "redis": {
     "host": "127.0.0.1",
    "port": "6379",
    "database": "0"
    }
}

So now I would like to do this:

{
    "url": "http://lvps5-35-243-92.dedicated.hosteurope.de:4567",
    "secret": "d68e2..."
}

After stopping and starting nodebb, I browse to
http://lvps5-35-243-92.dedicated.hosteurope.de:4567/
And I get 403 forbidden - Permission denied when trying to login with my credentials.

And the website reports the same problem as above:

Login failed [...] Session expired.

The log entry for the login request seems to be:

2017-12-10 16:23:11	Error	31.16.249.122	403	POST /login?error=csrf-invalid HTTP/1.1

Any ideas are greatly appreciated.

edit:
Ultimately I will need the forum to run with the actual domain: https://schreibnacht.de

So instead of getting it to run with the generic hostname, making it work with the schreibnacht domain would be even more appreciated.

For the schreibnacht.de domain I get:

/index.html Not found
This page does not exist. Return to homepage.

When browsing to https://schreibnacht.de/
And a

This site can’t be reached
schreibnacht.de unexpectedly closed the connection.
ERR_CONNECTION_CLOSED

When browsing to https://schreibnacht.de:4567

With the nodebb configuration

{
    "url": "https://schreibnacht.de",
    "secret": "d68e...",
    "database": "redis",
    "port": [
        "4567",
        "4568",
        "4569"
    ],
    "redis": {
     "host": "127.0.0.1",
    "port": "6379",
    "database": "0"
    }
}

And an nginx proxy configured like this:

location / {
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Host $http_host;
    proxy_set_header X-NginX-Proxy true;

    proxy_pass http://127.0.0.1:4567;
    proxy_redirect off;

    # Socket.IO Support
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
}

Although the domain points to the same server as http://lvps5-35-243-92.dedicated.hosteurope.de. Shouldn't that mean that http://schreibnacht.de:4567 should deliver the same as http://lvps5-35-243-92.dedicated.hosteurope.de:4567?
Considering they should both not be picked up by the nginx because they use the open port directly?

Curiously, going to http://5.35.243.92:4567/ (IP directly with nodebb configured to "url": "https://schreibnacht.de") with this configuration, the page still shows up and does't display the "/index.html not found" problem. But it does report "You've been disconnected, please wait while we reconnect you."