Error analysis / debugging / performance analysis tools
-
Hey everyone, hopefully especially the devs who do this much more often.
I am trying to get to the bottom of a strange issue we keep having. On our NodeBB instance, we have a regular (once a month) events, where the highest number of users (~40-80, so not very many) are concurrently online.
With such low numbers, I figure NodeBB would not have any problems ever. Sadly we are experiencing extremely high response times (< 1 minute and eventual 503s being thrown) almost every time in the last 3 times we had the event.
The problem is not accompanied by unusual resource usage (CPU < 10%, RAM at 12/16 GB, Swap 660 MB/1GB). While redis is saving to disk we have 1 of 6 availables vCores at 100% usage, but the others are basically idle. Also, the slow response time and timeouts do not necessarily occur while redis is saving to disk. They appear much more often than redis writes its dump.
It is usually a fairly short period of time that we experience this problem (10-30 minutes). Then the forum goes back to normal speed.
One of the most curious findings: Stopping nodebb, restarting redis, waiting for it to load the dump and then starting nodebb again does NOT resolve the problem. Not even for a short time. As soon as the forum is reachable again, we are back to high response times and 503s. Then, eventually, without something observable changing, the forum goes back to normal.
The logs of NodeBB and redis do not seem to spill out anything of note. (redis just reports when it's dumping to DB (every 5 minutes as configured) and the success of the operation, NodeBB reports that it was started).
Apache has some conspicuous entries:
7804 [Fri Jul 06 21:18:22.308468 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost) 7805 [Fri Jul 06 21:18:22.308604 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://www.schreibnacht.de/... 7806 [Fri Jul 06 21:18:22.350118 2018] [proxy:error] [pid 21390] AH00940: HTTP: disabled connection for (localhost) 7807 [Fri Jul 06 21:18:22.350170 2018] [authz_core:error] [pid 21390] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://www.schreibnacht.de/... 7808 [Fri Jul 06 21:18:22.395862 2018] [proxy:error] [pid 22316] AH00940: HTTP: disabled connection for (localhost) 7809 [Fri Jul 06 21:18:22.395923 2018] [authz_core:error] [pid 22316] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://www.schreibnacht.de/... 7810 [Fri Jul 06 21:18:22.431708 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost) 7811 [Fri Jul 06 21:18:22.431781 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://www.schreibnacht.de/... 7812 [Fri Jul 06 21:18:22.464752 2018] [proxy:error] [pid 22334] AH00940: HTTP: disabled connection for (localhost) 7813 [Fri Jul 06 21:18:22.464817 2018] [authz_core:error] [pid 22334] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://schreibnacht.de/... 7814 [Fri Jul 06 21:18:22.557643 2018] [proxy:error] [pid 22318] AH00940: HTTP: disabled connection for (localhost) 7815 [Fri Jul 06 21:18:22.557711 2018] [authz_core:error] [pid 22318] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/public /503.html, referer: https://www.schreibnacht.de/... 7816 [Fri Jul 06 21:18:22.570026 2018] [proxy:error] [pid 22329] AH00940: HTTP: disabled connection for (localhost) 7817 [Fri Jul 06 21:18:22.570079 2018] [authz_core:error] [pid 22329] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/public /503.html, referer: https://www.schreibnacht.de/... 7818 [Fri Jul 06 21:18:22.738724 2018] [proxy:error] [pid 21529] AH00940: HTTP: disabled connection for (localhost) 7819 [Fri Jul 06 21:18:22.738807 2018] [authz_core:error] [pid 21529] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://www.schreibnacht.de/... 7820 [Fri Jul 06 21:18:22.898662 2018] [proxy:error] [pid 21390] AH00940: HTTP: disabled connection for (localhost) 7821 [Fri Jul 06 21:18:22.898724 2018] [authz_core:error] [pid 21390] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://schreibnacht.de/... 7822 [Fri Jul 06 21:18:22.957591 2018] [proxy:error] [pid 22324] AH00940: HTTP: disabled connection for (localhost) 7823 [Fri Jul 06 21:18:22.957675 2018] [authz_core:error] [pid 22324] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html, referer: https://www.schreibnacht.de/... 7824 [Fri Jul 06 21:18:23.019271 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost) 7825 [Fri Jul 06 21:18:23.019347 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html, referer: https://schreibnacht.de/... 7826 [Fri Jul 06 21:18:23.033226 2018] [proxy:error] [pid 20448] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:4567 (localhost) failed 7827 [Fri Jul 06 21:18:23.033249 2018] [proxy:error] [pid 20448] AH00959: ap_proxy_connect_backend disabling worker for (localhost) for 60s 7828 [Fri Jul 06 21:18:23.033257 2018] [proxy_http:error] [pid 20448] [client x.x.x.x:xxxx] AH01114: HTTP: failed to make connection to backend: localhost 7829 [Fri Jul 06 21:18:23.033310 2018] [authz_core:error] [pid 20448] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html 7830 [Fri Jul 06 21:18:23.090340 2018] [proxy:error] [pid 20860] AH00940: HTTP: disabled connection for (localhost) 7831 [Fri Jul 06 21:18:23.090435 2018] [authz_core:error] [pid 20860] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://www.schreibnacht.de/... 7832 [Fri Jul 06 21:18:23.334201 2018] [proxy:error] [pid 22323] AH00940: HTTP: disabled connection for (localhost) 7833 [Fri Jul 06 21:18:23.334289 2018] [authz_core:error] [pid 22323] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html, referer: https://www.schreibnacht.de/... 7834 [Fri Jul 06 21:18:23.346097 2018] [proxy:error] [pid 21669] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:4567 (localhost) failed 7835 [Fri Jul 06 21:18:23.346146 2018] [proxy:error] [pid 21669] AH00959: ap_proxy_connect_backend disabling worker for (localhost) for 60s
Sadly we only have this even once every month which limits our testing period extremely and every time we can experience and try to monitor the problem, we anger our users a lot at the same time.
So I am asking a very wide, very thin spread question:
How might you go about debugging this issue?
What might my apache log hint at?
Which tools are you guys using (if any) to generate artificial load for testing on your servers?
What are regular troubleshooting techniques that you deploy?
What other advice might you have, what other information can I provide to help my cause?Many thanks for your time and sorry for the lengthy post.
Best regards,
Worp -
@worp1900 I looked up one of those errors and found this: https://stackoverflow.com/questions/23709832/ap-proxy-connect-backend-disabling-worker-for-127-0-0-1
-
@pitaj thank you so much for your reply. The response noting
ProxyPass / http://backendserver:8080/ retry=0
Might be worth investigating. I was not aware there was a delay before determining a proxy-action as failed. Although it makes obvious sense something like that would exist.
Do you have any recommendation how I might load-test the nodebb instance? Any tool that you have used or have seen used effectively?
I know about various tools with which one can test a web application, but I might not be aware of any obvious pitfalls that some tools might have with nodeBB or Proxy architectures in general.
If I could create a testing scenario to reproduce the error, I might be able to verify if that fix helps. Otherwise I'll have to wait until our next event and potentially realize it hasn't helped at all
This is my Apache config btw, if it's informative in any way:
Protocols h2 http/1.1 ServerName www.schreibnacht.de ServerAlias schreibnacht.de SSLEngine on # Basic security headers Header always set X-Content-Type-Options "nosniff" Header always set X-Xss-Protection "1; mode=block" # NodeBB header RequestHeader set X-Forwarded-Proto "https" # Static file cache <FilesMatch "\.(ico|jpg|jpeg|png|gif|js|css)$"> <IfModule mod_expires.c> ExpiresActive on ExpiresDefault "access plus 14 days" Header set Cache-Control "public" </IfModule> </FilesMatch> ProxyRequests off <Proxy *> Order deny,allow Allow from all </Proxy> # Custom Error Document when NodeBB is offline ProxyPass /error-documents ! ErrorDocument 503 /error-documents/503.html Alias /error-documents /path/to/public # Websocket passthrough RewriteEngine On RewriteCond %{REQUEST_URI} ^/socket.io [NC] RewriteCond %{QUERY_STRING} transport=websocket [NC] RewriteRule /(.*) ws://localhost:4567/$1 [P,L] ProxyPass / http://localhost:4567/ retry=0 ProxyPassReverse / http://localhost:4567/ retry=0 # Log stuff ErrorLog ${APACHE_LOG_DIR}/www-schreibnacht-error.log CustomLog ${APACHE_LOG_DIR}/www-schreibnacht-access.log combined
-
In the meantime I found:
Testing
Run NodeBB in development mode:./nodebb dev
This will expose the plugin debug logs, allowing you to see if your plugin is loaded, and its hooks registered. Activate your plugin from the administration panel, and test it out.
From https://docs.nodebb.org/development/plugins/This might also be helpful.
-
@worp1900 said:
On our NodeBB instance, we have a regular (once a month) events, where the highest number of users (~40-80, so not very many) are concurrently online.
That shouldn’t be a problem. See for example this case with an average of 100 concurrent users and working fine on very moderate resources.
However, they use Nginx, which works much better for concurrent users, etc. than Apache.
Anyhow, there is probably some misconfiguration somewhere. To find it by trying different configurations, of course there are many load testing tools to simulate concurrent users. But sorry, I’m just an user with no load testing experience.
-
@juan-g hey hey, thanks for the reply. I appreciate it. Guess I gotta take a wild guess at which tool to use and try my best. I knew about the load testing tools, was just hoping to get a hint as to what tool or configuration is used by the nodebb team.
Nonetheless: Thank you for caring I really do appreciate it.