Hey everyone, hopefully especially the devs who do this much more often.
I am trying to get to the bottom of a strange issue we keep having. On our NodeBB instance, we have a regular (once a month) events, where the highest number of users (~40-80, so not very many) are concurrently online.
With such low numbers, I figure NodeBB would not have any problems ever. Sadly we are experiencing extremely high response times (< 1 minute and eventual 503s being thrown) almost every time in the last 3 times we had the event.
The problem is not accompanied by unusual resource usage (CPU < 10%, RAM at 12/16 GB, Swap 660 MB/1GB). While redis is saving to disk we have 1 of 6 availables vCores at 100% usage, but the others are basically idle. Also, the slow response time and timeouts do not necessarily occur while redis is saving to disk. They appear much more often than redis writes its dump.
It is usually a fairly short period of time that we experience this problem (10-30 minutes). Then the forum goes back to normal speed.
One of the most curious findings: Stopping nodebb, restarting redis, waiting for it to load the dump and then starting nodebb again does NOT resolve the problem. Not even for a short time. As soon as the forum is reachable again, we are back to high response times and 503s. Then, eventually, without something observable changing, the forum goes back to normal.
The logs of NodeBB and redis do not seem to spill out anything of note. (redis just reports when it's dumping to DB (every 5 minutes as configured) and the success of the operation, NodeBB reports that it was started).
Apache has some conspicuous entries:
7804 [Fri Jul 06 21:18:22.308468 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost)
7805 [Fri Jul 06 21:18:22.308604 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://www.schreibnacht.de/...
7806 [Fri Jul 06 21:18:22.350118 2018] [proxy:error] [pid 21390] AH00940: HTTP: disabled connection for (localhost)
7807 [Fri Jul 06 21:18:22.350170 2018] [authz_core:error] [pid 21390] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://www.schreibnacht.de/...
7808 [Fri Jul 06 21:18:22.395862 2018] [proxy:error] [pid 22316] AH00940: HTTP: disabled connection for (localhost)
7809 [Fri Jul 06 21:18:22.395923 2018] [authz_core:error] [pid 22316] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://www.schreibnacht.de/...
7810 [Fri Jul 06 21:18:22.431708 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost)
7811 [Fri Jul 06 21:18:22.431781 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://www.schreibnacht.de/...
7812 [Fri Jul 06 21:18:22.464752 2018] [proxy:error] [pid 22334] AH00940: HTTP: disabled connection for (localhost)
7813 [Fri Jul 06 21:18:22.464817 2018] [authz_core:error] [pid 22334] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://schreibnacht.de/...
7814 [Fri Jul 06 21:18:22.557643 2018] [proxy:error] [pid 22318] AH00940: HTTP: disabled connection for (localhost)
7815 [Fri Jul 06 21:18:22.557711 2018] [authz_core:error] [pid 22318] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/public /503.html, referer: https://www.schreibnacht.de/...
7816 [Fri Jul 06 21:18:22.570026 2018] [proxy:error] [pid 22329] AH00940: HTTP: disabled connection for (localhost)
7817 [Fri Jul 06 21:18:22.570079 2018] [authz_core:error] [pid 22329] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/public /503.html, referer: https://www.schreibnacht.de/...
7818 [Fri Jul 06 21:18:22.738724 2018] [proxy:error] [pid 21529] AH00940: HTTP: disabled connection for (localhost)
7819 [Fri Jul 06 21:18:22.738807 2018] [authz_core:error] [pid 21529] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pu blic/503.html, referer: https://www.schreibnacht.de/...
7820 [Fri Jul 06 21:18:22.898662 2018] [proxy:error] [pid 21390] AH00940: HTTP: disabled connection for (localhost)
7821 [Fri Jul 06 21:18:22.898724 2018] [authz_core:error] [pid 21390] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://schreibnacht.de/...
7822 [Fri Jul 06 21:18:22.957591 2018] [proxy:error] [pid 22324] AH00940: HTTP: disabled connection for (localhost)
7823 [Fri Jul 06 21:18:22.957675 2018] [authz_core:error] [pid 22324] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html, referer: https://www.schreibnacht.de/...
7824 [Fri Jul 06 21:18:23.019271 2018] [proxy:error] [pid 21668] AH00940: HTTP: disabled connection for (localhost)
7825 [Fri Jul 06 21:18:23.019347 2018] [authz_core:error] [pid 21668] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html, referer: https://schreibnacht.de/...
7826 [Fri Jul 06 21:18:23.033226 2018] [proxy:error] [pid 20448] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:4567 (localhost) failed
7827 [Fri Jul 06 21:18:23.033249 2018] [proxy:error] [pid 20448] AH00959: ap_proxy_connect_backend disabling worker for (localhost) for 60s
7828 [Fri Jul 06 21:18:23.033257 2018] [proxy_http:error] [pid 20448] [client x.x.x.x:xxxx] AH01114: HTTP: failed to make connection to backend: localhost
7829 [Fri Jul 06 21:18:23.033310 2018] [authz_core:error] [pid 20448] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html
7830 [Fri Jul 06 21:18:23.090340 2018] [proxy:error] [pid 20860] AH00940: HTTP: disabled connection for (localhost)
7831 [Fri Jul 06 21:18:23.090435 2018] [authz_core:error] [pid 20860] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/pub lic/503.html, referer: https://www.schreibnacht.de/...
7832 [Fri Jul 06 21:18:23.334201 2018] [proxy:error] [pid 22323] AH00940: HTTP: disabled connection for (localhost)
7833 [Fri Jul 06 21:18:23.334289 2018] [authz_core:error] [pid 22323] [client x.x.x.x:xxxx] AH01630: client denied by server configuration: /path/to/nodebb/schreibnacht/publ ic/503.html, referer: https://www.schreibnacht.de/...
7834 [Fri Jul 06 21:18:23.346097 2018] [proxy:error] [pid 21669] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:4567 (localhost) failed
7835 [Fri Jul 06 21:18:23.346146 2018] [proxy:error] [pid 21669] AH00959: ap_proxy_connect_backend disabling worker for (localhost) for 60s
Sadly we only have this even once every month which limits our testing period extremely and every time we can experience and try to monitor the problem, we anger our users a lot at the same time.
So I am asking a very wide, very thin spread question:
How might you go about debugging this issue?
What might my apache log hint at?
Which tools are you guys using (if any) to generate artificial load for testing on your servers?
What are regular troubleshooting techniques that you deploy?
What other advice might you have, what other information can I provide to help my cause?
Many thanks for your time and sorry for the lengthy post.
Best regards,
Worp