Posting to a Specific Thread Completely Locks Up NodeBB

Bug Reports
  • We have one special thread on our forum that, anytime someone posts to it, the CPU spikes to 100% for maybe a little less than five minutes and the site is completely unavailable. Nothing goes wrong in the logs and after the time is up, everything is normal. We locked the thread and rarely see an issue. If we unlock it, one post to it and the issue comes right back.

    Here is the link to the specific thread:

  • I should note, if we stop the NodeBB process and restart it, it spikes right back up again. We just have to wait out the spike, unfortunately.

  • I grabbed an strace of the process during the spike in case that helps. It just looks like this over and over again...

    gettimeofday({1408323867, 541586}, NULL) = 0
    gettimeofday({1408323867, 541638}, NULL) = 0
    gettimeofday({1408323867, 541743}, NULL) = 0
    futex(0x7f4fc00008c8, FUTEX_WAKE_PRIVATE, 1) = 1
    mmap(0x2568ba73f000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2568ba73f000
    munmap(0x2568ba73f000, 790528) = 0
    munmap(0x2568bb025000, 258048) = 0
    mmap(0x2568ba800000, 8538824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2568ba800000
    mmap(0x3600d0bc9000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3600d0bc9000
    munmap(0x3600d0bc9000, 225280) = 0
    munmap(0x3600d1425000, 823296) = 0
    mmap(0x3600d0c00000, 8538792, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3600d0c00000
    mmap(0x14316f988000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x14316f988000
    munmap(0x14316f988000, 491520) = 0
    munmap(0x143170225000, 557056) = 0
    mmap(0x14316fa00000, 8538752, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x14316fa00000
    mmap(0x4aff0cb4000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x4aff0cb4000
    munmap(0x4aff0cb4000, 311296) = 0
    munmap(0x4aff1525000, 737280) = 0
    mmap(0x4aff0d00000, 8538720, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4aff0d00000
    mmap(0x29f36d128000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x29f36d128000
    munmap(0x29f36d128000, 884736) = 0
    munmap(0x29f36da25000, 163840) = 0
    mmap(0x29f36d200000, 8538688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x29f36d200000
    gettimeofday({1408323867, 562398}, NULL) = 0
    gettimeofday({1408323867, 590361}, NULL) = 0
    gettimeofday({1408323867, 591337}, NULL) = 0
    gettimeofday({1408323867, 592171}, NULL) = 0
    gettimeofday({1408323867, 592268}, NULL) = 0
    gettimeofday({1408323867, 592316}, NULL) = 0
    gettimeofday({1408323867, 592357}, NULL) = 0
    gettimeofday({1408323867, 592399}, NULL) = 0
    gettimeofday({1408323867, 592440}, NULL) = 0
    gettimeofday({1408323867, 592481}, NULL) = 0
    gettimeofday({1408323867, 592577}, NULL) = 0
    gettimeofday({1408323867, 592621}, NULL) = 0
    gettimeofday({1408323867, 592689}, NULL) = 0
    gettimeofday({1408323867, 592732}, NULL) = 0
    gettimeofday({1408323867, 592773}, NULL) = 0
    gettimeofday({1408323867, 592870}, NULL) = 0
    gettimeofday({1408323867, 592916}, NULL) = 0
    gettimeofday({1408323867, 592956}, NULL) = 0
    gettimeofday({1408323867, 593601}, NULL) = 0
    munmap(0x4aff0d00000, 8540160) = 0
    munmap(0x14316fa00000, 8540160) = 0
    munmap(0x3600d0c00000, 8540160) = 0
    munmap(0x2568ba800000, 8540160) = 0
    munmap(0x21beb2f00000, 8540160) = 0
    gettimeofday({1408323867, 594822}, NULL) = 0
    gettimeofday({1408323867, 594890}, NULL) = 0
    gettimeofday({1408323867, 594986}, NULL) = 0
    futex(0x7f4fc00008c8, FUTEX_WAKE_PRIVATE, 1) = 1
    mmap(0x2e5e2dba1000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2e5e2dba1000
    munmap(0x2e5e2dba1000, 389120) = 0
    munmap(0x2e5e2e425000, 659456) = 0
    mmap(0x2e5e2dc00000, 8538640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2e5e2dc00000
    mmap(0xc7259f7c000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0xc7259f7c000
    munmap(0xc7259f7c000, 540672) = 0
    munmap(0xc725a825000, 507904) = 0
    mmap(0xc725a000000, 8538680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc725a000000
    mmap(0x3cfa03435000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3cfa03435000
    munmap(0x3cfa03435000, 831488) = 0
    munmap(0x3cfa03d25000, 217088) = 0
    mmap(0x3cfa03500000, 8538664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3cfa03500000
    mmap(0x1b58bac54000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x1b58bac54000
    munmap(0x1b58bac54000, 704512) = 0
    munmap(0x1b58bb525000, 344064) = 0
    mmap(0x1b58bad00000, 8538640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1b58bad00000
    mmap(0x5870c1ca000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x5870c1ca000
    munmap(0x5870c1ca000, 221184) = 0
    munmap(0x5870ca25000, 827392) = 0
    mmap(0x5870c200000, 8538624, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x5870c200000
    gettimeofday({1408323867, 616858}, NULL) = 0
    gettimeofday({1408323867, 644188}, NULL) = 0
    gettimeofday({1408323867, 645073}, NULL) = 0
    gettimeofday({1408323867, 645907}, NULL) = 0
    gettimeofday({1408323867, 645990}, NULL) = 0
    gettimeofday({1408323867, 646034}, NULL) = 0
    gettimeofday({1408323867, 646075}, NULL) = 0
    gettimeofday({1408323867, 646114}, NULL) = 0
    gettimeofday({1408323867, 646156}, NULL) = 0
    gettimeofday({1408323867, 646196}, NULL) = 0
    gettimeofday({1408323867, 646291}, NULL) = 0
    gettimeofday({1408323867, 646334}, NULL) = 0
    gettimeofday({1408323867, 646411}, NULL) = 0
    gettimeofday({1408323867, 646455}, NULL) = 0
    gettimeofday({1408323867, 646498}, NULL) = 0
    gettimeofday({1408323867, 646538}, NULL) = 0
    gettimeofday({1408323867, 646579}, NULL) = 0
    gettimeofday({1408323867, 646619}, NULL) = 0
    gettimeofday({1408323867, 647215}, NULL) = 0
    munmap(0x1b58bad00000, 8540160) = 0
    munmap(0x3cfa03500000, 8540160) = 0
    munmap(0xc725a000000, 8540160) = 0
    munmap(0x2e5e2dc00000, 8540160) = 0
    munmap(0x29f36d200000, 8540160) = 0
    gettimeofday({1408323867, 648436}, NULL) = 0
    gettimeofday({1408323867, 648500}, NULL) = 0
    gettimeofday({1408323867, 648596}, NULL) = 0
    futex(0x7f4fc00008c8, FUTEX_WAKE_PRIVATE, 1) = 1
    mmap(0x1e2678b8b000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x1e2678b8b000
    munmap(0x1e2678b8b000, 479232) = 0
    munmap(0x1e2679425000, 569344) = 0
    mmap(0x1e2678c00000, 8538600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1e2678c00000
    mmap(0x266b76515000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x266b76515000
    munmap(0x266b76515000, 962560) = 0
    munmap(0x266b76e25000, 86016) = 0
    mmap(0x266b76600000, 8538584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x266b76600000
    mmap(0x22128e50c000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x22128e50c000
    munmap(0x22128e50c000, 999424) = 0
    munmap(0x22128ee25000, 49152) = 0
    mmap(0x22128e600000, 8538560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x22128e600000
    mmap(0x24ec62b8b000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x24ec62b8b000
    munmap(0x24ec62b8b000, 479232) = 0
    munmap(0x24ec63425000, 569344) = 0
    mmap(0x24ec62c00000, 8538544, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x24ec62c00000
    mmap(0x7fdf4db2000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fdf4db2000
    munmap(0x7fdf4db2000, 319488) = 0
    munmap(0x7fdf5625000, 729088) = 0
    mmap(0x7fdf4e00000, 8538520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fdf4e00000
    gettimeofday({1408323867, 669966}, NULL) = 0
    gettimeofday({1408323867, 698507}, NULL) = 0
    gettimeofday({1408323867, 699587}, NULL) = 0
    gettimeofday({1408323867, 700691}, NULL) = 0
    gettimeofday({1408323867, 700787}, NULL) = 0
    gettimeofday({1408323867, 700833}, NULL) = 0
    gettimeofday({1408323867, 700873}, NULL) = 0
    gettimeofday({1408323867, 700916}, NULL) = 0
    gettimeofday({1408323867, 700962}, NULL) = 0
    gettimeofday({1408323867, 701007}, NULL) = 0
    gettimeofday({1408323867, 701165}, NULL) = 0
    gettimeofday({1408323867, 701208}, NULL) = 0
    gettimeofday({1408323867, 701269}, NULL) = 0
    gettimeofday({1408323867, 701310}, NULL) = 0
    gettimeofday({1408323867, 701350}, NULL) = 0
    gettimeofday({1408323867, 701389}, NULL) = 0
    gettimeofday({1408323867, 701430}, NULL) = 0
    gettimeofday({1408323867, 701470}, NULL) = 0
    gettimeofday({1408323867, 702268}, NULL) = 0
    munmap(0x24ec62c00000, 8540160) = 0
    munmap(0x22128e600000, 8540160) = 0
    munmap(0x266b76600000, 8540160) = 0
    munmap(0x1e2678c00000, 8540160) = 0
    munmap(0x5870c200000, 8540160) = 0
    gettimeofday({1408323867, 703120}, NULL) = 0
    gettimeofday({1408323867, 703192}, NULL) = 0
    gettimeofday({1408323867, 703318}, NULL) = 0
    futex(0x7f4fc00008c8, FUTEX_WAKE_PRIVATE, 1) = 1
    mmap(0xd83d282d000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0xd83d282d000
    munmap(0xd83d282d000, 864256) = 0
    munmap(0xd83d3125000, 184320) = 0
    mmap(0xd83d2900000, 8538504, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xd83d2900000
    mmap(0x10ebf46cb000, 9588736, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x10ebf46cb000
    munmap(0x10ebf46cb000, 217088) = 0
    munmap(0x10ebf4f25000, 831488) = 0
    mmap(0x10ebf4700000, 8538480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x10ebf4700000

  • This post is deleted!
  • Weird, does it happen in this thread as well? http://mangolassi.it/topic/1022/what-are-you-doing-right-now

    You can also try these http://docs.mongodb.org/manual/administration/monitoring/ to see what the database is doing during the spike.

    I see you are on 0.5.0-2 now is it still happening on that version?

  • Yes on 0.5.0-2

    No, that other thread is fine. I've been watching it carefully assuming that it would freak out too but it doesn't. There is something unique about that one thread.

    I will wait until we have a lull (site is active around the clock) and see what the DB is doing.

    I suspect one of the posts contains something causing a problem. But who knows what 😞

  • We have noticed that even reading the thread causing the CPU to spike for a few minutes.

  • @scottalanmiller Strange, I've got a topic with just under 600 posts on my forum, and it loads and I can post to it without issue (I do use Pagination though). Does it happen when you view a certain portion of posts? Maybe one of the posts is causing the issue?

  • I am seeing this in one of the posts and a matching error in the browser console.

    <p><a href="uploading 10%" target="_blank"><img src="uploading 10%" alt="Awesome recovery.gif" class="img-responsive"></a> </p>

    GET http://mangolassi.it/topic/389/random-thread-anything-goes/uploading%2010% 504 (Gateway Time-out) nodebb.min.js?eb0c6f65:3

    Maybe thats related.

    It causes this on the server side

    TypeError: Failed to decode param 'uploading%2010%'
        at decode_param (/home/baris/node-forum/node_modules/express/lib/router/layer.js:147:15)
        at Layer.match (/home/baris/node-forum/node_modules/express/lib/router/layer.js:119:11)
        at Function.match_layer (/home/baris/node-forum/node_modules/express/lib/router/index.js:281:16)
        at next (/home/baris/node-forum/node_modules/express/lib/router/index.js:182:10)
        at /home/baris/node-forum/node_modules/express/lib/router/index.js:184:16
        at Function.match_layer (/home/baris/node-forum/node_modules/express/lib/router/index.js:288:3)
        at next (/home/baris/node-forum/node_modules/express/lib/router/index.js:182:10)
        at /home/baris/node-forum/node_modules/express/lib/router/index.js:184:16
        at Function.match_layer (/home/baris/node-forum/node_modules/express/lib/router/index.js:288:3)
        at next (/home/baris/node-forum/node_modules/express/lib/router/index.js:182:10)
    
  • @a_5mith said:

    @scottalanmiller Strange, I've got a topic with just under 600 posts on my forum, and it loads and I can post to it without issue (I do use Pagination though). Does it happen when you view a certain portion of posts? Maybe one of the posts is causing the issue?

    Yes, I think that it is a single post, we've got others nearly as long. It didn't deteriorate, it failed suddenly.

  • @baris said:

    I am seeing this in one of the posts and a matching error in the browser console.

    So maybe we have something being treated as non-text causing an issue? That would make sense.

  • @baris said:

    Awesome recovery.gif

    So how do I edit that entry so that it doesn't crash? Deleting it won't make it completely go aware. But I can't delete or edit anyway because just by going to the page makes the server hang. Hmm...

  • Hmm once you are on that page you can inspect that element to get its post id. Once you have that you can update it in the database itself it will have a _key = post:<post id>.

    db.objects.find({_key:"post:<post_id>");

    Then something like

    db.objects.update({_key:"post:<post_id>"}, {$set:{"content": "testing edit"}});

    Let me know if that works.

  • Any updates on this?

    I am seeing a similar issue on a pretty busy mongo instance, when it happens the nodebb nodejs process hits 100% cpu but the mongo process is idle. Usually it takes a restart to fix it.

  • This issue has returned for us big time. We don't know what thread is doing it now but it is happening about once an hour. It is getting really bad all of a sudden - of course coinciding with a major increase of traffic over the last two days. We are getting a little desperate now.

  • @baris said:

    Any updates on this?

    I am seeing a similar issue on a pretty busy mongo instance, when it happens the nodebb nodejs process hits 100% cpu but the mongo process is idle. Usually it takes a restart to fix it.

    that's exactly what we see. We now just hop in and do an immediate "kill -9" on the offending process and restart right away. It is horrible.

  • Oh sure, it happened while I was posting this! That's twice in ten minutes now!

  • I'll wait for @baris reply but since he replied to you a month ago he's done a ton of improvements to the mongodb driver (our biggest client is actually using mongo right now) so hang tight, hopefully we can help. Are you on the latest code at the moment (0.5.1)?

  • No, we are on 0.5.0-4. When should we update?

  • You can update to 0.5.1, when it happens can you send a screen grab of the top and mongostat that might give us more insight.


Suggested Topics