So does anyone have a good guide on what I am actually looking at in the Sharkey dashboard?

Amber 🌸

@[email protected] That’s complicated. It depends on the context really. 1,000 delayed jobs? For a single user instance that’s horrifying. For transfem.social that’s a bit on the lower end because we have such wide federation that there’s always a ton of small to big servers offline that prevent jobs from being sent out or inbox jobs from processing. It also depends on how quickly it is rising, in the case of our dos I noticed it rising pretty damn fast up to 80,000 delayed jobs within a span of an hour. That’s pretty fucking scary even with as much traffic as we have.

Aurora 🏳️‍🌈

@[email protected] Could it be because we did our huge migration and some of that stuff is still settling out? It seems like that was the biggest traffic we got by far.

I'm not sure what the bottleneck would be here, my cpu is running at ~10% and I've got plenty of ram.

I was running the server off wifi until recently, maybe that contributed to the issue?

It's not rising very much now, maybe 100 added today.

Amber 🌸

@[email protected] id have to see your queues.

Aurora 🏳️‍🌈

@[email protected] I posted the one image above, do you need a screenshot of something else? I'm not sure what is sensitivity to send openly on Fedi tbh, but I doubt analytics are.

Amber 🌸

@[email protected] that’s perfectly fine, just got the edit. that’s absolutely normal

Aurora 🏳️‍🌈

@[email protected] Oh okay, great. It does seem to be going down now too, only about ~600 delayed as of a couple minutes ago.

Amber 🌸

@[email protected] the job queues are hard to wrap your head around because it’s a bunch of independent variables.

1) Worker configuration (yes)
2) Database latency (yes, including concurrent connections because too many connections overloads the database and it responds slower)
3) other servers (outbound jobs build up when it fails to deliver, so even if everything is good on your side a remote instance might be unable to accept that job and it’ll queue it to retry later)
4) again remote instances, if you have an activity that’s a reply to another note the instance wants to fetch the original note and if it can’t then the job will be delayed. so instance A might be fine, but instance B which made the note that a user on instance A replied to is stopping the job from being processed
5) cpu, although that’s the least likely. Jobs can be gridlocked by other instance tasks such as the media proxy which does some on the fly server sided work to process images. It’s how the second dos attack on transfem.social worked. They spammed our media proxy, requiring the instance to process the same image over and over again which made 8 cpu cores go to 100%. There is no sharkey side caching for that so to account for it before the patches we made HAProxy cache the media proxy so the conversion wasn’t done hundreds of times per second for the same image.

Amber 🌸

@[email protected] source? I’m the head administrator (and one of the main sysadmins) of transfem.social and a sharkey developer/contributor (although I mostly serve an administrative role)

Amber 🌸

@[email protected] (just for people seeing this thread in passing lol)

Aurora 🏳️‍🌈

@[email protected] Yeah I'm not worried about any source for you lmao. You could tell me the delay is because I don't have enough blahaj in proximity to the server and I would probably try it.

Amber 🌸

@[email protected] oh also, the api itself can gridlock the instance but you don’t see that until you’re this size. If we ran sharkey in the unified configuration (default, db workers and frontend+api in the same process) it’d have imploded in on itself.

ash nova :neocat_flag_genderfluid:

@[email protected] @[email protected] hehe that sounds like good advice actually, make sure the server is comfy

Aurora 🏳️‍🌈

@[email protected] @[email protected] Server is very comfy don't worry, it has its own blanket keeping it nice and warm. I even activated the RGB for extra 10% ram just in case.

Amber 🌸

@[email protected] and that gridlocking isn’t bypassable by throwing more CPUs at it I already tried that you have to put on a big girl face and set the variable MK_SERVER_ONLY (iirc? I’d have to double check) and MK_WORKER_ONLY.

ash nova :neocat_flag_genderfluid:

@[email protected] @[email protected] incidentally mine is also split and a bit overscaled but that's mostly because I can, not because I need it strictly

Amber 🌸

@[email protected] @[email protected] there’s another level where you use haproxy to send websocket traffic to its own MK_SERVER_ONLY node based on /streaming & HTTP/1.1 -> websocket negotiation. This isn’t possible with sharkey config but you can use middleware. There’s also doing this but matching on headers "Accept: application/ld+json" (and other content types like activity+json) to route federation traffic to its own node…

Gwen, the kween fops :neofox_flag_trans: :sheher:

@[email protected] @[email protected] @[email protected] I can't get over putting a blanket on your server to keep it warm lmaaaaao

Amber 🌸

@[email protected] @[email protected] @[email protected] I once tossed a comforter on my 25u server rack suffocating it so the temps would spike just to heat my room up a bit more when I took off the comforter because I knew it’d take a while for it to go back to its normal temperature range. This is because I am a completely normal individual

ash nova :neocat_flag_genderfluid:

@[email protected] @[email protected] I'm running dedicated workers and MK_SERVER_ONLY nodes, but I haven't split those up into different API routes or anything like that if that's what you mean, they just do all the web traffic pretty much. Been meaning to separate out AP things from client web but I can't quite be bothered to write that much nginx conf rn and changing my own HTTP to HAProxy is more of a longer term project xD

Amber 🌸

@[email protected] @[email protected] @[email protected] it’s so funny to see idrac reporting 130°F exhaust temperatures. I didn’t believe it, so I put my hand behind my server and wew you’ll never guess this - it wasn’t kidding.