For the first time the #CoSocialCa Mastodon server has started to struggle just a little bit to keep up with the flow of the Fediverse.

Mick 🇨🇦

We’ve usually been “push” heavy but we’ve started to see some spikes in “pull” queue latency. The worst of these spikes was today, where we fell behind by at least a couple minutes for most of the afternoon.

1/?

#MastoAdmin #CoSocialTechOps

Emelia 👸🏻

@mick i wonder if it'd make sense to produce log messages for each instance, actor, activity type, directionality and other properties over time & make that graphable?

What percentage of activities processed or sent in the last day were Delete's, what type of activities does Server Y send me?

(Maybe this also ties into @polotek's question earlier today)

Mick 🇨🇦

This is great! It’s exciting to see our community growing.

I’m going to make a simple change to see if we can better keep up.

The system that we’re running on has plenty of headroom for more sidekiq threads.

2/?

Mick 🇨🇦

@thisismissem @polotek That would be helpful. I’d love to be better able to interpret weird traffic spikes like this.

Without being creepy about it.

Mick 🇨🇦

For anyone interested in understanding the guts of Mastodon, I have found this article from Digital Ocean very helpful: https://www.digitalocean.com/community/tutorials/how-to-scale-your-mastodon-server#perfecting-sidekiq-queues

Eventually we’ll grow so big that we’ll need oodles of sidekiq queues and we’ll want to be able to customize how many types of them we want, and will run them as jobs across multiple servers and so-on.

But for now I’m just going to make the number of threads slightly bigger and see what happens.

3/?

Bruce Elrick

@mick Push and pull depend on the point of view. Would you please clarify as it isn't immediately obvious to me (it occurs to me that I do not have a good mental model of AP)?

Mick 🇨🇦

We’ll do this staging first, because I am a responsible sysadmin (and I am only ever half sure I know what I’m doing.)

We’re running the default config that came with our DigitalOcean droplet, which as a single sidekiq service running 25 threads.

4/?

Mick 🇨🇦

@virtuous_sloth When someone here writes a post, makes a comment, interacts with a remote post, the activity lands in the push queue, and is then distributed to other servers in the network. This is pretty active on our server but tends to clear very quickly.

The pull queue “Handles tasks such as handling imports, backups, resolving threads, deleting users, and forwarding replies.” And for some reason is getting clogged of late.

Bruce Elrick

@mick interesting. Definitely sounds like pull would be a lot lower workload, but perhaps the push queues have a lot more resources dedicated while pull does not?

Mick 🇨🇦

That article from DigitalOcean suggests that 10-15 threads = 1 GB of RAM.

We also need to give each thread its own DB connection.

In staging the DB is local, so we don’t need to worry too much about a few extra connections.

In production, we’re connected to a DB pool that will funnel the extra connections into a smaller number of connections to the DB. Our Database server still has oodles of capacity to keep up with all of this.

5/?

Mick 🇨🇦

Staging server only has 2 GB of RAM but it also has virtually no queue activity so let’s give it a shot.

Having confirmed that we have sufficient resources to accommodate the increase and then picked a number out of hat, I’m going to increase the number of threads to 40.

6/?

Mick 🇨🇦

No signs of trouble. Everything still hunky-dory in staging.

On to production.

If this is the last post you read from our server then something has gone very wrong.

7/?

Mick 🇨🇦

@virtuous_sloth Only thing I can figure is that we’ve taken on a bunch of new followers as a server and tipped across some threshold? Or the users we follow are more chatty of late?

It did start when Eurovision began, but as a small server of Canadians idk how much Eurovision discussion we were plugged into.

A few more threads ought to help. If not, I’ll delve deeper.

Mick 🇨🇦

Aaaand we’re good.

I’ll keep an eye on things over the next days and week and see if this has any measurable impact on performance one way or the other.

And that’s enough recreational server maintenance for one Friday night.

8/?

Marco Rogers

@mick @thisismissem so the thing that has me thinking about this is I was using activitypub.academy to view some logs. I did a follow to my server and it showed that my server continually sent duplicate "Accept" messages back. I can't tell if that's an issue with my server or with the academy. Because I can't see my logs.

Marco Rogers

@mick @thisismissem people told me that activity pub was very "chatty". I understand a lot better why that is now. But I now suspect that there's also a ton of inefficiency there. Because few people are looking at the actual production behavior.

Emelia 👸🏻

@polotek @mick yeah, I've seen that too & I'm not sure if it's a bug on the source mastodon server or on ActivityPub.academy server

Marco Rogers

@thisismissem @mick one thing I know is a problem is retries that build up in sidekiq. Sidekiq will retry jobs basically forever. And when server disappear, their jobs sit in the retry queue failing indefinitely. I'm sure larger instances with infra teams do some cleanup hear. But how are smaller instances supposed to learn about this?

Emelia 👸🏻

@polotek @mick there are retry limits

julian

@[email protected] that's definitely a problem with the sidekiq configuration then. Keeping retries forever just clogs up the queue... I can't imagine they'd be kept forever!