For the first time the #CoSocialCa Mastodon server has started to struggle just a little bit to keep up with the flow of the Fediverse.
-
For the first time the #CoSocialCa Mastodon server has started to struggle just a little bit to keep up with the flow of the Fediverse.
We’ve usually been “push” heavy but we’ve started to see some spikes in “pull” queue latency. The worst of these spikes was today, where we fell behind by at least a couple minutes for most of the afternoon.
1/?
-
@mick i wonder if it'd make sense to produce log messages for each instance, actor, activity type, directionality and other properties over time & make that graphable?
What percentage of activities processed or sent in the last day were Delete's, what type of activities does Server Y send me?
(Maybe this also ties into @polotek's question earlier today)
-
For anyone interested in understanding the guts of Mastodon, I have found this article from Digital Ocean very helpful: https://www.digitalocean.com/community/tutorials/how-to-scale-your-mastodon-server#perfecting-sidekiq-queues
Eventually we’ll grow so big that we’ll need oodles of sidekiq queues and we’ll want to be able to customize how many types of them we want, and will run them as jobs across multiple servers and so-on.
But for now I’m just going to make the number of threads slightly bigger and see what happens.
3/?
-
We’ll do this staging first, because I am a responsible sysadmin (and I am only ever half sure I know what I’m doing.)
We’re running the default config that came with our DigitalOcean droplet, which as a single sidekiq service running 25 threads.
4/?
-
@virtuous_sloth When someone here writes a post, makes a comment, interacts with a remote post, the activity lands in the push queue, and is then distributed to other servers in the network. This is pretty active on our server but tends to clear very quickly.
The pull queue “Handles tasks such as handling imports, backups, resolving threads, deleting users, and forwarding replies.” And for some reason is getting clogged of late.
-
That article from DigitalOcean suggests that 10-15 threads = 1 GB of RAM.
We also need to give each thread its own DB connection.
In staging the DB is local, so we don’t need to worry too much about a few extra connections.
In production, we’re connected to a DB pool that will funnel the extra connections into a smaller number of connections to the DB. Our Database server still has oodles of capacity to keep up with all of this.
5/?
-
Staging server only has 2 GB of RAM but it also has virtually no queue activity so let’s give it a shot.
Having confirmed that we have sufficient resources to accommodate the increase and then picked a number out of hat, I’m going to increase the number of threads to 40.
6/?
-
@virtuous_sloth Only thing I can figure is that we’ve taken on a bunch of new followers as a server and tipped across some threshold? Or the users we follow are more chatty of late?
It did start when Eurovision began, but as a small server of Canadians idk how much Eurovision discussion we were plugged into.
A few more threads ought to help. If not, I’ll delve deeper.
-
@mick @thisismissem so the thing that has me thinking about this is I was using activitypub.academy to view some logs. I did a follow to my server and it showed that my server continually sent duplicate "Accept" messages back. I can't tell if that's an issue with my server or with the academy. Because I can't see my logs.
-
@mick @thisismissem people told me that activity pub was very "chatty". I understand a lot better why that is now. But I now suspect that there's also a ton of inefficiency there. Because few people are looking at the actual production behavior.
-
@thisismissem @mick one thing I know is a problem is retries that build up in sidekiq. Sidekiq will retry jobs basically forever. And when server disappear, their jobs sit in the retry queue failing indefinitely. I'm sure larger instances with infra teams do some cleanup hear. But how are smaller instances supposed to learn about this?
-
@[email protected] that's definitely a problem with the sidekiq configuration then. Keeping retries forever just clogs up the queue... I can't imagine they'd be kept forever!