For the first time the #CoSocialCa Mastodon server has started to struggle just a little bit to keep up with the flow of the Fediverse.
-
For the first time the #CoSocialCa Mastodon server has started to struggle just a little bit to keep up with the flow of the Fediverse.
Weβve usually been βpushβ heavy but weβve started to see some spikes in βpullβ queue latency. The worst of these spikes was today, where we fell behind by at least a couple minutes for most of the afternoon.
1/?
-
Emelia πΈπ»replied to Mick π¨π¦ on last edited by
@mick i wonder if it'd make sense to produce log messages for each instance, actor, activity type, directionality and other properties over time & make that graphable?
What percentage of activities processed or sent in the last day were Delete's, what type of activities does Server Y send me?
(Maybe this also ties into @polotek's question earlier today)
-
This is great! Itβs exciting to see our community growing.
Iβm going to make a simple change to see if we can better keep up.
The system that weβre running on has plenty of headroom for more sidekiq threads.
2/?
-
Mick π¨π¦replied to Emelia πΈπ» on last edited by
@thisismissem @polotek That would be helpful. Iβd love to be better able to interpret weird traffic spikes like this.
Without being creepy about it.
-
For anyone interested in understanding the guts of Mastodon, I have found this article from Digital Ocean very helpful: https://www.digitalocean.com/community/tutorials/how-to-scale-your-mastodon-server#perfecting-sidekiq-queues
Eventually weβll grow so big that weβll need oodles of sidekiq queues and weβll want to be able to customize how many types of them we want, and will run them as jobs across multiple servers and so-on.
But for now Iβm just going to make the number of threads slightly bigger and see what happens.
3/?
-
@mick Push and pull depend on the point of view. Would you please clarify as it isn't immediately obvious to me (it occurs to me that I do not have a good mental model of AP)?
-
Weβll do this staging first, because I am a responsible sysadmin (and I am only ever half sure I know what Iβm doing.)
Weβre running the default config that came with our DigitalOcean droplet, which as a single sidekiq service running 25 threads.
4/?
-
@virtuous_sloth When someone here writes a post, makes a comment, interacts with a remote post, the activity lands in the push queue, and is then distributed to other servers in the network. This is pretty active on our server but tends to clear very quickly.
The pull queue βHandles tasks such as handling imports, backups, resolving threads, deleting users, and forwarding replies.β And for some reason is getting clogged of late.
-
@mick interesting. Definitely sounds like pull would be a lot lower workload, but perhaps the push queues have a lot more resources dedicated while pull does not?
-
That article from DigitalOcean suggests that 10-15 threads = 1 GB of RAM.
We also need to give each thread its own DB connection.
In staging the DB is local, so we donβt need to worry too much about a few extra connections.
In production, weβre connected to a DB pool that will funnel the extra connections into a smaller number of connections to the DB. Our Database server still has oodles of capacity to keep up with all of this.
5/?
-
Staging server only has 2 GB of RAM but it also has virtually no queue activity so letβs give it a shot.
Having confirmed that we have sufficient resources to accommodate the increase and then picked a number out of hat, Iβm going to increase the number of threads to 40.
6/?
-
No signs of trouble. Everything still hunky-dory in staging.
On to production.
If this is the last post you read from our server then something has gone very wrong.
7/?
-
@virtuous_sloth Only thing I can figure is that weβve taken on a bunch of new followers as a server and tipped across some threshold? Or the users we follow are more chatty of late?
It did start when Eurovision began, but as a small server of Canadians idk how much Eurovision discussion we were plugged into.
A few more threads ought to help. If not, Iβll delve deeper.
-
Aaaand weβre good.
Iβll keep an eye on things over the next days and week and see if this has any measurable impact on performance one way or the other.
And thatβs enough recreational server maintenance for one Friday night.
8/?
-
@mick @thisismissem so the thing that has me thinking about this is I was using activitypub.academy to view some logs. I did a follow to my server and it showed that my server continually sent duplicate "Accept" messages back. I can't tell if that's an issue with my server or with the academy. Because I can't see my logs.
-
@mick @thisismissem people told me that activity pub was very "chatty". I understand a lot better why that is now. But I now suspect that there's also a ton of inefficiency there. Because few people are looking at the actual production behavior.
-
-
@thisismissem @mick one thing I know is a problem is retries that build up in sidekiq. Sidekiq will retry jobs basically forever. And when server disappear, their jobs sit in the retry queue failing indefinitely. I'm sure larger instances with infra teams do some cleanup hear. But how are smaller instances supposed to learn about this?
-
-
@[email protected] that's definitely a problem with the sidekiq configuration then. Keeping retries forever just clogs up the queue... I can't imagine they'd be kept forever!