Posts made by feld@friedcheese.us

feld

@octothorpe you can, my friend @SlicerDicer just did it without issues

though the specific model he used caused problems because there was ONE specific USB-C port that does not support being used as a boot drive and it wasn't obvious. He had to dig through hardware docs to find it as the source of the problem haha... thanks Tim Apple... :woz:

feld

@octothorpe just buy the 4TB NVME, get a 20gbit USB 4.0 adapter, and plug it in and make it your boot drive.

feld

@silverpill @p @jeffcliff @frogzone looking at that mastodon pull request: they're really trying to overcomplicate everything aren't they

feld

@p @silverpill @frogzone @jeffcliff We'll do ed25519 if it's gonna actually stick. I don't think it's been revisited since the original proposal was seen and I didn't even know Honk was supporting it.

Anyone else? Mastodon?

feld

@p Unfamiliar with tkrzw; I was thinking they should build on LMDB.

feld

@p I spent so many hours fighting with IPFS I think it's a dead end. You can even find a stale MR where I upgraded the storage backend from Badger to Badger2 (because the default filesystem store sucks) and it didn't help much. Nobody working on Kubo has any interest in fixing the storage scalability problems it seems. To use it for fedi we'd really want the garbage collection to work and not spin 100% CPU and IO for hours but that's what it does when you get enough data in there 🤬

feld

@p

> If it's targeted, that's great, but a meg of stuff, something useful should have been in there.

Not really. The only good stuff in there would be the Ecto stats but they're not granular enough to be useful. Someone sharing the raw Postgres stats from pg_exporter would have been better.

> but "the directory listing is 187MB" is a real problem (even if it's not a pointer-chase, you're still reading 187MB from the disk and you're still copying 187MB of dirents into userspace and `ls` takes 45s), and that gets marked as a dup of a "nice to have" S3 bug, but this is the default Pleroma configuration. It's stuff you can't write off, you know? You hit real constraints.

Where is the "ls" equivalent happening for Pleroma? Fetching a file by name is not slow even when there are millions in the same directory.

> Yeah, BEAM using CPU isn't the bottleneck, though. Like I said in the post you're replying to, it's I/O.

BEAM intentionally holds the CPU in a spinlock when it's done doing work in the hopes that it will get more work. That's what causes the bottleneck. It might not look like high CPU usage percentage-wise, but it's preventing the kernel from context switching to another process.

And what IO? Can someone please send dtrace, systemtap, some useful tracing output showing that BEAM is doing excessive unecessary IO? BEAM should be doing almost zero IO; we don't read and write to files except when people upload attachments. Even if you're not using S3 your media files should be served by your webserver, not Pleroma/Phoenix.

> That is cool, but if you could do that for fetching objects from the DB, you'd have a bigger bump.

patches welcome, but I don't have time to dig into this in the very near future.

> Anyway, I am very familiar with fedi's n*(n-1)/2 problems. (Some time in the future, look for an object proxy patch.)

plz plz send

> But you know, back-pressure, like lowering the number of retries based on the size of the table, that could make a big difference when a system gets stressed.

patches welcome. You can write custom backoff algorithms for Oban. It's supported.

> You could ping graf; it's easier to just ask stressed instances than to come up with a good way to do stress-testing.

Everyone I've asked to get access to their servers which were stuggling has refused except mint. Either everyone's paranoid over nothing or far too many people have illegal shit on their servers. I don't know what to think. It's not exactly a motivator to solve their problems.

> Oh, yeah, so 403s? What counts as permanent?

Depends on what it is. If you get a 403 on an object fetch or profile refresh they're blocking you, so no point in retrying. If it was deleted you get a 404 or a 410, so no point in retrying that either... (when a Delete for an activity you didn't even have came in, it would put in the job to fetch the activity it was referencing... and kept trying to fetch it over and over and over...)

> You think you might end up with a cascade for those? Like, if rendering TWKN requires reading 10MB...

No, I mean it was hanging to fetch latest data from remote server before rendering the activity, which was completely unnecessary. Same with rich media previews -- if it wasn't in cache, the entire activity wouldn't render until it tried to fetch it. Stupid when it could be fetched async and pushed out over websocket like we do now.

> The schedule doesn't bug me. ... the following bug is a big problem, that's a basic thing that was broken in a release. Some kind of release engineer/QA situation could have caught it.

Again, it wasn't broken in a release. There were two different bugs: one was that we used the wrong source of truth for whether or not you were successfully following someone. The other bug became more prominent because more servers started federating Follow requests without any cc field and for some reason our validator was expecting at least an empty cc field when it doesn't even make sense to have one on a Follow request.

You seem to have a lot of opinions and ideas on how to improve things but nobody else on the team seems to give a shit about any of this stuff right now. So send patches. I'll merge them. Join me.

feld

@p

> I mean, like I mentioned, the Prometheus endpoints were public at the time.

Problem is that this data is useful for monitoring overall health of an instance but doesn't give enough granular information to track down a lot of issues. With the metrics/telemetry work I have in progress we'll be able to export more granular Pleroma-specific metrics that will help a lot.

> The main bottleneck is the DB

So often it's just badly configured Postgres. If your server has 4 cores and 4 GB of RAM you can't go use pgtune and tell it you want to run Postgres with 4 cores and 4GB. There's nothing leftover for the BEAM. You want at least 500MB-1GB dedicated to BEAM, more if your server has a lot of local users so it can handle memory allocation spikes.

And then what else is running on your OS? That needs resources too. There isn't a good way to predict the right values for everyone.

Like I said, it's running *great* on my little shitty thin client PC with old slow Intel J5005 cores and 4GB RAM. But I have an SSD for the storage and almost nothing else runs on the OS (FreeBSD). I'm counting a total of 65 processes before Pleroma, Postgres, and Nginx are running. Most Linux servers have way more services running by default. That really sucks when trying to make things run well on lower specced hardware.

You also have to remember that BEAM is greedy and will intentionally hold the CPU longer than it needs because it wants to produce soft-realtime performance results. This needs to be tuned down on lower resource servers because BEAM itself will be preventing Postgres from doing productive work. It's just punching itself in the face then. Set these vm.args on any server that isn't massively overpowered:

+sbwt none
+sbwtdcpu none
+sbwtdio none

> using an entire URL for an index is costing a lot in disk I/O

For the new Rich Media cache (link previews stored in the db so they're not constantly refetched) I hashed the URLs for the index for that same reason. Research showed a hash and the chosen index type were super optimal.

Another thing I did was I noticed we were storing *way* too much data in Oban jobs. Like when you federated an activity we were taking the entire activity's JSON and storing it in the jobs. Imagine making a post with 100KB of content that needs to go to 1000 servers? Each delivery job in the table was HUGE. Now it's just the ID of the post and we do the JSON serialization at delivery time. Much better, lower resource usage overall, lower IO.

Even better would be if we could serialize the JSON *once* for all deliveries but it's tricky because we gotta change the addressing for each delivery. Jason library has some features we might be able to leverage for this but it doesn't seem important to chase yet. Even easier might be to put placeholders in the JSON text, store it in memory, and then just use regex or cheaper string replacement to fill those fields at delivery time. Saves all that repeat JSON serialization work.

Other things I've been doing:

- making sure Oban jobs that have an error we should really treat as permanent are caught and don't allow the job to repeat. It's wasteful for us, rude to remote servers when we're fetching things

- finding every possible blocker for rendering activities/timelines and making those things asynchronous. One of the most recent ones I found was with polls. They could stall rendering a page of the timeline if the poll wasn't refreshed in the last 5 mins or whatever. (and also... I'm pretty sure polls were still being refreshed AFTER the poll was closed 🤬)

I want Pleroma to be the most polite Fedi server on the network. There are still some situations where it's far too chatty and sends requests to other servers that could be avoided, so I'm trying to plug them all. Each of these improvements lowers the resource usage on each server. Just gotta keep striving to make Pleroma do *less* work.

I do have my own complaints about the whole Pleroma releases situation. I wish we were cutting releases like ... every couple weeks if not every month. But I don't make that call.

feld

@p @cvnt @phnt @NonPlayableClown @Owl @dj @ins0mniak @transgrammaractivist Nobody proved there was an *Oban* bottleneck and still haven't.

I'm always running my changes live on my instances. They were massively overpowered. Now I have a severely underpowered server and it's still fine.

If I could reproduce reported issues it would be much easier to solve them but things generally just work for me.

A ton of work has been put into correctness (hundreds of Dialyzer fixes) and tracking down elusive bugs and looking for optimizations like reducing JSON encode/decode work when we don't need to, avoiding excess queries, etc.

I'm halfway done with an entire logging rewrite and telemetry integration which will make it even easier to identify bottlenecks.

It's actually been going really great

feld

RIP Android

Discontinuing syncthing-android - Announce - Syncthing Community Forum https://forum.syncthing.net/t/discontinuing-syncthing-android/23002

feld

@thisismissem @jerry then please make sure Mastodon does not make this same mistake.

If GoToSocial wants to pretend that it's possible to lock a thread and keep people from having a discussion they can do that.

But it won't accomplish anything outside their bubble. The users of any other software that exists can still have a discussion under that public parent post. The trolls and harassers will continue to do what they do, you just won't see it. Which can be accomplished by simply muting the thread or silently dropping the activities.

feld

@thisismissem @jerry Stop this please. There's no need for it. If you don't want to receive replies to a post, just drop them. The senders do not need to know that the server does not want responses to a public post. There is no benefit to this.

If the server software (Mastodon, GoToSocial) supports recognizing these locked threads it should deny the ability to send the reply in the first place.

If the server software does not support recognizing these locked threads there is no point in responding with a Reject. They won't understand the Reject anyway.

Just add a new key to the activity/object and let software that understands it Do The Right Thing

️.

feld

@thisismissem @jerry

> there are cases where you want the server to know “hey, we rejected your message"

That's what we thought about email too and it formed the basis for mail bomb / backscatter attacks. This is just the Fediverse making the same mistakes all over again...

I guarantee this will be abused and cause servers to have a massive backlog of Rejects in their queues. Especially the attacker can ensure the Reject can never be delivered successfully.

IMO every Reject activity should be from deliberate human interaction so it can't be weaponized.

Communicating state with Accept/Reject for Follow Requests is the only option we've got. But trying to further establish state across the Fediverse for other activities is not something I'd recommend.

We need to stop trying to make the Fediverse act like a centralized platform because it cannot and will not work that way.

edit: also the more we do things like this the harder it will be to self-host on commodity hardware as it continues to raise the hardware requirements to process the garbage that will constantly be trying to crush your little fedi server. We already suffer under Mastodon Deletes, please don't send Rejects.

feld

@thisismissem @jerry nooo please don't send anything back. This is a bad idea

feld

World's biggest waste of energy

feld

I know what a TCP packet looks like, it's easy to find diagrams even.

But what about the data going over a Unix Domain Socket? I can't seem to find a diagram of what that looks like and what/if any encapsulation is happening

feld

@danderson can't wait to see this used in a programming language instead of curly braces

feld

@erincandescent @thisismissem @mttaggart @mwl > they really do not want to wade into geopolitics

Ok I'll bite, why would they have .tw for Taiwan then?

feld

This "beanless coffee" is a scam. I don't need my coffee to be ultra processed crap that sorta tastes like coffee

Beanless Coffee https://www.atomocoffee.com/pages/beanless-coffee

feld

@Raccoon That's a super admirable of you. Not many people would offer to do this. Everyone on fedi is so tribal, so I really respect that.

I'll probably DM you tomorrow.

Have a good night

[email protected]

Posts