Periodically I see the absolute monster that is the Akkoma timeline query

Erin 💽✨

SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at", b2."id", b2."user_id", b2."activity_id", b2."inserted_at", b2."updated_at", NOT (t3."id" IS NULL) 
FROM "activities" AS a0 
INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') 
LEFT OUTER JOIN "bookmarks" AS b2 ON (b2."user_id" = $1) AND (b2."activity_id" = a0."id") 
LEFT OUTER JOIN "thread_mutes" AS t3 ON (t3."user_id" = $2) AND (t3."context" = a0."data"->>'context') 
INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f4 ON TRUE 
INNER JOIN "users" AS u5 ON (a0."actor" = u5."ap_id") AND (u5."invisible" = FALSE) 
WHERE ((($3 && a0."recipients")) OR (a0."actor" = $4)) 
AND (a0."id" > $5) 
AND (a0."data"->>'type' = ANY($6)) 
AND (not (a0."actor" = ANY($7))) 
AND (((not (a0."recipients" && $8)) or a0."actor" = $9)) 
AND ((recipients_contain_blocked_domains(a0."recipients", $10) = false) or a0."actor" = $11) 
AND (not (a0."data"->>'type' = 'Announce' and a0."data"->'to' ?| $12)) 
AND ((not (split_part(a0."actor", '/', 3) = ANY($13))) or a0."actor" = ANY($14)) 
AND ((not (split_part(o1."data"->>'actor', '/', 3) = ANY($15))) or (o1."data"->>'actor') = ANY($16)) 
AND (not ( a0."data"->>'type' = 'Announce' and a0."actor" = ANY($17))) 
AND (a0."data"->>'type' != 'Announce' or o1."data"->>'actor' != $18) 
AND (not(o1."data"->>'type' = 'Answer')) 
AND (a0."id" > $19) 
ORDER BY a0."id" desc nulls last 
LIMIT $20

And it just makes me think that, as much complexity as it adds, Mastodon’s decision to precompute the timelines was correct

(Not depicted here: the query parameters, which include the URL of the followers collection of every user I follow)

Erin 💽✨

Honestly I wonder how any indexes one could drop if the timeline was precomputed instead

@erincandescent wafrn pastes the internal ids of the users you follow, but no joins (multiple queries get done but its always the sams number)

ids are hardcoded because a sub query was realy slow and I used rhe power of redis for this one

@erincandescent mine used to be worse btw. lots of joins and stuff

Erin 💽✨

@gabboman I think the best approach would just be to create an inbox table for each user where each row contained (user_id, activity_id, object_id, reasons_post_might_be_hidden_maybe) that it could just slam down and then JOIN in the activity/object data it needs for the frontend

Jenniferplusplus

@erincandescent
me: that's not SO bad
me: wait, there's like 4 tables referenced in that SELECT statement, where are those coming from?
me: *sees read more link*
me: *clicks*
me: *gasps*

Erin 💽✨

@jenniferplusplus I’m sure once upon a time this was an entirely reasonable query

…

It is no longer an entirely reasonable query

Jenniferplusplus

@erincandescent
FWIW, I'm actually pretty convinced that timelines are an obvious use case for a timeseries database, and based on that, Letterbook can partially pre-compute them, with fast insertion and queries. I don't have the actual SQL this generates handy, but it closely resembles the src linq expression

(from post in _feeds.Timelines
where post.Time <= before
where post.AudienceId in audienceKeys
where includeShared != (post.SharedBy == null)
select post)
.Take(limit)
.Distinct();

Jenniferplusplus

@erincandescent At some point, I'll likely need to complicate this a little bit, because that where ... in is less performant than a join, so I'll need to find a break point where it makes more sense to create a temporary join table rather than include a big list of keys. but still

Erin 💽✨

@jenniferplusplus I kinda think you want to precompute each user their own timeline because otherwise you will just end up dealing with so much complexity etc for blocking

(obviously this does mean that you e.g. need to garbage collect entries when someone blocks someone else)

Jenniferplusplus

@erincandescent that's handled via the audience mechanism, so blocking someone removes them from your followers audience, and thus their next timeline query will exclude your posts.

The audience memberships are queried separately, prior to this timeline query.

Erin 💽✨

@jenniferplusplus yes, but this layout means (structurally) you’re scanning everyone you follow’s outbox, which becomes an increasingly big gather op as someone follows more people

Alternatively you can do the fan-out at delivery time and then it just becomes one dense index scan

Jenniferplusplus

@erincandescent the tsdb that powers timelines isn't the canonical source of truth for the app. It only handles timelines and notifications, and everything gets denormalized at write time. Single origin scenarios like populating an outbox use the canonical db.

Jenniferplusplus

@erincandescent There are a few reasons I don't want to precompute explicit individual timelines for each user. A big one is that I want inactive users to cost as close to zero as possible. So with the exception of cases where the inactive user is the only follower of some remote actor, zero compute is spent on their timelines.

λ Natty :butterfly_::neofox_lesbian:

@[email protected] scared to ask where's the part that populates it with all the data the frontend needs

Jenniferplusplus

@erincandescent You're right that following more people makes for larger gathers. But it's still just a single inner join at query time, because of the denormalization.

Erin 💽✨

@natty the entire thing is

marius

@jenniferplusplus I think you're falling pray to corpo-think where the desired outcome for your service is an eternally increasing number of users.

In my opinion that's a wrong way to design services for communities of the fediverse.

If you limit the number then you need to care less about the inactive users because you don't need to worry about eternally increasing compute time.

@erincandescent

Erin 💽✨

@mariusor @jenniferplusplus even on small instances you need to think about scaling because social media is just an enormous mount of traffic; people way underestimate this

Jenniferplusplus

@erincandescent @mariusor
This.

But even if that didn't have an effect, supporting large instances is an explicit goal for the project. There's like 5 billion people who use social media, and approximately all of them use "corpo" services, which I think we generally agree is bad. So if that is ever to change, even community services will need to operate at internet scale. The vast majority of people cannot and will not operate their own communication infrastructure.