Periodically I see the absolute monster that is the Akkoma timeline query
-
Periodically I see the absolute monster that is the Akkoma timeline query
SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at", b2."id", b2."user_id", b2."activity_id", b2."inserted_at", b2."updated_at", NOT (t3."id" IS NULL) FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') LEFT OUTER JOIN "bookmarks" AS b2 ON (b2."user_id" = $1) AND (b2."activity_id" = a0."id") LEFT OUTER JOIN "thread_mutes" AS t3 ON (t3."user_id" = $2) AND (t3."context" = a0."data"->>'context') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f4 ON TRUE INNER JOIN "users" AS u5 ON (a0."actor" = u5."ap_id") AND (u5."invisible" = FALSE) WHERE ((($3 && a0."recipients")) OR (a0."actor" = $4)) AND (a0."id" > $5) AND (a0."data"->>'type' = ANY($6)) AND (not (a0."actor" = ANY($7))) AND (((not (a0."recipients" && $8)) or a0."actor" = $9)) AND ((recipients_contain_blocked_domains(a0."recipients", $10) = false) or a0."actor" = $11) AND (not (a0."data"->>'type' = 'Announce' and a0."data"->'to' ?| $12)) AND ((not (split_part(a0."actor", '/', 3) = ANY($13))) or a0."actor" = ANY($14)) AND ((not (split_part(o1."data"->>'actor', '/', 3) = ANY($15))) or (o1."data"->>'actor') = ANY($16)) AND (not ( a0."data"->>'type' = 'Announce' and a0."actor" = ANY($17))) AND (a0."data"->>'type' != 'Announce' or o1."data"->>'actor' != $18) AND (not(o1."data"->>'type' = 'Answer')) AND (a0."id" > $19) ORDER BY a0."id" desc nulls last LIMIT $20
And it just makes me think that, as much complexity as it adds, Mastodonβs decision to precompute the timelines was correct
(Not depicted here: the query parameters, which include the URL of the followers collection of every user I follow)
-
Honestly I wonder how any indexes one could drop if the timeline was precomputed instead
-
Gabbo the wafrn guy :neocat_floof_devil_256: (not a vampire)replied to Erin π½β¨ last edited by
@erincandescentΒ wafrn pastes the internal ids of the users you follow, but no joins (multiple queries get done but its always the sams number)
ids are hardcoded because a sub query was realy slow and I used rhe power of redis for this one
-
Gabbo the wafrn guy :neocat_floof_devil_256: (not a vampire)replied to Gabbo the wafrn guy :neocat_floof_devil_256: (not a vampire) last edited by
@erincandescentΒ mine used to be worse btw. lots of joins and stuff
-
Erin π½β¨replied to Gabbo the wafrn guy :neocat_floof_devil_256: (not a vampire) last edited by
@gabboman I think the best approach would just be to create an inbox table for each user where each row contained
(user_id, activity_id, object_id, reasons_post_might_be_hidden_maybe)
that it could just slam down and then JOIN in the activity/object data it needs for the frontend -
Jenniferplusplusreplied to Erin π½β¨ last edited by
@erincandescent
me: that's not SO bad
me: wait, there's like 4 tables referenced in that SELECT statement, where are those coming from?
me: *sees read more link*
me: *clicks*
me: *gasps* -
Erin π½β¨replied to Jenniferplusplus last edited by
@jenniferplusplus Iβm sure once upon a time this was an entirely reasonable query
β¦
It is no longer an entirely reasonable query
-
Jenniferplusplusreplied to Erin π½β¨ last edited by
@erincandescent
FWIW, I'm actually pretty convinced that timelines are an obvious use case for a timeseries database, and based on that, Letterbook can partially pre-compute them, with fast insertion and queries. I don't have the actual SQL this generates handy, but it closely resembles the src linq expression(from post in _feeds.Timelines
where post.Time <= before
where post.AudienceId in audienceKeys
where includeShared != (post.SharedBy == null)
select post)
.Take(limit)
.Distinct(); -
Jenniferplusplusreplied to Jenniferplusplus last edited by
@erincandescent At some point, I'll likely need to complicate this a little bit, because that where ... in is less performant than a join, so I'll need to find a break point where it makes more sense to create a temporary join table rather than include a big list of keys. but still
-
Erin π½β¨replied to Jenniferplusplus last edited by
@jenniferplusplus I kinda think you want to precompute each user their own timeline because otherwise you will just end up dealing with so much complexity etc for blocking
(obviously this does mean that you e.g. need to garbage collect entries when someone blocks someone else)
-
Jenniferplusplusreplied to Erin π½β¨ last edited by [email protected]
@erincandescent that's handled via the audience mechanism, so blocking someone removes them from your followers audience, and thus their next timeline query will exclude your posts.
The audience memberships are queried separately, prior to this timeline query.
-
Erin π½β¨replied to Jenniferplusplus last edited by
@jenniferplusplus yes, but this layout means (structurally) youβre scanning everyone you followβs outbox, which becomes an increasingly big gather op as someone follows more people
Alternatively you can do the fan-out at delivery time and then it just becomes one dense index scan
-
Jenniferplusplusreplied to Erin π½β¨ last edited by
@erincandescent the tsdb that powers timelines isn't the canonical source of truth for the app. It only handles timelines and notifications, and everything gets denormalized at write time. Single origin scenarios like populating an outbox use the canonical db.
-
Jenniferplusplusreplied to Jenniferplusplus last edited by
@erincandescent There are a few reasons I don't want to precompute explicit individual timelines for each user. A big one is that I want inactive users to cost as close to zero as possible. So with the exception of cases where the inactive user is the only follower of some remote actor, zero compute is spent on their timelines.
-
Ξ» Natty :butterfly_:β:neofox_lesbian:replied to Erin π½β¨ last edited by
@[email protected] scared to ask where's the part that populates it with all the data the frontend needs
-
Jenniferplusplusreplied to Jenniferplusplus last edited by
@erincandescent You're right that following more people makes for larger gathers. But it's still just a single inner join at query time, because of the denormalization.
-
Erin π½β¨replied to Ξ» Natty :butterfly_:β:neofox_lesbian: last edited by
@natty the entire thing is
-
@jenniferplusplus I think you're falling pray to corpo-think where the desired outcome for your service is an eternally increasing number of users.
In my opinion that's a wrong way to design services for communities of the fediverse.
If you limit the number then you need to care less about the inactive users because you don't need to worry about eternally increasing compute time.
-
@mariusor @jenniferplusplus even on small instances you need to think about scaling because social media is just an enormous mount of traffic; people way underestimate this
-
Jenniferplusplusreplied to Erin π½β¨ last edited by
@erincandescent @mariusor
This.But even if that didn't have an effect, supporting large instances is an explicit goal for the project. There's like 5 billion people who use social media, and approximately all of them use "corpo" services, which I think we generally agree is bad. So if that is ever to change, even community services will need to operate at internet scale. The vast majority of people cannot and will not operate their own communication infrastructure.