on #bluesky there literally is the guy with the big dial constantly looking back at the audience for approval like a contestant on the price is right

Irenes (many)

Irenes (many)

@aud @jonny we want to be SUPER clear that we aren't making fun of the person. especially since it really does sound like there's just one person doing this part of the system, it's a super easy idea to miss.

Asta [AMP]

@[email protected] @[email protected] I was really surprised by how relatively simple the algorithm seemed. I haven’t tried to implement it, so maybe it’s harder than it seems, but I would also be surprised if there wasn’t some implementation that’s already done that they couldn’t at least test.

Asta [AMP]

@[email protected] @[email protected] oh no, for sure. I didn’t get that impression, nor was I trying to make fun of them either.

jonny (good kind)

wow that's so much responsibility, serious props

jonny (good kind)

@ireneista @aud confirmed mostly one person, dang: https://neuromatch.social/@jonny/113638937294602495
https://bsky.app/profile/why.bsky.team/post/3ld3uammbjs2j

Asta [AMP]

@[email protected] @[email protected] Jesus

If they’re hiring, can I apply <_<

jonny (good kind)

i guess my background assumption was that would be something you would want constant feedback on from everyone, and eventually if i was in charge of it my whole goal would be to make it not exist by giving the handles to everyone because that's too much responsibility for anyone to have lmao

Ulrike Hahn

@jonny understudied question: how perceptions of algorithmically determined feed quality change as a function of network size (by algorithm, including reverse chronological)

Cassandrich

@aud @jonny @ireneista If you can scrape enough and explicitly disregard junk, you can probably make a genuine search engine (not reskinned aggregator) at least as good as early Google (i.e. way better than late Google) this way.

Cassandrich

@aud @jonny @ireneista Proposed algorithm for first stage disregarding junk: statistical model for SEO spam (that doubles as model for AI spam since it was trained off SEO spam). 🤪

Irenes (many)

@dalias @aud @jonny mm. the models that detect spam are essentially the same models that generate spam, just used in a slightly different mode. who wins is a question of who has more data.

we do not have more data than google does, and the spammers have breached the walls of google's castle and are pouring into the keep.

Irenes (many)

@dalias @aud @jonny we do think there might be something to the idea of refusing to index (just flat out refusing, not down-ranking) any site that is seeking any sort of profit. google has a hard classification problem in part because there isn't a lot of difference between spam and things the company considers legitimate.

Cassandrich

@ireneista @aud @jonny You can also supplement by boosting sources linked from Wikipedia and penalizing ones edited out of Wikipedia as spam or illegitimate.

Google's quality plummeted the more they tried to bury Wikipedia.

jonny (good kind)

@dalias @ireneista @aud i just gotta believe a global search index is more complicated than that

Asta [AMP]

@[email protected] @[email protected] @[email protected] (I definitely agree; I don’t know of a way to automate it or whether one even should, but)

Asta [AMP]

@[email protected] @[email protected] @[email protected] my personal feeling is “start small”. Do not try to encompass everything. The slop will come and go and including it just rewards it.

I guess I’m somewhat advocating for a sort of organically grown dataset.

Cassandrich

@aud @jonny @ireneista I don't think "seeking profit" is the right distinction. "Displaying any sort of deceptive ads" is closer.

This is similar to the whole fedi HOA thing of harassing anyone who's trying to sell their art because folks can't distinguish that from capitalism.

Cassandrich

@aud @jonny @ireneista Crawling rooted from Wikipedia really would not be that bad an idea.

jonny (good kind)

@dalias @aud @ireneista bootstrapping off anything is great but wikipedia links to like what percent of the web i wonder.