I know the Internet Archive has been under a ton of infrastructural pressure lately, but anyone have any idea about how long they might take to review an application and get back to you?

Irenes (many)

@aud that makes a ton of sense to us. we're very glad to be on the same page, as well!

Irenes (many)

@aud a thing we've asked ourselves recently is whether the world has changed enough that something like the original Yahoo, which was a list put together entirely by humans, makes sense again

it originally stopped making sense around the time the web got bigger than 30,000 pages (which also messed up altavista, in a totally different way)

Irenes (many)

@aud the web is, of course, still much larger than that today, even if we exclude corporate sites

Asta [AMP]

@[email protected] so maybe I’ve been saying it incorrectly; what I think is important is not the search algorithm itself, but the specific set of webpages that it’s searching through. Any algorithm I could make could be gamed to return click bait genAI slop, but it’s much harder to ruin a curated set.

Asta [AMP]

@[email protected] I’ve been asking exactly the same question.

Irenes (many)

@aud the first thing we see tugging against that is that there are a variety of reasons that people choose to obscure things rather than clarify them, with any sort of crowdsourcing. we do think that this stuff has been around long enough that those motivations can be taxonomized and reasoned about...

Irenes (many)

@aud on quora we've seen people do ideologically motivated hostile taxonomy, for example to suppress write-ups surrounding the ethics of abortion

Irenes (many)

@aud on stack overflow we've seen people use the site's various unlockable moderation powers as tools for building their personal fame and "winning" petty grievances

Irenes (many)

@aud on wikipedia we've read about various governing bodies being subverted in the usual ways that democratic structures get subverted, mostly for ideological reasons, and we've seen the higher-level bodies refuse to take corrective action, which we attribute to the usual thing about any power structure's highest incentive being self preservation - conflict is dangerous

Irenes (many)

@aud we don't yet have a positive recommendation about any of this

(we'll make a negative one: avoid any sort of programmatically-granted authority that feels like a game mechanic, or else people will treat interpersonal conflict as a videogame)

Irenes (many)

@aud we also think the presentation of information, the affordances it enables and the experience it gives, is super important to the success of anything like this.

Asta [AMP]

@[email protected] it's like you are in my brain

Asta [AMP]

@[email protected] competitive gamification is so, so, so antithetical to this, I feel.

This did start making me think about gamification elements ala, say, how the protein folding community does it and whether there's any parallels or lessons there. Borderlands 3 had a freaking community protein building game; I wonder if there's a review of how that worked, somewhere? Engaging the mind is necessary, but we want people to work together, not against one another.

(no coffee in me yet but if we think of curation as a form of trimming a graph or tree, what sorts of things can we do from there? Randomly pull former searches and results and have people make cuts? etc)

Asta [AMP]

@[email protected] ugh

yeah, this is exactly the type of stuff that should be avoided, I think.

Asta [AMP]

@[email protected] I wonder if there's a way to semi-reliably transform data such that a set of results for a given query could be assessed without giving away too much information? I kind of doubt it but I just wanna write down the thought experiment anyway (I think it's often the 'metadata' about the search results that gives you context as to whether the result will be any good; for instance, searching for quests about Baldur's Gate 3 might give you articles that appear to be good, but if they're hosted on, say, I dunno, the financial times or ESPN or something wild...)

Say a user searches for query A, and results Z, Y, and X are given as the 'first' ones. Assume that given the true search query and the true set of results, a human could reasonably expect to exclude result X and we'd want them to rank it lower / trim it. What information, and what transformations, would work?

Perhaps word similarity in meanings, including all the data about the results? To go with the above example, say the user searched for "BG3 Astarion Quest" and result Z is Astarion's page from a gaming wiki, result Y is from another gaming site about BG3 companion missions, and result X is from a url with gibberish and/or the words 'finance' or 'politics' in it, but otherwise the text snippet seems to indicate it's a good match.

If you did something like a... word similarity, or clustering, and obtained a distance metric for all the words in the URL and provided snippets related to the query, you could then arguably transform it to another word meaning vector space. So long as you retained the distances, the URL of result X could still pop up as an outlier and be marked as a potential candidate for trimming...

mmmmmm... lots of ways that can go wrong, though. I don't really know if there's any way to preserve anonymity and maintain legibility of the problem or not just gatekeep places with weird urls.

so, yeah. I think that introduces exploitable elements, extra noise, and diverts from the purpose while obfuscating the underlying problem that is trust.

Asta [AMP]

@[email protected] I keep falling back on the idea of anonymous accounts in which only a randomly assigned ID and the things they specifically want to submit are tracked (and only for auditing purposes) and weighted. In some contexts, that might be sufficient: if a certain account keeps ranking unrelated results highly (as a form of automated SEO, basically) their internal ranking can be lowered upon review. This starts to get into data analytics, unfortunately, and potential abuse. I wonder if there's a way to genuinely avoid that, though. In this case, you could easily see which accounts are ranking things similarly, as well, so you could potentially suss out if there's a bot campaign or whatever.

But you need processing power and data storage for auditing. Perhaps there should be a "data jubilee" every X months? All data for auditing purposes is only kept for a certain amount of time before being 'baked in' and deleted?

Irenes (many)

@aud the challenge with pseudonyms is that queries can be highly identifying. not all are, but they can be.

Asta [AMP]

@[email protected] I guess human participation comes down to, in a way, to "How can we allow people to contribute?" that avoids:

1. Surveillance of participating parties (probably best to assume any data that is kept at any point lives forever, although this can certainly be mitigated through constant data clearing and re-issuance of new IDs)
2. Exploits to destroy the sanctity of the information.

As the point is to enable information access without allowing surveillance.

Asta [AMP]

@[email protected] Maybe the idea of issuing IDs then letting them live only for X amount of time isn't a bad one, actually. Because then even if a server doesn't 'respect' the demand to clear the data, the user won't be submitting under their old ID anymore.

Asta [AMP]

@[email protected] RIGHT, exactly. Shit, anything is. The unfortunate reality is that if a piece of data is correlated to another one, the relationship exists both ways. This is why the best thing to do is not to collect any data that isn't necessary.

Part of the issue is that under a 'hostile regime', keeping the set of contributors low makes it easier to identify them, as well. So there need to be enough contributors to where they can't reliably pinpoint any one piece of data to a contributor, but not so large that you simply can't trust anyone... blehhh

(I know you were talking about queries but I think the problem exists at basically every point of the problem)