I know the Internet Archive has been under a ton of infrastructural pressure lately, but anyone have any idea about how long they might take to review an application and get back to you?
-
@[email protected] ugh
yeah, this is exactly the type of stuff that should be avoided, I think. -
@[email protected] I wonder if there's a way to semi-reliably transform data such that a set of results for a given query could be assessed without giving away too much information? I kind of doubt it but I just wanna write down the thought experiment anyway (I think it's often the 'metadata' about the search results that gives you context as to whether the result will be any good; for instance, searching for quests about Baldur's Gate 3 might give you articles that appear to be good, but if they're hosted on, say, I dunno, the financial times or ESPN or something wild...)
Say a user searches for query A, and results Z, Y, and X are given as the 'first' ones. Assume that given the true search query and the true set of results, a human could reasonably expect to exclude result X and we'd want them to rank it lower / trim it. What information, and what transformations, would work?
Perhaps word similarity in meanings, including all the data about the results? To go with the above example, say the user searched for "BG3 Astarion Quest" and result Z is Astarion's page from a gaming wiki, result Y is from another gaming site about BG3 companion missions, and result X is from a url with gibberish and/or the words 'finance' or 'politics' in it, but otherwise the text snippet seems to indicate it's a good match.
If you did something like a... word similarity, or clustering, and obtained a distance metric for all the words in the URL and provided snippets related to the query, you could then arguably transform it to another word meaning vector space. So long as you retained the distances, the URL of result X could still pop up as an outlier and be marked as a potential candidate for trimming...
mmmmmm... lots of ways that can go wrong, though. I don't really know if there's any way to preserve anonymity and maintain legibility of the problem or not just gatekeep places with weird urls.
so, yeah. I think that introduces exploitable elements, extra noise, and diverts from the purpose while obfuscating the underlying problem that is trust. -
@[email protected] I keep falling back on the idea of anonymous accounts in which only a randomly assigned ID and the things they specifically want to submit are tracked (and only for auditing purposes) and weighted. In some contexts, that might be sufficient: if a certain account keeps ranking unrelated results highly (as a form of automated SEO, basically) their internal ranking can be lowered upon review. This starts to get into data analytics, unfortunately, and potential abuse. I wonder if there's a way to genuinely avoid that, though. In this case, you could easily see which accounts are ranking things similarly, as well, so you could potentially suss out if there's a bot campaign or whatever.
But you need processing power and data storage for auditing. Perhaps there should be a "data jubilee" every X months? All data for auditing purposes is only kept for a certain amount of time before being 'baked in' and deleted? -
@aud the challenge with pseudonyms is that queries can be highly identifying. not all are, but they can be.
-
@[email protected] I guess human participation comes down to, in a way, to "How can we allow people to contribute?" that avoids:
1. Surveillance of participating parties (probably best to assume any data that is kept at any point lives forever, although this can certainly be mitigated through constant data clearing and re-issuance of new IDs)
2. Exploits to destroy the sanctity of the information.
As the point is to enable information access without allowing surveillance. -
@[email protected] Maybe the idea of issuing IDs then letting them live only for X amount of time isn't a bad one, actually. Because then even if a server doesn't 'respect' the demand to clear the data, the user won't be submitting under their old ID anymore.
-
@[email protected] RIGHT, exactly. Shit, anything is. The unfortunate reality is that if a piece of data is correlated to another one, the relationship exists both ways. This is why the best thing to do is not to collect any data that isn't necessary.
Part of the issue is that under a 'hostile regime', keeping the set of contributors low makes it easier to identify them, as well. So there need to be enough contributors to where they can't reliably pinpoint any one piece of data to a contributor, but not so large that you simply can't trust anyone... blehhh
(I know you were talking about queries but I think the problem exists at basically every point of the problem) -
@[email protected] Perhaps anonymous data about results can be gathered; assume that all of a sudden, a particular result has spiked in popularity (without collecting the query); flag it for manual review? That way you don't know how it came up, and if the site has no real relevance (say it's an obvious "AI" generated result or a website for a sales service or something) a set of individuals can then decide whether that's likely to be natural or the result of an influence campaign.
-
@aud yeah, aggregation thresholds are a good technique for that sort of thing
we do think you're perhaps still thinking far too much in the nature of search... at some point when we have spoons we can get into more detail (it's not like it's secret), but everything you're describing so far is something that existing web search platforms already do in some way, but they're still drowning in spam
-
Asta [AMP]replied to Asta [AMP] last edited by [email protected]
@[email protected] I'm definitely centering my thinking around AI slop and SEO spam, which are certainly... two of the major problems surrounding search at the moment.
It's pretty obvious that people at Google had solved this relevancy problem before they let the ad people take over. Ugh. So at the very least not, it's been proven to not be intractable (except we also know that Google the Company itself abuses the analytics for surveillance, but ignoring that one).
But basically, any auditing framework and search algorithm that is implemented and deployed will ultimately be insufficient as eventually, a way to abuse it will be found.
That also makes me wonder about data snapshots: ie, effectively mirrors of the data that existed at certain points in time that no longer change, except to trim outdated results (or flag them as no longer existing). That way a 'known good' version still exists and can't be fucked with. -
@aud for example, so-called "evil unicorns" are sites which purport to be the authoritative answer for something that wasn't a point of contention until something in the external world happened. when it somehow becomes newsworthy that Evil Billionaire Seven thinks cows eat cars, or whatever, the site that claims they do gets a lot of traffic.... because it was already the authoritative result for a question nobody was asking
-
@aud that's more of a disinformation example than a generative slop example; both are serious problems, and likely require different solutions
-
@aud but we don't think this is solvable, in the end. the thing search is trying to be, inherently flattens the social structures that are the only solution humanity has ever had to staying grounded in reality. the problem is not to simply invent some search quality weird trick that the big companies haven't, it's to come up with something other than search.
-
@[email protected] Yeah, that doesn't surprise me. For better or for worse, I'm very very new to this space and approaching it from the opposite direction. So, I'm glad to hear others have thought about it more deeply, for sure...
I think, so long as there are mechanisms that can be deployed, I am definitely most... not interested, but maybe most capable? in how to scale it out on volunteer hardware in a reasonably secure way. It's just, if I can't visualize the problem space from one end to the other, I can't reliably assess whether my idea of what backend framework is necessary would be suitable or not. -
@aud no, we don't agree with that. the world has changed in a way which makes the old solutions not work anymore. the reason that has happened is that generative models and "is this slop" models are the exact same thing, almost literally, and certainly in terms of what training data they need. whoever has more data wins... and the playing field has been leveled, so google no longer has an advantage on that front.
-
@[email protected] Right. I think to a certain extent one should avoid indexing those in the first place. Or, at least, that's how I imagine it would work.
Of course, a person could also create a bunch of reasonable sounding personal blogs, get them into the index, then change the contents to be a bunch of evil unicorns after a period of time has passed and let it slip in that way. Sigh. -
@aud yeah, you have to keep in mind that this is an adaptive system, we're not fighting against the wind or the rain, we're fighting against people who will find new forms of malicious behavior in response to any protection we come up with.
-
@aud that isn't to say we can't win, but for each proposed mitigation we need to do adversarial modeling: how would an attacker get around this? what would we do about that? how would they get around that? etc
-
@[email protected] hmmmm. yes. I'm definitely keeping my thinking too bounded by that problem, probably because I remember how good search used to be.
I think you're right that I need to free myself of it. -
@aud and mitigations that just become an endless chain of "we'll re-train the thing based on the feedback that we got it wrong at the previous step" are generally not the best idea because, like, there's no reason to think we can actually stay ahead in that arms race