I know the Internet Archive has been under a ton of infrastructural pressure lately, but anyone have any idea about how long they might take to review an application and get back to you?

Irenes (many)

@aud yeah, aggregation thresholds are a good technique for that sort of thing

we do think you're perhaps still thinking far too much in the nature of search... at some point when we have spoons we can get into more detail (it's not like it's secret), but everything you're describing so far is something that existing web search platforms already do in some way, but they're still drowning in spam

Asta [AMP]

@[email protected] I'm definitely centering my thinking around AI slop and SEO spam, which are certainly... two of the major problems surrounding search at the moment.

It's pretty obvious that people at Google had solved this relevancy problem before they let the ad people take over. Ugh. So at the very least not, it's been proven to not be intractable (except we also know that Google the Company itself abuses the analytics for surveillance, but ignoring that one).

But basically, any auditing framework and search algorithm that is implemented and deployed will ultimately be insufficient as eventually, a way to abuse it will be found.

That also makes me wonder about data snapshots: ie, effectively mirrors of the data that existed at certain points in time that no longer change, except to trim outdated results (or flag them as no longer existing). That way a 'known good' version still exists and can't be fucked with.

Irenes (many)

@aud for example, so-called "evil unicorns" are sites which purport to be the authoritative answer for something that wasn't a point of contention until something in the external world happened. when it somehow becomes newsworthy that Evil Billionaire Seven thinks cows eat cars, or whatever, the site that claims they do gets a lot of traffic.... because it was already the authoritative result for a question nobody was asking

Irenes (many)

@aud that's more of a disinformation example than a generative slop example; both are serious problems, and likely require different solutions

Irenes (many)

@aud but we don't think this is solvable, in the end. the thing search is trying to be, inherently flattens the social structures that are the only solution humanity has ever had to staying grounded in reality. the problem is not to simply invent some search quality weird trick that the big companies haven't, it's to come up with something other than search.

Asta [AMP]

@[email protected] Yeah, that doesn't surprise me. For better or for worse, I'm very very new to this space and approaching it from the opposite direction. So, I'm glad to hear others have thought about it more deeply, for sure...

I think, so long as there are mechanisms that can be deployed, I am definitely most... not interested, but maybe most capable? in how to scale it out on volunteer hardware in a reasonably secure way. It's just, if I can't visualize the problem space from one end to the other, I can't reliably assess whether my idea of what backend framework is necessary would be suitable or not.

Irenes (many)

@aud no, we don't agree with that. the world has changed in a way which makes the old solutions not work anymore. the reason that has happened is that generative models and "is this slop" models are the exact same thing, almost literally, and certainly in terms of what training data they need. whoever has more data wins... and the playing field has been leveled, so google no longer has an advantage on that front.

Asta [AMP]

@[email protected] Right. I think to a certain extent one should avoid indexing those in the first place. Or, at least, that's how I imagine it would work.

Of course, a person could also create a bunch of reasonable sounding personal blogs, get them into the index, then change the contents to be a bunch of evil unicorns after a period of time has passed and let it slip in that way. Sigh.

Irenes (many)

@aud yeah, you have to keep in mind that this is an adaptive system, we're not fighting against the wind or the rain, we're fighting against people who will find new forms of malicious behavior in response to any protection we come up with.

Irenes (many)

@aud that isn't to say we can't win, but for each proposed mitigation we need to do adversarial modeling: how would an attacker get around this? what would we do about that? how would they get around that? etc

Asta [AMP]

@[email protected] hmmmm. yes. I'm definitely keeping my thinking too bounded by that problem, probably because I remember how good search used to be.

I think you're right that I need to free myself of it.

Irenes (many)

@aud and mitigations that just become an endless chain of "we'll re-train the thing based on the feedback that we got it wrong at the previous step" are generally not the best idea because, like, there's no reason to think we can actually stay ahead in that arms race

Asta [AMP]

@[email protected] ah, yeah, sorry. I really meant "solved" for a particular time and place for a particular set of cases. In the sense that yahoo, etc, had 'solved' it during their heyday, so to speak.

But perhaps that's still not correct, even with a rather loosey-goosey definition of what solving means.

Irenes (many)

@aud oh good sorry to belabor the point!

Asta [AMP]

@[email protected] I guess the idea of "where is the information and how do we find it?" is unsolvable so long as time keeps flowing. There's just approaches that work better than others, until they don't.

Asta [AMP]

@[email protected] agreed.

(that's actually a big reason why we're having this discussion, in that I feel very free and un-judged to propose whatever comes to mind, whether I can see and work through the flaws or not. You're not only very insightful and helpful, but you're also very non-judgemental and safe to bounce even non-working ideas against. It's nice)

Asta [AMP]

@[email protected] Well, and then I think that comes down to a sort of fundamental issue with the approach, too, which harkens back to you saying that maybe search itself is dead. Constant 're-training' might stem the bleeding, but it also might mean your approach is just... not going to work and you're just throwing bandaids on top of it.

(see: LLMs and "fixes" for the most obvious example. But SEO spam is possibly a form of that as well)

Irenes (many)

@aud awww!!!! that's really sweet of you to say. we try really hard at that, but we still feel like we fail a lot

Asta [AMP]

@[email protected] No no, it's good. Even though I'm less devoted to the idea of 'search' as my language might indicate (I'm really just using it as a shorthand for 'how do get information?'), it's still very much bound up in my thinking as to how information might be obtained. I mean, the idea of asking questions and receiving an answer and judging its suitability is... well, old. As you're aware. And search engines try to present themselves as basically that in a very literal form. So as the "current existing model", they definitely dominate my brain space.

Asta [AMP]

@[email protected] I know the feeling (as someone who tries for the same thing... most of the time. When someone is in good faith), but I can safely say I always feel that way about any discussion we've had!