Hang on, I think attaching semantics to schemas, rather than data, solves 100% of the problems with both semantics and schemas.

Jenniferplusplus

@tetron I want to give my json schema and human readable documentation to the people to who want that. And I want them to go off on their own devise their own method to attach semantic meaning to things that doesn't burden me with solving this problem that I don't have and don't care about.

Peter Amstutz

@jenniferplusplus
I'm not very familiar with the ActivityPub spec but this is about AP isn't it?

Jenniferplusplus

@tetron That is certainly the largest and most immediate contributor, yes.

But it's a concern almost any time that almost any W3C standard or working group is involved with something that needs to operate at high QPS.

Peter Amstutz

@jenniferplusplus
So the irony is that linked data semantic web stuff is totally designed for annotating external resources the way you want, but only if the resource itself has a linked data mapping (i.e. there's way to refer to individual elements in the document), and schema documents written with json schema don't. Which is why the schemas need to be linked data themselves. Cue the endless screaming.

Jenniferplusplus

@tetron That's not really ironic, so much as tangential. I get the benefits in a reference context. But at best it's useless in a processing context. To the extent that it displaces techniques that enable processing, it's actually a detriment.

Peter Amstutz

@jenniferplusplus
So I'm writing from the perspective of the particular thing I linked earlier but I just want to mention a couple of things it has:

a) code generators for a bunch of languages including C#, which use the schema to write the data structures and parsing/validation for you, which is very fast and there's no lunacy like having to transit through an rdf triple store

b) knowing which fields are identifiers or references to other things has some nice properties for validation

Jenniferplusplus

@tetron That would be helpful if there was a defined schema, or if it was even possible to define a schema. But with activitypub, that's not actually possible.

Peter Amstutz

@jenniferplusplus
So if we're talking about https://www.w3.org/TR/activitystreams-vocabulary/
there is a machine readable formal model under there, it is just defined in OWL. I don't offhand know of tools that take in OWL and give you data models in more practical languages but that doesn't mean they don't exist. For ActivityStreams specifically it doesn't look like it would be all that hard.
At this rate I'm going to talk myself into writing a proof of concept, which is dangerous.

Jenniferplusplus

@tetron I'm pretty sure both the owl and context are broken and contradict the spec. The spec also mandates that fields with a single value must serialize as a value rather than an array, which creates an enormous number of problems.

I'm pretty sure that combines to make it impossible. But if you can somehow turn it into a proper schema, you'd be advancing fediverse development by years.

Peter Amstutz

@jenniferplusplus
this is incomplete and broken but if it was more complete and less broken and could be used to generate C# code, would it be useful?

very incomplete conversion of activitystreams2.owl to schema salad representation

very incomplete conversion of activitystreams2.owl to schema salad representation - gist:2714b63983063548af4705a7cf9defa2

Gist (gist.github.com)

Jenniferplusplus

@tetron Potentially. It at least gets to the next hurdle, which is that adding things to the @/context is what passes for an extension mechanism in AP. And the state of those vocabularies is even worse. Several of the most common ones don't even exist.

Jenniferplusplus

@tetron But, let's assume this can turn into JSON schemas, accounting for extensions. And those schemas are suitable to do validation and code generation. Then it would become a primarily social problem of convincing maintainers to switch to using schema-defined implementations.

d@nny "disc@" mc²

@jenniferplusplus @tetron cc @aud who i think is working on something related but possibly a different angle

infinite love ⴳ

@jenniferplusplus @tetron this is ironically what json-ld contexts were *supposed* to do -- you can "upgrade" any arbitrary json into json-ld by providing your own context, even if it wasn't explicitly declared by the document producer. but this requires you to "guess" what the producer meant by any given term, instead of the producer telling you explicitly what they meant. and your "guess" might not match someone else's "guess".

anyway, i don't see why this can't be layered on top of a schema.

infinite love ⴳ

@jenniferplusplus @tetron it's just that usually, the semantic data nerds will insist that the semantics are required while the schema is optional. it feels like the counterargument here is that the schema should be ~required instead, while the semantics should be optional.

infinite love ⴳ

@jenniferplusplus @tetron or maybe in an ideal world you could package both together. this is something i've been trying out -- have the context document include not just an intended context mapping, but also schema/ontology information. see https://w3id.org/fep/1985.jsonld for example, which defines a context mapping for 4 terms, but then also separately contains a graph for those term definitions. for example, `orderType` will tell you its domain, range, min/max cardinality (i.e. required/functional).

Jenniferplusplus

@trwnh @tetron meaning is necessary to do anything meaningful with a document, sure. But the meaning is implicit in the context. We're all out here building AP social networking services, passing each other social messages. We know what these things mean. But without a schema, doing that processing is slow, expensive, and error prone. We gain nothing by defining these messages semantically, and lose a lot from the lack of structure.

Peter Amstutz

@jenniferplusplus @trwnh
So the problem the semantic stuff is trying to solve is how to have an extensible standard without causing chaos, e.g. if two different implementations decide to add a field called "grilledCheese" but actually each one uses it to mean different things with different structure. Then the semantic markup lets you tell them apart.

Peter Amstutz

@jenniferplusplus @trwnh
But your application probably only cares about or understands a subset of all the terms in use and it makes sense to use a schema to rigorously validate the things you support and ignore the rest.

Asta [AMP]

@[email protected] @[email protected] @[email protected] So, like, definitely I am actively working on doing an AP implementation in Rust, and the fact that the schema is so broad is definitely difficult.

I have some advantages in that I am implementing a specific case rather than thinking about the problem in broader, non-AP specific terms. For instance: the additional schema definition added by @context is something I'm able to parse and deserialize into native types, to an extent, but it is not something I necessarily have to care about unless I choose to. If I don't know the data is there, I... don't really have to do anything about it if I don't want to.

As an example, Mastodon seems to add blur hashes and positional information to image objects. Misskey adds a _misskey_summary field to notes. These are defined in the @context section of the payload. In the implementation I'm working on, things that are part of the incoming payload but aren't part of the AP spec are left in an _extra HashMap that exists on the object (rather than a specific field, which I'm reserving for things that are defined in AP, such as id, name, etc). The idea is that someone using these structures (myself or others) might care about that data and do something with it... assuming they know it's there and part of the payload, of course. But if you don't know if it's there, well... not much you can do with it at compile time, really.

About the arrays and having things just be a single line if only one element of their structure is populated, I'm handling that via specific serialization functions. Basically, everything is deserialized into the AP type that is specified in the original schema regardless of whether they're a simple string or a more complex struct (links in particular often are just a simple string). At serialization time, I check how many of my fields are populated... and if it's only one, I spit it out like a string.

Similar for arrays: if I am expecting a payload of a single item but receive instead an array of items, I deserialize the payload into the __array element (which is an array of AP Objects) of my AP Object type. Basically, my AP object implementation can be both a real AP object or a simple container for an array of AP Objects.

This came about because I noticed when working with real data that AP didn't talk much about arrays but they're everywhere in real payloads. I think I need to generalize this functionality (currently it only works on Objects when it should work on anything and everything that inherits from it).

Basically: because I'm working on a specific example of this type of problem, I'm free to make decisions that wouldn't necessarily work for every type of problem... and also because I'm working with real data, I have to deviate (in a sense) from the written spec to handle real payloads. Thankfully, there's no shortage of data...