Hang on, I think attaching semantics to schemas, rather than data, solves 100% of the problems with both semantics and schemas.
-
@tetron That is certainly the largest and most immediate contributor, yes.
But it's a concern almost any time that almost any W3C standard or working group is involved with something that needs to operate at high QPS.
-
@jenniferplusplus
So the irony is that linked data semantic web stuff is totally designed for annotating external resources the way you want, but only if the resource itself has a linked data mapping (i.e. there's way to refer to individual elements in the document), and schema documents written with json schema don't. Which is why the schemas need to be linked data themselves. Cue the endless screaming. -
@tetron That's not really ironic, so much as tangential. I get the benefits in a reference context. But at best it's useless in a processing context. To the extent that it displaces techniques that enable processing, it's actually a detriment.
-
@jenniferplusplus
So I'm writing from the perspective of the particular thing I linked earlier but I just want to mention a couple of things it has:a) code generators for a bunch of languages including C#, which use the schema to write the data structures and parsing/validation for you, which is very fast and there's no lunacy like having to transit through an rdf triple store
b) knowing which fields are identifiers or references to other things has some nice properties for validation
-
@tetron That would be helpful if there was a defined schema, or if it was even possible to define a schema. But with activitypub, that's not actually possible.
-
@jenniferplusplus
So if we're talking about https://www.w3.org/TR/activitystreams-vocabulary/
there is a machine readable formal model under there, it is just defined in OWL. I don't offhand know of tools that take in OWL and give you data models in more practical languages but that doesn't mean they don't exist. For ActivityStreams specifically it doesn't look like it would be all that hard.
At this rate I'm going to talk myself into writing a proof of concept, which is dangerous. -
@tetron I'm pretty sure both the owl and context are broken and contradict the spec. The spec also mandates that fields with a single value must serialize as a value rather than an array, which creates an enormous number of problems.
I'm pretty sure that combines to make it impossible. But if you can somehow turn it into a proper schema, you'd be advancing fediverse development by years.
-
@jenniferplusplus
this is incomplete and broken but if it was more complete and less broken and could be used to generate C# code, would it be useful?very incomplete conversion of activitystreams2.owl to schema salad representation
very incomplete conversion of activitystreams2.owl to schema salad representation - gist:2714b63983063548af4705a7cf9defa2
Gist (gist.github.com)
-
@tetron Potentially. It at least gets to the next hurdle, which is that adding things to the @/context is what passes for an extension mechanism in AP. And the state of those vocabularies is even worse. Several of the most common ones don't even exist.
-
Jenniferplusplusreplied to Jenniferplusplus last edited by [email protected]
@tetron But, let's assume this can turn into JSON schemas, accounting for extensions. And those schemas are suitable to do validation and code generation. Then it would become a primarily social problem of convincing maintainers to switch to using schema-defined implementations.
-
d@nny "disc@" mc²replied to Jenniferplusplus last edited by
@jenniferplusplus @tetron cc @aud who i think is working on something related but possibly a different angle
-
infinite love ⴳreplied to Jenniferplusplus last edited by
@jenniferplusplus @tetron this is ironically what json-ld contexts were *supposed* to do -- you can "upgrade" any arbitrary json into json-ld by providing your own context, even if it wasn't explicitly declared by the document producer. but this requires you to "guess" what the producer meant by any given term, instead of the producer telling you explicitly what they meant. and your "guess" might not match someone else's "guess".
anyway, i don't see why this can't be layered on top of a schema.
-
@jenniferplusplus @tetron it's just that usually, the semantic data nerds will insist that the semantics are required while the schema is optional. it feels like the counterargument here is that the schema should be ~required instead, while the semantics should be optional.
-
@jenniferplusplus @tetron or maybe in an ideal world you could package both together. this is something i've been trying out -- have the context document include not just an intended context mapping, but also schema/ontology information. see https://w3id.org/fep/1985.jsonld for example, which defines a context mapping for 4 terms, but then also separately contains a graph for those term definitions. for example, `orderType` will tell you its domain, range, min/max cardinality (i.e. required/functional).
-
Jenniferplusplusreplied to infinite love ⴳ last edited by
@trwnh @tetron meaning is necessary to do anything meaningful with a document, sure. But the meaning is implicit in the context. We're all out here building AP social networking services, passing each other social messages. We know what these things mean. But without a schema, doing that processing is slow, expensive, and error prone. We gain nothing by defining these messages semantically, and lose a lot from the lack of structure.
-
@jenniferplusplus @trwnh
So the problem the semantic stuff is trying to solve is how to have an extensible standard without causing chaos, e.g. if two different implementations decide to add a field called "grilledCheese" but actually each one uses it to mean different things with different structure. Then the semantic markup lets you tell them apart. -
@jenniferplusplus @trwnh
But your application probably only cares about or understands a subset of all the terms in use and it makes sense to use a schema to rigorously validate the things you support and ignore the rest. -
@[email protected] @[email protected] @[email protected] So, like, definitely I am actively working on doing an AP implementation in Rust, and the fact that the schema is so broad is definitely difficult.
I have some advantages in that I am implementing a specific case rather than thinking about the problem in broader, non-AP specific terms. For instance: the additional schema definition added by@context
is something I'm able to parse and deserialize into native types, to an extent, but it is not something I necessarily have to care about unless I choose to. If I don't know the data is there, I... don't really have to do anything about it if I don't want to.
As an example, Mastodon seems to add blur hashes and positional information to image objects. Misskey adds a_misskey_summary
field to notes. These are defined in the@context
section of the payload. In the implementation I'm working on, things that are part of the incoming payload but aren't part of the AP spec are left in an_extra
HashMap that exists on the object (rather than a specific field, which I'm reserving for things that are defined in AP, such asid
,name
, etc). The idea is that someone using these structures (myself or others) might care about that data and do something with it... assuming they know it's there and part of the payload, of course. But if you don't know if it's there, well... not much you can do with it at compile time, really.
About the arrays and having things just be a single line if only one element of their structure is populated, I'm handling that via specific serialization functions. Basically, everything is deserialized into the AP type that is specified in the original schema regardless of whether they're a simple string or a more complex struct (links in particular often are just a simple string). At serialization time, I check how many of my fields are populated... and if it's only one, I spit it out like a string.
Similar for arrays: if I am expecting a payload of a single item but receive instead an array of items, I deserialize the payload into the__array
element (which is an array of APObject
s) of my AP Object type. Basically, my AP object implementation can be both a real AP object or a simple container for an array of AP Objects.
This came about because I noticed when working with real data that AP didn't talk much about arrays but they're everywhere in real payloads. I think I need to generalize this functionality (currently it only works onObjects
when it should work on anything and everything that inherits from it).
Basically: because I'm working on a specific example of this type of problem, I'm free to make decisions that wouldn't necessarily work for every type of problem... and also because I'm working with real data, I have to deviate (in a sense) from the written spec to handle real payloads. Thankfully, there's no shortage of data... -
@tetron @jenniferplusplus this assumes you are working purely in one problem space and never cross any boundaries. for example, if your schema is roughly "activitystreams plus some extensions", then you won't know what to do with something that isn't as2. here, the mime type is doing a lot of the semantic work for you. if you want to ensure that certain extensions are understood, you end up basically needing to define a new mime type. but the problem is you can embed documents in other documents
-
@tetron @jenniferplusplus so the mime type actually changes for only *part of the document* instead of the entire document. i think this is something a lot of people are not prepared to encounter, and generally don't know how to deal with, except by making assumptions based on popular usage. for example, the `publicKey` property is not part of as2. it's from the old deprecated security/v1. if doing ld, you expect some CryptographicKey object(s) inside it, but a "plain json" might use a string!