An idea of new plugin html-preparser

qgp9

Hi, I'm back with a new idea again

After I made MagicBlock and got an issue of insecure images, I thought to make a plugin ( tentatively named long nodebb-plugin-secure-image-advance ) which include white list ( managed by human in first phase ) checking of images.

But immediately I realized that it needs to handle <code> blocks again and this could be same for many other plugin which parse contents and filter it.

So I bring new idea of pre html parsing plugin. This should be placed after a composer related plugin ( markdown ), plugin is better then a core's. And it can fire own filter hooks, and other plugin can use it.

I also have thought some specs. Basic list of preparsing will be

Replace <code>..</code> block with an unique string , and fire filter hooks. After all hooos, recover code blocks. I do same in Magicblock.
Fire filter hooks with list of image tags.
Fire filter hooks with list of link tags.

This can reduce amount of resources for multiple parsings of whole contents for same purposes by plugins.

Cons are also quite clear.

Plugin dependency ( I think this is huge. Has NodeBB any solution for this? Or am I only one who doesn't know an already existing feature? )
whole filtering structure of contents will become much more complicate.
If many plugin doesn't use this, then maybe useless

So, personally, I want to have it, and I may do because I'll have own similar plugins more than two. But I want to hear comments from a community before.

julian

Well, the main thing is you have raw markdown saved into the database (which is ok), then converted to html by the markdown plugin, and after that, going back to a tokenized form is quite difficult.

In fact, because of this reason (that it is not a 1:1 relationship between md, html, and back to md) that we can't natively support a wysiwyg and markdown solution.

It is possible to listen to a plugin hook fired by markdown if you'd like to hook into the parser used, though I will admit the logic is needlessly complex

qgp9

@julian Hi,

Actually, what I meant about this plugin idea is manipulates of compiled html by nodbb-markdown-plugin like many other plugin does.

Even, I don't want to touch guts of markdown syntax or stored text, it can easily ruin things against future NodeBB updates. Hooking from nodebb-markdown-plugin is also good idea but with fully compiled html :). ( but then what is different from filter:parse.post? )

Wait, that's good point, I don't think it's so secure way but if we convert addresses of images before saving or while writing, then it will be quite efficient. But I'm not sure that it's really better way. Right now, I take my way

frissdiegurke

I actually think the idea is great.

However I wouldn't do it via custom filters but via method attaching like so:

Attach a method [String] getWithoutCode() that caches the resulting value so it does not calculate twice.
Attach a method setWithoutCode(String) to apply changes to the cache.
Bind a listener to get of postData (or whatever) to parse and apply the cached value (cache a variable like [Boolean] anyChanges to not parse when the content hasn't changed).
Bind a listener to set of postData (or whatever) to clean the cache.

This way the calculations would only happen when any plugin requires it and it would not happen more than once during a series of plugins that don't manipulate the original content.
I guess it would even be neater not to use get/set functions but put the logic in getter/setter ofdata.postDataWithoutCode (or a better name).

Same stuff could be added for various data that could be pre-processable (although setting postDataWithoutCode would require the cached list of tags to be updated, etc.).

@qgp9 If you wanna give it a try take a look at Object.defineProperty() (link) for the javascript way of getter/setter definition.

This would work until any plugin returns a new object instead of a modified version of the passed one thought. But I've not seen any plugin that does this and I guess it's negligible due to no valid reason to do so.

qgp9

@frissdiegurke said:

However I wouldn't do it via custom filters but via method attaching like so:

If I understood correctly, this means we need to manage two branches at the same time, and always merge those two branches after every method call or filter hook. otherwise result of getWithoutCode and postData will contain different version.
A Merging means re-parsing for next plugin(filter). This way could provide good interface and save lines of code but not resources.

qgp9

@frissdiegurke I think I misunderstood. I'm re-reading

frissdiegurke

@qgp9 It's not merging after every method call, but whenever the non-modified branch should be get.

So if n plugins in series work with the get/setWithoutCode only and don't touch postData, the postData will not be recalculated between those plugins. This is highly likely to be due to the NodeBB hooks priority which would be needed anyways to ensure it's running after html parsing.

qgp9

@frissdiegurke My original understood might be quite close

So if n plugins in series work with the get/setWithoutCode only and don't touch postData, the postData will not be recalculated between those plugins

I don't understand this part. Let's call this plugin as "pre-parser", and if we have 4 plugins C,D,F which call get/setWithoutCode, and E which use just full raw data

C will read cached contents(w/o code) by getWithoutCode.
C will update cached contents by SetWithoutCode.
D will read and update cached contents
E will read data and edit or not return.
F will read cached contents.

So here problem is that.

At 1. When setWithoutCode is called, pre-parser doesn't know D will use what.
At 2. When setWithoutCode is called, pre-parser doesn't know E will use what.
At 3. If D doesn't say 'updated' or 'nochanged" to pre-parser explicitly, pre-parser need to compare raw data and cached data from 2) while next calling of getWithoutCode because of no information. Of course I assume that pre-praser can catch whole list of plugin in a pipe and also catch which plugin is calling get/setWithoutCode.

and 2) can be solved pre-parser require pre-registraion or similar to plugins which will use cached data.
but what about 3). If any plugin doesn't know about pre-parser then pre-parser need to take care of it and so on, in the end, it will need resource again. and complicate

frissdiegurke

I guess your basic problem is: you don't know that javascript getter/setter can be defined (see my link to Object.defineProperty()).

I'll do some schematic code here (not tested):

function prepareData(data) {
  let postData = data.postData;
  let cache = null;
  let hasChanged = false;

  Object.defineProperty(data, "postDataWithoutCode", {
    set: function (value) {
      cache = value;
      hasChanged = true;
    },
    get: function () {
      if (cache == null) { cache = calculateCache(postData); }
      return cache;
    }
  });
  Object.defineProperty(data, "postData", {
    set: function (value) {
      postData = value;
      cache = null;
      hasChanged = false;
    },
    get: function () {
      if (hasChanged) { postData = calculatePostData(cache); hasChanged = false; }
      return postData;
    }
  });

  return data;
}

qgp9

@frissdiegurke I need to read a document what you gave more carefully , but before, is that so different from an usual OOP object which has 3 data-members and and 4 get/set accessors?

frissdiegurke

This way it calculates/merges at most as often as it's get/set (thus in worst case no performance decrease).
In best case (and likely case due to priority within hooks that'd be needed by those plugins) all plugins that know about the pre-parser are called in series thus it's only one calculation and one merge in total.

Overall: For m series of plugins that only use postDataWithoutCode there are m total calculations and merges.

@qgp9 It's pretty much like a private attribute and two public ones with get/set accessors except it's not needed to call those accessors explicitly. And we overwrite the get/set accessors of the original postData to keep in mind.

qgp9

@frissdiegurke Sorry, I read the documents, I don't understand some points. I may be missing some key point/feature of javascript.

My question is simply that,

How does pre-parser deal with a plugin which doesn't call any function/method of pre-parser?

In long, If plugin E knows nothing of pre-parser, How pre-prarse decide to merge cache to postdata when D called setWithoutCode. Any other chance after D's set before E take data in usual way?

qgp9

And If pre-parser has to decide and manage data Before a next plugin in a filtering pipe because of above reason ( D->E ), then again How pre-parser know next plugin will call getAnything in that moment?

frissdiegurke

When E does anything to the postData, it means E calls at least once the getter of postData; Thus the cache gets merged to the postData first (postData = calculatePostData(cache);). So E gets all changes of previous plugins.

When E does modify the postData the cache gets cleared to be re-generated when D gets postDataWithoutCode.

It's not about knowing which plugins comes next, but about providing up-to-date information when it is required.

We don't have to merge the cache back before the original data is required.
We don't have to calculate the cache before it's required.

qgp9

@frissdiegurke

OK, now I understand. Those getter and setter doesn't mean literal 'getpostData' but they are invoked whenever postData is accessed and modified. right?

I was totally stupid while reading. I have to go home. too tired

I see, let me think more based on that. I understand your story now, but not sure yet that's really better way or not.

Thank you!

frissdiegurke

@qgp9 said:

@frissdiegurke

OK, now I understand. Those getter and setter doesn't mean literal 'getpostData' but they are invoked whenever postData is accessed and modified. right?

Yes

I was totally stupid while reading. I have to go home. too tired

I see, let me think more based on that. I understand your story now, but not sure yet that's really better way or not.

If you find any cons let me know.

Thank you!

You're welcome

qgp9

@frissdiegurke said:

If you find any cons let me know.

Pros

Your method is very flexible. One can cooperate with other plugins just by just control of priorities. This is really un-replaceable advantage.
Also I can imagine that we could do this without activated NodeBB plugin called pre-parser but just with helper module per each plugin. I'm not sure of update and version control with npm, but I believe there are certain solutions.

Cons

If we want to provide "image list" or "link list" above "contents without code block", a situation will become too complicate. Without accurate and elaborate priorities configurations, we will loose an advantage in resources. In this situation, separated filtering pipe will be better.
Your method actually will be touching - let's say - core object. I think this is what we have to be very careful. In the future when NodeBB core development team wants to use get/set of postData, it can be a big problem. Actually it may be possible we check if there are already get/seter and make new get/seter which include old function, but depend on type of original functions, it's not easy to guarantee a proper working. I think a better way could be a pre-registration of plugins which use pre-parser and pre-parser access whole filtering pipe information and decide what to do in each step. An disadvantage of this is that after uninformed plugin before next get call of cached value, pre-parser should check what is changed.

Now what I want to say here is that your method is really brilliant and has un-replaceable advantage but too deep and touching core ( maybe not practically yet but conceptually ). Therefore it needs to be consulted by a core development team.

In other point, your method is quite independent from my custom filter hooks and we can even take both advantages ( of course it will make another problem like Too much or Too complicate :), but just as an idea ).

qgp9

So now, I have questions to the NodeBB development team. @julian

About @frissdiegurke 's idea.
- What do you think? Is it secure or do you have any better ideas, policies , plans?
To manage dependencies between plugin,
- Can we enforce to install and activate plugins by dependencies?
- Or can we give warnings of wrong configurations, dependencies and so on by plugin on admin UI ( dash board? )?
Do we have secure way to add/remove/modify custom data by plugin in a filtering pipe?
- For example, on filtering hook, data.pluginData.<plugin name> can be always for plugins (means some sure for core will not use it for different reason!! ), whether they are removed or not after filtering. I know if I select good name( random or highly uniq), then it will be quite safe but it will be better if we have an official space.
If I want to make a npm module which is dedicated to NodeBB while it's not a plugin, then what is good name for. Maybe nodebb-helper-any-name. It may be for individual plugins.

I had one more, but I forgot what was it while writing....