Import/Display Issue With &#codes;

medwards

I'm running a data import using @bentael 's importer.

It looks like posts that have escape sequences render correctly as part of a posts content, but display the literal escape sequence in the post title. Screenshot:

Screen Shot 2014-02-18 at 3.37.28 PM.png

And looking at the data via redis on the backend:

127.0.0.1:6379> hget topic:3369 title
"&#1048;&#1089;&#1095;&#1077;&#1079;&#1083;&#1080; &#1076;&#1077;&#1085;&#1100;&#1075;&#1080; ?!"
127.0.0.1:6379> hget post:3369 content
"&#1042;&#1089;&#1077;&#1084; &#1087;&#1088;&#1080;&#1074;&#1077;&#1090;, &#1073;&#1091;&#1082;&#1074;&#1072;&#1083;&#1100;&#1085;&#1086; 5.06 &#1087;&#1086;&#1084;&#1077;&#1085;&#1103;&#1083; &#1088;&#1077;&#1082;&#1074;&#1080;&#1079;&#1080;&#1090;&#1099; &#1089; &#1087;&#1086;&#1084;&#1086;&#1097;&#1100;&#1102; &#1089;&#1074;&#1086;&#1077;&#1075;&#1086; &#1084;&#1077;&#1085;&#1077;&#1085;&#1078;&#1077;&#1088;&#1072;, &#1089;&#1077;&#1075;&#1086;&#1076;&#1085;&#1103; &#1079;&#1072;&#1093;&#1086;&#1078;&#1091; &#1074; dashboard, &#1072; &#1085;&#1072; &#1073;&#1072;&#1083;&#1083;&#1072;&#1085;&#1089;&#1077; 2 &#1076;&#1086;&#1083;&#1083;&#1072;&#1088;&#1072;, &#1073;&#1099;&#1083;&#1086; &#1076;&#1086; &#1101;&#1090;&#1086;&#1075;&#1086; &#1087;&#1086;&#1095;&#1090;&#1080; 50 &#1076;&#1086;&#1083;&#1083;&#1072;&#1088;&#1086;&#1074;, &#1074; &#1088;&#1072;&#1079;&#1076;&#1077;&#1083;&#1077; Royalties &#1085;&#1077;&#1090;&#1091; &#1074;&#1099;&#1087;&#1083;&#1072;&#1090;, &#1075;&#1076;&#1077; &#1090;&#1086;&#1075;&#1076;&#1072; &#1076;&#1077;&#1085;&#1100;&#1075;&#1080;?"
127.0.0.1:6379>

It looks like the content and title are being encoded the same way, but for some reason it renders properly in the post content but not the title.

I also tried copying and pasting some of the properly rendered characters as the title of a new post. This appears to work but produces a different encoding within the database:

127.0.0.1:6379> hget topic:3472 title
"\xd0\x92\xd1\x81\xd0\xb5\xd0\xbc \xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82"
127.0.0.1:6379>

Any suggestions? I suppose I could try to perform the conversion between encodings during the import... any idea the specific names of these two encoding types?

bentael

hmm, could it be one the plugins?
which plugins do you have activated? list them all please.

bentael

@medwards

never mind it's not a plugin, it's this line, i think
https://github.com/designcreateplay/NodeBB/blob/master/src/topics.js#L282

I tried to hack the source, to take out the .escape() call on that line, it turns out sanitize() returns a validator object, that's why you can chain (escape()) i think, but the string result is in there, {str: '... result ...'} so the new line would look like:

// topic.js:282'ish

// FROM
data.title = validator.sanitize(data.title).escape();

// TO
data.title = (validator.sanitize(data.title) || {}).str || '';

then for some reason, i was getting a TypeError in webserver.js, so i added a safe check there

// webserver.js:121'ish

// FROM 
				tag.content = tag.content.replace(/[&<>'"]/g, function(tag) {
// TO
				tag.content = (tag.content || '').replace(/[&<>'"]/g, function(tag) {

That worked for me, but, it's hack, if it works, file an issue on github repo, there maybe a good explanation why, if nodebb expects us to escape the title before hand, the importer can do that.. but im a little skeptical.

julian

As usual, we only sanitize stuff on the way out, not on the way in, so I'd look to the importer to see why they are being saved into Redis as ... html entities?

For reference, redis expects \x escaped unicode:

127.0.0.1:6379[2]> set foo тест
OK
127.0.0.1:6379[2]> get foo
"\xd1\x82\xd0\xb5\xd1\x81\xd1\x82"
127.0.0.1:6379[2]>

bentael

@medwards the only place where the import plugin touches the content is the convert function, if there is valid convert config, You've mentioned in a previous conversation that you're using a custom build bbcode-to-md convert function, or you modified the one in there, Can you find an example topic, and maybe map it to its content in the vb database? and paste the two here so I can test the conversion functions?

btw @julian i noticed the that validator package is way out of date, looks the repo has moved and the API changed a bit, i.e. sanirtize() was removed

julian

"validator": "~1.5.1",

Latest version seems to be 3.2.1. Wow, what a pain. Thanks @bentael

medwards

Sorry for the late reply.

So it sounds like it's not really intended for the HTML Entities to ever make it into the database. I was able to pull a node module "html-entities" and feed the titles through that on the way in through the importer. The post contents were rendering okay so I just left those as is, didn't want to mess around with what order to apply conversions and whether they'd screw each other up.

One of these days I'll get around to contributing back some changes to nodebb-plugin-import. Things have just been crazy around here lately.

Thanks for all your help.

julian

Validator is now v3.2.1, but it's a transparent upgrade, so it shouldn't affect anything here.