Regex Question.

a_5mith

So I've got a strange issue with my youtube plugin, it doesn't seem to handle parameters after the youtube ID very well.

I've got a var that looks like this:

id = $el.data('youtube-id')

Which is parsed via the following regex

         var	regularUrl = /<a href="(?:https?:\/\/)?(?:www\.)?(?:youtube\.com)\/(?:watch\?v=)([\w\-_]+)?&([\w\-_]+)">.+<\/a>/g;
    var	shortUrl = /<a href="(?:https?:\/\/)?(?:www\.)?(?:youtu\.be)\/([\w\-_]+)">.+<\/a>/g;
    var	embedUrl = /<a href="(?:https?:\/\/)?(?:www\.)youtube.com\/embed\/([\w\-_]+)">.+<\/a>/;

Except it doesn't just put the ID in, it also includes all of the parameters that a user may add afterwards, like start times etc. Which breaks, because the ID becomes tGZlwK2qTCI&t=3m20s which isn't a valid video ID. Normally this would only be a problem for the thumbnail that I fetch, but I append &autoplay to the URL here:

src="//www.youtube.com/embed/' + id + '?autoplay=1"

I assume it's something up with my regex, I would like $1 to be just the video ID (11 characters), and everything after that to become a part of $2 so I can parse the parameters back in afterwards.

Basically what I'm after is the youtube URL looking like

src="//www.youtube.com/embed/' + id + '?autoplay=1` + parameters + `"

So $1 would be the 11 character youtube ID, and $2 would be all other parameters after that ID.

julian

Is this client-side or server-side? Usually I can tell, except with Node, it's all js

a_5mith

@julian said:

Is this client-side or server-side? Usually I can tell, except with Node, it's all js

Ermmm.

https://github.com/a5mith/nodebb-plugin-youtube-lite/blob/master/library.js

&

https://github.com/a5mith/nodebb-plugin-youtube-lite/blob/master/static/lib/lazyYT.js

julian

Server-side, then.

Use this module: http://nodejs.org/api/url.html#url_url_parse_urlstr_parsequerystring_slashesdenotehost

It's going to make your life a million times easier than parsing an URL via regex.

For client-side, use the Location object, built in. But that's another topic

a_5mith

@julian said:

Server-side, then.

Use this module: http://nodejs.org/api/url.html#url_url_parse_urlstr_parsequerystring_slashesdenotehost

It's going to make your life a million times easier than parsing an URL via regex.

For client-side, use the Location object, built in. But that's another topic

Remember you're talking to an idiot here? I'll look into that.

? Offline

@Ted sits idly by and stalks the topic, knowing that with a little more time, this will be resolved.

esiao

@a_5mith Your regex seems fine to me. You can use tools like http://regexr.com to debug them more easily.
I've just tested the href parameter why are you testing everything and not only the actual link ?

a_5mith

@esiao The regex works, but if you use a parameter, the ID becomes the {ID}&the parameter, which breaks embedding.

I forked the youtube plugin that psychobunny made, so I've not really changed much of it.

EDIT: Using that site, I've managed to get what looks right, I'll give it a go and let you know how it goes.

EDIT again, as you can see from http://regexr.com/39m51 the end of the ID is now being included under $2 if there's no parameters, which also breaks it. Is there a way of parsing null if there's no parameters? I'm so close. I think.

esiao

@a_5mith

With /<a href="(?:https?:\/\/)?(?:www\.)?(?:youtu\.be)\/((?:[\w\-_]+){11})\??([^&]+)?(&?[\w&]+)*">.+<\/a>/g
On <a href="http://youtu.be/foNkJJWFuI8?t=47s&parameter">something</a>
It creates three groups
1: id
2: time
3: parameter

Is that what you wanted ? If the time is not used you can make a non capturing group on ([^&]+)

a_5mith

That wouldn't work on <a href="http://youtu.be/foNkJJWFuI8?t=47s&parameter=1">something</a> due to the = sign.

I'm ok with not using the parameters bit, but time would be good to have. As long as I can get the ID without anything else leaking into it, I'm not 100% concerned about parameters etc.

esiao

@a_5mith Just adding (&?[\w&=]+) instead of (&?[\w&]+) should do the trick.

a_5mith

Hey @esiao , thanks for the code, there's a slight issue though, that appears to be regex based, it's only firing each code, once, if I embed the same URL, it will only embed 1, not the other, however if I change the video embed to be one of the other URL variations, replacing watch?v= with /embed/ for example, then it embeds fine, as I can't read regex, is there something in this that is stopping it from firing again afterwards.

esiao

@a_5mith Yes you're right, here's the fix

/(?:<a href="(?:https?:\/\/)?(?:www\.)?(?:youtu\.be)\/((?:[\w\-_]+){11})\??([^&]+)?(&?[\w&=]+)*">[^<a]+<\/a>)+/gm

But if the links are like <a href="">link</a><a href="">link</a> it will not work.

a_5mith

@esiao said:

Unfortunately, that doesn't seem to work either, even if I put the works of shakespere between the two youtube URLs, it still only displays one.

Also it doesn't seem to match watch?v=videoID either. But it's probably a slightly different regex.

frissdiegurke

I'd like to help you out, but I'd need more specific inputs you'd like to read the ID and the parameters from.

Currently I can just tell you that sth. like [\w\-_] is no clean regex since it's equivalent to [\w-] and the shorter the better the overview
Also the [^<a]+ out of the last full regex of @esiao would stop at the first a occurrence, not only at the first <a occurrence as it may suggest.
So there are a few not-so-well parts within each regex I've seen yet and you didn't consider users who put the v=... parameter after other parameters within the regularUrls. And are you sure that it'll always be like <a href="...">...</a> and in no case the a-tag could get another attribute (My emoji-extended broke at some version because the code-tags got ^^)...

If you want me to help you out with more clean regex (up to my knowledge) I'd likely help you if I get a few example URLs that cover all cases.
Also if you'd be willing to learn regex syntax I'd try to explain my results afterwards

But for now I have to sleep first, good night zzz