← Back to index | Original Bugzilla link

Bug 15382 – std.uri has an incorrect set of reserved characters

Status: REOPENED
Severity: minor
Priority: P4
Component: phobos
Product: D
Version: D2
Platform: Other
OS: Other
Creation time: 2015-11-27T01:06:17Z
Last change time: 2024-12-01T16:25:28Z
Assigned to: No Owner
Creator: Neia Neutuladh

Comments

Comment #0 by dhasenan — 2015-11-27T01:06:17Z

https://tools.ietf.org/html/rfc3986#section-2.2 says that the following characters are reserved and may have special meaning in a URI: :/?#[]@!$&'()*+,;=" std.uri only includes the following characters in the reserved set: ;/?:@&=+$, I'm not sure how encode() and encodeComponent() are supposed to be used, so I'm unsure what the appropriate fix would be. It looks like you're supposed to manually escape anything that could be construed as a reserved character but shouldn't be, then pass the rest to encode() for it to escape everything that shouldn't appear in any URL; or alternatively encode each segment that you don't want to have any control characters in using encodeComponent() and join them together. But I'm not sure and it's not really documented.

Comment #1 by belka — 2016-04-08T05:39:23Z

Look at "2.4. When to Encode or Decode": "the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data." So reserved characters can be encoded but it isn't a must. Only characters used as delimiters in a particular URL scheme must be encoded. Wikipedia differs between reserved characters with or without reserved meaning. I tested it quickly in Firefox and Firefox doesn't seem to encode characters like * or (). The behavior of encodeComponent is actually exactly the same as encodeURIComponent from JavaScript. The behavior described in the issue, is how PHP urlencode works, that encodes all reserved characters.

Comment #2 by bugzilla — 2019-12-23T09:36:02Z

This is more a question how std.uri works, than a bug report. Please use the forum [1] for such questions in the future. [1] https://forum.dlang.org/group/learn

Comment #3 by kdevel — 2021-01-24T22:43:55Z

According to § 2.2 of RFC 3986 there are the following character classes: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" The code in phobos/std/uri.d references these character classes instead: 62 uflags['#'] |= URI_Hash; 66 uflags[c] |= URI_Alpha; 67 uflags[c + 0x20] |= URI_Alpha; // lowercase letters 69 foreach (c; '0' .. '9' + 1) uflags[c] |= URI_Digit; 70 foreach (c; ";/?:@&=+$,") uflags[c] |= URI_Reserved; 71 foreach (c; "-_.!~*'()") uflags[c] |= URI_Mark; If encodeComponent is used URI_Encode is invoked with unescapedSet = URI_Alpha | URI_Digit | URI_Mark. This leads to some reserved characters not beeing encoded, e.g. ! or (. The notion of mark characters stems from the obsoleted RFC 2396 [2]. RFC 3986 explains the changes in its Appendix D.2 [3]. [1] https://tools.ietf.org/html/rfc3986#section-2 [2] https://tools.ietf.org/html/rfc2396#section-2.3 [3] https://tools.ietf.org/html/rfc3986#appendix-D.2

Comment #4 by robert.schadek — 2024-12-01T16:25:28Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/10149 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB