Bug 15382 – std.uri has an incorrect set of reserved characters

Status
REOPENED
Severity
minor
Priority
P4
Component
phobos
Product
D
Version
D2
Platform
Other
OS
Other
Creation time
2015-11-27T01:06:17Z
Last change time
2024-12-01T16:25:28Z
Assigned to
No Owner
Creator
Neia Neutuladh
Moved to GitHub: phobos#10149 →

Comments

Comment #0 by dhasenan — 2015-11-27T01:06:17Z
https://tools.ietf.org/html/rfc3986#section-2.2 says that the following characters are reserved and may have special meaning in a URI: :/?#[]@!$&'()*+,;=" std.uri only includes the following characters in the reserved set: ;/?:@&=+$, I'm not sure how encode() and encodeComponent() are supposed to be used, so I'm unsure what the appropriate fix would be. It looks like you're supposed to manually escape anything that could be construed as a reserved character but shouldn't be, then pass the rest to encode() for it to escape everything that shouldn't appear in any URL; or alternatively encode each segment that you don't want to have any control characters in using encodeComponent() and join them together. But I'm not sure and it's not really documented.
Comment #1 by belka — 2016-04-08T05:39:23Z
Look at "2.4. When to Encode or Decode": "the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data." So reserved characters can be encoded but it isn't a must. Only characters used as delimiters in a particular URL scheme must be encoded. Wikipedia differs between reserved characters with or without reserved meaning. I tested it quickly in Firefox and Firefox doesn't seem to encode characters like * or (). The behavior of encodeComponent is actually exactly the same as encodeURIComponent from JavaScript. The behavior described in the issue, is how PHP urlencode works, that encodes all reserved characters.
Comment #2 by bugzilla — 2019-12-23T09:36:02Z
This is more a question how std.uri works, than a bug report. Please use the forum [1] for such questions in the future. [1] https://forum.dlang.org/group/learn
Comment #3 by kdevel — 2021-01-24T22:43:55Z
According to ยง 2.2 of RFC 3986 there are the following character classes: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" The code in phobos/std/uri.d references these character classes instead: 62 uflags['#'] |= URI_Hash; 66 uflags[c] |= URI_Alpha; 67 uflags[c + 0x20] |= URI_Alpha; // lowercase letters 69 foreach (c; '0' .. '9' + 1) uflags[c] |= URI_Digit; 70 foreach (c; ";/?:@&=+$,") uflags[c] |= URI_Reserved; 71 foreach (c; "-_.!~*'()") uflags[c] |= URI_Mark; If encodeComponent is used URI_Encode is invoked with unescapedSet = URI_Alpha | URI_Digit | URI_Mark. This leads to some reserved characters not beeing encoded, e.g. ! or (. The notion of mark characters stems from the obsoleted RFC 2396 [2]. RFC 3986 explains the changes in its Appendix D.2 [3]. [1] https://tools.ietf.org/html/rfc3986#section-2 [2] https://tools.ietf.org/html/rfc2396#section-2.3 [3] https://tools.ietf.org/html/rfc3986#appendix-D.2
Comment #4 by robert.schadek — 2024-12-01T16:25:28Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/10149 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB