Recently, the question came up about why it is not possible for IIS to handle a URL that contains a ‘%’ character that is not part of an escape sequence. The resulting discussion produced some informative references to the relevant RFC documents and also included some anecdotes on URL canonicalization.
I would like to share my response to the question. I hope that you find it informative.
Disclaimers and Assumptions
Discussing how URLs is processed can be somewhat confusing because common usage of the term “URL” does not reflect the fact that there are different parts in the string that are processed differently. Consider the following URL:
It can be though of as 4 distinct parts that are each subject to their own rules and processing. The first part is the protocol specification: “http://”. This information is used by the client to know what protocol to use when contacting a server to handle the request. The second part is the hostname: “www.myserver.com”. This information is used by the client to resolve the IP address of the client and connect. It may optionally include IP port information by using a colon to delimit the port ID: “www.myserver.com:80”. Notably, neither of these two parts of the URL are typically sent to the server. They are consumed directly by the client and never put on the wire. The exception to this is when your browser is configured to use a proxy server. In that case, the client will sent them because the proxy server will need the information to know how to connect to the downstream server.
The third and fourth parts of the URL are the path: “/somefolder/somepage.aspx” and the query string: “morestuff”. This is the information that a web server deals with. The URL canonicalization rules discussed in this post are applied to the path only. The query string is generally considered opaque by the web server and left to the application that handles the request to process (ie. somepage.aspx in this example.)
Next, I would like to note that there are many different components involved in processing a typical HTTP request that may apply their own rules regarding what is allowed in a URL and what is not. In the case of IIS, the first component to see the request is http.sys. This is the driver component that picks up the raw request from the wire and parses the HTTP protocol. Next, the request goes to user mode and the IIS core server which sends the request through our pipeline. In the pipeline, the request may be handled by any number of modules that can each get a look at the request. In the case of this conversation, the module of interest is the RequestFilteringModule. The request is then dispatched to a handler, which is a special module that is responsible for the heavy lifting work of processing the request and producing a response. Some common handlers in IIS are the static file handler, which serves plain old static pages; ASP.NET which provides its own development platform and gives access to the .NET framework; and FastCGI, which provides a platform to support things like PHP.
My expertise lies in IIS and the modules that ship with it. Some of the common handlers, like ASP.NET and PHP can and do implement their own checks for URL validity, but I am not going to try and explain them because I don’t know their internals well enough. I also would not consider myself an expert on http.sys, but I do know some things about how it validates URLs which I will explain here.
Http.sys is the first component that looks at the incoming URL. One of its jobs is to parse the HTTP request according to a well defined set of rules. Strictness is important here as http.sys needs to protect itself from malformed requests. For the URL, http.sys applies RFC 2396. Http.sys also has a documented registry setting called AllowRestrictedCharacters, which opens things up a bit. In the case of the original question that started this discussion, even this setting will not allow requests through with an ‘%’ character that is not part of an escape sequence. The reason comes directly from RFC 2396, which as this to say in section 2.4.4 entitled “When to escape and unescape”:
Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI. Implementers should be careful not to escape or unescape the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string.
Once a request moves from http.sys to the IIS core server, the RequestFilteringModule will get a look at it. This happens in the RQ_BEGIN_REQUEST notification. The behavior of the request filter is pretty clearly called out by its schematized configuration, where we’ve tried to name the properties as clearly as possible. I won’t go over all of the details here, but I do want to call out one property that generates some confusion: allowDoubleEscaping.
The allowDoubleEscaping feature maps to an old UrlScan feature called VerifyNormalization. The original intent of VerifyNormalization was to ensure that any URLs passed to applications were reduced to their canonical form (see the caution above in the quote from RFC 2396 about misinterpretation of a ‘%’ character.) Guaranteeing that a URL is in canonical form protects against the case where an application naively does its own unescape on a URL that has already been unescaped. Strictly speaking, there is nothing in any RFC that disallows URLs that produce different results when decoded more than once, but proper process mandates that the application handling the request does the right thing. As such, this check is a defense in depth measure. We turn it on by default because misinterpreting non-canonical URLs is probably the most common type of bug that leads to security problems.
So allowDoubleEscaping/VerifyNormalization seems pretty straightforward. Why did I say that it causes confusion? The issue is when a ‘+’ character appears in a URL. The ‘+’ character doesn’t appear to be escaped, since it does not involve a ‘%’. Also, RFC 2396 notes it as a reserved character that can be included in a URL when it’s in escaped form (%2b). But with allowDoubleEscaping set to its default value of false, we will block it even in escaped form. The reason for this is historical: Way back in the early days of HTTP, a ‘+’ character was considered shorthand for a space character. Some canonicalizers, when given a URL that contains a ‘+’ will convert it to a space. For this reason, we consider a ‘+’ to be non-canonical in a URL. I was not able to find any reference to a RFC that calls out this ‘+’ treatment, but there are many references on the web that talk about it as a historical behavior.
The IIS Core Server
Ok, so getting back to the components that might be checking for URLs, I would like to say just a bit about the IIS core server itself.
In general, the IIS core server does not enforce any restrictions on URLs - but there are a couple of things it does with the physical file mappings that are produced from URL mappings. The core pipeline API exposes the IHttpFileInfo interface that allows modules to work with files in the IIS file cache. The static file handler, for example, uses this interface. Also, when you configure a handler to verify the existence of a file, the core itself uses this. The file cache API will not work with file paths that contain a ‘:’ character. The reason for this is that the colon is used by NTFS to specify alternate data streams. It is possible to use alternate data stream syntax as a non-canonical way to get file data. The most infamous case of this was the ancient ::$Data bug.
The other notable thing that the IIS core does in any cases where it works with physical file names is that it uses a special syntax. Specifically, we prepend “\\?\” to any local filenames, or “\\?\UNC\” to UNC filenames. This special syntax tells Win32 file APIs that they should not attempt to do their own canonicalization on the filename and should instead pass it directly to the file system layers below. This ensures that when we open a file, we are using its canonical name. If a filename gets past our canonicalization checks and is still non-canonical, this will cause the file operation to fail. The downside to this is that the CLR (at least the last time I checked) does not support this syntax. So if you are writing a .NET application that uses IIS server variables, you may need to strip out the above prepends.
So there you have it. This covers the things that a URL must pass through in order to be processed by IIS. I hope that you find it informative and useful.