December 25, 2011

What changes can you safely make to a URL?

URLs don’t look so good in running text: they start with an incomprehensible code, they contain strange-looking punctation, and they run together all-lowercase words that are next to impossible to read. To try to make them look a little nicer, you may be tempted to leave some parts off and gussy up the rest with some camel casing. But how do you know what you can safely change and what will break the link? Here is what you can and cannot safely change.

Parts of a URL

Without getting into too much detail, the structure of a URL is something like this:
protocolsubdomaindomain nameTLDpath
http://paulgraham.com/hp.html
http://philip.greenspun.com/panda
http://philip.greenspun.com/images/pcd0803/florence-bike-6.4.jpg
For our purposes, when I say “hostname,” I mean the domain name including any subdomains (plus the second-level domain if present) plus the top-level domain (TLD). For example, google.com, scholar.google.com, or cra.gc.ca: everything that comes before the first slash.

Editing the URL

So there you are, faced with lots of unsightly and incomprehensible “http”s and forward slashes. How can you make these URLs fit into your text a little better?

Removing the protocol descriptor

Can you remove the http://? Yes.
http://www.google.com = www.google.com

The http in the URL stands for “hypertext transfer protocol,” which is the protocol used by the World Wide Web. It’s there to tell your browser that you are asking for a web page and that the browser should use the HTTP protocol as opposed to, say, FTP (file transfer protocol). But it’s reasonable to expect that any browser is going to assume HTTP as a default, so feel free to leave this off.

If the protocol descriptor is anything but http:// (for example https:// or ftp://) you should leave it in. Otherwise, browsers will assume that it’s an HTTP request and because it isn’t really, the request will fail (if you’re lucky, the server will be kind enough to redirect the request to the right URL, but it’s not a good idea to rely on this).

Changing capitalization

Can you add or remove capitalization? Sometimes. In the hostname, yes:
paulgraham.com = PaulGraham.com
but in the path, no.
slate.me/tbFnWs ≠ slate.me/tbfnws

For the hostname, it’s OK to have a house style that puts capital letters in the hostname (Scholar.Google.com), or even uses camel casing (OurCompany.com). However, the capitalization in the path (the section after the first slash) should not be changed: slate.me/tbFnWs is not the same as slate.me/tbfnws.

The reason is in the way that URLs are processed. The first thing that happens after you type in a URL and press Enter is that your browser sends a request out to the internet. DNS servers accept this request and translate the hostname (philip.greenspun.com) into an IP address (64.95.64.40), which is the address of the server that will have the files you’re looking for. Once the request reaches the server at 64.95.64.40, the server uses the path part of the URL to look through its file system and return the file you requested (/images/pcd0803/florence-bike-6.4.jpg). Only some web servers take the case of the path into account, but you shouldn’t assume that it won’t matter.

As a matter of style, use capital letters very sparingly. Traditionally URLs are all lowercase, and to the purist, capitalization looks funny. Keep your caps for the beginnings of words (PaulGraham.com) and never capitalize the whole URL or the top-level domain name (.com, .ca, etc.).

Removing the www

Can you add or remove a www on the beginning of a URL? No. At least, only sometimes.
www.pashley.co.ukpashley.co.uk

The www is a subdomain, just as the scholar in scholar.google.com is. If you take the subdomain designation off, for example to change www.google.com to google.com, you are changing the domain name.

OK, I admit that most (almost all) servers are configured to treat a domain name with and without the www subdomain designation the same way by forwarding traffic from one to the other, so you can usually get away with changing this. But it’s important to understand that if you add or remove a www it’s not the same domain name. If you are determined to add or remove a www, test the new form of the URL to make sure it works.

Try the URLs above and see what happens. The version without the www is invalid. If it works for you, you might be using a browser with aggressive “domain guessing.” That’s fine, but it’s far from every browser that does that: my versions of Chrome and Firefox won’t guess the URL in the above example. By the way, I’m not holding up the above sites as an example of bad design or configuration. I think it’s fine to accept only one version of your domain name, but editor beware.

Removing a terminal slash

Can you remove a slash from the end of a URL? Yes.

The final slash in each of these URLs can be omitted: www.google.com/ or philip.greenspun.com/panda/.

Edit, October 2022: I used to say that there was no web server in the world that wasn’t configured to behave as if there was a slash on the end of the URL, but I found one. Surprisingly, https://www.vlada.gov.sk/koalicia-vita-schvalenie-reformy-nemocnic/ behaves differently from https://www.vlada.gov.sk/koalicia-vita-schvalenie-reformy-nemocnic, and the latter gives a 404 not found error. So I have to amend my advice here to say that it’s almost always OK to remove the final slash.

But don’t these changes only affect people who use out-of-date browsers?

The pitfalls I’ve described above result from the way the DNS servers and web servers on the internet work, not the features of the user’s browser. However, some browsers use “domain guessing” to try other forms of a URL if the first request fails, so they’re more likely to be able to work around missing information in a URL. To make sure the URLs you print works for all your readers, be conservative about how you change them.