I recently received a query from a visitor attempting to create a Sitemaps file using the Sitemaps protocol as described in the tutorial
How to Get Search Engines
to Discover (Index) All the Web Pages on Your Site. He wanted to know whether he should refer to a page on his site as (say)
www.example.com/about/index.html", as "
www.example.com/about/" or as both in his site map. Both web addresses ("URLs") point
to the same file. This article attempts to answer that question. My answer, however, as you will see, applies to more than just the site map.
Web servers are configured to deliver a default web page (if it exists) whenever a browser requests for a directory name. For example,
if you were to ask for "
www.example.com/about/", a typical web server will look for a file called "
in the "
about" folder of your website. If it exists, the server will deliver that page's content to the browser. The browser's
address bar, however, will still show the URL you requested, which is "
www.example.com/about/" in this example. If the page
does not exist, and the server is not configured to look for any other index page, it will just show a directory listing of the
folder (unless you have disabled that facility
on your site).
This means that for special pages like the "index.html" of your directory, there are actually two ways of accessing the file.
The problem of having more than one URL pointing to the same file is not primarily a human usability problem (since humans can easily figure out they're looking at the same page). It is a search engine problem. I have written about this problem at length elsewhere, such as in the article How to Create a Search Engine Friendly Website. If you have not read that article, please read it now before continuing futher. I shall assume you understand the issues of content duplication in the rest of this article.
In view of the problems discussed in that tutorial, if there are two or more ways of referring to a particular web page on your site, you should always
decide on one URL and consistently use that on your site. For example, decide whether you want to refer to a page as
www.example.com/about/" or "
www.example.com/about/index.html". Once you've made that decision, make sure that all web pages
on your site link to the page using the form of URL that you've settled on. Your site map, whether a normal site map or the
search engine specific
site map using the sitemaps protocol, should also refer to that page with the same URL.
Which form of URL should you use? The one with the directory name alone, or the one with the filename? There are a few ways to look at this.
Some people argue that using the directory name alone (like "
www.example.com/about/") is superior to using the actual
filename. When you use the directory name, the web server will transparently find the index file and deliver it to the user (or search engine).
In theory, this means that if you ever want to change to use a different filename for your index page, such as if you want to use a script
like "index.php" to display the page instead of a static page like index.html, you can easily do that without changing any of the URLs on
your site. All you have to do is to modify your server configuration file accordingly.
In practice, however, the above advantage is not significant. If you currently directly refer to "index.html" and later want to use a script file named "index.php" to generate the content, it's also possible to modify your server configuration file so that the web server invokes "index.php" when the "index.html" file is requested. The technique for this is given in my article How to Masquerade Your CGI/PHP Scripts as Static HTML Pages and it involves no more work than that required to deliver "index.php" for a directory name.
If you have a brand new site that has not been indexed yet, and cannot decide which method to use, use the directory name form (like
example.com/about/"). I personally think it is marginally better because the URL is shorter. Short URLs have some advantages:
besides being easier to remember, they also avoid some of the mangling that hits long URLs by third party sites and forum software,
as mentioned in my article How to Create Good Filenames
for Your Web Pages.
If your site has already been in existence for some time, you should look for the form of URL that is most frequently used, both by your website and by others linking to your site, and use that URL consistently throughout your site. This is the method I adopted on thesitewizard.com. The site had already been in existence for a while, with its subdirectory index pages referred to by name, before I realised ("realized" in US English) I preferred the shorter form. Since there were already many links pointing directly to these folder index pages, changing them will cause more problems than it solves. As a result, I decided to be consistent and stick to the form I had been using in the past.
If your site is in the same boat, and you have a change of heart about what constitutes a prettier URL, you too may have to resign yourself to your current form for practical reasons, as I did.
Don't spend too much time mulling over whether to
use the directory name or the index file
form of URL. In practice, it probably does not matter which you use. Just decide on one form and stick to it.
The important thing is to be consistent. If you have referred to a file as "
example.com/about/" in the past, continue
to refer to it as such and don't link to it in other places as "
example.com/about/index.html". This applies to both your web pages
as well as to your site map.
Do you find this article useful? You can learn of new articles and scripts that are published on thesitewizard.com by subscribing to the RSS feed. Simply point your RSS feed reader or a browser that supports RSS feeds at http://www.thesitewizard.com/thesitewizard.xml. You can read more about how to subscribe to RSS site feeds from my RSS FAQ.
This article is copyrighted. Please do not reproduce this article in whole or part, in any form, without obtaining my written permission.
It will appear on your page as: