If your site is one of those websites where only a few pages seem to be indexed by the search engines, this article is for you. It describes how you can provide the major search engines with a list of the all the pages on your website, thus allowing them to learn of the existence of pages which they may have missed in the past.
How do you know which pages of your site has been indexed by a search engine and which not? One way is to use "site:domain-name" to search for your site. This works with Google, Yahoo and Microsoft Live, although not with Ask.
For example, if your domain is example.com, type "site:example.com" (without the quotes) into the search field of the search engine. From the results list, you should be able to see all the pages which the search engine knows about. If you find that a page from your site is not listed, and you have not intentionally blocked it using robots.txt or a meta tag, then perhaps that search engine does not know about that page or has been unable to access it.
Here's what to do, when you discover that there are pages not indexed by the search engine.
The first thing to do is to check your robots.txt file, and make sure it complies with the rules of a robots.txt file. Many webmasters, new and old, unintentionally block a search engine from a part of their site by having errors in their robots.txt file.
Another thing you might want to do is to make sure that your web page does not have a meta tag that prevents a robot from indexing a particular page. This may occur if you have ever put a meta "noindex" tag on the page, and later wanted it indexed but forgot to remove it.
The major search engines, Google, Yahoo, Live and Ask, all support something known as a Sitemap file. This is not the "Site Map" that you see on many websites, including thesitewizard.com. My Site Map and others like it are primarily designed to help human beings find specific pages on the website. The sitemap file that uses the Sitemap protocol is, instead, designed for search engines, and is not at all human-friendly.
Sitemaps have to adhere to a particular format. The detailed specifications for this can be found at the sitemaps.org website. It is not necessary to use every aspect of the specification to create a site map if all you want is to make sure the search engines locate all your web pages. Details on how to create your own sitemap will be given later in this article.
As a result of the sitemap protocol, an extension to the robots.txt file has been agreed by the search engines. Once you have finished creating the sitemap file and uploaded it to your website, modify your robots.txt file to include the following line:
You should change the web address ("URL") given to the actual location of your sitemap file. For example, change "www.example.com" to your domain name and "name-of-sitemap-file.xml" to the name that you have given your sitemap file.
If you don't have a robots.txt file, please see my article on robots.txt for more information on how to create one. The article can be found at http://www.thesitewizard.com/archive/robotstxt.shtml
The search engines that visit your site will automatically look into your robots.txt file before spidering your site. When they read the file, they will see the sitemap file listed and load it for more information. This will enable them to discover the pages that they have missed in the past. In turn, this will hopefully send them to index those files.
A sitemap file that follows the Sitemap Protocol is just a straightforward ASCII text file. You can create it using any ordinary ASCII text editor. If you use Windows, Notepad (found in the Accessories folder of your Start menu) can be used. Do not use a word processor like Microsoft Office or Word.
By way of example, take a look at the following .
You will notice that a sitemap file begins with the text
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
and ends with
Those portions of the sitemap file are invariant. All sitemaps have to begin and end this way, so you can simply copy them from my example to your own file.
Next, notice that every page on the website (that you want indexed in the search engine) is listed in the sitemap, using the following format:
where http://www.example.com/ should be replaced by the URL of the page you want indexed. In other words, if you want to add a page, say, http://www.example.com/sing-praises-for-thesitewizard.com.html to your website, just put the web address for that page between <url><loc> and </loc></url>, and place the entire line inside the section demarcated by <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> and </urlset>.
To make your job simpler, just copy the entire example sitemap that I gave in the example above, replace all the example URLs with your own page addresses, add any more that you like, and you're done.
Save the file under any name you like. Most people save it with a ".xml" file extension. If you don't have any particular preference, call it "sitemap.xml". If you use Notepad instead of a decent text editor, you should note the tips I gave in my article on how to save a file without the .txt extension in Notepad, otherwise you will encounter other problems.
Remember to update your robots.txt file as mentioned earlier to include the URL of your sitemap file, so that the search engines can learn of the existence of the file.
Note: a sitemap file cannot have more than 50,000 URLs (web addresses) nor be bigger than 10 MB. If yours is bigger than that, you'll have to create multiple sitemap files. Please see the Sitemaps site on how this can be done.
If you have pages on your website that seem to be omitted from the search engine indices, following the tips in this article will help you make sure that the search engines learn of all the pages on your web site. Of course, whether they actually go about spidering and listing them is another matter. However, with the sitemap file, you can at least know that they are aware of all the available pages on your site.
Do you find this article useful? You can learn of new articles and scripts that are published on thesitewizard.com by subscribing to the RSS feed. Simply point your RSS feed reader or a browser that supports RSS feeds at http://www.thesitewizard.com/thesitewizard.xml. You can read more about how to subscribe to RSS site feeds from my RSS FAQ.
This article is copyrighted. Please do not reproduce this article in whole or part, in any form, without obtaining my written permission.
It will appear on your page as: