How to Get Search Engines to Discover (Index) All the Web Pages on Your Site

And How To Find Out Which Pages Are Out By The Search Engine


How to Get Search Engines to Discover (Index) All the Web Pages on Your Site

by Christopher Heng, thesitewizard.com

If your site is one of those websites where only a few pages seem to be indexed by the search engines, this article is for you. It describes how you can provide the major search engines with a list of the all the pages on your website, thus allowing them to learn of the existence of pages which they may have missed in the past.

How do you Find Out which Pages of your Website is Indexed?

How do you know which pages of your site has been indexed by a search engine and which not? One way is to use "site:domain-name" to search for your site. This works with Google, Bing and Yahoo, although not with Ask.

For example, if your domain is example.com, type "site:example.com" (without the quotes) into the search field of the search engine. From the results list, you should be able to see all the pages which the search engine knows about. If you find that a page from your site is not listed, and you have not intentionally blocked it using robots.txt or a meta tag, then perhaps that search engine does not know about that page or has been unable to access it.

Steps to Getting the Search Engine to Discover and Index Your Whole Site

Here's what to do, when you discover that there are pages not indexed by the search engine.

  1. Check Whether Search Engines are Blocked from that Page

    The first thing to do is to check your robots.txt file, and make sure it complies with the rules of a robots.txt file. Many webmasters, new and old, unintentionally block a search engine from a part of their site by having errors in their robots.txt file.

    Another thing you might want to do is to make sure that your web page does not have a meta tag that prevents a robot from indexing a particular page. This may occur if you have ever put a meta "noindex" tag on the page, and later wanted it indexed but forgot to remove it.

  2. Create a File Using the Sitemap Protocol

    The major search engines, Google, Bing, Yahoo and Ask, all support something known as a Sitemap file. This is not the "Site Map" that you see on many websites, including thesitewizard.com. My Site Map and others like it are primarily designed to help human beings find specific pages on the website. The sitemap file that uses the Sitemap protocol is, instead, designed for search engines, and is not at all human-friendly.

    Sitemaps have to adhere to a particular format. The detailed specifications for this can be found at the sitemaps.org website. It is not necessary to use every aspect of the specification to create a site map if all you want is to make sure the search engines locate all your web pages. Details on how to create your own sitemap will be given later in this article.

  3. Modify Your Robots.txt File for Sitemaps Auto-Discovery

    As a result of the sitemap protocol, an extension to the robots.txt file has been agreed by the search engines. Once you have finished creating the sitemap file and uploaded it to your website, modify your robots.txt file to include the following line:

    Sitemap: http://www.example.com/name-of-sitemap-file.xml

    You should change the web address ("URL") given to the actual location of your sitemap file. For example, change "www.example.com" to your domain name and "name-of-sitemap-file.xml" to the name that you have given your sitemap file.

    If you don't have a robots.txt file, please see my article on robots.txt for more information on how to create one. The article can be found at https://www.thesitewizard.com/archive/robotstxt.shtml

    The search engines that visit your site will automatically look into your robots.txt file before spidering your site. When they read the file, they will see the sitemap file listed and load it for more information. This will enable them to discover the pages that they have missed in the past. In turn, this will hopefully send them to index those files.

How to Create a Sitemap File

A sitemap file that follows the Sitemap Protocol is just a straightforward plain text file. You can create it using any ordinary plain text editor. If you use Windows, Notepad can be used. If you use a Mac, try TextEdit. Do not use a word processor like Microsoft Office, Wordpad or Word. For Windows users (Windows Vista, 7, 8.1 and later versions), you can start up Notepad by clicking the Start menu (or Start screen), and typing "notepad" (without the quotes), then clicking the "Notepad" line that appears.

By way of example, take a look at the following .

You will notice that a sitemap file begins with the text

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

and ends with

</urlset>

Those portions of the sitemap file are invariant. All sitemaps have to begin and end this way, so you can simply copy them from my example to your own file.

Next, notice that every page on the website (that you want indexed in the search engine) is listed in the sitemap, using the following format:

<url><loc>http://www.example.com/</loc></url>

where http://www.example.com/ should be replaced by the URL of the page you want indexed. In other words, if you want to add a page, say, http://www.example.com/sing-praises-for-thesitewizard.com.html to your website, just put the web address for that page between <url><loc> and </loc></url>, and place the entire line inside the section demarcated by <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> and </urlset>.

To make your job simpler, copy the entire example sitemap that I gave in the example above into an empty Notepad window (or TextEdit on the Mac). Then replace all the example URLs with your own page addresses, adding any more that you like, and you're done.

If you are wondering "But where do I copy it to? Should I paste it in the <head> section or the <body> section?", it means you didn't read my instructions above. Close whatever program you have running that allowed you to see all those things and made you confused. Start up Notepad per my instructions. The window should be empty, without any content at all. Paste my example into that empty window. Then modify the lines as mentioned above.

Save the file under any name you like. Most people save it with a ".xml" file extension. If you don't have any particular preference, call it "sitemap.xml". If you use Notepad instead of a decent text editor, you should note the tips I gave in my article on how to save a file without the .txt extension in Notepad, otherwise you will encounter other problems.

Remember to update your robots.txt file as mentioned earlier to include the URL of your sitemap file, so that the search engines can learn of the existence of the file.

Note: a sitemap file cannot have more than 50,000 URLs (web addresses) nor be bigger than 50 MB. If yours is bigger than that, you'll have to create multiple sitemap files. Please see the Sitemaps site on how this can be done.

Conclusion: Dealing with Missing Pages in the Search Engine's Index

If you have pages on your website that seem to be omitted from the search engine indices, following the tips in this article will help you make sure that the search engines learn of all the pages on your web site. Of course, whether they actually go about spidering and listing them is another matter. However, with the sitemap file, you can at least know that they are aware of all the available pages on your site.

Copyright © 2008-2018 by Christopher Heng. All rights reserved.
Get more free tips and articles like this, on web design, promotion, revenue and scripting, from https://www.thesitewizard.com/.

thesitewizard™ News Feed (RSS Site Feed)  Subscribe to thesitewizard.com newsfeed

Do you find this article useful? You can learn of new articles and scripts that are published on thesitewizard.com by subscribing to the RSS feed. Simply point your RSS feed reader or a browser that supports RSS feeds at https://www.thesitewizard.com/thesitewizard.xml. You can read more about how to subscribe to RSS site feeds from my RSS FAQ.

Please Do Not Reprint This Article

This article is copyrighted. Please do not reproduce or distribute this article in whole or part, in any form.

Related Pages

New Articles

Popular Articles

How to Link to This Page

It will appear on your page as:

How to Get Search Engines to Discover (Index) All the Web Pages on Your Site





Home
Donate
Contact
Link to Us
No Spam Policy
Privacy Policy
Topics
Site Map

Getting Started
Web Design
Search Engines
Revenue Making
Domains
Web Hosting
Blogging
JavaScripts
PHP
Perl / CGI
HTML
CSS
.htaccess / Apache
Newsletters
General
Seasonal
Reviews
FAQs
Wizards

 

 
Free webmasters and programmers resources, scripts and tutorials
 
HowtoHaven.com: Free How-To Guides
 
Site Design Tips at thesitewizard.com
Find this site useful?
Please link to us.