Tracking Google index performance with XML sitemaps

Posted January 1st, 2010 in Google Webmaster Tools, SEO by Tim

I recently updated a script that generates sitemap XMLs for a web site I run. It’s a local review business ditectory with hundreds of thousands of pages, but with only around 150k in the Google index.

Like any database driven site, those hundreds of thousands of pages can be boiled down to a few distinct templates, for me they are:

  1. Home page
  2. “Article” pages (FAQs, online help, about us, contact us, etc.)
  3. Locality page (e.g. best rated Sydney metro businesses)
  4. Industry + locality page (e.g. Sydney hairdressers)
  5. Business listing page (e.g. Toni & Guy Bondi Beach)

The bottom three templates (in reverse order) represent the most unique pages per template and make up 99% of organic traffic to the site.

Currently, my URL structure is pretty systematic, so using Google search filters such as “site:abc.com inurl:business_listing” I am able to get total number of indexed pages on a template by template basis.

However, in the next few months I intend to improve some URLs from say /business-listing/toni-guy-bondi-beach/12345-54321.html to simply /toni-guy-bondi-beach/ which will make make it impossible to track total indexed pages using my current method.

But, if you’re in the same position there is a solution, which so simple that I feel like a post dedicated to it is overkill. Because it’s a database driven site, I store all the page aliases (e.g. “toni-guy-bondi-beach”) in my locality, industry and business listing database tables.

This allows me to update my script to separate each template into different sitemaps. Google Webmaster Tools will then show you the total # of URLs in the sitemap vs. the number indexed:

By dividing the numbers, I can easily see which templates aren’t performing as well and look to see if there are any obvious factors causing it to be considered duplicate content or if it’s an internal linking issue, etc.

If you’re interested in learning more about large site indexation, there is an SEOmoz post by Rand on Google’s indexation cap which is an interesting read.