Increasing Search Indexing Coverage With an XML Sitemap
I just read Jeff Atwood’s post on Coding Horror about the importance of Sitemaps. I’m always eager to hear about people’s experiences since I spent so much time on XML Sitemaps and getting sitemaps.org launched while I was at Google. Sitemaps, of course, are supported by Google, Yahoo, and Live Search. All you have to do is reference the Sitemap location in your robots.txt file and all the engines will pick it up.
Atwood noted that he uses Google to search for his own stuff, which makes it that much more frustrating when some of the content isn’t indexed. (Not to mention of course, the lost visitor opportunities.) Once he created an XML Sitemap, Google started finding and indexing more of his pages. Yay!
However, he and his commenters had a few questions about the process, so I thought I’d take a few minutes to answer them. Of course, I don’t work for Google anymore, so these answers are entirely my own. If you want official answers, check out the Official Google Webmaster Help forum.
Why is Google having so much trouble crawling my dynamic site? Can’t Googlebot figure out my URL scheme? (I’m paraphrasing Atwood’s post here.)
I haven’t spent a lot of time studying stackoverflow.com (the site in question), but since Google is crawling and indexing the URLs after finding them in the Sitemap, the problem likely isn’t with the dynamic nature of the URLs themselves. The issue is probably that the internal linking structure doesn’t provide links to every single page. Since Googlebot crawls the web by following links, it wouldn’t know about the unlinked URLs. Atwood notes this possibility:
“On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage… I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn’t necessarily the case.”
Of course, pages with links to them (particularly no external links) may not have substantial PageRank and therefore are unlikely to rank for anything other than long tail queries. But since the scenario Atwood describes is all about long tail queries (typing in the exact title of a page, for instance), then getting those pages crawled and indexed is sufficient.
To dig a bit more into Atwood’s needs, he says, “It’s far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.” If he’s looking to provide comprehensive search for visitors of his site, he might consider Google’s custom search engine (CSE). Generally, the CSE searches over what’s in the Google index. But if you’re submitted a Sitemap, Google will maintain a CSE-specific index that contains any URLs from the Sitemap that aren’t in Google’s web search index. So, the CSE could provide even better search results than a regular web search.
Why would Google put some URLs in the CSE-specific index and not the regular web index? Well, Google’s algorithms use lots of criteria for determining not only how to rank pages, but what pages to crawl and index as well. So, if, for instance, Googlebot has crawled what it’s deemed the maximum number of URLs from your site for the week for the web index (I’m over-simplifying here a bit), it can still add the remainder to the CSE index.
It doesn’t sound very scalable. (from John Topley)
You can easily write a script that updates the Sitemap each time the site is updated. And if your Sitemap reaches the maximum size, you can break it up into multiple Sitemaps automatically or you can segment them by folder (or whatever organizational structure works best for you). If you want, you can even ping the search engines each time the Sitemap is updated, or you can just reference it in your robots.txt file as Atwood suggests and let them pick it up.
How do you determine change frequency? (John Topley)
If your script can determine this, then you can set it up programmatically. Otherwise, I’d skip this attribute and just concentrate on listing the URLs.
I think google is not happy with the “dynamic” parts of the url e.g. “?” or “&” (Marcel Sauer)
Google does fine with dynamic URLs. They can have trouble if the dynamic nature of the site leads to things like infinite URLs, lots of URLs that display the same page, crazy parameters, or recursive redirects, but as I noted above, the trouble tends not to be with the URLs themselves, but the fact that they aren’t always well-linked.