Posts Tagged ‘Sitemaps’

Increasing Search Indexing Coverage With an XML Sitemap

October 13, 2008

I just read Jeff Atwood’s post on Coding Horror about the importance of Sitemaps. I’m always eager to hear about people’s experiences since I spent so much time on XML Sitemaps and getting sitemaps.org launched while I was at Google. Sitemaps, of course, are supported by Google, Yahoo, and Live Search. All you have to do is reference the Sitemap location in your robots.txt file and all the engines will pick it up.

Atwood noted that he uses Google to search for his own stuff, which makes it that much more frustrating when some of the content isn’t indexed. (Not to mention of course, the lost visitor opportunities.) Once he created an XML Sitemap, Google started finding and indexing more of his pages. Yay!

However, he and his commenters had a few questions about the process, so I thought I’d take a few minutes to answer them. Of course, I don’t work for Google anymore, so these answers are entirely my own. If you want official answers, check out the Official Google Webmaster Help forum.

Why is Google having so much trouble crawling my dynamic site? Can’t Googlebot figure out my URL scheme? (I’m paraphrasing Atwood’s post here.)
I haven’t spent a lot of time studying stackoverflow.com (the site in question), but since Google is crawling and indexing the URLs after finding them in the Sitemap, the problem likely isn’t with the dynamic nature of the URLs themselves. The issue is probably that the internal linking structure doesn’t provide links to every single page. Since Googlebot crawls the web by following links, it wouldn’t know about the unlinked URLs. Atwood notes this possibility:

“On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage… I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn’t necessarily the case.”

Of course, pages with links to them (particularly no external links) may not have substantial PageRank and therefore are unlikely to rank for anything other than long tail queries. But since the scenario Atwood describes is all about long tail queries (typing in the exact title of a page, for instance), then getting those pages crawled and indexed is sufficient.

To dig a bit more into Atwood’s needs, he says, “It’s far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.” If he’s looking to provide comprehensive search for visitors of his site, he might consider Google’s custom search engine (CSE). Generally, the CSE searches over what’s in the Google index. But if you’re submitted a Sitemap, Google will maintain a CSE-specific index that contains any URLs from the Sitemap that aren’t in Google’s web search index. So, the CSE could provide even better search results than a regular web search.

Why would Google put some URLs in the CSE-specific index and not the regular web index? Well, Google’s algorithms use lots of criteria for determining not only how to rank pages, but what pages to crawl and index as well. So, if, for instance, Googlebot has crawled what it’s deemed the maximum number of URLs from your site for the week for the web index (I’m over-simplifying here a bit), it can still add the remainder to the CSE index.

It doesn’t sound very scalable. (from John Topley)
You can easily write a script that updates the Sitemap each time the site is updated. And if your Sitemap reaches the maximum size, you can break it up into multiple Sitemaps automatically or you can segment them by folder (or whatever organizational structure works best for you). If you want, you can even ping the search engines each time the Sitemap is updated, or you can just reference it in your robots.txt file as Atwood suggests and let them pick it up.

How do you determine change frequency? (John Topley)
If your script can determine this, then you can set it up programmatically. Otherwise, I’d skip this attribute and just concentrate on listing the URLs.

I think google is not happy with the “dynamic” parts of the url e.g. “?” or “&” (Marcel Sauer)
Google does fine with dynamic URLs. They can have trouble if the dynamic nature of the site leads to things like infinite URLs, lots of URLs that display the same page, crazy parameters, or recursive redirects, but as I noted above, the trouble tends not to be with the URLs themselves, but the fact that they aren’t always well-linked.

Is PageRank the Ultimate Measure of Online Influence?

October 8, 2008

Steve Rubel recently wrote a blog post about measuring online influence. He concluded that Google PageRank is the ultimate way to measure online influence.

I completely disagree.

I agree with him that we need better measures and the ones that we have are looking through a glass darkly (at best), but PageRank is probably one of the worst measures around. John Mueller asked me if I ever worried about PageRank, so I can answer that question while I explain why I disagree so much with Rubel.

What is PageRank anyway?
First, a bit of explanation about PageRank. Two entirely different things are called “PageRank”. There’s the Google toolbar PageRank, which is represented by integers 1 through 10, and then there’s the internal PageRank number that Google uses as one of its many (hundreds of) ranking factors. When I say PageRank doesn’t matter (and I say it a lot), I’m talking about the toolbar PageRank. The internal PageRank that Google uses does matter, but it’s far from only thing that matters.

At the simplest level, PageRank (both toolbar and internal) is a measure of a page’s link popularity. How many links does a page have and how authoritative are those links?

Why do I think PageRank doesn’t matter?
Rubel is talking about the toolbar PageRank in his post. So why do I say it doesn’t matter while he says it’s the “ultimate measure”?

  • It’s updated infrequently. As Matt Cutts has said, it’s updated every “few months”. So, it’s generally pretty stale data. When you see a 5, there’s really no way of knowing if the site is currently a 5 or was a 5 two months ago but is now a 7. Or a 2. (“Real” PageRank is computed continually.)
  • It’s not very accurate. The internal PageRank is not an integer number 1 – 10. It’s something much more precise. So even without the staleness problem, there’s still an accuracy problem.
  • It can easily be gamed. Link schemes, link exchanges, and paid links have been around for a long time. Google is always working to be one step ahead, but these techniques can work for a time.
  • Link builders have an advantage. Certainly savvy SEOs and link builders know how to get quality links. One site could have more online influence and engagement but just not have an owner who knows about link building.
  • The toolbar number may be obfuscated. Google has to maintain a delicate balance of giving as much information as possible to web site owners, while not giving away enough to let spammers impact the quality of search results. This was one of the hardest parts of my job when I ran Google Webmaster Central. I talked to B&B owners who just wanted people to know their inns existed. And I talked to black hats who used every loophole to get their viagra sites on the first page.The Official Google Webmaster Central blog talked about obfuscation that Google did late last year. In this particular case, sites were selling text links that weren’t marked as advertising and their major selling point was the high PageRank of the site. By reducing the visible PageRank, those sites could not as easily sell links.
  • PageRank doesn’t necessarily correlate to ranking. Matt mentioned this recently on Sphinn, when he said “Even if you don’t show much PageRank, Google still has 200+ other signals we use in our ranking. It’s definitely common to see lower-PageRank sites ranking above higher-PageRank sites–which tends confuses the people who obsess too much about PageRank and who don’t focus on other factors that search engines might use to rank pages”.

Why does Rubel think PageRank is the “ultimate”?
Rubel sees things a little differently. He said:

  • “Page Rank is something you earn by producing high quality content that people link to” – Unfortunately, that’s not entirely correct. An average piece of content might get lots of links via a link builder, or if the person writing the content is popular, or even if the person writing the content is universally hated and lots of people link to the content to trash it. A piece high quality content may be very engaging and may influence a lot of people, but those people may not be linkers by default. And really, if what you’re really looking to do is measure what content gets the most links based on the argument that something with a lot of links has a lot of influence because the links themselves raise awareness about that content, then use Yahoo! Site Explorer, which will give you up-to-date and accurate link counts. Don’t use a rounded, out of date number that approximates link counts.
  • “It enables you to influence people on the Internet’s biggest stage – Google – and just as people are searching for the topics you are knowledgeable about. This means it amplifies your influence because the press start at search engines when researching stories” – As noted above, PageRank is one of more than a hundred factors in determining ranking. It happens all the time that a site with a lower toolbar PageRank will rank above high PageRank sites. Ranking isn’t just about link quantity. It’s about crawlability, extractability, quality content, link quality, anchor text…. Well, a lot of things.
  • “Page Rank is channel agnostic and takes the entire online ecosystem into account. It judges you based on links from all kinds of sources, not just people who live in the same fish tank. In other words, it goes beyond people who hang out on Twitter who love people who Tweet or bloggers who link to other bloggers, etc. It eschews the echo chamber”- Again, not exactly. It may eschew the echo chamber but it rewards a savvy link builder. And some audiences are more likely to link than others. For instance, marketing blogs link out all the time. Recipe blogs are getting better at linking. But some newspapers don’t link at all, or provide the link as text. Some audiences aren’t the type to have sites from which they can link, so you can only see their involvement through things like comments and subscriber numbers. And for some audiences that do control sites, linking just doesn’t cross their minds. It’s not something they think about doing.

I’m not the first person to disagree with Rubel on this. Michael Gray mentioned it on Twitter and Rubel replied “I know Page Rank is not perfect. But it determines your footprint on Google and that’s why it’s the ultimate influence metric.”

If there’s one thing that PageRank is not, it’s the determination of your Google footprint. The internal “real” PageRank isn’t even that. Lots of things go into determining your Google footprint. His discussion in the comments goes further down this path of misunderstanding what PageRank is. He agrees with someone in the comments who says that “PageRank is the sum of all other measurements.” It’s not. It’s one measurement added in with a whole bunch of others.

Others in the comments do point this out. In fact, James Joyner said “My site has gone from PR7 to PR4 for no apparent reason. At the same time, my visitors, commenters, and social media followers have gone up. My content gets syndicated at Newsweek. It’s included in Google News, for goodness sakes. But my PR has plummeted. Oddly, however, my Google traffic has not.”

So how do you measure online influence?
But what of Rubel’s actual question? How do you measure online influence? I would ask why you want to measure it. I spoke at the eMetrics summit a few months ago and the big discussion was around measuring engagement. But what is actionable about that measure, even if you are able to track it down?

It could be that the measure is different depending on your goal. If you’re coke and you want to sell more soft drinks, then the only measure you care about may be increased sales.

Rubel mentions that unique visitor counts are largely empty numbers as hordes of visitors might come from search but then leave immediately. Well, sure. That’s why you have to measure bounce rate. And conversion. And understand that the goal isn’t to rank #1 in Google and get a lot of traffic, it’s to rank highly for search queries that your customers who want to buy your products are doing. But I’ve talked about that before.

If you’re a blogger and you don’t sell anything, then you might care about getting more readers. Or you might make money from advertising. Or maybe you want to get famous so a big time magazine wants you to write for them. “Online influence” is a nebulous term, at best.

I do agree that we need better measures. That we’re overwhelmed with numbers and we don’t know what’s actionable or useful. And I think you can measure the impact and value of things like social media that don’t correlate directly to sales. These are the things I spend a lot of my time thinking about these days. But I’m pretty sure toolbar PageRank is not that magic measure.

Google Moderator Beta: Ask a Google Engineer

September 30, 2008

A few days ago, I noticed that Zurich-based Googler John Mueller posted a Twitter link to Google’s new Moderator application and invited everyone to ask Google engineers, such as Matt Cutts, questions. I asked Matt to bring me some frozen yogurt from the Google cafe. As Google Moderator takes advantage of the wisdom of crowds, my question soon plummeted to last place as apparently I’m the only one interested in my need for icy treats and everyone else (er, at least 86 people) voted that they didn’t like my question. (I don’t mind being voted down; I just wish I at least had some frozen yogurt!)

Matt blogged about this tool, explaining that it has been available internally at Google for a while and is a great way to prioritize questions.

If only it were searchable
It seems like a pretty cool tool. The biggest drawback I see to using it is that none of the content gets indexed, so unlike something like Yahoo! Answers, any work you put into asking or answering questions can’t be found later by those searching for that information. Why can no content be indexed, you ask? For one thing, the content is entirely in JavaScript. With JavaScript turned off, all you see is a message that says:

Google Moderator is a tool that allows distributed communities to submit and vote on questions for talks, presentations, and events. You must have JavaScript enabled in order to use this feature.

You can see that Google is indexing what’s in the noscript tag by checking out the search result:

googlemoderatorserp

Huh.

The other problem with indexing is that every URL is differentiated by characters that begin with a #. Even if the content did load without JavaScript, search engines would see every URL as moderator.appspot.com, since they drop everything after a # (since traditional web practices dictate that the # in a URL indicate an anchor point within the existing page).

As a sidenote on how ensuring your site is search-friendly can help usability, note that these issues also keep the back button from working in the browser, so if you read an answer, you can’t easily get back to the list of questions.

(The site has other, more minor, search issues, such as that the logo doesn’t link to the home page and there’s no meta description, but really, fixing those issues would be like using a thimble to bail out a sinking boat with a hole the size of a bowling ball in the bottom of it.)

Go ahead, ask me a question
I started a question series to test out the system, and I will answer the questions that are voted to the top, but I’m likely to answer them here (and post links to the answers there) because of that er, minor indexing issue. Feel free to ask a question there and test things out yourself.

Answering questions to Matt
I’m not Matt, nor do I play him on TV, but I did see a few questions to him that I thought I’d steal away to answer. (Although he has answered quite a few already himself!)

Q: What’s the best way to get a count of indexed pages in Google? Last I checked, Webmaster Tools just links to the standard “site:” operator. Various query tricks have had different levels of success in the past, but none have been reliable. (Nick, Chicago)

I love this question because I get to talk up Google Webmaster Central. A reliable count does indeed exist! Simply create an XML Sitemap that includes a comprehensive and accurate list of the pages you would like indexed. Alternately, you could create several Sitemaps, each with a different set of pages you want to track. If you want to track more than 50,000 URLs, simply add multiple Sitemaps to one Sitemap Index file. Submit the Sitemap or Sitemap Index file to Google Webmaster Tools and check back after it’s been processed. The Sitemaps tab displays not only the Sitemap URL count, but also tells you the number of URLs from that Sitemap that have been indexed. You can track this number of time to measure indexing coverage.

sitemapcount

It’s a pretty handy trick and much more accurate than the site: operator.

Q: Is Google looking for a true solution to deal with duplicate content between UK & US Websites own by the same company? (François, Brussels) and How will Google Identifies that Particular Website belongs to particular location even when it is hosted in US, and uses that Data to show that website to users of country for which it is appropriate? (Cold, Jaipur India)
I can’t speak for what Google is looking to do, but I do know that generally search engines filter duplicate content and show the most relevant version to the searcher. So, in the case of US and UK content, Google would look to show the US version to the US searcher and the UK version to the UK searcher. It wouldn’t generally look to show both versions in a single search result. Google figures out which is more relevant to the searcher using things like the searcher’s geographic location (based on IP address) and whether the searcher is using google.com or google.co.uk.

You can provide signals to Google about the content using TLD (putting the US content on yoursite.com and the UK content on yoursite.co.uk) or domains hosted in the target country (yoursiteus.com hosted in the US and yoursiteuk.com hosted in the UK), segmenting the content into subdomains or subfolders and then specifying the target country for each in Google Webmaster Tools (us.yoursite.com associated with US and uk.yoursite.com associated with UK), and using the meta language element. (To be honest, I’m not entirely sure about the meta country tag. Anyone have experience with this?)

geotarget

So, back to the tool
In particular, I think this tool could be really handy for QA at conferences. I’ve spoken at several conferences (including SMX and Web 2.0 Expo) where attendees could use an online system to submit questions, and I’ve used Twitter for this a few times, but this is the first system I’ve seen that also lets the rest of the audience vote questions up or down. I may have to try it out for my next event.

Sitemaps.org Update: You Can Now Store Your XML Sitemap Files Anywhere!

February 27, 2008

The major search engines have announced an update to the sitemaps.org protocol which enables site owners to store their XML Sitemap files in any location — even on a different domain than the one referenced in the Sitemap. This will be a welcome change for those who manage multiple domains and would like to keep all Sitemap files in one place, as well as for those who would like to store their Sitemap in a location other than the root.

The only caveat? You have to be able to edit the robots.txt file of the domain the Sitemap file references.

Read more on Search Engine Land

  • Nine By Fox

    Stories from the online marketing industry, Vanessa's travel adventures, and more. For reference material and analysis, see the Library.
  • Buy the Book!

  • Categories

  • The Latest From Twitter