In an earlier post, I said that key to government opening its data to citizens, being more transparent, and improving the relationship between citizens and government in light of our web 2.0 world was ensuring content on government sites could be easily found in search engines. Architecting sites to be search engine friendly, particularly sites with as much content and legacy code as those the government manages, can be a resource-intensive process that takes careful long-term planning. But
two keys are:
Thursday Congressman Honda asked, “how can congress take advantage of web 2.0 technologies to transform the relationship between citizens and government?” He noted that “A dramatic shift in perspective is needed before that need can be met. Instead of databases becoming available as a result of Freedom Of Information Act requests, government officials should be required to justify why any public data should not be freely available to the taxpayers who paid for its creation.” He asked for input on what web 2.0 features he should add to his website to take advantage of today’s online world.
The most important feature government web sites can add isn’t really feature at all. But it would absolutely transform the relationship between citizens and government and make an amazing array of public data available. What’s this magic feature?
Make government web sites search engine friendly.
This category contains fundamental building blocks about search engine optimization (SEO). You might also want to check out the following blog posts I’ve written:
SEO Basics
Linking
Technical Development and Troubleshooting
SEO Commentary
I just read Jeff Atwood’s post on Coding Horror about the importance of Sitemaps. I’m always eager to hear about people’s experiences since I spent so much time on XML Sitemaps and getting sitemaps.org launched while I was at Google. Sitemaps, of course, are supported by Google, Yahoo, and Live Search. All you have to do is reference the Sitemap location in your robots.txt file and all the engines will pick it up.
Atwood noted that he uses Google to search for his own stuff, which makes it that much more frustrating when some of the content isn’t indexed. (Not to mention of course, the lost visitor opportunities.) Once he created an XML Sitemap, Google started finding and indexing more of his pages. Yay!
However, he and his commenters had a few questions about the process, so I thought I’d take a few minutes to answer them. Of course, I don’t work for Google anymore, so these answers are entirely my own. If you want official answers, check out the Official Google Webmaster Help forum.
Why is Google having so much trouble crawling my dynamic site? Can’t Googlebot figure out my URL scheme? (I’m paraphrasing Atwood’s post here.)
I haven’t spent a lot of time studying stackoverflow.com (the site in question), but since Google is crawling and indexing the URLs after finding them in the Sitemap, the problem likely isn’t with the dynamic nature of the URLs themselves. The issue is probably that the internal linking structure doesn’t provide links to every single page. Since Googlebot crawls the web by following links, it wouldn’t know about the unlinked URLs. Atwood notes this possibility:
“On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage… I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn’t necessarily the case.”
Of course, pages with links to them (particularly no external links) may not have substantial PageRank and therefore are unlikely to rank for anything other than long tail queries. But since the scenario Atwood describes is all about long tail queries (typing in the exact title of a page, for instance), then getting those pages crawled and indexed is sufficient.
To dig a bit more into Atwood’s needs, he says, “It’s far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.” If he’s looking to provide comprehensive search for visitors of his site, he might consider Google’s custom search engine (CSE). Generally, the CSE searches over what’s in the Google index. But if you’re submitted a Sitemap, Google will maintain a CSE-specific index that contains any URLs from the Sitemap that aren’t in Google’s web search index. So, the CSE could provide even better search results than a regular web search.
Why would Google put some URLs in the CSE-specific index and not the regular web index? Well, Google’s algorithms use lots of criteria for determining not only how to rank pages, but what pages to crawl and index as well. So, if, for instance, Googlebot has crawled what it’s deemed the maximum number of URLs from your site for the week for the web index (I’m over-simplifying here a bit), it can still add the remainder to the CSE index.
It doesn’t sound very scalable. (from John Topley)
You can easily write a script that updates the Sitemap each time the site is updated. And if your Sitemap reaches the maximum size, you can break it up into multiple Sitemaps automatically or you can segment them by folder (or whatever organizational structure works best for you). If you want, you can even ping the search engines each time the Sitemap is updated, or you can just reference it in your robots.txt file as Atwood suggests and let them pick it up.
How do you determine change frequency? (John Topley)
If your script can determine this, then you can set it up programmatically. Otherwise, I’d skip this attribute and just concentrate on listing the URLs.
I think google is not happy with the “dynamic” parts of the url e.g. “?” or “&” (Marcel Sauer)
Google does fine with dynamic URLs. They can have trouble if the dynamic nature of the site leads to things like infinite URLs, lots of URLs that display the same page, crazy parameters, or recursive redirects, but as I noted above, the trouble tends not to be with the URLs themselves, but the fact that they aren’t always well-linked.
Steve Rubel recently wrote a blog post about measuring online influence. He concluded that Google PageRank is the ultimate way to measure online influence.
I completely disagree.
I agree with him that we need better measures and the ones that we have are looking through a glass darkly (at best), but PageRank is probably one of the worst measures around. John Mueller asked me if I ever worried about PageRank, so I can answer that question while I explain why I disagree so much with Rubel.
What is PageRank anyway?
First, a bit of explanation about PageRank. Two entirely different things are called “PageRank”. There’s the Google toolbar PageRank, which is represented by integers 1 through 10, and then there’s the internal PageRank number that Google uses as one of its many (hundreds of) ranking factors. When I say PageRank doesn’t matter (and I say it a lot), I’m talking about the toolbar PageRank. The internal PageRank that Google uses does matter, but it’s far from only thing that matters.
At the simplest level, PageRank (both toolbar and internal) is a measure of a page’s link popularity. How many links does a page have and how authoritative are those links?
Why do I think PageRank doesn’t matter?
Rubel is talking about the toolbar PageRank in his post. So why do I say it doesn’t matter while he says it’s the “ultimate measure”?
Why does Rubel think PageRank is the “ultimate”?
Rubel sees things a little differently. He said:
I’m not the first person to disagree with Rubel on this. Michael Gray mentioned it on Twitter and Rubel replied “I know Page Rank is not perfect. But it determines your footprint on Google and that’s why it’s the ultimate influence metric.”
If there’s one thing that PageRank is not, it’s the determination of your Google footprint. The internal “real” PageRank isn’t even that. Lots of things go into determining your Google footprint. His discussion in the comments goes further down this path of misunderstanding what PageRank is. He agrees with someone in the comments who says that “PageRank is the sum of all other measurements.” It’s not. It’s one measurement added in with a whole bunch of others.
Others in the comments do point this out. In fact, James Joyner said “My site has gone from PR7 to PR4 for no apparent reason. At the same time, my visitors, commenters, and social media followers have gone up. My content gets syndicated at Newsweek. It’s included in Google News, for goodness sakes. But my PR has plummeted. Oddly, however, my Google traffic has not.”
So how do you measure online influence?
But what of Rubel’s actual question? How do you measure online influence? I would ask why you want to measure it. I spoke at the eMetrics summit a few months ago and the big discussion was around measuring engagement. But what is actionable about that measure, even if you are able to track it down?
It could be that the measure is different depending on your goal. If you’re coke and you want to sell more soft drinks, then the only measure you care about may be increased sales.
Rubel mentions that unique visitor counts are largely empty numbers as hordes of visitors might come from search but then leave immediately. Well, sure. That’s why you have to measure bounce rate. And conversion. And understand that the goal isn’t to rank #1 in Google and get a lot of traffic, it’s to rank highly for search queries that your customers who want to buy your products are doing. But I’ve talked about that before.
If you’re a blogger and you don’t sell anything, then you might care about getting more readers. Or you might make money from advertising. Or maybe you want to get famous so a big time magazine wants you to write for them. “Online influence” is a nebulous term, at best.
I do agree that we need better measures. That we’re overwhelmed with numbers and we don’t know what’s actionable or useful. And I think you can measure the impact and value of things like social media that don’t correlate directly to sales. These are the things I spend a lot of my time thinking about these days. But I’m pretty sure toolbar PageRank is not that magic measure.
A few days ago, I noticed that Zurich-based Googler John Mueller posted a Twitter link to Google’s new Moderator application and invited everyone to ask Google engineers, such as Matt Cutts, questions. I asked Matt to bring me some frozen yogurt from the Google cafe. As Google Moderator takes advantage of the wisdom of crowds, my question soon plummeted to last place as apparently I’m the only one interested in my need for icy treats and everyone else (er, at least 86 people) voted that they didn’t like my question. (I don’t mind being voted down; I just wish I at least had some frozen yogurt!)
Matt blogged about this tool, explaining that it has been available internally at Google for a while and is a great way to prioritize questions.
If only it were searchable
It seems like a pretty cool tool. The biggest drawback I see to using it is that none of the content gets indexed, so unlike something like Yahoo! Answers, any work you put into asking or answering questions can’t be found later by those searching for that information. Why can no content be indexed, you ask? For one thing, the content is entirely in JavaScript. With JavaScript turned off, all you see is a message that says:
Google Moderator is a tool that allows distributed communities to submit and vote on questions for talks, presentations, and events. You must have JavaScript enabled in order to use this feature.
You can see that Google is indexing what’s in the noscript tag by checking out the search result:
Huh.
The other problem with indexing is that every URL is differentiated by characters that begin with a #. Even if the content did load without JavaScript, search engines would see every URL as moderator.appspot.com, since they drop everything after a # (since traditional web practices dictate that the # in a URL indicate an anchor point within the existing page).
As a sidenote on how ensuring your site is search-friendly can help usability, note that these issues also keep the back button from working in the browser, so if you read an answer, you can’t easily get back to the list of questions.
(The site has other, more minor, search issues, such as that the logo doesn’t link to the home page and there’s no meta description, but really, fixing those issues would be like using a thimble to bail out a sinking boat with a hole the size of a bowling ball in the bottom of it.)
Go ahead, ask me a question
I started a question series to test out the system, and I will answer the questions that are voted to the top, but I’m likely to answer them here (and post links to the answers there) because of that er, minor indexing issue. Feel free to ask a question there and test things out yourself.
Answering questions to Matt
I’m not Matt, nor do I play him on TV, but I did see a few questions to him that I thought I’d steal away to answer. (Although he has answered quite a few already himself!)
Q: What’s the best way to get a count of indexed pages in Google? Last I checked, Webmaster Tools just links to the standard “site:” operator. Various query tricks have had different levels of success in the past, but none have been reliable. (Nick, Chicago)
I love this question because I get to talk up Google Webmaster Central. A reliable count does indeed exist! Simply create an XML Sitemap that includes a comprehensive and accurate list of the pages you would like indexed. Alternately, you could create several Sitemaps, each with a different set of pages you want to track. If you want to track more than 50,000 URLs, simply add multiple Sitemaps to one Sitemap Index file. Submit the Sitemap or Sitemap Index file to Google Webmaster Tools and check back after it’s been processed. The Sitemaps tab displays not only the Sitemap URL count, but also tells you the number of URLs from that Sitemap that have been indexed. You can track this number of time to measure indexing coverage.
It’s a pretty handy trick and much more accurate than the site: operator.
Q: Is Google looking for a true solution to deal with duplicate content between UK & US Websites own by the same company? (François, Brussels) and How will Google Identifies that Particular Website belongs to particular location even when it is hosted in US, and uses that Data to show that website to users of country for which it is appropriate? (Cold, Jaipur India)
I can’t speak for what Google is looking to do, but I do know that generally search engines filter duplicate content and show the most relevant version to the searcher. So, in the case of US and UK content, Google would look to show the US version to the US searcher and the UK version to the UK searcher. It wouldn’t generally look to show both versions in a single search result. Google figures out which is more relevant to the searcher using things like the searcher’s geographic location (based on IP address) and whether the searcher is using google.com or google.co.uk.
You can provide signals to Google about the content using TLD (putting the US content on yoursite.com and the UK content on yoursite.co.uk) or domains hosted in the target country (yoursiteus.com hosted in the US and yoursiteuk.com hosted in the UK), segmenting the content into subdomains or subfolders and then specifying the target country for each in Google Webmaster Tools (us.yoursite.com associated with US and uk.yoursite.com associated with UK), and using the meta language element. (To be honest, I’m not entirely sure about the meta country tag. Anyone have experience with this?)
So, back to the tool
In particular, I think this tool could be really handy for QA at conferences. I’ve spoken at several conferences (including SMX and Web 2.0 Expo) where attendees could use an online system to submit questions, and I’ve used Twitter for this a few times, but this is the first system I’ve seen that also lets the rest of the audience vote questions up or down. I may have to try it out for my next event.
SMX East is coming up in NYC in early October. It should be a great conference with lots of great stuff about search marketing, including an whole day on doing SEO in-house, programmed by Jessica Bowman.
I’m coordinating several sessions, and we’re nearing last call for speakers. If you’re interested in speaking at these or any other sessions, get your pitch in now! (And if you’ve already pitched for any of my sessions, you should hear something by late this week or early next.)
In particular, the sessions I’m programming are below. In particular, I’m looking for case studies, real-world implementation best practices, and tactical, actionable stuff that attendees can bring home and implement right away. OK, maybe bring to work. Unless they work from home. Or a coffee shop. Like I am right now.
CSS, AJAX, Web 2.0 & SEO
This session looks CSS, AJAX and Web 2.0 dynamic design techniques that can cause search engine indexing and ranking issues, with solutions to consider.
Enhanced Listings
Yahoo has SearchMonkey. Google has sitelinks management. Even Microsoft is looking at ways to dress-up listings. This session looks at the move toward enhanced listings and how search marketers can tap into them.
Flash & SEO
Google is handling Flash in a new way thanks to a partnership with Adobe, and Yahoo may soon do the same. Meanwhile, there are plenty of ‘old’ techniques to make Flash sites search engine friendly. But any of these techniques still don’t mean that Flash issues are solved. More in this session.
Unraveling URLs & Demystifying Domains
Can you find the same page on your site using different URLs? That might cause you duplicate content issues. Does your content management system put out parameters that block crawling? Own multiple domains pointing at the same site? Are you 301 redirecting them or leaving canonicalization to chance? Confused on even how to pronounce canonicalization, in addition to now being worried about it? Relax. This session looks at a variety of URL and domain name issues you should consider to increase your success with SEO.
Last week at SES, I sat in on the Black Hat, White Hat session. The panelists gave their definitions of “white hat” and “black hat” SEO, and then talked about particular techniques, such as paid links, and whether they thought they were OK or not. The panelists talked about shades of gray and the lack of rules and the need for experimentation.
I come from a background of working at a search engine, rather than from a background of being an SEO, so I may see things a bit differently than those on the panel.
White hat = no guidelines violations
Generally speaking, I don’t see a lot of shades of gray in this discussion. The search engines have published guidelines (and while I was at Google, I spent a lot of time expanding the descriptions of those guidelines to help make them clearer). Violate any of those guidelines and you risk having the site removed from the index. That’s pretty black and white.
Related discussions have more shades of gray
I do see shades of gray in different discussions, such as:
While all of these discussions and more are certainly valid and useful, I feel the trouble comes in when people have these discussions as the discussion about what is white hat and what is black hat. When people who aren’t experienced in the intricacies of SEO look for information and they see statements like “these are white hat reasons to cloak” and “all paid links aren’t bad”, they can be led astray and think that those things adhere to search engine guidelines.
What are the guidelines?
As for the SES panel, I respect all of the panelists and felt they all had interesting, useful things to say. But for me, the question at hand is simple to answer. Techniques that violate the guidelines aren’t white hat. They may be effective (at least for a time), commonplace, non-deceptive, or justified, but that doesn’t make them white hat. To me, white hat is anything that doesn’t put the site at risk of being removed from the search index.
When I was at Google, I spent a lot of time expanding the guidelines, detailing examples, and providing options of techniques that didn’t violate the guidelines. Since then, Google’s continued to expand the information and make it even more helpful. Check them out, and in particular, click the links under the “quality guidelines – specific guidelines” if you want to read up on the details.
What is SEO?
So what is white hat SEO? The panlists agreed it was about creating quality content — being the most relevant result for a desired query. I absolutely agree, but SEO is also about making sure the site can be easily crawled and indexed by search engines. From a search engine perspective, the best site in the world is unlikely to rank if the bot can’t extract any content from it. That latter part of SEO is, of course, the motivation behind Jane and Robot (which is just getting started; look for more soon!).
You might also check out:
A couple of weeks ago, Adobe announced that it was working with Google and Yahoo! on making Flash content easier to index in search engines. Google said it was using the search-engine specific Flash player that Adobe had made available (Yahoo!’s integration is still in the works). While I think it’s great and absolutely vital that search engines continue to evolve beyond strictly text (to ensure they are providing the best possible experience for their users), I don’t think this announcement means that all the Flash content on the web will now suddenly start ranking in search results and I don’t think that Flash developers can stop thinking about search engine optimization.
How search engines work
It all goes back to how search engines work. At least for now (even with all of the advancements in the last year around universal search), the foundations of the major search engines are based on text. The web began with primarily text-only pages and the search engine algorithms were built on that idea. When people started searching for information, they searched with words. We’re used to asking for things in words, after all, and since words were what the web was made up of, the questions and answers matched up quite well. Search engines are a bit of a middleman (middlemachine?) between a searcher’s textual questions and a web site’s textual answers.
Searching continues to be text based
Sure, you might imagine other types of exchanges. I might want to upload a picture of a person and ask for all the other pictures on the web of that person. Or I might want to search through the audio of a song for a particular lyric. All of those types of searches and more are coming (and some have been tried, with varying degrees of success), but at least for now, those applications are not how the three major search engines work and not how most people search.
Over time, search engines have experimented with different elements on pages beyond simply the text itself to better understand what those pages are about. Although since these experiments are built on a text-based foundation, the experiments have also still mostly focused on text. For instance, search engines found that the text that’s in the title may be a strong indicator of the focus on the page. The textual caption under and image is likely describing that image.
How Flash fits in with text-based search engines
Now, consider Flash. Most Flash pages contain little text. Those that do could often just as easily display that text outside of the Flash components (which would make it easier for those on screen readers and mobile phones, for instance, to view the content).
With this latest innovation in crawling Flash, Google can more easily access the text in Flash, but they still can’t process it quite as well as it can HTML text because they aren’t extracting any meta data about that text. As I mentioned earlier, search engines are now storing all kinds of meta data based on the structure of the text in HTML, like if it’s in a title tag, or an H1 and so on. So Flash-based text has that disadvantage.
Provide a separate URL for each piece of Flash content
Another consideration is how the Flash application itself is constructed. This new Flash player that Adobe is making available to Google and Yahoo! helps the search engines in that it enables them to access content it never could before. The crawlers can interact with the Flash application as a user would and crawl deeper into the application to get to text that may be four or five levels deep. On first glance, this may seem similar to search engine crawlers following links within HTML sites, but it can actually be quite different.
HTML pages (generally) have unique URLs for each page. Flash applications can be constructed that way, but can also be constructed so that as you go deeper into the application, the URL doesn’t change. This can be problematic for lots of usability reasons that have nothing to do with search. For instance, the back button in the browser doesn’t work. Users can’t easily email, Digg, or otherwise share a particular section of the Flash application easily. Bookmarking only works for the beginning of the Flash app.
As you might imagine, it also causes problems in search. Sure, the search engine crawlers may now be able to get to some of that content several levels in, but they have to index all of the text under a single URL. (Also note that they likely won’t index all of the application in this case; they will execute only a certain number of interactions.)
Say information about your latest product line is available once you choose “products” from the home page, then “new” from the products page, then “coming soon” from the new page. If the URL of the application doesn’t change for each interaction, then search engines will have to index the content from the home page, products page, new page, and coming soon page all under a single URL. When a searcher looks for your latest product line, that URL may appear in the results. But once the searcher clicks over, they aren’t brought to your coming soon page, they see your home page, and may have no idea where to go from there. If you ensure your Flash app uses a different URL for each page, then the searcher can be brought directly to the page that has the right content, which should greatly improve conversion rates and lower bounce rates.
But if you take the announcement that Google can now index Flash at face value, without looking deeper, you may not realize this, and think that your single-URL Flash application is now perfectly positioned for search.
Taking back the tour
Want an example of how the statement “Google can now index Flash” isn’t the whole story?
I’ve been watching the Tour de France. It’s playing on the Versus network for the first time this year. I’d never heard of the Versus network before (since it seems to mostly show ultimate fighting cage matches, this may be because I’m not its target audience; not to mention that I wasn’t the target audience for the network under its previous name, OLN, as I think it mostly played shows about people fishing then), and the network is looking to capitalize on this potential new audience.
Versus is spending a lot of money on its Tour de France campaign “Take Back the Tour”. It has put together flashy commercials and an equally flashy website.
Versus probably would like to be found when people search for [tour de france]. The Tour de France page on the main versus.com domain shows up in the search results, but the Take Back The Tour site that they spent so money money on? Nowhere to be found.
Well, they’re spending all the money on commercials and print ads, so maybe people have been searching for [take back the tour] as well. The site does rank #1 for that query on both Google and Live (although it’s down at #8 on Yahoo!). For all three engines, even those who do the search because they saw an ad might not be sure if the takebackthetour.com listing is really the official site based on how the listing looks in the search results.
You can see that at this point, Google doesn’t see any content on the site and in fact, notes on the cached page that [take back the tour] appears only in links pointing to the page. Since it can’t extract any text, it has no way of knowing that the site is about the Tour de France.
Google still doesn’t Flash executed via JavaScript
So. What’s the problem? Google crawls Flash now and all should be well. I see at least two problems. The first is fundamental. The Flash executes via JavaScript. Google noted in their blog post that:
“Googlebot does not execute some types of JavaScript. So if your web page loads a Flash file via JavaScript, Google may not be aware of that Flash file, in which case it will not be indexed.”
They did update the post later to say that:
“For our July 1st launch, we didn’t enable Flash indexing for Flash files embedded via SWFObject. We’re now rolling out an update that enables support for common JavaScript techniques for embedding Flash, including SWFObject and SWFObject2.”
Will this update help the Take Back the Tour site? Maybe not.
Can Google find any words to index?
Another big obstacle to the crawl of this site is that even if Google could get to the Flash, it would find few words to index. Nearly all of the text on the site is contained in images. The first thing you see when you go to the site is lots of words, but the only ones that seem to be text, rather than part of the image, are in the link “join the movement”.
So, once Google can access the Flash, it will be able to crawl and index those words. This design is a theme throughout the site. Links like “back” are text. Nearly everything else is in images.
Let’s pretend for a moment that they changed the Flash file so that the text wasn’t contained in images (and that the JavaScript problem didn’t exist). Would this help indexing? Yes and no.
No separate URLs can lead to a poor experience for searchers
Each time you click a link in the Flash file, you are taken to another page, but the URL doesn’t change. It stays at takebackthetour.com no matter how you navigate. That means that any text Google does pick up will be indexed under that one URL.
By clicking about three levels deep, I can find TV spots about the tour. If the site designers added some text about those TV spots, using the language of their customers, then searchers looking for [tour de france video] or something similar might see the takebackthetour.com site come up in their search results. But when they clicked through to the site, they wouldn’t see the TV spots. They would see the Flash splash page. And they would have to figure out how to navigate through the site to find the video section. Chances are that many searchers would scan the initial page that came up, not see what they were looking for and go back to the search results to find another site.
Little change for viral success
This makes for a poor user experience from search, but consider also that the creators of this campaign obviously are hoping it goes viral. If you want a site to go viral, you have to make it easily shareable. Sure, people may love the rant section or the video section or the contest, but no URL of any of these sections exists for those people to email, Digg, Twitter, Stumble, or otherwise share. A viral campaign that requires every person who shares the content to say, “go to this URL, then click ‘join the movement’, then click ‘how will you take back the tour’ is over before it even begins.
And what about accessibility? And those on the go? I watched the first night of the tour at a friend’s house. What if I had seen the commercial, wanted to check it out, and pulled up the site on my Windows Mobile Smartphone? I would have had this awesome experience:
It’s not even an accurate error message, since the first problem is that I don’t have JavaScript support.
Be smart about Flash
Clearly, a few problems still exist with Flash websites. My view is this:
A (Google official) blog post about scraped content on a scraper site.
But the original does rank first.
Although not in blog search.
(But that appears to be because the post isn’t indexed in blogsearch at all. Because rivva.de is listed in its place?