Posts Tagged ‘SEO’

Practical Tips for Government Web Sites (And Everyone Else!) To Improve Their Findability in Search

April 15, 2009

In an earlier post, I said that key to government opening its data to citizens, being more transparent, and improving the relationship between citizens and government in light of our web 2.0 world was ensuring content on government sites could be easily found in search engines. Architecting sites to be search engine friendly, particularly sites with as much content and legacy code as those the government manages, can be a resource-intensive process that takes careful long-term planning. But

two keys are:

  • Assessing who the audience is and what they’re searching for
  • Ensuring the site architecture is easily crawlable

Read more on O’Reilly Radar

Transforming the Relationship Between Citizens and Government: Making Content Findable Online

March 24, 2009

Thursday Congressman Honda asked, “how can congress take advantage of web 2.0 technologies to transform the relationship between citizens and government?” He noted that “A dramatic shift in perspective is needed before that need can be met. Instead of databases becoming available as a result of Freedom Of Information Act requests, government officials should be required to justify why any public data should not be freely available to the taxpayers who paid for its creation.” He asked for input on what web 2.0 features he should add to his website to take advantage of today’s online world.

The most important feature government web sites can add isn’t really feature at all. But it would absolutely transform the relationship between citizens and government and make an amazing array of public data available. What’s this magic feature?

Make government web sites search engine friendly.

Read more on O’Reilly Radar

Search Engine Optimization Recap

November 18, 2008

Increasing Search Indexing Coverage With an XML Sitemap

October 13, 2008

I just read Jeff Atwood’s post on Coding Horror about the importance of Sitemaps. I’m always eager to hear about people’s experiences since I spent so much time on XML Sitemaps and getting sitemaps.org launched while I was at Google. Sitemaps, of course, are supported by Google, Yahoo, and Live Search. All you have to do is reference the Sitemap location in your robots.txt file and all the engines will pick it up.

Atwood noted that he uses Google to search for his own stuff, which makes it that much more frustrating when some of the content isn’t indexed. (Not to mention of course, the lost visitor opportunities.) Once he created an XML Sitemap, Google started finding and indexing more of his pages. Yay!

However, he and his commenters had a few questions about the process, so I thought I’d take a few minutes to answer them. Of course, I don’t work for Google anymore, so these answers are entirely my own. If you want official answers, check out the Official Google Webmaster Help forum.

Why is Google having so much trouble crawling my dynamic site? Can’t Googlebot figure out my URL scheme? (I’m paraphrasing Atwood’s post here.)
I haven’t spent a lot of time studying stackoverflow.com (the site in question), but since Google is crawling and indexing the URLs after finding them in the Sitemap, the problem likely isn’t with the dynamic nature of the URLs themselves. The issue is probably that the internal linking structure doesn’t provide links to every single page. Since Googlebot crawls the web by following links, it wouldn’t know about the unlinked URLs. Atwood notes this possibility:

“On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage… I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn’t necessarily the case.”

Of course, pages with links to them (particularly no external links) may not have substantial PageRank and therefore are unlikely to rank for anything other than long tail queries. But since the scenario Atwood describes is all about long tail queries (typing in the exact title of a page, for instance), then getting those pages crawled and indexed is sufficient.

To dig a bit more into Atwood’s needs, he says, “It’s far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.” If he’s looking to provide comprehensive search for visitors of his site, he might consider Google’s custom search engine (CSE). Generally, the CSE searches over what’s in the Google index. But if you’re submitted a Sitemap, Google will maintain a CSE-specific index that contains any URLs from the Sitemap that aren’t in Google’s web search index. So, the CSE could provide even better search results than a regular web search.

Why would Google put some URLs in the CSE-specific index and not the regular web index? Well, Google’s algorithms use lots of criteria for determining not only how to rank pages, but what pages to crawl and index as well. So, if, for instance, Googlebot has crawled what it’s deemed the maximum number of URLs from your site for the week for the web index (I’m over-simplifying here a bit), it can still add the remainder to the CSE index.

It doesn’t sound very scalable. (from John Topley)
You can easily write a script that updates the Sitemap each time the site is updated. And if your Sitemap reaches the maximum size, you can break it up into multiple Sitemaps automatically or you can segment them by folder (or whatever organizational structure works best for you). If you want, you can even ping the search engines each time the Sitemap is updated, or you can just reference it in your robots.txt file as Atwood suggests and let them pick it up.

How do you determine change frequency? (John Topley)
If your script can determine this, then you can set it up programmatically. Otherwise, I’d skip this attribute and just concentrate on listing the URLs.

I think google is not happy with the “dynamic” parts of the url e.g. “?” or “&” (Marcel Sauer)
Google does fine with dynamic URLs. They can have trouble if the dynamic nature of the site leads to things like infinite URLs, lots of URLs that display the same page, crazy parameters, or recursive redirects, but as I noted above, the trouble tends not to be with the URLs themselves, but the fact that they aren’t always well-linked.

Is PageRank the Ultimate Measure of Online Influence?

October 8, 2008

Steve Rubel recently wrote a blog post about measuring online influence. He concluded that Google PageRank is the ultimate way to measure online influence.

I completely disagree.

I agree with him that we need better measures and the ones that we have are looking through a glass darkly (at best), but PageRank is probably one of the worst measures around. John Mueller asked me if I ever worried about PageRank, so I can answer that question while I explain why I disagree so much with Rubel.

What is PageRank anyway?
First, a bit of explanation about PageRank. Two entirely different things are called “PageRank”. There’s the Google toolbar PageRank, which is represented by integers 1 through 10, and then there’s the internal PageRank number that Google uses as one of its many (hundreds of) ranking factors. When I say PageRank doesn’t matter (and I say it a lot), I’m talking about the toolbar PageRank. The internal PageRank that Google uses does matter, but it’s far from only thing that matters.

At the simplest level, PageRank (both toolbar and internal) is a measure of a page’s link popularity. How many links does a page have and how authoritative are those links?

Why do I think PageRank doesn’t matter?
Rubel is talking about the toolbar PageRank in his post. So why do I say it doesn’t matter while he says it’s the “ultimate measure”?

  • It’s updated infrequently. As Matt Cutts has said, it’s updated every “few months”. So, it’s generally pretty stale data. When you see a 5, there’s really no way of knowing if the site is currently a 5 or was a 5 two months ago but is now a 7. Or a 2. (“Real” PageRank is computed continually.)
  • It’s not very accurate. The internal PageRank is not an integer number 1 – 10. It’s something much more precise. So even without the staleness problem, there’s still an accuracy problem.
  • It can easily be gamed. Link schemes, link exchanges, and paid links have been around for a long time. Google is always working to be one step ahead, but these techniques can work for a time.
  • Link builders have an advantage. Certainly savvy SEOs and link builders know how to get quality links. One site could have more online influence and engagement but just not have an owner who knows about link building.
  • The toolbar number may be obfuscated. Google has to maintain a delicate balance of giving as much information as possible to web site owners, while not giving away enough to let spammers impact the quality of search results. This was one of the hardest parts of my job when I ran Google Webmaster Central. I talked to B&B owners who just wanted people to know their inns existed. And I talked to black hats who used every loophole to get their viagra sites on the first page.The Official Google Webmaster Central blog talked about obfuscation that Google did late last year. In this particular case, sites were selling text links that weren’t marked as advertising and their major selling point was the high PageRank of the site. By reducing the visible PageRank, those sites could not as easily sell links.
  • PageRank doesn’t necessarily correlate to ranking. Matt mentioned this recently on Sphinn, when he said “Even if you don’t show much PageRank, Google still has 200+ other signals we use in our ranking. It’s definitely common to see lower-PageRank sites ranking above higher-PageRank sites–which tends confuses the people who obsess too much about PageRank and who don’t focus on other factors that search engines might use to rank pages”.

Why does Rubel think PageRank is the “ultimate”?
Rubel sees things a little differently. He said:

  • “Page Rank is something you earn by producing high quality content that people link to” – Unfortunately, that’s not entirely correct. An average piece of content might get lots of links via a link builder, or if the person writing the content is popular, or even if the person writing the content is universally hated and lots of people link to the content to trash it. A piece high quality content may be very engaging and may influence a lot of people, but those people may not be linkers by default. And really, if what you’re really looking to do is measure what content gets the most links based on the argument that something with a lot of links has a lot of influence because the links themselves raise awareness about that content, then use Yahoo! Site Explorer, which will give you up-to-date and accurate link counts. Don’t use a rounded, out of date number that approximates link counts.
  • “It enables you to influence people on the Internet’s biggest stage – Google – and just as people are searching for the topics you are knowledgeable about. This means it amplifies your influence because the press start at search engines when researching stories” – As noted above, PageRank is one of more than a hundred factors in determining ranking. It happens all the time that a site with a lower toolbar PageRank will rank above high PageRank sites. Ranking isn’t just about link quantity. It’s about crawlability, extractability, quality content, link quality, anchor text…. Well, a lot of things.
  • “Page Rank is channel agnostic and takes the entire online ecosystem into account. It judges you based on links from all kinds of sources, not just people who live in the same fish tank. In other words, it goes beyond people who hang out on Twitter who love people who Tweet or bloggers who link to other bloggers, etc. It eschews the echo chamber”- Again, not exactly. It may eschew the echo chamber but it rewards a savvy link builder. And some audiences are more likely to link than others. For instance, marketing blogs link out all the time. Recipe blogs are getting better at linking. But some newspapers don’t link at all, or provide the link as text. Some audiences aren’t the type to have sites from which they can link, so you can only see their involvement through things like comments and subscriber numbers. And for some audiences that do control sites, linking just doesn’t cross their minds. It’s not something they think about doing.

I’m not the first person to disagree with Rubel on this. Michael Gray mentioned it on Twitter and Rubel replied “I know Page Rank is not perfect. But it determines your footprint on Google and that’s why it’s the ultimate influence metric.”

If there’s one thing that PageRank is not, it’s the determination of your Google footprint. The internal “real” PageRank isn’t even that. Lots of things go into determining your Google footprint. His discussion in the comments goes further down this path of misunderstanding what PageRank is. He agrees with someone in the comments who says that “PageRank is the sum of all other measurements.” It’s not. It’s one measurement added in with a whole bunch of others.

Others in the comments do point this out. In fact, James Joyner said “My site has gone from PR7 to PR4 for no apparent reason. At the same time, my visitors, commenters, and social media followers have gone up. My content gets syndicated at Newsweek. It’s included in Google News, for goodness sakes. But my PR has plummeted. Oddly, however, my Google traffic has not.”

So how do you measure online influence?
But what of Rubel’s actual question? How do you measure online influence? I would ask why you want to measure it. I spoke at the eMetrics summit a few months ago and the big discussion was around measuring engagement. But what is actionable about that measure, even if you are able to track it down?

It could be that the measure is different depending on your goal. If you’re coke and you want to sell more soft drinks, then the only measure you care about may be increased sales.

Rubel mentions that unique visitor counts are largely empty numbers as hordes of visitors might come from search but then leave immediately. Well, sure. That’s why you have to measure bounce rate. And conversion. And understand that the goal isn’t to rank #1 in Google and get a lot of traffic, it’s to rank highly for search queries that your customers who want to buy your products are doing. But I’ve talked about that before.

If you’re a blogger and you don’t sell anything, then you might care about getting more readers. Or you might make money from advertising. Or maybe you want to get famous so a big time magazine wants you to write for them. “Online influence” is a nebulous term, at best.

I do agree that we need better measures. That we’re overwhelmed with numbers and we don’t know what’s actionable or useful. And I think you can measure the impact and value of things like social media that don’t correlate directly to sales. These are the things I spend a lot of my time thinking about these days. But I’m pretty sure toolbar PageRank is not that magic measure.

Google Moderator Beta: Ask a Google Engineer

September 30, 2008

A few days ago, I noticed that Zurich-based Googler John Mueller posted a Twitter link to Google’s new Moderator application and invited everyone to ask Google engineers, such as Matt Cutts, questions. I asked Matt to bring me some frozen yogurt from the Google cafe. As Google Moderator takes advantage of the wisdom of crowds, my question soon plummeted to last place as apparently I’m the only one interested in my need for icy treats and everyone else (er, at least 86 people) voted that they didn’t like my question. (I don’t mind being voted down; I just wish I at least had some frozen yogurt!)

Matt blogged about this tool, explaining that it has been available internally at Google for a while and is a great way to prioritize questions.

If only it were searchable
It seems like a pretty cool tool. The biggest drawback I see to using it is that none of the content gets indexed, so unlike something like Yahoo! Answers, any work you put into asking or answering questions can’t be found later by those searching for that information. Why can no content be indexed, you ask? For one thing, the content is entirely in JavaScript. With JavaScript turned off, all you see is a message that says:

Google Moderator is a tool that allows distributed communities to submit and vote on questions for talks, presentations, and events. You must have JavaScript enabled in order to use this feature.

You can see that Google is indexing what’s in the noscript tag by checking out the search result:

googlemoderatorserp

Huh.

The other problem with indexing is that every URL is differentiated by characters that begin with a #. Even if the content did load without JavaScript, search engines would see every URL as moderator.appspot.com, since they drop everything after a # (since traditional web practices dictate that the # in a URL indicate an anchor point within the existing page).

As a sidenote on how ensuring your site is search-friendly can help usability, note that these issues also keep the back button from working in the browser, so if you read an answer, you can’t easily get back to the list of questions.

(The site has other, more minor, search issues, such as that the logo doesn’t link to the home page and there’s no meta description, but really, fixing those issues would be like using a thimble to bail out a sinking boat with a hole the size of a bowling ball in the bottom of it.)

Go ahead, ask me a question
I started a question series to test out the system, and I will answer the questions that are voted to the top, but I’m likely to answer them here (and post links to the answers there) because of that er, minor indexing issue. Feel free to ask a question there and test things out yourself.

Answering questions to Matt
I’m not Matt, nor do I play him on TV, but I did see a few questions to him that I thought I’d steal away to answer. (Although he has answered quite a few already himself!)

Q: What’s the best way to get a count of indexed pages in Google? Last I checked, Webmaster Tools just links to the standard “site:” operator. Various query tricks have had different levels of success in the past, but none have been reliable. (Nick, Chicago)

I love this question because I get to talk up Google Webmaster Central. A reliable count does indeed exist! Simply create an XML Sitemap that includes a comprehensive and accurate list of the pages you would like indexed. Alternately, you could create several Sitemaps, each with a different set of pages you want to track. If you want to track more than 50,000 URLs, simply add multiple Sitemaps to one Sitemap Index file. Submit the Sitemap or Sitemap Index file to Google Webmaster Tools and check back after it’s been processed. The Sitemaps tab displays not only the Sitemap URL count, but also tells you the number of URLs from that Sitemap that have been indexed. You can track this number of time to measure indexing coverage.

sitemapcount

It’s a pretty handy trick and much more accurate than the site: operator.

Q: Is Google looking for a true solution to deal with duplicate content between UK & US Websites own by the same company? (François, Brussels) and How will Google Identifies that Particular Website belongs to particular location even when it is hosted in US, and uses that Data to show that website to users of country for which it is appropriate? (Cold, Jaipur India)
I can’t speak for what Google is looking to do, but I do know that generally search engines filter duplicate content and show the most relevant version to the searcher. So, in the case of US and UK content, Google would look to show the US version to the US searcher and the UK version to the UK searcher. It wouldn’t generally look to show both versions in a single search result. Google figures out which is more relevant to the searcher using things like the searcher’s geographic location (based on IP address) and whether the searcher is using google.com or google.co.uk.

You can provide signals to Google about the content using TLD (putting the US content on yoursite.com and the UK content on yoursite.co.uk) or domains hosted in the target country (yoursiteus.com hosted in the US and yoursiteuk.com hosted in the UK), segmenting the content into subdomains or subfolders and then specifying the target country for each in Google Webmaster Tools (us.yoursite.com associated with US and uk.yoursite.com associated with UK), and using the meta language element. (To be honest, I’m not entirely sure about the meta country tag. Anyone have experience with this?)

geotarget

So, back to the tool
In particular, I think this tool could be really handy for QA at conferences. I’ve spoken at several conferences (including SMX and Web 2.0 Expo) where attendees could use an online system to submit questions, and I’ve used Twitter for this a few times, but this is the first system I’ve seen that also lets the rest of the audience vote questions up or down. I may have to try it out for my next event.

Speak at SMX East About Search

August 27, 2008

SMX East is coming up in NYC in early October. It should be a great conference with lots of great stuff about search marketing, including an whole day on doing SEO in-house, programmed by Jessica Bowman.

I’m coordinating several sessions, and we’re nearing last call for speakers. If you’re interested in speaking at these or any other sessions, get your pitch in now! (And if you’ve already pitched for any of my sessions, you should hear something by late this week or early next.)

SMX East speaker pitch form

In particular, the sessions I’m programming are below. In particular, I’m looking for case studies, real-world implementation best practices, and tactical, actionable stuff that attendees can bring home and implement right away. OK, maybe bring to work. Unless they work from home. Or a coffee shop. Like I am right now.

CSS, AJAX, Web 2.0 & SEO
This session looks CSS, AJAX and Web 2.0 dynamic design techniques that can cause search engine indexing and ranking issues, with solutions to consider.

Enhanced Listings
Yahoo has SearchMonkey. Google has sitelinks management. Even Microsoft is looking at ways to dress-up listings. This session looks at the move toward enhanced listings and how search marketers can tap into them.

Flash & SEO
Google is handling Flash in a new way thanks to a partnership with Adobe, and Yahoo may soon do the same. Meanwhile, there are plenty of ‘old’ techniques to make Flash sites search engine friendly. But any of these techniques still don’t mean that Flash issues are solved. More in this session.

Unraveling URLs & Demystifying Domains
Can you find the same page on your site using different URLs? That might cause you duplicate content issues. Does your content management system put out parameters that block crawling? Own multiple domains pointing at the same site? Are you 301 redirecting them or leaving canonicalization to chance? Confused on even how to pronounce canonicalization, in addition to now being worried about it? Relax. This session looks at a variety of URL and domain name issues you should consider to increase your success with SEO.

What’s Really Black Hat Anyway?

August 24, 2008

Last week at SES, I sat in on the Black Hat, White Hat session. The panelists gave their definitions of “white hat” and “black hat” SEO, and then talked about particular techniques, such as paid links, and whether they thought they were OK or not. The panelists talked about shades of gray and the lack of rules and the need for experimentation.

I come from a background of working at a search engine, rather than from a background of being an SEO, so I may see things a bit differently than those on the panel.

White hat = no guidelines violations
Generally speaking, I don’t see a lot of shades of gray in this discussion. The search engines have published guidelines (and while I was at Google, I spent a lot of time expanding the descriptions of those guidelines to help make them clearer). Violate any of those guidelines and you risk having the site removed from the index. That’s pretty black and white.

Related discussions have more shades of gray
I do see shades of gray in different discussions, such as:

  • Do all the guidelines make sense in today’s technological environment? For instance, are there valid reasons for cloaking that don’t manipulate search engines and deceive users, such as showing search engine bots a canonical version of a URL and different versions of that URL to users for tracking purposes? For discussions like this, there’s a reasonable debate to be had about whether search engines should consider different rules, but too often, I see the discussion tally up the reasons why a technique isn’t deception then conclude that it is therefore, white hat. But whether a technique “should” be OK and whether it’s white hat are two different discussions. With regard to cloaking, that’s currently against Google guidelines regardless of intent. So justified or not, it’s not white hat with the current set of guidelines.
  • Do techniques that violate the guidelines work? Often, I see the discussion of black hat vs. white hat veer into a discussion about what techniques are most effective: which ones work for enterprise sites or affiliate sites or sites in highly competitive areas. This came up at the panel, when someone talked about how they didn’t believe that a completely white hat site in one of the three Ps (porn, pills or poker) could rank highly. Again, this is an interesting discussion, but a different one. Techniques that violate the guidelines may eventually get the site banned, regardless of their initial efficacy, so it’s important to understand the long term goal of the site before engaging in them. As the panelists noted, sites that are in it for the long term likely want the slow and steady approach.
  • What about techniques that violate the guidelines but are commonplace? I hear this discussion integrated with the hat discussion as well, often when talking about paid links. Paid links aren’t black hat, I sometimes hear, because everyone uses them and they’re vital for ranking success. But again, whether or not a technique is commonplace is somewhat irrelevant to question people are asking when they want to know what is white hat. Paid links violate the guidelines (at least Google’s — the other engines aren’t quite as strict), so they can’t be considered a white hat technique. A different, but valid, discussion is whether all paid links should be against the guidelines.

While all of these discussions and more are certainly valid and useful, I feel the trouble comes in when people have these discussions as the discussion about what is white hat and what is black hat. When people who aren’t experienced in the intricacies of SEO look for information and they see statements like “these are white hat reasons to cloak” and “all paid links aren’t bad”, they can be led astray and think that those things adhere to search engine guidelines.

What are the guidelines?
As for the SES panel, I respect all of the panelists and felt they all had interesting, useful things to say. But for me, the question at hand is simple to answer. Techniques that violate the guidelines aren’t white hat. They may be effective (at least for a time), commonplace, non-deceptive, or justified, but that doesn’t make them white hat. To me, white hat is anything that doesn’t put the site at risk of being removed from the search index.

When I was at Google, I spent a lot of time expanding the guidelines, detailing examples, and providing options of techniques that didn’t violate the guidelines. Since then, Google’s continued to expand the information and make it even more helpful. Check them out, and in particular, click the links under the “quality guidelines – specific guidelines” if you want to read up on the details.

What is SEO?
So what is white hat SEO? The panlists agreed it was about creating quality content — being the most relevant result for a desired query. I absolutely agree, but SEO is also about making sure the site can be easily crawled and indexed by search engines. From a search engine perspective, the best site in the world is unlikely to rank if the bot can’t extract any content from it. That latter part of SEO is, of course, the motivation behind Jane and Robot (which is just getting started; look for more soon!).

You might also check out:

Search-Friendly Flash?

July 15, 2008

A couple of weeks ago, Adobe announced that it was working with Google and Yahoo! on making Flash content easier to index in search engines. Google said it was using the search-engine specific Flash player that Adobe had made available (Yahoo!’s integration is still in the works). While I think it’s great and absolutely vital that search engines continue to evolve beyond strictly text (to ensure they are providing the best possible experience for their users), I don’t think this announcement means that all the Flash content on the web will now suddenly start ranking in search results and I don’t think that Flash developers can stop thinking about search engine optimization.

How search engines work
It all goes back to how search engines work. At least for now (even with all of the advancements in the last year around universal search), the foundations of the major search engines are based on text. The web began with primarily text-only pages and the search engine algorithms were built on that idea. When people started searching for information, they searched with words. We’re used to asking for things in words, after all, and since words were what the web was made up of, the questions and answers matched up quite well. Search engines are a bit of a middleman (middlemachine?) between a searcher’s textual questions and a web site’s textual answers.

Searching continues to be text based
Sure, you might imagine other types of exchanges. I might want to upload a picture of a person and ask for all the other pictures on the web of that person. Or I might want to search through the audio of a song for a particular lyric. All of those types of searches and more are coming (and some have been tried, with varying degrees of success), but at least for now, those applications are not how the three major search engines work and not how most people search.

Over time, search engines have experimented with different elements on pages beyond simply the text itself to better understand what those pages are about. Although since these experiments are built on a text-based foundation, the experiments have also still mostly focused on text. For instance, search engines found that the text that’s in the title may be a strong indicator of the focus on the page. The textual caption under and image is likely describing that image.

How Flash fits in with text-based search engines
Now, consider Flash. Most Flash pages contain little text. Those that do could often just as easily display that text outside of the Flash components (which would make it easier for those on screen readers and mobile phones, for instance, to view the content).

With this latest innovation in crawling Flash, Google can more easily access the text in Flash, but they still can’t process it quite as well as it can HTML text because they aren’t extracting any meta data about that text. As I mentioned earlier, search engines are now storing all kinds of meta data based on the structure of the text in HTML, like if it’s in a title tag, or an H1 and so on. So Flash-based text has that disadvantage.

Provide a separate URL for each piece of Flash content
Another consideration is how the Flash application itself is constructed. This new Flash player that Adobe is making available to Google and Yahoo! helps the search engines in that it enables them to access content it never could before. The crawlers can interact with the Flash application as a user would and crawl deeper into the application to get to text that may be four or five levels deep. On first glance, this may seem similar to search engine crawlers following links within HTML sites, but it can actually be quite different.

HTML pages (generally) have unique URLs for each page. Flash applications can be constructed that way, but can also be constructed so that as you go deeper into the application, the URL doesn’t change. This can be problematic for lots of usability reasons that have nothing to do with search. For instance, the back button in the browser doesn’t work. Users can’t easily email, Digg, or otherwise share a particular section of the Flash application easily. Bookmarking only works for the beginning of the Flash app.

As you might imagine, it also causes problems in search. Sure, the search engine crawlers may now be able to get to some of that content several levels in, but they have to index all of the text under a single URL. (Also note that they likely won’t index all of the application in this case; they will execute only a certain number of interactions.)

Say information about your latest product line is available once you choose “products” from the home page, then “new” from the products page, then “coming soon” from the new page. If the URL of the application doesn’t change for each interaction, then search engines will have to index the content from the home page, products page, new page, and coming soon page all under a single URL. When a searcher looks for your latest product line, that URL may appear in the results. But once the searcher clicks over, they aren’t brought to your coming soon page, they see your home page, and may have no idea where to go from there. If you ensure your Flash app uses a different URL for each page, then the searcher can be brought directly to the page that has the right content, which should greatly improve conversion rates and lower bounce rates.

But if you take the announcement that Google can now index Flash at face value, without looking deeper, you may not realize this, and think that your single-URL Flash application is now perfectly positioned for search.

Taking back the tour
Want an example of how the statement “Google can now index Flash” isn’t the whole story?

I’ve been watching the Tour de France. It’s playing on the Versus network for the first time this year. I’d never heard of the Versus network before (since it seems to mostly show ultimate fighting cage matches, this may be because I’m not its target audience; not to mention that I wasn’t the target audience for the network under its previous name, OLN, as I think it mostly played shows about people fishing then), and the network is looking to capitalize on this potential new audience.

Versus is spending a lot of money on its Tour de France campaign “Take Back the Tour”. It has put together flashy commercials and an equally flashy website.

firstpage

Versus probably would like to be found when people search for [tour de france]. The Tour de France page on the main versus.com domain shows up in the search results, but the Take Back The Tour site that they spent so money money on? Nowhere to be found.

Well, they’re spending all the money on commercials and print ads, so maybe people have been searching for [take back the tour] as well. The site does rank #1 for that query on both Google and Live (although it’s down at #8 on Yahoo!). For all three engines, even those who do the search because they saw an ad might not be sure if the takebackthetour.com listing is really the official site based on how the listing looks in the search results.

results

You can see that at this point, Google doesn’t see any content on the site and in fact, notes on the cached page that [take back the tour] appears only in links pointing to the page. Since it can’t extract any text, it has no way of knowing that the site is about the Tour de France.

Google still doesn’t Flash executed via JavaScript
So. What’s the problem? Google crawls Flash now and all should be well. I see at least two problems. The first is fundamental. The Flash executes via JavaScript. Google noted in their blog post that:

“Googlebot does not execute some types of JavaScript. So if your web page loads a Flash file via JavaScript, Google may not be aware of that Flash file, in which case it will not be indexed.”

They did update the post later to say that:

“For our July 1st launch, we didn’t enable Flash indexing for Flash files embedded via SWFObject. We’re now rolling out an update that enables support for common JavaScript techniques for embedding Flash, including SWFObject and SWFObject2.”

Will this update help the Take Back the Tour site? Maybe not.

Can Google find any words to index?
Another big obstacle to the crawl of this site is that even if Google could get to the Flash, it would find few words to index. Nearly all of the text on the site is contained in images. The first thing you see when you go to the site is lots of words, but the only ones that seem to be text, rather than part of the image, are in the link “join the movement”.

So, once Google can access the Flash, it will be able to crawl and index those words. This design is a theme throughout the site. Links like “back” are text. Nearly everything else is in images.

Let’s pretend for a moment that they changed the Flash file so that the text wasn’t contained in images (and that the JavaScript problem didn’t exist). Would this help indexing? Yes and no.

No separate URLs can lead to a poor experience for searchers
Each time you click a link in the Flash file, you are taken to another page, but the URL doesn’t change. It stays at takebackthetour.com no matter how you navigate. That means that any text Google does pick up will be indexed under that one URL.

By clicking about three levels deep, I can find TV spots about the tour. If the site designers added some text about those TV spots, using the language of their customers, then searchers looking for [tour de france video] or something similar might see the takebackthetour.com site come up in their search results. But when they clicked through to the site, they wouldn’t see the TV spots. They would see the Flash splash page. And they would have to figure out how to navigate through the site to find the video section. Chances are that many searchers would scan the initial page that came up, not see what they were looking for and go back to the search results to find another site.

Little change for viral success
This makes for a poor user experience from search, but consider also that the creators of this campaign obviously are hoping it goes viral. If you want a site to go viral, you have to make it easily shareable. Sure, people may love the rant section or the video section or the contest, but no URL of any of these sections exists for those people to email, Digg, Twitter, Stumble, or otherwise share. A viral campaign that requires every person who shares the content to say, “go to this URL, then click ‘join the movement’, then click ‘how will you take back the tour’ is over before it even begins.

And what about accessibility? And those on the go? I watched the first night of the tour at a friend’s house. What if I had seen the commercial, wanted to check it out, and pulled up the site on my Windows Mobile Smartphone? I would have had this awesome experience:

nojavascript

It’s not even an accurate error message, since the first problem is that I don’t have JavaScript support.

Be smart about Flash
Clearly, a few problems still exist with Flash websites. My view is this:

  • It’s important for web technology providers to think about things like accessibility and search engine optimization or those who implement those technologies will turn to other solutions. To this end, Adobe should be commended for continuing to evolve their offerings to better serve the needs of their users.
  • Search engines have to continue to evolve beyond HTML as their primary goal is to provide the best possible results for searchers. They can’t rely on site owners across the web understanding what technologies are better for search. Google is clearly working on “organizing all the world’s information”, not just all the information well optimized for search engines, and this latest Flash development is an important part of that evolution.
  • If you operate a business online, search is an important acquisition channel. Don’t leave such an important avenue for gaining new customers in the hands of others. Ensure that you are making it as easy as possible for search engines to find your content.
  • Flash may very well be a great technology for your site, but implement it wisely.

Irony

July 11, 2008

A (Google official) blog post about scraped content on a scraper site.

But the original does rank first.

Although not in blog search.

(But that appears to be because the post isn’t indexed in blogsearch at all. Because rivva.de is listed in its place?

  • Nine By Fox

    Stories from the online marketing industry, Vanessa's travel adventures, and more. For reference material and analysis, see the Library.
  • Buy the Book!

  • Categories

  • The Latest From Twitter