Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

A proposal for making AJAX crawlable

Wednesday, October 07, 2009 at 10:51 AM

Webmaster level: Advanced

Today we're excited to propose a new standard for making AJAX-based websites crawlable. This will benefit webmasters and users by making content from rich and interactive AJAX-based websites universally accessible through search results on any search engine that chooses to take part. We believe that making this content available for crawling and indexing could significantly improve the web.

While AJAX-based websites are popular with users, search engines traditionally are not able to access any of the content on them. The last time we checked, almost 70% of the websites we know about use JavaScript in some form or another. Of course, most of that JavaScript is not AJAX, but the better that search engines could crawl and index AJAX, the more that developers could add richer features to their websites and still show up in search engines.

Some of the goals that we wanted to achieve with this proposal were:
  • Minimal changes are required as the website grows

  • Users and search engines see the same content (no cloaking)

  • Search engines can send users directly to the AJAX URL (not to a static copy)

  • Site owners have a way of verifying that their AJAX website is rendered correctly and thus that the crawler has access to all the content


Here's how search engines would crawl and index AJAX in our initial proposal:
  • Slightly modify the URL fragments for stateful AJAX pages
    Stateful AJAX pages display the same content whenever accessed directly. These are pages that could be referred to in search results. Instead of a URL like http://example.com/page?query#state we would like to propose adding a token to make it possible to recognize these URLs: http://example.com/page?query#[FRAGMENTTOKEN]state . Based on a review of current URLs on the web, we propose using "!" (an exclamation point) as the token for this. The proposed URL that could be shown in search results would then be: http://example.com/page?query#!state.

  • Use a headless browser that outputs an HTML snapshot on your web server
    The headless browser is used to access the AJAX page and generates HTML code based on the final state in the browser. Only specially tagged URLs are passed to the headless browser for processing. By doing this on the server side, the website owner is in control of the HTML code that is generated and can easily verify that all JavaScript is executed correctly. An example of such a browser is HtmlUnit, an open-sourced "GUI-less browser for Java programs.

  • Allow search engine crawlers to access these URLs by escaping the state
    As URL fragments are never sent with requests to servers, it's necessary to slightly modify the URL used to access the page. At the same time, this tells the server to use the headless browser to generate HTML code instead of returning a page with JavaScript. Other, existing URLs - such as those used by the user - would be processed normally, bypassing the headless browser. We propose escaping the state information and adding it to the query parameters with a token. Using the previous example, one such URL would be http://example.com/page?query&[QUERYTOKEN]=state . Based on our analysis of current URLs on the web, we propose using "_escaped_fragment_" as the token. The proposed URL would then become http://example.com/page?query&_escaped_fragment_=state .

  • Show the original URL to users in the search results
    To improve the user experience, it makes sense to refer users directly to the AJAX-based pages. This can be achieved by showing the original URL (such as http://example.com/page?query#!state from our example above) in the search results. Search engines can check that the indexable text returned to Googlebot is the same or a subset of the text that is returned to users.



(Graphic by Katharina Probst)

In summary, starting with a stateful URL such as
http://example.com/dictionary.html#AJAX , it could be available to both crawlers and users as
http://example.com/dictionary.html#!AJAX which could be crawled as
http://example.com/dictionary.html?_escaped_fragment_=AJAX which in turn would be shown to users and accessed as
http://example.com/dictionary.html#!AJAX

View the presentation

We're currently working on a proposal and a prototype implementation. Feedback is very welcome — please add your comments below or in our Webmaster Help Forum. Thank you for your interest in making the AJAX-based web accessible and useful through search engines!

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

84 comments:

Kelvin Mackay said...

First thoughts - while this is an excellent idea, won't it allow dodgy webmasters to serve different content to search engines and users?

Since the basis of the argument in favour is executing javascript to see what the user sees is far too resource intensive, logically this means verifying that the content isn't different is also too resource intensive.

I can imagine major search engines have the resources to do random checks from time to time, but I bet it'll be a while (years?) before the technology to do that is perfected.

Until then, will this be a spammer's dream come true?

Marcus Westin said...

This is fantastic. There is no reason to make websites page-based anymore, except seo. Let's get this moving.

Miller said...

what about Flash websites that also uses deep-linking? making the server return an HTML snapshot wouldn't work..

why use #!deep-link instead of #deep-link? - is it just to mark which pages should be indexed? if it is why not add the special char just to the pages that you don't want to be indexed?

and what about old pages? everybody should change the link structure just because of this?

I don't think it's a good solution to add the headless browser to the server.. I would rather do a version of the website that runs without javascript.. content should be accessible to users and not only to crawlers..

Ortega said...

The basic problem of getting Ajax the same way as the user boils down to the old problem:

"The search engine crawler is a blind and deaf user. How does a blind and deaf user perceive an Ajax heavy page?"

The solution would be to look at already existing standards to solve this problem. This way you could avoid the risk that SEOs serve you another content than to the regular users.

Your solution is too synthetic and is detached from the disability of search engines: blind and deaf.

So let's have look on how users with disability deal with Ajax pages:

WAI-ARIA can add an additional semantic information layer on top of the Ajax generated page. This way the content and the actions get sense again and you can deal with information like you already do with plain html.

Additionally you promote web developers to use defined standards to make there Ajax page accessible to everyone - not only to one very important user with a disability: your search engine.

Now let's have a look how this can be solved:

WAI-ARIA usually sets roles and states with Javascript. This roles can be queried even in a headless browser.
My suggestion would be to look at the tools that users with disability use.

An intelligent solution would be to look at NVDA.

NVDA is a free open source screen reader for windows written in python. You can run nvda in a python console and interact like a blind user with the page.

At Google are a lot of python coders which would be able to copy the behavior of blind NVDA users in python and on the same time create an incentive to make their Ajax pages accessible.

But the biggest benefit would be to use the same view on Ajax generated pages as a regular user and not the created one by SEOs.

Tejaswi said...

Hope the prototype implementation is on a Google SERP :-)

http://www.google.com/#q=citeseer

Should look like:
http://www.google.com/#!q=citeseer

Would make it easier to crawl AJAX :-)

Simon Lynch said...

Fantastic to see you are moving on this (have just spent a month trying to stop new code confusing Googlebot...and I am not sure we are really getting there in a very elegant way).

What about adding a switch to solve the problem of how to serve contextual (or interest-based, demographically targeted) ads on pages where the user is logged in and there is no public URL for the content, eg an inbox. If we can insert a flag in the URL, an ad network would then know not to try and index the page (as it will not work) and can then chose to serve ads based on other parameters.

An example would be AdSense (as there is now display from Doubleclick and other networks and interest-based targetting). If the Mediapartners bot could be told 'don't bother' it would speed up ad serving, reduce load with the publisher and Google.

monty said...

Or you could practice unobtrusive javascript that doesn't depend on JS for content. My site Complimedia Online does this quite well.

Depending on servers to change will take years.

Gorka said...

Truly I think this is a waste of time. If you follow a real MVC Framework for web applications there is no real need to have content that needs to be indexed behind some AJAX interface. This is like saying "we know you have no idea how to build RIA and we don't expect you to change things so we will do our best to crawl everything (even things that shouldn't be crawled by html spiders)"

If you want content to be indexed put it behind a permanent unique url. (period - I cannot stress this out enough!!)

Cheers,
Gorka
Mexico

Phil Bogle said...

I'm missing something important. Won't many of the AJAX urls be composed and fetched in code? How is the crawler supposed to determine the set of AJAX urls that will be fetched by a given page.

James Butler said...

This seems to be a proposed solution in search of a problem. AFAIK, the benefits of AJAX have less to do with whether the visitor's screen refreshes (it always does) and more to do with rapid interaction with complex dataset displays.

For example, while it is "cool" and all to make an entire site using AJAX just because one can, this is not the reason why AJAX was developed.

Of what use are the returned status codes if all the AJAX is doing is making a request for static page subcontent? AJAX is more interactive with the server than static pages because it is important to the structure to understand each step of the request process to validate the quality of the data being requested, not necessarily to overcome browser caching limitations, as many have suggested.

I cannot think of a situation where having Google crawl AJAX-delivered content would not be handled better with static pages, or page requestss with one or two content-oriented parameters that would be passed using GET-style notation.

For every AJAX project I have ever been involved with, AJAX was selected as an appropriate coding environment because of the interaction between the page and the back end ... such as updating dynamic data drawn from a server-side database or from some other location.

Plain old content pages are never served well by using AJAX. There is too much server interaction to warrant it, and the visitor cannot tell the difference between the delay in grabbing and displaying content in an AJAX-supplied section of the page and the delay in grabbing and displaying the full page, utilizing the browser's standard caching capabilities to minimize image requests, etc.

I can't see that this will be important to anyone who is a serious developer. It will only be important to those who are amused by coding tricks and feel the need to employ them at every opportunity, regardless of whether the technology is appropriate or not.

Simon Lynch said...

@James Butler

This is a real problem. To give a simple example from what we have been doing on http://jobs.justlanded.com - every job can be replied to, favourited and reported. If you do this properly (ie to work for non-JS users) you get 3 useless pages for Googlebot to index for each useful page of content. If your site runs in 40 languages Googlebot ends up spidering 120 URLS of useless noindex actions. OK, you can chose to just give functionality to JS enabled users, but that doesn't work either as Googlebot is following a lot of JS links now.

Costa Rica said...

Sometimes an opinion is just an opinion and it doesn't necessarily have to be the right one, so please accept my apologies and advance if I'm wrong for any reason.

The crss seems to have the solution that Google bot needs for crawling webpages with important content.
1. the sitemap must contain only an acceptable website, webpages within websites ( acceptable and able to index ).
2. developing a sitemap maker with crss system, where new important content is uploaded to websites servers anytime, they get submitted by webmaster with crss, if they qualify by an automatic acceptance in Google, if it gets approved.
3. A sitemap for pictures and video must be done separately and submitted via crss only by the webmasters when they are accepted and ranked, if not, there is no sense to make it difficult, just change it for something better.
4. Google bot will not have to spend any more time crawling and discovering endless links and useless websites while exposing itself and the viewers to spam and garbage content.
5. Let the webmaster make their sitemaps with a certified sensitive program from Google only; Approved by Google only, and with qualified webpages by Google only.
In my opinion the job is not for Google to spider the webpages, it should be done by the interested webmasters, if they want their content to appear in the search engines ever. For all of these reasons, a standard system with site mapping must be developed, and there the crss will take effect. The program will automatically let the webmaster find out the importance and page rank of the new content. If it is not enough, then he will have to improve and submit it again and take it into consideration.
Also, a while ago someone from Google said that what counts is excellent content on text and pictures, and sometimes 5 or 10 webpages do better than a million or more non relevant webpages. So, let's take the time to do the best in small amounts and not millions of unusable ones that are good for nothing.
Why do I say all of this? Well, it doesn't make any sense to see webmasters tools showing for a website more than a thousand indexed webpages and in the results for searching from viewers only one or two, and then just 2 weak keywords working. Lets get real and submit only the webpages that will work some way or another.

Simon Lynch said...

@Costa Rica

Would just like to politely point out that your proposed solution suffers from a big problem. You write, "In my opinion the job is not for Google to spider the webpages, it should be done by the interested webmasters". In a perfect world this might be true, but Google needs to spider because it needs to determine relevance. If it were up to webmasters to do so I can guarantee that the results would not be very useful (unless you really need to buy performance enhancing medication (I would type something else but I will get spamblocked) etc). You then write, "The program will automatically let the webmaster find out the importance and page rank of the new content." - this is clearly not a good idea for users or Google. In a world which worked this way, all you would produce is spammy results. Lots of people are very keen on being 'important' and 'ranked' - if you gave people an iterative system to work against, all you will produce is spam.

Simon Lynch said...

Have a question for anyone at Google reading this. On http://jobs.justlanded.com we ended up putting actions (such as reply or report) in a directory blocked by robots.txt. From what we can see the crawl is still going to these pages.

Previously we tried with a single page with a URL parameter marked with a noindex meta, but they were still being crawled (ie the bot was taking each param as a unique URL and trying to crawl).

Is there any was we can tell Googlebot to not crawl where it doesn't need to? Maybe you could propose a rel=dontfollow tag to instruct a crawler to ignore the link completely (which is not the case with nofollow).

Jano said...

What does it means in this case: "_escaped_fragment_=" How can google recognize the variable name I'm sending via AJAX?

Ellithy said...

wow, this is good
but will make major traffic changes
I think sites like facebook will now get huge traffic from Google

k said...

I can't see this working as there are so many ajax methods out there


פורומים

Tiago said...

It looks a excellent idea. I have some questions:

- how would you be able to index AJAX webpages with real-time content (e.g. page with same address but with high frequency of content changes)? (Like Twitter does?)

- How much resource and bandwidth Googlebots will need from the hosted pages to index these real-time content?

- Have you published any paper on that?

Lorenzo Pastrana said...

Google says : "No, no, our crawler will not be AJAX compatible, please make sure (for us to continue to make ca$h on your content) that everything works exactly as in 1996 ...", suggested solution is just an ugly workload transfer. Arrgh ! This is a google prescribed hack!

frank said...

It goes much further then only make ajax crawable. Also reading systems for blind people are not compatible with dynamic javascript solutions.
My suggestions should be to let the application run in two 'states' a javascript (ajax) enabled state and a html (non ajax) state. This means doing partial updates or full page refreshes. The 'default' html returned from the web server is the html already rendered for the initial state and all state (page) transitions are normal links (with or without SEO tags). All links can be followed by crawlers. When javascript is enabled however the html links (which should become ajax interactions) are 'overrulled' in their behavior. All onclicks will go to the 'ajax interaction' javascript component which decides which ajax action to execute and will do a partial update of the page state insteaf of a full page transition. This combines both worlds in my opinion and is also a solution for non javascript environments.

Lorenzo Pastrana said...

This is exactly what we are doing ... An should be the only way to have a crawlable AJAX site (that is not an in-browser web app that would not need any referencing)

MrM said...

In my opinion, this is a terrible solution. It is on the right track, but with one major mistake. The headless browser should be run at the side of the crawler, not the webserver.

Why should every webserver install a headless browser to accommodate Google, when Google can do it and solve the problem instantly.

Googlebot should simply fetch the page with all it's javascript, execute it in a headless browser and index the result. This would also eliminate the problem of serving different content to spiders and regular users.

Tiago Vieira said...

@MrM - your argument is good.

aartiles said...

Aquí tienen mi interpretación de la propuesta en español:
http://www.bitacoradewebmaster.com/2009/10/08/propuesta-de-estandar-de-google-para-hacer-un-sitio-ajax-indexable/

Errioxa (Spanish speaker) said...

And this example is valid?

a href="domain.com/title-url/archive.html" onclick="myFunctionAjax(x,y);return:false;)">text description /a

Google will follow the link and the users will see ajax.

justpcgamer said...

Why not just use degradable Ajax? What is a headless browser also :S

Sky said...

I was trying to work out a standard for this, from a web-developer perspective a little while ago.

I think the main issue with ajax hashes is that they conflict with id markers. So this was my thought.

Browsers should implement id searches of everything in the location.hash section before a space (i.e. %20). So, e.g. at:
http://example.com/#foo%20bar
an id="foo" element would match CSS :target selectors.

The advantage to using a space (%20) is that spaces are not allowed in id elements (even in HTML5).

So, in summary, I think a space is a better choice than an '!' but including !'s as not-allowed in ID attributes in HTML5 would be another way to go.

Either way, I'd ask that browsers split the hash section into what registers for #target selectors, and keep the rest static, etc. That way, the hash section can represent a specific sub-section of a document AND the ajax state of the document.

Alan said...

We've been working on the 'headless browser' concept, and actually have a system that runs server-side if the client doesn't have Javascript. All state change events get converted into clickable URLs. It's called golf:

http://code.google.com/p/golf/

FinalDestiny said...

Woooow damn indexing ajax is great!!

Blake said...

This is really great. Using the Hijax approach for building web applications makes implementing this solution incredibly simple. Since no public content should be accessible only via AJAX, all content is already mapped to a static page. The trick is making your application aware of the _escaped_fragment_ in the query string and rendering the appropriate data from there.

For Newd (Crispin Porter + Bogusky's open source data aggregator) the updates necessary to meet this standard took on a few hours with no perceivable negative impact to the application itself.

You can grab the source at my git hub page and read through the process i posted on my site

XUL Developers said...

The server-side headless browser is really crazy idea. I understand that Google is simply trying to offload its job/capacity problems to others.

I witnessed last week that their Javascript parser is not working properly when resolving onclick links resulting in "http://example.com/undefined" urls. It seems that headless browser for crawling the pages is a tough problem for Google itself and it is trying to make it the problem of others. Not fair.

Michael Bensoussan said...

ahah, love that idea, I could code in GWT more often :-)

Jeremy said...

Interesting, but looks life of a URL looks complex. Not sure why google could not just crawl the fragments (considering the fragments as part of the URL). The webmaster will have to make sure to serve the right state when a url with a fragment will be requested, but this is definitely a good practice to have anyway.

Citrus said...




I agree with MrM above, who said:
> "Googlebot should simply fetch the page with all its javascript, execute it in a headless browser and index the result."

The results of AJAX requests should not be considered indexable content. What if the browser-side JavaScript filters out some of the content before displaying it? And what if the JavaScript is used to generate content and insert it into the DOM without AJAX requests?

The proposal from Google is unsuitable for a modern thin server web architecture like LimeBits.

Google needs to bite the bullet and execute the JavaScript on its own headless browsers. Imagine what a great service this will be, and how much of an advance over Bing!


tjpick@gmail.com said...

Wouldn't it be much much easier to:

1. give the content behind the ajax call a static url, which can be accessed by non-js clients too

2. add something like < link rel="alternate" media="ajax" type="text/html" href="http://example.com/doc.html#state" /> to the head of that static version, which the search engines can then use as a pointer to back the js version

?

James Butler said...

@Simon Lynch

You have exposed one of the biggest issues with AJAX crawling, in my opinion (aside from its overuse in situations that simply do not benefit from it):

Should ALL of the AJAX content be crawled?

What about that bit that is showing the local temperature? Does the crawler need that one?

How about the one that is maintaining user state? Useful to a searcher from another country?

And the other one that is simply a preview function? Should that one be crawled, too?

Using an indicator, such as the bang under consideration, would help here, but will not be useful in every situation. For example, if Google is crawling the directory you intended to be private (probably because it's permissions inheritance from the "mother page" allows Google to access that directory without re-checking robots.txt), where does that end?

If you are serving pages that contain no dynamic data, then use static pages. The issue is gone.

If your page requires dynamic data in order to become fully-formed, and you want spiders to crawl it in its state following the insertion of dynamic data, you need to look hard at WHY you are using AJAX to generate that page, and WHETHER it should be converted to static, or supplied to the crawler as static alongside the AJAX-enabled version for real visitors to use.

Again, just because you CAN does not mean that you SHOULD.

Ortega said...

@frank That's not correct. I'm blind myself and I can use dynamic html with different screen-readers very well, if the pages use WAI-ARIA. The myth blind users don't use javascript or can't perceive ajax pages seems still to be alive. What a pity!

@miller If you look at the newest release of the free and open source screen-reader NVDA you find already support for flash and java applets:
http://www.nvda-project.org/blog/NVDA2009.1beta1Released

XoraX said...

I think that require to take the real fragment too (the anchor) if passed in ajax request.

So you would define an other token in the ajax state...

for example :
http://www.xorax.info/blog/programmation/259-x-sendfile.html#programmation/171-ma-killer-app.html#comments
target on :
http://www.xorax.info/blog/programmation/171-ma-killer-app.html#comments

Here, if the second token is "#", the url available for user and showed in results will be :
http://www.xorax.info/blog/programmation/259-x-sendfile.html#!programmation/171-ma-killer-app.html#comments

Jamie Hill said...

...or just use unobtrusive javascript.

Ranjeet Smith said...

The sad thing is that people actually got paid to come up with this.

Tim Acheson said...

May I offer some constructive criticism? This is an extremely bad solution.

The role of Google's search engine is to facilitate search. That means working with the web as it is, not trying to change it.

Yes, it's a challenge to index Ajax; but it's not impossible. Yes, it will require work by Google. Guys, that’s your job. The next-generation of indexing wizardry needs to be able to do this without every website implementing something that Google has come up with.

I wish Google would stop trying to tell us how to create the web. When Google issues a decree like this, it can have a huge impact.

For instance, Google has decreed that dashes were better for friendly URLs. It’s still there guidelines now. As a direct result, the web has been corrupted by a Spam of hyphens in the majority of URLs. In almost every case, the result is grossly incorrect punctuation. This might be comical, if it were not such a serious matter. Increasing numbers of children are leaving school with poor basic skills, while Google merrily tells everybody to misuse punctuation and ignore its proper purpose because it works better with their search engine. Where will this corruption of the web by Google end?

oliverw said...

1. Introduce a new meta tag that pages can use to indicate that they are ajax powered+

2. Change your crawler backend to load the the content into a headless browser on your infrastructure and crawl the DOM

3. Profit

Sharanyan said...
This post has been removed by the author.
Sharanyan said...

John,

Tis post very usefull for all programmers and webmasters. programmers don't know how to develop a search engine crwalable site with Ajax.

This posts shows us how to get the best results by using Ajax properly so search engines can easily crawl and index our website.When using AJAX, functions are called using onClick in links. These functions aren't visible for search engines because of the Javascript restriction.

My Website Seo Company also showing same conditions, however i have now optimized all Ajax codes properly.

Thanks for the great post john, keep it up.

Regards,
Sharanyan

Remi said...

AJAX shouldnt be used for navigation. This is not a good idea. It will just allow more spamming opportunities to exploit for the next year.

grauw said...

I think it’s good that Google is thinking about tackling a long-standing problem with AJAX-sites, but the suggested implementation is a bit ugly… I don’t like that you have to add a bang to the hash, and instead of using a _escaped_fragment_ query parameter it would be better to use HTTP content negotiation.

I would say that probably the best solution is to allow JavaScript to rewrite the path and query parts of the URL, then you would have neither problem, nor the need to (ab)use the hash part of the URL. Functionality for scripts to do that will be part of HTML5 (and is apparantly already implemented by IE8). But Google doesn’t seem to want to wait for that :(.

marcin said...

I can't understand why you don't want to request url as it is: http://example.com/#?key=value&q=query

why you want to replace # by !# ? it make no sense for me. In the scenario above http://example.com/#?key=value&q=query, if you can read from REQUEST_URI any #hash params it means that a headless browsers has requested it since only a headless browser can send url with #hash params for all other browsers part after # won't be send. I am confused by this idea of ! or _escaped_fragment_, why to make it so complicated and simply not use http://example.com/#?key=value&q=query and leave rest for developers - checking if any fragmented # params has been sent, if so it means that a headless browser is asking for static content. Its much simpler I reckon.

cheers,
/Marcin

Mikhail Kozlov said...

Sounds like a good idea, but how about making unobstructive JavaScript a standard? I think this where main push should be made at first. Those how care about information appearing in search results will use it, overs not.

darkflame said...

I'm really keen for a solution asap.
Because at the moment it leads to some crazy situations.
If you have an app like google maps, and you want anyone to be able to link to a spot AND those links to be visible to search engines, its nearly impossible.

Even googlemaps has a horrible "click for a link" feature...something I'm also having to implement on my site.

Users should be able to just cut and paste whats in their URL, post that on a page, and for search engines to be able to associate it with the content correctly.

marcin said...

Hi there, please have a look our proposal with working example at http://feedsmanagement.com/proposals/crawlable.ajax and please let us know what do you think about this solution

cheers,
/Marcin

alexandre_21 said...
This post has been removed by the author.
alexandre_21 said...
This post has been removed by the author.
alexandre_21 said...
This post has been removed by the author.
Lawrence said...
This post has been removed by a blog administrator.
snifan said...

I am newbie blogger until now i still learn about blog how to modify my blog will better in search engine results, with ajax i do not to use that in my blog. Please help me!

Leigh said...

Perhaps Google should put its energy into developing declarative, approaches for what the AJAX community currently does with JavaScript. There are already standards in this area that are designed for accessibility and are search-engine friendly (XForms).

A binding language for matching intent-based markup to JavaScript presentation (along the lines of XBL) would be a welcome addition, and let Javascript take the web beyond CSS.

Darren Beige said...

This implementation of an idea is terrible. I don't want to have to code into my application on the server-side the reception of a '_escaped_fragment' GET param, nor do I wish to have to deal with a ! character in my JS. The MAJOR issue with this approach is having to rely-on a headless browser running on the webserver, not GoogleBot's many thousands of server farms. This causes large amounts of unneccessary strain on my resources, something which I cannot justify (due to cost).

Google (and other engines) have certainly got to crawl the AJAX web, but not like this. The suggestion raised earlier (about interpreting Javascript URL rewrites) is very appealing though.

RJ said...

Is there a way to delete the presentation that found it way into my Google Docs account.

KPI LIST said...

When using rewrite rules that do not have a hash, what element of the url should be presented with [FRAGMENTTOKEN]"!"

Does this look like a stateful URL?

ajax.open("GET", unescape("http%3A//www.example.com/process/find/34847/Product_Home"));

Adding a token might look like this

http://www.example.com/2/34847/Product_Home/#!AJAX

should the URL's be rewritten to generate a stateful URL?

When developing for products we like to rewrite URLS like this

http://www.examle.com/1/190999/product _name_product_description_descritpion2/

Our Ajax is obviously calling a central DB, so should we simply present urls without headers to bots, how do we encode the URL # properly?

Jimbo said...

Why not just use the # sign followed by the usual key/value params found in normal URLs. That how gmail and google website optimizer work. I'm guessing that Google can't unwind its ignoring of the # sign and is inventing the ! system to make up for it...keep it simple. way to complicated.

Manish said...

Fine solution, except:

1. There's no need for '_escaped_fragment_' or to mandate '!' in the URL.

2. Let the site direct all request from Google and other search engines to the headless server, based on either the user-agent string or some other request header that says it's a searchbot.

3. Let the headless server generate the HTML in such a way that it rewrites all '#' as '?'.

4. A rule may be added to the headless server that converts only '#!' URLs (leaving other '#' URLs intact). The site may choose '^' or '@' or anything else instead of '!' -- Google never even knows about this (it only sees '?' in the end).

In short, I like the idea of using a headless server, but there's no need for '!' and '_escaped_fragment_'; let the site have its own rules and mechanisms for this (Google doesn't need to care). The interface between Google and the website being crawled should be minimal.

Any downsides to this?

Manish Jethani said...

Oops. By "headless server" I really meant headless browser.

Simon Lynch said...

@Jimbo

Anything after # is not sent by the browser with a request, so strictly speaking this does not identify distinct URLs. Google 'unwinding' that treatment doesn't work with the other use of this for on-page anchors so is not very compatible with the rest of the [pre-AJAX] web.

mrgccc3 said...

When will this type of #!ajax be active?

alexandre_21 said...
This post has been removed by the author.
Mauricio said...
This post has been removed by the author.
Mauricio said...

When will this type of #!ajax be active? I already implemented it in my site www.magicafm.com

Alaa Murad said...

Do we really get blacklisted if we create a static site and feed googebot static html ?

I was thinking of making a JSP that output the HTML as of our GWT app would and then use Tomcat filter or other methods to rewrite the URL !

I would think it's better way like this, but I have a fear of ending in the black list, we spent the last months creating CMS that is totally based on GWT and we thought we will find a solution to this problem.

Alaa Murad said...

I'm re-looking at this proposal and I do see this as cloaking !

Not sure what is the new here as a lot of people already using some method to provide a static html to the server.

I think Google should indeed keep the "#" in the URL and leave it up to the user to cloak.

flex said...
This post has been removed by the author.
flex said...

Hi,
I would suggest you offer AJAX developers a few options instead of settling on one.

Here is how I separate the content from the design.

Standard Wordpress Blog:
http://www.flexcapacitor.com/content

Flex Wordpress Blog requesting only the content:
http://www.flexcapacitor.com/demos/Wordpress/
Notice the updates to the URL fragment.

Only the content:
http://www.flexcapacitor.com/content/?xml=1
Select view source

In this example I'm using a query string rather than using a fragment.
If I left "?xml=1" off the blog page would load. There could be a standard fragment or query string parameter you could add to your page that returns only the content. Something like, "seo=1".


The second method the site is using works like this:

Whenever the site makes an AJAX / asyncronous call it updates the URL fragment so it can restore the state later. Then it dispatches an "ASYNC_UPDATE_COMPLETE" event after a result/content is received from the server.

Now, when a search engine comes along to index a page that uses AJAX the search engine can listen for an ASYNC_UPDATE_COMPLETE event. When it receives this event it can grab the updated URL and treat the page as a completely new page and then index the new content.

For this to work you would have to have a standard event that search engines could listen for. That wouldn't be too hard for developers. You could also modify the popular AJAX libraries to dispatch this event when new data is returned. That way developers don't even have to think about it. They would only need to update the URL fragment.

jessica said...

I have a question from SEO view point. I have my website developed in asp (promodirect.com). I believe/heard that HTML is the best way to make your website SEO friendly compared to any other scripting programme. Can someone distinguish for asp and html scripts/websites from SEO view point.

nextadmin said...

making ajax searchable has the added significance of potentially adding adsense to RIAs. I have been struggling with doing this for two years...

Francisco said...

As Ortega said,

Take a tour for WAI-ARIA W3C standard on http://www.w3.org/WAI/intro/aria . It much probably could be a better solution.

We can't go patching code and URL's in different ways.

See the similarity of a crawler and a blind user.

Peace!

Madhu said...

I havent chosen GWT because of two reasons. One is indexing and two is ads. This atleast solves one of the problem. What do you plan to do about ads, ads currently are per page and AJAX has only one page,

Nadia King said...

Can we get an update on this topic? Any additional information would be great - is there a timeline on making a decision one way or the other?

Thanks!

ivanceras said...

Usually when users share a link from an ajax application they would cut the text at the location bar ("http://example.com/index.php#module=3") than clicking on special "Copy link button" which has extra functionality to rephrase the link to "http://example.com/index.php?module=3" which supposed to work since when the user pasted that link on his blog, google bot indexes that link and when a user clicks on that link. The web server knows, what ajax module to be server since the web server could capture the Request URI "?module=3" but not "#module=3". The way to do it is to load the initial page, then look at the location url then load the ajax module. which has a double request to the server.

Miky said...

Does this "standard" work right now on Google or it's just an idea for now ?

Thanks.

mrgccc3 said...

Yeah when Will this become standard?

Career Soultions Workshop said...

If you need a site for testing please consider using www.BizPartnerHunt.com

This site is great but not used very much because people can't find it. The ads are not able to be crawled because it is an AJAX based website.

Gary said...

I'm surprised Google hasn't advanced further with crawling AJAX than this proposal.

As MrM points out, the obvious flaw is that the headless browser should be run at the side of the crawler, not the webserver.

flex said...

I like what was proposed but I also think developers should have another option. That is:

1. When the page first loads dispatch an event "hasAsync". This tells the search engine crawlers that the page has AJAX or Flash or whatever.
2. Henceforth, when you have make an async call, update the url fragment and once the data is returned (and /or applied if possible), then dispatch a "asyncUpdate" event.

This will tell the crawler that an asyncronous call has been made and to check the url fragment and then associate the new content on the page with the new URL fragment.

USE CASE:
A fella goes to a webpage (for example, an ajax version of www.google.com). When the page loads an event is dispatched via javascript "hasAsync" or "hasAJAX". He doesn't know this or care but the crawler is listening for this event. He enters a term in the search box, "batman" and then he clicks the "search" button. When this button is clicked the webpage updates the url fragment to "http://google.com/#s=batman". Then it makes an asyncronous call. When the search results come back the new information populates the page. When this information comes back the webpage dispatches an "asyncUpdate" event. The crawler is listening for this event. It then crawls and indexes the new content and associates the content with the new URL with the new URL fragment, "http://google.com/#s=batman". The event also would contain the data returned.

Google, please hire me.

ERF said...

Really great!!!
But what is the status with this right now?
Coz in our company we wanted do use GWT for product development. But we had to reconsider and use JSP and other java technologies insead, just because the content can be indexed.
But we would really like to start using thw web toolkit.
I would really know the answer to my question, please.

Aaron said...

I too am wondering what the status of this is. I'm designing a new site (completely in AJAX) and am going to have to write additional linking methods to become SEO.