Wednesday, October 07, 2009 at 10:51 AM
Webmaster level: AdvancedToday we're excited to propose a new standard for making AJAX-based websites crawlable. This will benefit webmasters and users by making content from rich and interactive AJAX-based websites universally accessible through search results on any search engine that chooses to take part. We believe that making this content available for crawling and indexing could significantly improve the web.
While AJAX-based websites are popular with users, search engines traditionally are not able to access any of the content on them. The last time we checked, almost 70% of the websites we know about use JavaScript in some form or another. Of course, most of that JavaScript is not AJAX, but the better that search engines could crawl and index AJAX, the more that developers could add richer features to their websites and still show up in search engines.
Some of the goals that we wanted to achieve with this proposal were:
- Minimal changes are required as the website grows
- Users and search engines see the same content (no cloaking)
- Search engines can send users directly to the AJAX URL (not to a static copy)
- Site owners have a way of verifying that their AJAX website is rendered correctly and thus that the crawler has access to all the content
Here's how search engines would crawl and index AJAX in our initial proposal:
- Slightly modify the URL fragments for stateful AJAX pages
Stateful AJAX pages display the same content whenever accessed directly. These are pages that could be referred to in search results. Instead of a URL like http://example.com/page?query#state we would like to propose adding a token to make it possible to recognize these URLs: http://example.com/page?query#[FRAGMENTTOKEN]state . Based on a review of current URLs on the web, we propose using "!" (an exclamation point) as the token for this. The proposed URL that could be shown in search results would then be: http://example.com/page?query#!state. - Use a headless browser that outputs an HTML snapshot on your web server
The headless browser is used to access the AJAX page and generates HTML code based on the final state in the browser. Only specially tagged URLs are passed to the headless browser for processing. By doing this on the server side, the website owner is in control of the HTML code that is generated and can easily verify that all JavaScript is executed correctly. An example of such a browser is HtmlUnit, an open-sourced "GUI-less browser for Java programs. - Allow search engine crawlers to access these URLs by escaping the state
As URL fragments are never sent with requests to servers, it's necessary to slightly modify the URL used to access the page. At the same time, this tells the server to use the headless browser to generate HTML code instead of returning a page with JavaScript. Other, existing URLs - such as those used by the user - would be processed normally, bypassing the headless browser. We propose escaping the state information and adding it to the query parameters with a token. Using the previous example, one such URL would be http://example.com/page?query&[QUERYTOKEN]=state . Based on our analysis of current URLs on the web, we propose using "_escaped_fragment_" as the token. The proposed URL would then become http://example.com/page?query&_escaped_fragment_=state . - Show the original URL to users in the search results
To improve the user experience, it makes sense to refer users directly to the AJAX-based pages. This can be achieved by showing the original URL (such as http://example.com/page?query#!state from our example above) in the search results. Search engines can check that the indexable text returned to Googlebot is the same or a subset of the text that is returned to users.

(Graphic by Katharina Probst)
In summary, starting with a stateful URL such as
http://example.com/dictionary.html#AJAX , it could be available to both crawlers and users as
http://example.com/dictionary.html#!AJAX which could be crawled as
http://example.com/dictionary.html?_escaped_fragment_=AJAX which in turn would be shown to users and accessed as
http://example.com/dictionary.html#!AJAX
View the presentation
We're currently working on a proposal and a prototype implementation. Feedback is very welcome — please add your comments below or in our Webmaster Help Forum. Thank you for your interest in making the AJAX-based web accessible and useful through search engines!


120 comments:
First thoughts - while this is an excellent idea, won't it allow dodgy webmasters to serve different content to search engines and users?
Since the basis of the argument in favour is executing javascript to see what the user sees is far too resource intensive, logically this means verifying that the content isn't different is also too resource intensive.
I can imagine major search engines have the resources to do random checks from time to time, but I bet it'll be a while (years?) before the technology to do that is perfected.
Until then, will this be a spammer's dream come true?
This is fantastic. There is no reason to make websites page-based anymore, except seo. Let's get this moving.
what about Flash websites that also uses deep-linking? making the server return an HTML snapshot wouldn't work..
why use #!deep-link instead of #deep-link? - is it just to mark which pages should be indexed? if it is why not add the special char just to the pages that you don't want to be indexed?
and what about old pages? everybody should change the link structure just because of this?
I don't think it's a good solution to add the headless browser to the server.. I would rather do a version of the website that runs without javascript.. content should be accessible to users and not only to crawlers..
The basic problem of getting Ajax the same way as the user boils down to the old problem:
"The search engine crawler is a blind and deaf user. How does a blind and deaf user perceive an Ajax heavy page?"
The solution would be to look at already existing standards to solve this problem. This way you could avoid the risk that SEOs serve you another content than to the regular users.
Your solution is too synthetic and is detached from the disability of search engines: blind and deaf.
So let's have look on how users with disability deal with Ajax pages:
WAI-ARIA can add an additional semantic information layer on top of the Ajax generated page. This way the content and the actions get sense again and you can deal with information like you already do with plain html.
Additionally you promote web developers to use defined standards to make there Ajax page accessible to everyone - not only to one very important user with a disability: your search engine.
Now let's have a look how this can be solved:
WAI-ARIA usually sets roles and states with Javascript. This roles can be queried even in a headless browser.
My suggestion would be to look at the tools that users with disability use.
An intelligent solution would be to look at NVDA.
NVDA is a free open source screen reader for windows written in python. You can run nvda in a python console and interact like a blind user with the page.
At Google are a lot of python coders which would be able to copy the behavior of blind NVDA users in python and on the same time create an incentive to make their Ajax pages accessible.
But the biggest benefit would be to use the same view on Ajax generated pages as a regular user and not the created one by SEOs.
Hope the prototype implementation is on a Google SERP :-)
http://www.google.com/#q=citeseer
Should look like:
http://www.google.com/#!q=citeseer
Would make it easier to crawl AJAX :-)
Fantastic to see you are moving on this (have just spent a month trying to stop new code confusing Googlebot...and I am not sure we are really getting there in a very elegant way).
What about adding a switch to solve the problem of how to serve contextual (or interest-based, demographically targeted) ads on pages where the user is logged in and there is no public URL for the content, eg an inbox. If we can insert a flag in the URL, an ad network would then know not to try and index the page (as it will not work) and can then chose to serve ads based on other parameters.
An example would be AdSense (as there is now display from Doubleclick and other networks and interest-based targetting). If the Mediapartners bot could be told 'don't bother' it would speed up ad serving, reduce load with the publisher and Google.
Or you could practice unobtrusive javascript that doesn't depend on JS for content. My site Complimedia Online does this quite well.
Depending on servers to change will take years.
Truly I think this is a waste of time. If you follow a real MVC Framework for web applications there is no real need to have content that needs to be indexed behind some AJAX interface. This is like saying "we know you have no idea how to build RIA and we don't expect you to change things so we will do our best to crawl everything (even things that shouldn't be crawled by html spiders)"
If you want content to be indexed put it behind a permanent unique url. (period - I cannot stress this out enough!!)
Cheers,
Gorka
Mexico
I'm missing something important. Won't many of the AJAX urls be composed and fetched in code? How is the crawler supposed to determine the set of AJAX urls that will be fetched by a given page.
This seems to be a proposed solution in search of a problem. AFAIK, the benefits of AJAX have less to do with whether the visitor's screen refreshes (it always does) and more to do with rapid interaction with complex dataset displays.
For example, while it is "cool" and all to make an entire site using AJAX just because one can, this is not the reason why AJAX was developed.
Of what use are the returned status codes if all the AJAX is doing is making a request for static page subcontent? AJAX is more interactive with the server than static pages because it is important to the structure to understand each step of the request process to validate the quality of the data being requested, not necessarily to overcome browser caching limitations, as many have suggested.
I cannot think of a situation where having Google crawl AJAX-delivered content would not be handled better with static pages, or page requestss with one or two content-oriented parameters that would be passed using GET-style notation.
For every AJAX project I have ever been involved with, AJAX was selected as an appropriate coding environment because of the interaction between the page and the back end ... such as updating dynamic data drawn from a server-side database or from some other location.
Plain old content pages are never served well by using AJAX. There is too much server interaction to warrant it, and the visitor cannot tell the difference between the delay in grabbing and displaying content in an AJAX-supplied section of the page and the delay in grabbing and displaying the full page, utilizing the browser's standard caching capabilities to minimize image requests, etc.
I can't see that this will be important to anyone who is a serious developer. It will only be important to those who are amused by coding tricks and feel the need to employ them at every opportunity, regardless of whether the technology is appropriate or not.
@James Butler
This is a real problem. To give a simple example from what we have been doing on http://jobs.justlanded.com - every job can be replied to, favourited and reported. If you do this properly (ie to work for non-JS users) you get 3 useless pages for Googlebot to index for each useful page of content. If your site runs in 40 languages Googlebot ends up spidering 120 URLS of useless noindex actions. OK, you can chose to just give functionality to JS enabled users, but that doesn't work either as Googlebot is following a lot of JS links now.
Sometimes an opinion is just an opinion and it doesn't necessarily have to be the right one, so please accept my apologies and advance if I'm wrong for any reason.
The crss seems to have the solution that Google bot needs for crawling webpages with important content.
1. the sitemap must contain only an acceptable website, webpages within websites ( acceptable and able to index ).
2. developing a sitemap maker with crss system, where new important content is uploaded to websites servers anytime, they get submitted by webmaster with crss, if they qualify by an automatic acceptance in Google, if it gets approved.
3. A sitemap for pictures and video must be done separately and submitted via crss only by the webmasters when they are accepted and ranked, if not, there is no sense to make it difficult, just change it for something better.
4. Google bot will not have to spend any more time crawling and discovering endless links and useless websites while exposing itself and the viewers to spam and garbage content.
5. Let the webmaster make their sitemaps with a certified sensitive program from Google only; Approved by Google only, and with qualified webpages by Google only.
In my opinion the job is not for Google to spider the webpages, it should be done by the interested webmasters, if they want their content to appear in the search engines ever. For all of these reasons, a standard system with site mapping must be developed, and there the crss will take effect. The program will automatically let the webmaster find out the importance and page rank of the new content. If it is not enough, then he will have to improve and submit it again and take it into consideration.
Also, a while ago someone from Google said that what counts is excellent content on text and pictures, and sometimes 5 or 10 webpages do better than a million or more non relevant webpages. So, let's take the time to do the best in small amounts and not millions of unusable ones that are good for nothing.
Why do I say all of this? Well, it doesn't make any sense to see webmasters tools showing for a website more than a thousand indexed webpages and in the results for searching from viewers only one or two, and then just 2 weak keywords working. Lets get real and submit only the webpages that will work some way or another.
@Costa Rica
Would just like to politely point out that your proposed solution suffers from a big problem. You write, "In my opinion the job is not for Google to spider the webpages, it should be done by the interested webmasters". In a perfect world this might be true, but Google needs to spider because it needs to determine relevance. If it were up to webmasters to do so I can guarantee that the results would not be very useful (unless you really need to buy performance enhancing medication (I would type something else but I will get spamblocked) etc). You then write, "The program will automatically let the webmaster find out the importance and page rank of the new content." - this is clearly not a good idea for users or Google. In a world which worked this way, all you would produce is spammy results. Lots of people are very keen on being 'important' and 'ranked' - if you gave people an iterative system to work against, all you will produce is spam.
Have a question for anyone at Google reading this. On http://jobs.justlanded.com we ended up putting actions (such as reply or report) in a directory blocked by robots.txt. From what we can see the crawl is still going to these pages.
Previously we tried with a single page with a URL parameter marked with a noindex meta, but they were still being crawled (ie the bot was taking each param as a unique URL and trying to crawl).
Is there any was we can tell Googlebot to not crawl where it doesn't need to? Maybe you could propose a rel=dontfollow tag to instruct a crawler to ignore the link completely (which is not the case with nofollow).
What does it means in this case: "_escaped_fragment_=" How can google recognize the variable name I'm sending via AJAX?
wow, this is good
but will make major traffic changes
I think sites like facebook will now get huge traffic from Google
I can't see this working as there are so many ajax methods out there
פורומים
It looks a excellent idea. I have some questions:
- how would you be able to index AJAX webpages with real-time content (e.g. page with same address but with high frequency of content changes)? (Like Twitter does?)
- How much resource and bandwidth Googlebots will need from the hosted pages to index these real-time content?
- Have you published any paper on that?
Google says : "No, no, our crawler will not be AJAX compatible, please make sure (for us to continue to make ca$h on your content) that everything works exactly as in 1996 ...", suggested solution is just an ugly workload transfer. Arrgh ! This is a google prescribed hack!
It goes much further then only make ajax crawable. Also reading systems for blind people are not compatible with dynamic javascript solutions.
My suggestions should be to let the application run in two 'states' a javascript (ajax) enabled state and a html (non ajax) state. This means doing partial updates or full page refreshes. The 'default' html returned from the web server is the html already rendered for the initial state and all state (page) transitions are normal links (with or without SEO tags). All links can be followed by crawlers. When javascript is enabled however the html links (which should become ajax interactions) are 'overrulled' in their behavior. All onclicks will go to the 'ajax interaction' javascript component which decides which ajax action to execute and will do a partial update of the page state insteaf of a full page transition. This combines both worlds in my opinion and is also a solution for non javascript environments.
This is exactly what we are doing ... An should be the only way to have a crawlable AJAX site (that is not an in-browser web app that would not need any referencing)
In my opinion, this is a terrible solution. It is on the right track, but with one major mistake. The headless browser should be run at the side of the crawler, not the webserver.
Why should every webserver install a headless browser to accommodate Google, when Google can do it and solve the problem instantly.
Googlebot should simply fetch the page with all it's javascript, execute it in a headless browser and index the result. This would also eliminate the problem of serving different content to spiders and regular users.
@MrM - your argument is good.
Aquí tienen mi interpretación de la propuesta en español:
http://www.bitacoradewebmaster.com/2009/10/08/propuesta-de-estandar-de-google-para-hacer-un-sitio-ajax-indexable/
And this example is valid?
a href="domain.com/title-url/archive.html" onclick="myFunctionAjax(x,y);return:false;)">text description /a
Google will follow the link and the users will see ajax.
Why not just use degradable Ajax? What is a headless browser also :S
I was trying to work out a standard for this, from a web-developer perspective a little while ago.
I think the main issue with ajax hashes is that they conflict with id markers. So this was my thought.
Browsers should implement id searches of everything in the location.hash section before a space (i.e. %20). So, e.g. at:
http://example.com/#foo%20bar
an id="foo" element would match CSS :target selectors.
The advantage to using a space (%20) is that spaces are not allowed in id elements (even in HTML5).
So, in summary, I think a space is a better choice than an '!' but including !'s as not-allowed in ID attributes in HTML5 would be another way to go.
Either way, I'd ask that browsers split the hash section into what registers for #target selectors, and keep the rest static, etc. That way, the hash section can represent a specific sub-section of a document AND the ajax state of the document.
We've been working on the 'headless browser' concept, and actually have a system that runs server-side if the client doesn't have Javascript. All state change events get converted into clickable URLs. It's called golf:
http://code.google.com/p/golf/
Woooow damn indexing ajax is great!!
This is really great. Using the Hijax approach for building web applications makes implementing this solution incredibly simple. Since no public content should be accessible only via AJAX, all content is already mapped to a static page. The trick is making your application aware of the _escaped_fragment_ in the query string and rendering the appropriate data from there.
For Newd (Crispin Porter + Bogusky's open source data aggregator) the updates necessary to meet this standard took on a few hours with no perceivable negative impact to the application itself.
You can grab the source at my git hub page and read through the process i posted on my site
The server-side headless browser is really crazy idea. I understand that Google is simply trying to offload its job/capacity problems to others.
I witnessed last week that their Javascript parser is not working properly when resolving onclick links resulting in "http://example.com/undefined" urls. It seems that headless browser for crawling the pages is a tough problem for Google itself and it is trying to make it the problem of others. Not fair.
ahah, love that idea, I could code in GWT more often :-)
Interesting, but looks life of a URL looks complex. Not sure why google could not just crawl the fragments (considering the fragments as part of the URL). The webmaster will have to make sure to serve the right state when a url with a fragment will be requested, but this is definitely a good practice to have anyway.
I agree with MrM above, who said:
> "Googlebot should simply fetch the page with all its javascript, execute it in a headless browser and index the result."
The results of AJAX requests should not be considered indexable content. What if the browser-side JavaScript filters out some of the content before displaying it? And what if the JavaScript is used to generate content and insert it into the DOM without AJAX requests?
The proposal from Google is unsuitable for a modern thin server web architecture like LimeBits.
Google needs to bite the bullet and execute the JavaScript on its own headless browsers. Imagine what a great service this will be, and how much of an advance over Bing!
Wouldn't it be much much easier to:
1. give the content behind the ajax call a static url, which can be accessed by non-js clients too
2. add something like < link rel="alternate" media="ajax" type="text/html" href="http://example.com/doc.html#state" /> to the head of that static version, which the search engines can then use as a pointer to back the js version
?
@Simon Lynch
You have exposed one of the biggest issues with AJAX crawling, in my opinion (aside from its overuse in situations that simply do not benefit from it):
Should ALL of the AJAX content be crawled?
What about that bit that is showing the local temperature? Does the crawler need that one?
How about the one that is maintaining user state? Useful to a searcher from another country?
And the other one that is simply a preview function? Should that one be crawled, too?
Using an indicator, such as the bang under consideration, would help here, but will not be useful in every situation. For example, if Google is crawling the directory you intended to be private (probably because it's permissions inheritance from the "mother page" allows Google to access that directory without re-checking robots.txt), where does that end?
If you are serving pages that contain no dynamic data, then use static pages. The issue is gone.
If your page requires dynamic data in order to become fully-formed, and you want spiders to crawl it in its state following the insertion of dynamic data, you need to look hard at WHY you are using AJAX to generate that page, and WHETHER it should be converted to static, or supplied to the crawler as static alongside the AJAX-enabled version for real visitors to use.
Again, just because you CAN does not mean that you SHOULD.
@frank That's not correct. I'm blind myself and I can use dynamic html with different screen-readers very well, if the pages use WAI-ARIA. The myth blind users don't use javascript or can't perceive ajax pages seems still to be alive. What a pity!
@miller If you look at the newest release of the free and open source screen-reader NVDA you find already support for flash and java applets:
http://www.nvda-project.org/blog/NVDA2009.1beta1Released
I think that require to take the real fragment too (the anchor) if passed in ajax request.
So you would define an other token in the ajax state...
for example :
http://www.xorax.info/blog/programmation/259-x-sendfile.html#programmation/171-ma-killer-app.html#comments
target on :
http://www.xorax.info/blog/programmation/171-ma-killer-app.html#comments
Here, if the second token is "#", the url available for user and showed in results will be :
http://www.xorax.info/blog/programmation/259-x-sendfile.html#!programmation/171-ma-killer-app.html#comments
...or just use unobtrusive javascript.
The sad thing is that people actually got paid to come up with this.
May I offer some constructive criticism? This is an extremely bad solution.
The role of Google's search engine is to facilitate search. That means working with the web as it is, not trying to change it.
Yes, it's a challenge to index Ajax; but it's not impossible. Yes, it will require work by Google. Guys, that’s your job. The next-generation of indexing wizardry needs to be able to do this without every website implementing something that Google has come up with.
I wish Google would stop trying to tell us how to create the web. When Google issues a decree like this, it can have a huge impact.
For instance, Google has decreed that dashes were better for friendly URLs. It’s still there guidelines now. As a direct result, the web has been corrupted by a Spam of hyphens in the majority of URLs. In almost every case, the result is grossly incorrect punctuation. This might be comical, if it were not such a serious matter. Increasing numbers of children are leaving school with poor basic skills, while Google merrily tells everybody to misuse punctuation and ignore its proper purpose because it works better with their search engine. Where will this corruption of the web by Google end?
1. Introduce a new meta tag that pages can use to indicate that they are ajax powered+
2. Change your crawler backend to load the the content into a headless browser on your infrastructure and crawl the DOM
3. Profit
John,
Tis post very usefull for all programmers and webmasters. programmers don't know how to develop a search engine crwalable site with Ajax.
This posts shows us how to get the best results by using Ajax properly so search engines can easily crawl and index our website.When using AJAX, functions are called using onClick in links. These functions aren't visible for search engines because of the Javascript restriction.
My Website Seo Company also showing same conditions, however i have now optimized all Ajax codes properly.
Thanks for the great post john, keep it up.
Regards,
Sharanyan
AJAX shouldnt be used for navigation. This is not a good idea. It will just allow more spamming opportunities to exploit for the next year.
I think it’s good that Google is thinking about tackling a long-standing problem with AJAX-sites, but the suggested implementation is a bit ugly… I don’t like that you have to add a bang to the hash, and instead of using a _escaped_fragment_ query parameter it would be better to use HTTP content negotiation.
I would say that probably the best solution is to allow JavaScript to rewrite the path and query parts of the URL, then you would have neither problem, nor the need to (ab)use the hash part of the URL. Functionality for scripts to do that will be part of HTML5 (and is apparantly already implemented by IE8). But Google doesn’t seem to want to wait for that :(.
I can't understand why you don't want to request url as it is: http://example.com/#?key=value&q=query
why you want to replace # by !# ? it make no sense for me. In the scenario above http://example.com/#?key=value&q=query, if you can read from REQUEST_URI any #hash params it means that a headless browsers has requested it since only a headless browser can send url with #hash params for all other browsers part after # won't be send. I am confused by this idea of ! or _escaped_fragment_, why to make it so complicated and simply not use http://example.com/#?key=value&q=query and leave rest for developers - checking if any fragmented # params has been sent, if so it means that a headless browser is asking for static content. Its much simpler I reckon.
cheers,
/Marcin
Sounds like a good idea, but how about making unobstructive JavaScript a standard? I think this where main push should be made at first. Those how care about information appearing in search results will use it, overs not.
I'm really keen for a solution asap.
Because at the moment it leads to some crazy situations.
If you have an app like google maps, and you want anyone to be able to link to a spot AND those links to be visible to search engines, its nearly impossible.
Even googlemaps has a horrible "click for a link" feature...something I'm also having to implement on my site.
Users should be able to just cut and paste whats in their URL, post that on a page, and for search engines to be able to associate it with the content correctly.
Hi there, please have a look our proposal with working example at http://feedsmanagement.com/proposals/crawlable.ajax and please let us know what do you think about this solution
cheers,
/Marcin
I am newbie blogger until now i still learn about blog how to modify my blog will better in search engine results, with ajax i do not to use that in my blog. Please help me!
Perhaps Google should put its energy into developing declarative, approaches for what the AJAX community currently does with JavaScript. There are already standards in this area that are designed for accessibility and are search-engine friendly (XForms).
A binding language for matching intent-based markup to JavaScript presentation (along the lines of XBL) would be a welcome addition, and let Javascript take the web beyond CSS.
This implementation of an idea is terrible. I don't want to have to code into my application on the server-side the reception of a '_escaped_fragment' GET param, nor do I wish to have to deal with a ! character in my JS. The MAJOR issue with this approach is having to rely-on a headless browser running on the webserver, not GoogleBot's many thousands of server farms. This causes large amounts of unneccessary strain on my resources, something which I cannot justify (due to cost).
Google (and other engines) have certainly got to crawl the AJAX web, but not like this. The suggestion raised earlier (about interpreting Javascript URL rewrites) is very appealing though.
Is there a way to delete the presentation that found it way into my Google Docs account.
When using rewrite rules that do not have a hash, what element of the url should be presented with [FRAGMENTTOKEN]"!"
Does this look like a stateful URL?
ajax.open("GET", unescape("http%3A//www.example.com/process/find/34847/Product_Home"));
Adding a token might look like this
http://www.example.com/2/34847/Product_Home/#!AJAX
should the URL's be rewritten to generate a stateful URL?
When developing for products we like to rewrite URLS like this
http://www.examle.com/1/190999/product _name_product_description_descritpion2/
Our Ajax is obviously calling a central DB, so should we simply present urls without headers to bots, how do we encode the URL # properly?
Why not just use the # sign followed by the usual key/value params found in normal URLs. That how gmail and google website optimizer work. I'm guessing that Google can't unwind its ignoring of the # sign and is inventing the ! system to make up for it...keep it simple. way to complicated.
Fine solution, except:
1. There's no need for '_escaped_fragment_' or to mandate '!' in the URL.
2. Let the site direct all request from Google and other search engines to the headless server, based on either the user-agent string or some other request header that says it's a searchbot.
3. Let the headless server generate the HTML in such a way that it rewrites all '#' as '?'.
4. A rule may be added to the headless server that converts only '#!' URLs (leaving other '#' URLs intact). The site may choose '^' or '@' or anything else instead of '!' -- Google never even knows about this (it only sees '?' in the end).
In short, I like the idea of using a headless server, but there's no need for '!' and '_escaped_fragment_'; let the site have its own rules and mechanisms for this (Google doesn't need to care). The interface between Google and the website being crawled should be minimal.
Any downsides to this?
Oops. By "headless server" I really meant headless browser.
@Jimbo
Anything after # is not sent by the browser with a request, so strictly speaking this does not identify distinct URLs. Google 'unwinding' that treatment doesn't work with the other use of this for on-page anchors so is not very compatible with the rest of the [pre-AJAX] web.
When will this type of #!ajax be active?
When will this type of #!ajax be active? I already implemented it in my site www.magicafm.com
Do we really get blacklisted if we create a static site and feed googebot static html ?
I was thinking of making a JSP that output the HTML as of our GWT app would and then use Tomcat filter or other methods to rewrite the URL !
I would think it's better way like this, but I have a fear of ending in the black list, we spent the last months creating CMS that is totally based on GWT and we thought we will find a solution to this problem.
I'm re-looking at this proposal and I do see this as cloaking !
Not sure what is the new here as a lot of people already using some method to provide a static html to the server.
I think Google should indeed keep the "#" in the URL and leave it up to the user to cloak.
Hi,
I would suggest you offer AJAX developers a few options instead of settling on one.
Here is how I separate the content from the design.
Standard Wordpress Blog:
http://www.flexcapacitor.com/content
Flex Wordpress Blog requesting only the content:
http://www.flexcapacitor.com/demos/Wordpress/
Notice the updates to the URL fragment.
Only the content:
http://www.flexcapacitor.com/content/?xml=1
Select view source
In this example I'm using a query string rather than using a fragment.
If I left "?xml=1" off the blog page would load. There could be a standard fragment or query string parameter you could add to your page that returns only the content. Something like, "seo=1".
The second method the site is using works like this:
Whenever the site makes an AJAX / asyncronous call it updates the URL fragment so it can restore the state later. Then it dispatches an "ASYNC_UPDATE_COMPLETE" event after a result/content is received from the server.
Now, when a search engine comes along to index a page that uses AJAX the search engine can listen for an ASYNC_UPDATE_COMPLETE event. When it receives this event it can grab the updated URL and treat the page as a completely new page and then index the new content.
For this to work you would have to have a standard event that search engines could listen for. That wouldn't be too hard for developers. You could also modify the popular AJAX libraries to dispatch this event when new data is returned. That way developers don't even have to think about it. They would only need to update the URL fragment.
I have a question from SEO view point. I have my website developed in asp (promodirect.com). I believe/heard that HTML is the best way to make your website SEO friendly compared to any other scripting programme. Can someone distinguish for asp and html scripts/websites from SEO view point.
making ajax searchable has the added significance of potentially adding adsense to RIAs. I have been struggling with doing this for two years...
As Ortega said,
Take a tour for WAI-ARIA W3C standard on http://www.w3.org/WAI/intro/aria . It much probably could be a better solution.
We can't go patching code and URL's in different ways.
See the similarity of a crawler and a blind user.
Peace!
I havent chosen GWT because of two reasons. One is indexing and two is ads. This atleast solves one of the problem. What do you plan to do about ads, ads currently are per page and AJAX has only one page,
Can we get an update on this topic? Any additional information would be great - is there a timeline on making a decision one way or the other?
Thanks!
Usually when users share a link from an ajax application they would cut the text at the location bar ("http://example.com/index.php#module=3") than clicking on special "Copy link button" which has extra functionality to rephrase the link to "http://example.com/index.php?module=3" which supposed to work since when the user pasted that link on his blog, google bot indexes that link and when a user clicks on that link. The web server knows, what ajax module to be server since the web server could capture the Request URI "?module=3" but not "#module=3". The way to do it is to load the initial page, then look at the location url then load the ajax module. which has a double request to the server.
Does this "standard" work right now on Google or it's just an idea for now ?
Thanks.
Yeah when Will this become standard?
If you need a site for testing please consider using www.BizPartnerHunt.com
This site is great but not used very much because people can't find it. The ads are not able to be crawled because it is an AJAX based website.
I'm surprised Google hasn't advanced further with crawling AJAX than this proposal.
As MrM points out, the obvious flaw is that the headless browser should be run at the side of the crawler, not the webserver.
I like what was proposed but I also think developers should have another option. That is:
1. When the page first loads dispatch an event "hasAsync". This tells the search engine crawlers that the page has AJAX or Flash or whatever.
2. Henceforth, when you have make an async call, update the url fragment and once the data is returned (and /or applied if possible), then dispatch a "asyncUpdate" event.
This will tell the crawler that an asyncronous call has been made and to check the url fragment and then associate the new content on the page with the new URL fragment.
USE CASE:
A fella goes to a webpage (for example, an ajax version of www.google.com). When the page loads an event is dispatched via javascript "hasAsync" or "hasAJAX". He doesn't know this or care but the crawler is listening for this event. He enters a term in the search box, "batman" and then he clicks the "search" button. When this button is clicked the webpage updates the url fragment to "http://google.com/#s=batman". Then it makes an asyncronous call. When the search results come back the new information populates the page. When this information comes back the webpage dispatches an "asyncUpdate" event. The crawler is listening for this event. It then crawls and indexes the new content and associates the content with the new URL with the new URL fragment, "http://google.com/#s=batman". The event also would contain the data returned.
Google, please hire me.
Really great!!!
But what is the status with this right now?
Coz in our company we wanted do use GWT for product development. But we had to reconsider and use JSP and other java technologies insead, just because the content can be indexed.
But we would really like to start using thw web toolkit.
I would really know the answer to my question, please.
I too am wondering what the status of this is. I'm designing a new site (completely in AJAX) and am going to have to write additional linking methods to become SEO.
Single page Ajax framework would be so much easier and faster then regular sites nowadays :(
I'd really would like to see google crawling javascript.
I know this is a little OT, but is there a way to load google adsense ads so that they update with dynamically generated AJAX content?
I know it is theoretically possible - given that GMail itself is driven by AJAX and those ads change with every email.
My website initially loads the basic header/footer and adsense, and then content comes along as the user navigates the website. But my ads never change and they aren't relevant to the content since they loaded with the initial page load.
Has anyone figured out a better way?
This is certainly a very good idea. Even I was thinking from a long time that why Google isn't doing anything for making Ajax crawlable.
This will really help to develop more user friendly website with the help of Ajax.
Even nowadays users are more attracted towards the Ajax based website.
But, still in doubt because one side Google is talking about page speed which is giving more value to it and now it is saying that it is going to do Ajax Crawlable but usage of Ajax to some extent would be fine but using it excessively would increase the size of the webpage.
Still in confusion. But, certainly making Ajax based website would be more user friendly.
+1 for googlebot rolling its own headless browser on ajaxy url.
errata: on the page Getting Started page you have:
return the HTML snapshot for
www.example.com/index.html#!mystate
(that is, the original URL!) to the
crawler.
Shouldn't that be "ajax.html" rather than "index.html"?
question to google,
crawling ajax url will work for html/javascript sites, but what about for flex/flash sites? will google re-read a SWF's metadata for different urls?
or at least google should consider hiding html content behind a flash object NOT black hat practice,
because from a flex's site's perspective, upon ajax url change, the swf must have a means to give the crawler meaningful data
Ok, I must do two works:
1) The webserver must generate a plain xhtml page;
2) At the client, I must change the full plain xhtml page to the client related requirements;
So, we have twice more work, for the same money and time.
Thanks, Google ;) It's very nice to help you make more money.
This sounds great. I've lost count of the number of AJAX based designs we've binned because they won't play nice with search engines.
So is it ever going to happen?
Hi all:
Please, look how my site works perfectly using the "/#!" proposed by Google (look in the status bar when mouseover a link).
Now look how Google don't index correctly this kind of links.
If somebody can explain this...
Thanks.
I like the proposal by tjpick@gmail.com about adding [link rel=alternative] tags or similar between the JS and non-JS versions.
Sometimes a page uses AJAX even without a #fragment in the URL, e.g. for performance reasons or in order to reuse the same code to generate the page without the #fragment. With link tags, sites can specify the non-JS version without resolving to a specific format for #fragments. But the link approach could be just a supplement to the proposed #! format.
@Mauricio:
For some reason whe I click your site, the bottom link says, for instance:
http://www.magicafm.com/#!historia.2559
, but actually, i the address bar it shows
http://www.magicafm.com/#historia.2559
Hope that helps.
Google has posted a documentation looong time ago, on how to make your ajaxy site crawlable.
Link: http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html
Thanks for share this information. I'll use it in a article that I'm writing on my owm site in www.otimizacaodesites.org
@ERF:
Thanks a lot ERF, realy, realy thanks!
This is interesting but is does not solve the real and difficult problems:
* This solution is specific for Google Search.
* Hard things quick to say very hard to do: "Web servers execute their own JavaScript at crawl time"
I have already solved the AJAX and SEO problem in any search engine with ItsNat, in some way ItsNat is headless browser in server... in Java (no JavaScript code is executed in server).
With ItsNat you develop a Single Page Interface (AJAX intensive) application and (almost) automatically the same is page based when JavaScript is disabled or ignored (like search engine crawlers see web sites).
More info:
The Single Page Interface Manifesto
Tutorial: Single Page Interface Web Site With ItsNat
Online demo
First of all I think that is great.
But are flash websites (which in that case are obviously a complex construct of html, javascript and a backend) invited to be a part of this agreement?
Whatever the answer is, it would be great, if you would add it to the faq.
http://code.google.com/intl/en-US/web/ajaxcrawling/docs/faq.html
Cheers,
Jan
Someone else might have said this already, but: whatever happened to graceful degradation? If you have that, the crawler should have no trouble in the first place; the only problem would be that this would send everyone to the static URL rather than the AJAXy one, whether their browser could handle AJAX or not.
Which brings to mind another problem that I hadn't really thought of before: what if someone gives a URL with the AJAX state fragment in it to a (JS-less) Links user?
I agree with the comment from Gary on Dec 23, 2009. The headless browser should be run at the crawler side. This is a crawler problem, not a website problem.
I agree with the comment Gary made on Dec 23, 2009. This is a crawler problem, not a website problem.
Is this actually up and running yet?
I've tried to convert my site, but neither Google.com is indexing my pages or is FetchAsGoogleBot converting #! links to _escaped_fragment_=
Its hard to tell if I'm doing something wrong, or is Google not indexing my site for another reason.
I supplied a sitemap full of #! two weeks ago so I would have thought some would have been indexed :-/
@darkflame: Yes, the AJAX crawling scheme is live. What's your domain?
What are the benefits of this proposal verse using a technology such as jQuery Ajaxy which gracefully upgrades a website and thus never encounters this search engine problem in the first place.
Here is it's project page:
http://www.balupton.com/projects/jquery-ajaxy
Here is a website it is used on:
http://www.balupton.com
To me this proposal has a severe disadvantage compared to the jQuery Ajaxy solution in that it requires major complicated work on the server side (in regards to the state escaping, translation and rendering) as well as making your websites break for js disabled users - something which has never been good.
I really fail to see a benefit of this proposal.
I implemented this on our site. On Google Webmaster Tools under Diagnostics > HTML Suggestions every page now appears as a duplicate:
/foo
/foo?_escaped_fragment_=
Did I do something wrong, or is this a bug in the Webmaster Tools?
I would like to hear some clarification why headless browser can not be run on search backed servers.
It will makes webmasters life easier and results more reliable.
Only drawback is increase of the load on search backend servers.
If Google is missing computing power for it - it's the end of the world as we know it.
@balupton: Such approach only applies to "a bit smarter webpages", not to a webapp having all its UI created on a blank page using JS.
Why not doing as facebook user agent which simply replace the hash char by its url encoded version :
http://example.com/dictionary.html#AJAX becomes
http://example.com/dictionary.html%23AJAX ?
%23AJAX can then be interpreted server side...
Using a headless browser and waiting for the Javascript to finish executing is very fragile. The reason being is that different pages have different rendering times, and over time, those times may change. In addition, Google takes into account page load time for ranking, and if the "JavaScript wait-time" is high enough, this affects the ranking (unless google ignores load times of snapshot pages).
Thoughts?
Exclamation points in fragment identifiers are already being used as search anchors for pages that don't have anchor tags, so there are things like #!s!text #!s10!text and #!s-3!text
http://en.wikipedia.org/wiki/Fragment_identifier#Proposals
Is your proposal going to conflict with this use? Please make hacks of the fragment identifier independent of each other so there aren't conflicts.
Using headless browser on server side, just to serve the search engine request is a poor idea. May be this is a Google way of off loading the work to individual webservers.
As a service provider, I should be able to focus on my users rather than millions of web application/service providers adding Html-Unit on there application.
An alternate thought: Could a search engine use Html-Unit or a headless browser on their end - of course, in an isolated secure environment where any bad script execution can run only with in the defined boundary - and generate the content just like normal browsers do?
really in need for solution for Ajax Crawling. have a yoga site entirely done in Ajax.
some confusions in implementing the Google recommendation of Ajax Crawling.
Can we give an XML data of ajax pages instead of the HTML snapshot?? Is it realy the XML file you meaning by HTML snapshot in the doument.
How to generate the HTML snapshot using any of the Headless Browsers?
If we implement this technique will it be indexed and cached as normal html pages??
The site is having much videos and photos and i have to implement the solution ASAP
we have generated the snapshot using HTML unit.
But literally instead of HTMl we could only get an XML version of the html in the page like below
?xml version="1.0" encoding="ISO-8859-1"?
html
head
/body
/html
removed the angular brackets in above code for no html allowed in comment box here
the links in the page code we generated like this is in the format
a href="#!BookStore/Home/"
removed angular brackets in above code too
Two questions to get answered..
Is the HTML unit meant to produce an output like above starting with
Are the links in the format a href="#!BookStore/Home/"
crawlable to search engine bots??
Please do reply ASAP if anyone knows as we are in the middle of the process..
I Contain A Site Full Of Ajax..I Don't Want To Use '#' As You Have Mentioned.I Can Change Title And Page Description Programmatically , Will Doing This Will Help Me In Crawling My Page. Please Help..
I don't know if this was already proposed on comments or not, but I propose using the Javascript to load content from clicked links on Javascript compatible browsers.
So, let's say you have an page called demo.php.
If you access it directly it will generate the content with the template, but if you access it with something like "?_token_ajax=1" it will render just the content (that would be called from Javascript).
Browsers who doesn't support JS/AJAX and Search Engines would go to "demo.php".
Browsers who support JS/AJAX will return false when the link is clicked and load "demo.php?_token_ajax=1" via AJAX...
It seems Google has done the right thing and is beginning to index Ajax type content e.g. Facebook comments.
As I commented previously, it's a search engine's job to index content on the web -- without altering the web!
http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html?showComment=1255079811655#c96003387870917958
It seems Google has done the right thing and is beginning to index Ajax type content e.g. Facebook comments.
As I've commented previously on this very announcement (above), it's a search engine's job to index content on the web -- without altering the web!
How does this fit in with HTML5 History.pushState()?
Hi everyone,
Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Central Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment