Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

GET, POST, and safely surfacing more of the web

Tuesday, November 01, 2011 at 3:53 PM

Webmaster Level: Intermediate to Advanced

As the web evolves, Google’s crawling and indexing capabilities also need to progress. We improved our indexing of Flash, built a more robust infrastructure called Caffeine, and we even started crawling forms where it makes sense. Now, especially with the growing popularity of JavaScript and, with it, AJAX, we’re finding more web pages requiring POST requests -- either for the entire content of the page or because the pages are missing information and/or look completely broken without the resources returned from POST. For Google Search this is less than ideal, because when we’re not properly discovering and indexing content, searchers may not have access to the most comprehensive and relevant results.

We generally advise to use GET for fetching resources a page needs, and this is by far our preferred method of crawling. We’ve started experiments to rewrite POST requests to GET, and while this remains a valid strategy in some cases, often the contents returned by a web server for GET vs. POST are completely different. Additionally, there are legitimate reasons to use POST (e.g., you can attach more data to a POST request than a GET). So, while GET requests remain far more common, to surface more content on the web, Googlebot may now perform POST requests when we believe it’s safe and appropriate.

We take precautions to avoid performing any task on a site that could result in executing an unintended user action. Our POSTs are primarily for crawling resources that a page requests automatically, mimicking what a typical user would see when they open the URL in their browser. This will evolve over time as we find better heuristics, but that’s our current approach.

Let’s run through a few POSTs request scenarios that demonstrate how we’re improving our crawling and indexing to evolve with the web.

Examples of Googlebot’s POST requests
  • Crawling a page via a POST redirect
    <html>
      <body onload="document.foo.submit();">
        <form name="foo" action="request.php" method="post">       <input type="hidden" name="bar" value="234"/>
        </form>
      </body>
    </html>
  • Crawling a resource via a POST XMLHttpRequest
    In this step-by-step example, we improve both the indexing of a page and its Instant Preview by following the automatic XMLHttpRequest generated as the page renders.

    1. Google crawls the URL, yummy-sundae.html.
    2. Google begins indexing yummy-sundae.html and, as a part of this process, decides to attempt to render the page to better understand its content and/or generate the Instant Preview.
    3. During the render, yummy-sundae.html automatically sends an XMLHttpRequest for a resource, hot-fudge-info.html, using the POST method.
      <html>
        <head>
          <title>Yummy Sundae</title>
          <script src="jquery.js"></script>
        </head>
        <body>
          This page is about a yummy sundae.
          <div id="content"></div>
          <script type="text/javascript">
            $(document).ready(function() {
              $.post('hot-fudge-info.html', function(data)
                {$('#content').html(data);});
            });
          </script>
        </body>
      </html>
    4. The URL requested through POST, hot-fudge-info.html, along with its data payload, is added to Googlebot’s crawl queue.
    5. Googlebot performs a POST request to crawl hot-fudge-info.html.
    6. Google now has an accurate representation of yummy-sundae.html for Instant Previews. In certain cases, we may also incorporate the contents of hot-fudge-info.html into yummy-sundae.html.
    7. Google completes the indexing of yummy-sundae.html.
    8. User searches for [hot fudge sundae].
    9. Google’s algorithms can now better determine how yummy-sundae.html is relevant for this query, and we can properly display a snapshot of the page for Instant Previews.
Improving your site’s crawlability and indexability

General advice for creating crawlable sites is found in our Help Center. For webmasters who want to help Google crawl and index their content and/or generate the Instant Preview, here are a few simple reminders:
  • Prefer GET for fetching resources, unless there’s a specific reason to use POST.
  • Verify that we're allowed to crawl the resources needed to render your page. In the example above, if hot-fudge-info.html is disallowed by robots.txt, Googlebot won't fetch it. More subtly, if the JavaScript code that issues the XMLHttpRequest is located in an external .js file disallowed by robots.txt, we won't see the connection between yummy-sundae.html and hot-fudge-info.html, so even if the latter is not disallowed itself, that may not help us much. We've seen even more complicated chains of dependencies in the wild. To help Google better understand your site it's almost always better to allow Googlebot to crawl all resources.

    You can test whether resources are blocked through Webmaster Tools “Labs -> Instant Previews.”
  • Make sure to return the same content to Googlebot as is returned to users’ web browsers. Cloaking (sending different content to Googlebot than to users) is a violation of our Webmaster Guidelines because, among other things, it may cause us to provide a searcher with an irrelevant result -- the content the user views in their browser may be a complete mismatch from what we crawled and indexed. We’ve seen numerous POST-request examples where a webmaster non-maliciously cloaked (which is still a violation), and their cloaking -- on even the smallest of changes -- then caused JavaScript errors that prevented accurate indexing and completely defeated their reason for cloaking in the first place. Summarizing, if you want your site to be search-friendly, cloaking is an all-around sticky situation that’s best to avoid.

    To verify that you're not accidentally cloaking, you can use Instant Previews within Webmaster Tools, or try setting the User-Agent string in your browser to something like:

    Mozilla/5.0 (compatible; Googlebot/2.1;
      +http://www.google.com/bot.html)

    Your site shouldn't look any different after such a change. If you see a blank page, a JavaScript error, or if parts of the page are missing or different, that means that something's wrong.
  • Remember to include important content (i.e., the content you’d like indexed) as text, visible directly on the page and without requiring user-action to display. Most search engines are text-based and generally work best with text-based content. We’re always improving our ability to crawl and index content published in a variety of ways, but it remains a good practice to use text for important information.
Controlling your content

If you’d like to prevent content from being crawled or indexed for Google Web Search, traditional robots.txt directives remain the best method. To prevent the Instant Preview for your page(s), please see our Instant Previews FAQ which describes the “Google Web Preview” User-Agent and the nosnippet meta tag.

Moving forward

We’ll continue striving to increase the comprehensiveness of our index so searchers can find more relevant information. And we expect our crawling and indexing capability to improve and evolve over time, just like the web itself. Please let us know if you have questions or concerns.

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

26 comments:

Anna said...

I don't want Google to use POST to compensate for other developers' lack of knowledge and skill. Why not just write a short guide on how to make ajax content crawlable? Explain to them what graceful degradation is so that you can get away with just following "href". I put a great deal of effort to make my site crawlable AND user-friendly. Don't touch my forms please.

iGEL said...

I agree with Anna. Crawling AJAX content is a good idea (if it works well), but don't do POSTs and don't rewrite POST to GET. It's not your job to undo the stupidity of developers and if you the rewrite, it will cause problems.

ach444 said...

But it is Google's job to find and index all of the information on the web. Unfortunately a lot of valuable information (think government websites) is hidden behind POST requests. It is silly to think that Google would have a policy of "Just count on developers to do their job."

Also you can imagine they would err on the side of caution here.

I think the most important sentence is: "Remember to include important content (i.e., the content you’d like indexed) as text, visible directly on the page and without requiring user-action to display."

This means that they are giving weight to text based on it's visibility. So avoid Tabs, Modals, Hovers, etc...

Tony said...

So now we can't trust that a POST form which is an action will not get executed by Google as a GET.

Bringing in the potential that a robot can start editing a website.

I'd guess the safest protection is to make sure your server only supports POST based requests for actions and returns a 405 code (Method Not Allowed) if a GET request was made.

Kevin said...

This so seems like a bad idea. POST requests often are done to manipulate data. Maybe Google is being careful, but GET requests are intended for reading data, POST for changing/adding data.

Unknown said...

On the subject of cloaking, is it permissible to show Google the content that someone would see if they were logged in, but prompt other users to register/login? Or does this count as showing Google different content?

MarkAtRamp51 said...

I think this is what google should be doing, enhancing their product as best they can.

It's not the intent of google to create a "platform for competition" between developers. The intent is to make relevant content easy to find.

There shouldn't be any worry about posts getting rewritten to gets, because since we are all so smart here, we know that a request that comes over get to manipulate that data should be responded to with some type of 400 response code. Thats my 2cents.

Stefan Tilkov said...

This sucks so much I could cry. The difference between GET and POST is defined in the HTTP spec, and the fact that GET (and not POST) is what can be safely called without consequences is a fundamental tenet of the Web's architecture. Not adhering to this is plainly a horrible idea.

FreedomVOICE said...
This comment has been removed by the author.
John E Lincoln said...

Google you really need to work on simplifying. The things you make websites do to be present in your index are way too elaborate and technical. Sure we get it, but overall your mission should be to simplify your product.

Everfluxx said...

RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REQUEST_METHOD} =POST
RewriteRule .* - [R=405,L]
# end of the story.

Esteban Felix said...

The issue I have with this is that we have an AJAX POST call on page load that verifies a user has loaded the page. We don't want google to show up as a user for various reasons.

Would we get penalized for modifying/disallowing google from POSTing in this particular case?

Illuminatus said...

I think it is easier if you work with facebook comments and disqus platform and a couple of others to make the content readable and you're done.

Javi said...

Hi,
if you do now Ajax request, how this affect to those notation "#!..." for links in ajax request?

thanks

jhon apps said...

i have also same problem to create a Custom facebook app.
if any one have also this solution for so let us know.

Matej said...

Developers, you must be prepared someone POSTs to any publicly accessible doc. Google or the bad guys. I don't say Google's bad or does the right thing with the POST, I say you should be prepared already.

Alper said...

I'll be waiting for the POST blocking middleware for Googlebot any time soon then.

Brian Moon said...

"Cloaking (sending different content to Googlebot than to users) is a violation of our Webmaster Guidelines"

So, when will Amazon be removed from your index? They return much less content to a Googlebot user agent than a Firefox user agent.

Coffee Grinder said...

As a general rule I always code my AJAX requests as POSTs because I found that IE was caching a GET requests (even with the all the usual page expiry headers) and so dynamically updated content was not being displayed (this was not an issue for Chrome just IE) - so yes if you are going to index AJAX content you need to crawl POSTs as well as GETs

Paul said...

"Also you can imagine they would err on the side of caution here."

Meaningless - even if it were possible for a search bot to "judge" it would still have to execute the action before being able to make that judgement, with whatever unknown consequences.

Perhaps we'll see a nojsfollow one day.

Fizster said...

Need some help with respect to best practices on linking, redirects and canonical tag usage on an Ajax driven website. I’ve used Twitter as a reference.
Redirects:
Should you 301 your escaped frag pages to the hash bang versions?
I see Twitter using both a 301 redirect and window.location redirect on http://twitter.com/?_escaped_fragment_=/andersoncooper to the /#!/andersoncooper version.
I would have thought that if you used a 301 on the escaped frag version, Googlebot would not be able to crawl it, and the whole point of having a escaped frag version is to serve a “static” version of the page to Google so they have something crawlable. I can see the window.location redirect making sense on the escape frag, as you want to serve users the usable version of the page, but I don’t get the 301.
Then on the "pretty URL" http://twitter.com/andersoncooper they are using the window.location to redirect to the #! version. I would have thought that using a 301 from the pretty URL to the hashbang would make more sense…
Linking:
Is it best practice to link to the #! URLs, or to the pretty URLs throughout the site. I see Twitter linking to hashbang URLs.
Canonical tags:
Should pretty URLs canonicalize to the #!, or should the #! canonicalize to the pretty URLs? And what should the escaped frag canonicalize to? Anything? Using Twitter as an example, it looks like they are using the base URL tag, coupled with canonicalizing to the root "/".
Very hard to find specific info from Google on what’s best practice with these issues – so any comments or help would be much appreciated.

joeliantosh said...

Need some help with respect to best practices on linking, redirects and canonical tag usage on an Ajax driven website.

Ömer Konat said...

very good… Thanks you very much !

MagicDude4Eva said...

I have not been able to establish if GoogleBot really crawls and indexes Disqus comments. Just looking at my GWMT fetch as GoogleBot, I don't see any comments being fetched - this post is an example.

Perhaps a WordPress misconfiguration or the GWMT function does not show what the bot would do?

terrell gibbs said...

HOW CA I BLACK OUT MY OWN SITE

yasser said...

I don't want Google to use POST to compensate for other developers' lack of knowledge and skill. Why not just write a short guide on how to make ajax content crawlable? Explain to them what graceful degradation is so that you can get away with just following "href". I put a great deal of effort to make my site crawlable AND user-friendly. Don't touch my forms please.العاب