Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Crawling through HTML forms

Friday, April 11, 2008 at 10:50 AM



Google is constantly trying new ideas to improve our coverage of the web. We already do some pretty smart things like scanning JavaScript and Flash to discover links to new web pages, and today, we would like to talk about another new technology we've started experimenting with recently.

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

Needless to say, this experiment follows good Internet citizenry practices. Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate.  Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site.

The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.

This experiment is part of Google's broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

51 comments:

John said...

I've noticed this with a few sites when the robots have gone through forms which are to search for products in locations and the robot indexed the results pages, but due to the results pages having replicated title and descriptions it's popped up as duplicate content problem in the webmaster tools.

The fix, use the what the person searches for in the title and description of the results page.

E.g.

Title - You Search for 'hotels' in the 'London'

Description - 'Search results for 'hotels' based in 'London'

So this way all the different combinations of results boxes have unique title and descriptions.

Tom said...

don't you Googlers know that you only release big news on a Friday if you want to bury it? this has monday or tuesday written all over it ;-)

Anthony said...

Don't you realize that those of us who want the form results Googled already provide it to you in a static page or pages?

mucows said...

[...]"A little while back I wrote about how I thought Google was indexing site SERPs for those sites that had Google Analytics tracking site searches. In effect, I mistakenly accused Google of leaking Analytics data into its index. I had enabled site search tracking and my friend Brian had too, and we were both seeing these site search results pages turning up in Google’s SERPs. Thus we were worried about the integrity of our data.

As it turns out, Google’s experimenting with a new form of discovering deep content on “high quality” sites. "[...]

SuperJason said...

Hmmm, I may have to exclude googlebot from my error list. I'm guessing it's pretty easy for it to put in some bogus content.

It reminds me of the monkey test. Have a group of monkeys test your website. They'll randomly do things until it breaks.

Doesn't Google already have enough to index? I'm really curious what will come of this.

--SuperJason (my tech blog)

incrediblehelp said...

"The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages."

Can you explain this sentence a little more clearly.

Olaf Lederer said...

can't say that I like this feature. I noticed that google has indexed hundreds of pages from some dealer list I offer on some website. This results are all the same with only one difference: the address!

I think this way you get millions of pages with duplicated content in the list.

Th only solution is that result pages get the "noindex" tag.

Les Porter said...

I can't see why this evolution is a bad thing. In the short term, it creates duplicate content issues for webmasters who have created methods for traditional spiders to access the content. The quickest fix for this is to protect the content from spiders with a change to your robots.txt.

Long term and as Googlebot improves, webmasters won't have to go through the extra effort of creating other avenues to the content beyond the form. One less task.

Olaf Lederer said...

@Les,

sure there are enough ways to keep the bot outside ;)

But I think this should be at least an option inside the G webmaster tools.

We sell toys for children and our dealer list is about 1500 dealer in the agriculture sector, that 1500 pages with non related content if the bot will spider our dealer list.

ShopDownLite.com said...

Is this really new? For years our website (http://www.shopdownlite.com) has been getting these giant carts added with no purchases and when we track it down it is the result of robots adding items to carts.

This seems like a good service and I like the first posters search results idea. I think we will recode for that now.

Jennifer Mathews Somogyi said...

I noticed this a while back and haven't seen any trouble come up from it other than the analytics not knowing it was Google that searched an in site search and not one of our visitors.
As for the forms - I stop think about usability and bots after a form is filled out and think only of the usability. The titles are for the user, the description is non existent. Not to mention that at times iFrames are used for the forms.
Our numbers have been off in our tracking showing more visitors to a confirmation page and the for filled out data is different.
In order to help us meet our goals and make sure we making the site search engine friendly every step of the way - can you possibly let us know of these tests before you start testing them?
It would save me having ot go back to our dev and design team because all of our forms need to now be reworked because of the search engines.
Yes - it's the perils of being an in-house SEO - got to get design and dev on board and then spend the time leveraging and waiting for the changes to get implemented.

Jenn

Gael Fraiteur said...

For good internet citizens (and according to the REST bible), the rule is that a robot should not invoke POST methods, because they may have side effects. GET methods are supposed to be programmed so that they have no side effect.

I hope Google respects these rules too.

Dan Thies said...

(Checks date... no, it's April 11, not April 1)

Bad robot! Down!

LebossTom said...

You can't change crawl rules without webmaster permission...

I suggest Google to ask for it in Webmaster Tools.

incrediblehelp said...

"You can't change crawl rules without webmaster permission..."

LOLOLOLOLOL

admin said...

This is so stupid.

Its indexing stuff in javascript thats used for my ajax that i don't want it to index.

Its been in my robots.txt for months
and it has a noindex/nofollow link on the page.

AND yet google still indexes it!
Duplicate content abound.

Olly said...

I thought Google indexes so many pages that they would know that this is a really bad idea. No one bothered about adding noindex or nofollow to forms/pages behind forms that are only reachable through forms.
Everyone knew that bots don't use forms. And thats good. Because its easy to provide links for content that should get crawled.

Inserting random text into form fields is just plain wrong. Thats what spammers do.

As close to no website is prepared for that it will end in heaps of webmasters and developers to get badly annoyed with google, and some might just block the whole site for googlebot. Is that what you want?

Top Nerd said...

Dynamic site links. Can a web master submit a request to Google to have a site generate dynamic site links in a fashion of his or her choosing? Or is it a Google automated function only? I have a technical tip directory that shows up as - similar pages. I would like to have it index as Technical Tips: http://www.clickanerd.com/techtips/tips/index.asp

David Hulbert said...

I don't think I like the sound of this. Can we have a rel="noindex" for forms please?

Vikash Bucha said...

what 'bout newsletter subscription and registration forms. Wouldn't that contribute to bogus entries ?

Robin said...
This comment has been removed by the author.
Robin said...

One of the websites that is being scraped this way by Google is Lyrics.net, which is owned by a coworker of me.
I have published some initial findings on my blog:
Lunchpauze.com

Ron Michael said...

Perhaps there is a better way to crawl HTML forms by "asking" the web site: define a standard way of exposing typical or possible results of a form. Add an extra element to the FORM tag, like "PRESULTS=/results.xml" which could be a URL to a static or dynamic list of URLs that might result from a search. For example: if you are searching by author in a box, and you search your database for the author, results.xml could contain a dynamically generated list of URLs for all authors in your database.

I think there is some possibility of people abusing this for black hat SEO, but it'd be a great tool for white hat SEO folks.

Leon Derczynski said...

Interesting! Who wins out of Googlebot and Akismet?

Aerophilia said...

Ok. So wont this mean that any "High Quality" hotel-type website I might own could be soon sending me hundreds of enquiries via the form on the site?

What about shopping sites? Is googlebot trying to set-up an account with Play.com & buy dvd's? O_o

Susan Moskwa said...

Hi folks,

Thanks for all the comments. To answer some of your questions:

@Anthony: Actually, there are more people than you might think who aren't aware that forms are stopping crawlers from accessing content that they want crawled. Here's a classic example.

@incrediblehelp: That sentence means that form crawling shouldn't impact the crawling or indexing of your site's other (non-form) pages.

@Gael: As mentioned in the post, we'll only be retrieving GET forms, not POST.

@admin: If you have a noindex meta tag on a particular page, that page needs to be allowed in your robots.txt file; otherwise we won't be able to crawl the page, and thus won't be able to see the meta tag on the page. Note that disallowing a URL from crawling (in your robots.txt file) does not necessarily mean it won't be indexed.

Matt said...

If developers bothered to understand HTTP, they would know that anything that can be retrieved using GET is fair game to be indexed. If GETting a URL is going to result in something bad, make sure to put it behind a POST.

bobx said...

I guess its their way of filling up the index with content beyond the ubiquitous:
"yea I am 18 years of age or older, can I now get to the pr0n pls" ?
checkboxes.

PestProJoe said...

Does this mean that links on forums may count against your site similar to links on Blogs?

-Joe
Do-It-Yourself Pest Control

gatwanagu said...

Hi, I have a question about this related to language recognition. If this is the wrong place, please point me to the right spot to ask.

My web site runs on Django and does create the (static) content including some meta tags dynamically out of the database.
One advantage is, that also the language content shown to the user is dynamically created and served.

If a visitor has 'en-US' as standard browser language and goes to lets say pal-pad.com/product/ he gets English content, if he has 'de-AT' and goes to pal-pad.com/product/ he will receive German content. He can choose his language also on a dedicated page where a 'Select widget' showing the available languages and a 'Select button' to finalise the chosen language are available.
Is the robot able to get this and crawl first in English and then in German so that depending on the search language of the user later the correct content is shown. Or is my non-English content lost in hidden space because the robot comes with an initial English language identity? And, does a simplification of the selection process to use only one widget (selection list) is an approach to make it easier?

Madsen said...

Isn't this potentially going to spam some wikies into oblivion? I know *most* wikies use POST, but my guess is that there are probably some that will accept both POST and GET when posting content. (I'm just speculating as I remember something about some PHP apps accepting both POST and GET from forms. This might have been in the pre-PHP5 days, but there's still a lot of PHP4 apps around.)

MrGamma said...

This is actually scary...

MrGamma said...

Whoops... I missed the part about crawling GET forms only... okay... no big deal...

Mike said...

This kind of thing is what I always hoped Google Sitemaps would do. i.e. let those of us who can't for whatever reason create web browses of our large dataasets, at least provide the URLs to Google in a sensible format. But no, a sitemap is no guarantee and the robot still needs to be able to spider it.


So how about a version of Sitemaps to work with your forms system? A list of keywords authorised by the site owner - even if thats just 'ID=001, ID=002' etc?

jaredslawski#1fan said...
This comment has been removed by the author.
Q5 Grafisch Webdesign etten-leur said...

This is big news! thanks

Aella said...

In some worlds, that's called fuzzing.

boenisch said...

We have a website with over 6,000 pages in google's index. But now the 600 pages that came through the Deep Web activities often are keen competition to the rest.

For we have lists of content (that should be indexed) and a form that can refine the list on one page, we can not just exclude the robots from the entire site.

google has to give us the chance to exclude a form functionality exclusivly.

Web Promotion Team said...

Really Good For all webmaster, but i want to ask a small question,
Does Google webmaster tool show correct result?
When i saw top queries option from Google webmaster tool, it show "keyword" 4th position, then when i check it's doesn't show... can anyone explain...

pkharat@telegenisys.com
Telegenisys Inc

Kamal Patel said...
This comment has been removed by the author.
Kamal Patel said...

Im fully agree with JOHN
I've also noticed this with a few sites when the robots have gone through forms which are to search for products in locations and the robot indexed the results pages, but due to the results pages having replicated title and descriptions it's popped up as duplicate content problem in the webmaster tools.

Soapy said...

I've noticed the googebot is actively entering search terms into our search form, it's been doing this for the past week or so. However it's not entering whole words, it always chops a few characters off "doubl" instead of double, "machin" instead of machine. Or sometimes totally random stuff like "me ta".

While google taking an active interest in our site is good. I'm not sure how these odd phrases are going to help create good results for indexing?

Geromme Talampas said...

well thanks for the information!!
pls visit my site
http://nuevaecijashoppingmall.6te.net

Aerophilia said...

@joginder singh punia

I CAN HAS SPAM EL GOOG 4UMS WITH CAPS AND LINK TEH WURST SITE EVA?

lolwut?

Maile Ohye said...

@Aerophilia: LOL, good one. I just deleted the spammy comment by joginder singh punia, though I neglected to check out his/her site.

Take care,
Maile

Jonathan said...

This communication is regarding http://ingramdatacomm.x10hosting.com. On Nov 6, 2008 I removed the cache copy and site from Google index. Since I have updated and rolled out the new site. I submitted the new sitemap.xml on Nov 11, 2008. I have checked both my sitemap and robots.txt file for any errors that would cause googlebot to not index my site. Please your help is very appreciated.
Sitemap http://ingramdatacomm.x10hosting.com/sitemap.xml
Robots File http://ingramdatacomm.x10hosting.com/robots.txt

Maile Ohye said...

Hi Jonathan,

Sorry to hear that your new site isn't indexed. A better place for your question is in the Webmaster Help Forum (as a comment on this blog post it's a bit off-topic). I think other members of the community would quickly notice that removing "the cache copy and site from Google index" could potentially be problematic.

If you don't hear a response to your thread after 36 hours-ish, bump it. :)

http://www.google.com/support/forum/p/Webmasters

Good luck,
Maile

shega888 said...
This comment has been removed by a blog administrator.
skunkbad said...

I have a CSS theme changer form on my site, and Google has indexed the theme changer script according to a site search for my domain on Google.
( www.brianswebdesign.com )
I have had in place a robots.txt file that excludes this script from being indexed, so I am wondering why this has happened. I can't use a meta tag in this script, and there is no rel="nofollow" attribute for a form, so I am wondering what to do. Any advice?

Just a thought said...

[quote->>It is a nice feature however Google among others are missing and will continue to miss lots of new content that is being published using Ajax programming. None are able to index such functionality. The traditional crawling technique fails over time as Ajax becomes more popular. Read more here:
<<-endquote]
...........
The quoted message above is confusing to me. I use ajax to make polls on my celeb website and the polls DO show up in the Google Index. So, have there been changes? Or is this a different kind of ajax?

My Google webmaster panel keeps telling me that I have a bunch of restricted URLs in my robots text, such as EDIT PROFILE! Of course I do not want these crawled for every member of my fun site and forum

Maile Ohye said...

Hi everyone,

Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.

Thanks and take care,
The Webmaster Central Team