Google Operating System Unofficial news and tips about Google

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Sunday, 13 April 2008

Google Starts to Index the Invisible Web

Posted on 10:05 by Unknown

Google Webmaster Central Blog has recently announced that Google started to index web pages hidden behind web forms. "In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page." For now, only a small number of websites will be affected by this change and Google will only fill forms that use GET to submit data and don't require personal information.

Many web pages are difficult to find because they're not indexed by search engines and they're only available if you know where to search and what to use as a query. All these web pages create the Invisible Web, which was estimated to include 550 billion documents in 2001. "Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not see or retrieve content in the deep Web -- those pages do not exist until they are created dynamically as the result of a specific search."

Anand Rajaraman found that the new feature is related to a low-profile Google acquisition from 2005.
Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. (...) The key problem in indexing the Invisible Web are:

1. Determining which web forms are worth penetrating.
2. If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging - we need to understand the semantics of the input box to guess possible valid inputs.

Transformic's technology addressed both problems (1) and (2). It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic. (...) The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler.

It's not clear what are the high-quality sites used by Google for the new feature, but this list includes some good options. Along with Google Book Search, Google Scholar, Google News Archive, this is yet another way to bring to light valuable information.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in Web Search | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Google Clusters Results from Forums
    Google already knows if a page is part of a discussion group and it also extracts useful information like the number of posts or the date o...
  • Google April Fools' Day 2009
    Like last year , many Google services and local sites created their own hoaxes for the April Fools' Day. The most significant announceme...
  • Google Toolbar and 404 Error Pages
    I find it very strange that people have abnormal reactions when Google does something. People have an incorrect perception of the "don...
  • Google Chrome OS Event
    Google will announce more information about Chrome OS at a press event that starts at 10:00am PST. Google will offer "an update on Goog...
  • Disable Google SearchWiki
    Google Search's preferences page includes the option to disable SearchWiki . Just click on the checkbox next to SearchWiki and you'...
  • Interesting Ways to Use Google Chart API
    College @ Home lists 50 ways to use Google Chart API , a simple API for dynamically generating charts. Plot functions, visualize the evolut...
  • Gmail's Shortcut for Inserting Hyperlinks
    Less than 2% of the Gmail users have enabled keyboard shortcuts and actually use them, but that doesn't mean Google shouldn't impro...
  • YouTube Annotations
    YouTube added a new feature for video creators: annotations . "Video Annotations are a new way for you to add interactive commentary to...
  • The Invisible Browser
    Google Chrome has been released and you can now finally try it. Developed in the past two years, the browser is barely noticeable after you...
  • Google Translation Bar
    One of my favorite bookmarklets translates the current web page into English using Google Translate: you can find it here . Unfortunately, G...

Categories

  • Acquisitions (4)
  • Ads (20)
  • AJAX Search (1)
  • Android (20)
  • Annoyances (7)
  • API (11)
  • April Fools Day (6)
  • Blog Search (3)
  • Blogger (26)
  • Book Search (10)
  • Easter Egg (9)
  • FeedBurner (6)
  • Firefox extensions (11)
  • Froogle (5)
  • Gmail (156)
  • Google Analytics (10)
  • Google Apps (11)
  • Google Bookmarks (7)
  • Google Buzz (1)
  • Google Calendar (33)
  • Google Chrome (106)
  • Google Chrome OS (13)
  • Google Co-op (2)
  • Google Contacts (16)
  • Google Desktop (5)
  • Google Dictionary (5)
  • Google Docs (120)
  • Google Drive (9)
  • Google Earth (10)
  • Google Finance (3)
  • Google Gears (17)
  • Google Goggles (4)
  • Google Groups (6)
  • Google Health (4)
  • Google Instant (5)
  • Google Latitude (5)
  • Google Local (6)
  • Google Maps (67)
  • Google Music (10)
  • Google News (23)
  • Google Notebook (6)
  • Google Pack (5)
  • Google Phone (5)
  • Google Photos (1)
  • Google Play (4)
  • Google Plus (45)
  • Google Profiles (11)
  • Google Promos (2)
  • Google Reader (34)
  • Google Scholar (2)
  • Google Sites (7)
  • Google Suggest (20)
  • Google Talk (17)
  • Google Toolbar (21)
  • Google Translate (39)
  • Google Trends (8)
  • Google Update (8)
  • Google Video (20)
  • Google Voice (3)
  • Google Wave (5)
  • Greasemonkey (18)
  • iGoogle (33)
  • Image Search (47)
  • InOut (20)
  • Knowledge (7)
  • Mobile (77)
  • Music (5)
  • Nostalgia (7)
  • OneBox (18)
  • orkut (5)
  • Page Creator (3)
  • Picasa (4)
  • Picasa Web Albums (25)
  • PlusBox (1)
  • Security (9)
  • Social (56)
  • Software (4)
  • Spam (3)
  • Tips (57)
  • Universal Search (4)
  • User interface (82)
  • Visualization (11)
  • Voice Search (4)
  • Web History (3)
  • Web Search (220)
  • Webmasters (8)
  • Yahoo (8)
  • Yahoo Pipes (2)
  • YouTube (91)

Blog Archive

  • ►  2012 (72)
    • ►  April (5)
    • ►  March (22)
    • ►  February (20)
    • ►  January (25)
  • ►  2011 (428)
    • ►  December (28)
    • ►  November (40)
    • ►  October (37)
    • ►  September (31)
    • ►  August (41)
    • ►  July (42)
    • ►  June (48)
    • ►  May (33)
    • ►  April (26)
    • ►  March (38)
    • ►  February (34)
    • ►  January (30)
  • ►  2010 (16)
    • ►  January (16)
  • ►  2009 (479)
    • ►  December (35)
    • ►  November (35)
    • ►  October (38)
    • ►  September (43)
    • ►  August (34)
    • ►  July (33)
    • ►  June (40)
    • ►  May (47)
    • ►  April (47)
    • ►  March (46)
    • ►  February (37)
    • ►  January (44)
  • ▼  2008 (505)
    • ►  December (35)
    • ►  November (30)
    • ►  October (43)
    • ►  September (44)
    • ►  August (39)
    • ►  July (54)
    • ►  June (45)
    • ►  May (51)
    • ▼  April (63)
      • Google/YouTube Priorities
      • Show the Real Number of Search Results in Gmail
      • Google Analytics for Blogs
      • iGoogle Artist Themes
      • Google Combines Driving Directions with Street View
      • FeedBurner Moves to Google Accounts
      • Google Video Categories
      • Improving Google Image Search Using Implicit PageRank
      • More Synergy Between Google's Communication Services
      • Google Docs Lives to Share the Words
      • Google Me (The Movie)
      • New in Google Docs: Insert Videos, Edit CSS
      • Update at Google Product Search
      • Google Annoyances
      • A Radio Interview with Marissa Mayer
      • YouTube Suggest
      • The Informational Distance Between Cities
      • Kai-Fu Lee on Cloud Computing
      • Google Search REST API
      • So When Do We Get Folders in Gmail?
      • Google's New Social Network: iGoogle
      • Recent Searches To Influence Google's Results
      • Google Phishing Warning
      • Search for Mapped Web Pages in Google Maps
      • Google WHOIS OneBox
      • Yet Another Google Video Redesign
      • Watch Restricted YouTube Videos
      • Finding the Right Signals to Rank Search Results
      • Google Maps Predicts Traffic Conditions
      • Subscribe to Authenticated Feeds in Google Reader
      • Google News Quote Finder
      • YouTube Search Enhancements
      • An Outdoor Campaign for Google Video
      • Google Earth 4.3 Adds New Navigation and Street View
      • For Google, Online Video = YouTube
      • Google Updater, the New Installer for Google Software
      • Google Starts to Index the Invisible Web
      • orkut Mobile
      • Collaborate on To-Do Lists and Notes in iGoogle
      • Google Notebook Exposes More Exporting Options
      • Create YouTube Playlists Dynamically
      • Viewfinder - Integrate Photos in a 3D World Model
      • User Interface Updates at YouTube
      • Yahoo Tests Google's Search Ads
      • Backup Your iGoogle Page
      • Export Google Presentations to PowerPoint
      • Advanced Search and Custom Views in Google Docs
      • Google App Engine: Write Your Own Google Apps
      • Google Earth Brings You the News
      • Track the Olympic Torch Relay
      • List of Web Applications That Use Google Gears
      • Upload Files from Your Mobile Phone Using Opera Mini
      • Translate a Google Spreadsheet
      • Google Talk, Labs Edition
      • Upload Old Email to Google Apps
      • Add Powerful Features to Textareas
      • New Google Analytics Charts Show Time Patterns
      • Which Tips from This Blog Are Still Valuable for You?
      • Google Detects the Published Date for Web Pages
      • Google Gears, a Software Update for the Web
      • Google Finance Stock Screener
      • YouTube's Video Identification in Action
      • Code Snippets in Google Universal Search
    • ►  March (64)
    • ►  February (37)
Powered by Blogger.

About Me

Unknown
View my complete profile