Google Operating System Unofficial news and tips about Google

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 31 October 2008

Google Uses OCR to Index Scanned PDF Files

Posted on 04:39 by Unknown
Google started to index to full text of "scanned" PDF files using a technique called OCR (optical character recognition). "Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves. But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world," says Evin Levey.

The great thing about the new feature is that you won't notice it unless you look for it, but it improves the quality of Google's search results. Google doesn't mention how many of the 300 million indexed PDF files were converted into text, but you can see some examples if you search for: [repairing aluminium wiring], [Steady success in a volatile world] and click on "View as HTML".


Google sponsors an open-source OCR software called OCRopus and it's likely that Google used it for indexing PDF files from the web. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. (...) It's initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications."
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in Web Search | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Google Clusters Results from Forums
    Google already knows if a page is part of a discussion group and it also extracts useful information like the number of posts or the date o...
  • Google April Fools' Day 2009
    Like last year , many Google services and local sites created their own hoaxes for the April Fools' Day. The most significant announceme...
  • Google Toolbar and 404 Error Pages
    I find it very strange that people have abnormal reactions when Google does something. People have an incorrect perception of the "don...
  • Google Chrome OS Event
    Google will announce more information about Chrome OS at a press event that starts at 10:00am PST. Google will offer "an update on Goog...
  • Disable Google SearchWiki
    Google Search's preferences page includes the option to disable SearchWiki . Just click on the checkbox next to SearchWiki and you'...
  • Interesting Ways to Use Google Chart API
    College @ Home lists 50 ways to use Google Chart API , a simple API for dynamically generating charts. Plot functions, visualize the evolut...
  • Gmail's Shortcut for Inserting Hyperlinks
    Less than 2% of the Gmail users have enabled keyboard shortcuts and actually use them, but that doesn't mean Google shouldn't impro...
  • YouTube Annotations
    YouTube added a new feature for video creators: annotations . "Video Annotations are a new way for you to add interactive commentary to...
  • The Invisible Browser
    Google Chrome has been released and you can now finally try it. Developed in the past two years, the browser is barely noticeable after you...
  • Google Translation Bar
    One of my favorite bookmarklets translates the current web page into English using Google Translate: you can find it here . Unfortunately, G...

Categories

  • Acquisitions (4)
  • Ads (20)
  • AJAX Search (1)
  • Android (20)
  • Annoyances (7)
  • API (11)
  • April Fools Day (6)
  • Blog Search (3)
  • Blogger (26)
  • Book Search (10)
  • Easter Egg (9)
  • FeedBurner (6)
  • Firefox extensions (11)
  • Froogle (5)
  • Gmail (156)
  • Google Analytics (10)
  • Google Apps (11)
  • Google Bookmarks (7)
  • Google Buzz (1)
  • Google Calendar (33)
  • Google Chrome (106)
  • Google Chrome OS (13)
  • Google Co-op (2)
  • Google Contacts (16)
  • Google Desktop (5)
  • Google Dictionary (5)
  • Google Docs (120)
  • Google Drive (9)
  • Google Earth (10)
  • Google Finance (3)
  • Google Gears (17)
  • Google Goggles (4)
  • Google Groups (6)
  • Google Health (4)
  • Google Instant (5)
  • Google Latitude (5)
  • Google Local (6)
  • Google Maps (67)
  • Google Music (10)
  • Google News (23)
  • Google Notebook (6)
  • Google Pack (5)
  • Google Phone (5)
  • Google Photos (1)
  • Google Play (4)
  • Google Plus (45)
  • Google Profiles (11)
  • Google Promos (2)
  • Google Reader (34)
  • Google Scholar (2)
  • Google Sites (7)
  • Google Suggest (20)
  • Google Talk (17)
  • Google Toolbar (21)
  • Google Translate (39)
  • Google Trends (8)
  • Google Update (8)
  • Google Video (20)
  • Google Voice (3)
  • Google Wave (5)
  • Greasemonkey (18)
  • iGoogle (33)
  • Image Search (47)
  • InOut (20)
  • Knowledge (7)
  • Mobile (77)
  • Music (5)
  • Nostalgia (7)
  • OneBox (18)
  • orkut (5)
  • Page Creator (3)
  • Picasa (4)
  • Picasa Web Albums (25)
  • PlusBox (1)
  • Security (9)
  • Social (56)
  • Software (4)
  • Spam (3)
  • Tips (57)
  • Universal Search (4)
  • User interface (82)
  • Visualization (11)
  • Voice Search (4)
  • Web History (3)
  • Web Search (220)
  • Webmasters (8)
  • Yahoo (8)
  • Yahoo Pipes (2)
  • YouTube (91)

Blog Archive

  • ►  2012 (72)
    • ►  April (5)
    • ►  March (22)
    • ►  February (20)
    • ►  January (25)
  • ►  2011 (428)
    • ►  December (28)
    • ►  November (40)
    • ►  October (37)
    • ►  September (31)
    • ►  August (41)
    • ►  July (42)
    • ►  June (48)
    • ►  May (33)
    • ►  April (26)
    • ►  March (38)
    • ►  February (34)
    • ►  January (30)
  • ►  2010 (16)
    • ►  January (16)
  • ►  2009 (479)
    • ►  December (35)
    • ►  November (35)
    • ►  October (38)
    • ►  September (43)
    • ►  August (34)
    • ►  July (33)
    • ►  June (40)
    • ►  May (47)
    • ►  April (47)
    • ►  March (46)
    • ►  February (37)
    • ►  January (44)
  • ▼  2008 (505)
    • ►  December (35)
    • ►  November (30)
    • ▼  October (43)
      • Google Uses OCR to Index Scanned PDF Files
      • SMS in Gmail Chat
      • Google SearchWiki
      • More Data About Feeds in Google Reader
      • Feeds for Google Alerts
      • YouTube Highlights Previously Viewed Videos
      • Gmail Modes
      • Google Gadgets in Gmail
      • Street View for Spain
      • New Default Groups for Google Contacts
      • 150,000 of Google Profiles
      • Google Earth for iPhone
      • Google Street View Tidbits
      • Link Within a YouTube Video
      • Search from YouTube's Player
      • Android Market Fees
      • Gmail Emoticons :-)
      • Gmail Mobile App 2.0
      • G1 Promoted on Google's Homepage
      • Gmail Autoresponder
      • Android Is Now Open Source
      • Google Chrome to Add Greasemonkey Support
      • Footnotes in Google Docs
      • The New iGoogle, Publicly Launched
      • KallOut, Powerful Contextual Search
      • Google Street View Expands Coverage in France
      • How Many Times Have You Searched Google?
      • Who Links to Non-Existing Pages from Your Site?
      • No More Annoying Frames in Google Video
      • Embed a Part of a YouTube Video
      • Advanced IMAP Settings for Gmail
      • Enhanced Snippets for Discussion Boards
      • YouTube Links to Online Music Stores
      • Machine Translation and Speech Recognition at Google
      • YouTube Theater View
      • Better Answers in Ask.com
      • Google Spreadsheets Redesign
      • Audio Knols
      • Google Tests Image Search Ads
      • No More Definition Links in Google Search?
      • Google Homepage Time-lapse
      • The Invisible GoogleUpdate.exe
      • Google News for Blogs
    • ►  September (44)
    • ►  August (39)
    • ►  July (54)
    • ►  June (45)
    • ►  May (51)
    • ►  April (63)
    • ►  March (64)
    • ►  February (37)
Powered by Blogger.

About Me

Unknown
View my complete profile