Parsing PDF text with coordinates in Ruby

When I was looking for a gem to parse PDF text, pdf-reader turned out to be a good choice. Unfortunately there is only a simple text output for pages. To process content using text positions, a little customization is required to retrieve them.

Customized subclasses (1)

The Page class provides a #walk method which takes visitors that get called with the rendering instructions for the page content. To get access to the text runs on a page, a subclass of PageTextReceiver can be used, which only adds readers for the @characters and @mediabox attributes:

  1. class CustomPageTextReceiver < PDF::Reader::PageTextReceiver
  2.   attr_reader :characters, :mediabox
  3. end

With these two values, PageLayout can be instantiated. It merges found characters into word groups (runs). To retrieve these runs afterwards, we also need a slighty chattier subclass:

  1. class CustomPageLayout < PDF::Reader::PageLayout
  2.   attr_reader :runs
  3. end

Custom subclasses (2)

Using these two subclasses we could now retrieve the text from PDFs together with its coordinates. But I observed two drawbacks with the original implementations.

First, I had files for which the outputted runs contained duplicates which seems to stem from text with shadowing. This can be handled by rejecting duplicates while PageLayout processes them:

  1. class CustomPageLayout < PDF::Reader::PageLayout
  2.   attr_reader :runs
  3.  
  4.   def group_chars_into_runs(chars)
  5.     # filter out duplicate chars before going on with regular logic,
  6.     # seems to happen with shadowed text
  7.     chars.uniq! {|val| {x: val.x, y: val.y, text: val.text}}
  8.     super
  9.   end
  10. end

Second, in some cases pdf-reader missed spaces in the parsed text, which I think may happen because originally it calculates spaces itself and PageTextReceiver discards spaces found in the PDF stream. I found it to be more reliable to keep spaces and strip extra spaces during further processing:

  1. class PageTextReceiverKeepSpaces < PDF::Reader::PageTextReceiver
  2.   # We must expose the characters and mediabox attributes to instantiate PageLayout
  3.   attr_reader :characters, :mediabox
  4.  
  5.   private
  6.   def internal_show_text(string)
  7.     if @state.current_font.nil?
  8.       raise PDF::Reader::MalformedPDFError, "current font is invalid"
  9.     end
  10.     glyphs = @state.current_font.unpack(string)
  11.     glyphs.each_with_index do |glyph_code, index|
  12.       # paint the current glyph
  13.       newx, newy = @state.trm_transform(0,0)
  14.       utf8_chars = @state.current_font.to_utf8(glyph_code)
  15.  
  16.       # apply to glyph displacment for the current glyph so the next
  17.       # glyph will appear in the correct position
  18.       glyph_width = @state.current_font.glyph_width(glyph_code) / 1000.0
  19.       th = 1
  20.       scaled_glyph_width = glyph_width * @state.font_size * th
  21.  
  22.       # modification to original pdf-reader code which accidentally removes spaces in some cases
  23.       # unless utf8_chars == SPACE
  24.       @characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
  25.       # end
  26.  
  27.       @state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
  28.     end
  29.   end
  30. end

It is the original code except for the two highlighted lines which are commented out to keep also spaces.

Further processing

Based on the customized PageTextReceiver and PageLayout I wrote a basic processor which takes the runs of each page and brings them in a structured form for further processing.

The processor class can be found in the following script, invoke with ./script.rb /path/to/some.pdf when the pdf-reader gem is installed:

  1. #! /usr/bin/ruby
  2. require 'pdf-reader'
  3.  
  4. class CustomPageLayout < PDF::Reader::PageLayout
  5.   attr_reader :runs
  6.  
  7.   # we need to filter duplicate characters which seem to be caused by shadowing
  8.   def group_chars_into_runs(chars)
  9.     # filter out duplicate chars before going on with regular logic,
  10.     # seems to happen with shadowed text
  11.     chars.uniq! {|val| {x: val.x, y: val.y, text: val.text}}
  12.     super
  13.   end
  14. end
  15.  
  16. class PageTextReceiverKeepSpaces < PDF::Reader::PageTextReceiver
  17.   # We must expose the characters and mediabox attributes to instantiate PageLayout
  18.   attr_reader :characters, :mediabox
  19.  
  20.   private
  21.   def internal_show_text(string)
  22.     if @state.current_font.nil?
  23.       raise PDF::Reader::MalformedPDFError, "current font is invalid"
  24.     end
  25.     glyphs = @state.current_font.unpack(string)
  26.     glyphs.each_with_index do |glyph_code, index|
  27.       # paint the current glyph
  28.       newx, newy = @state.trm_transform(0,0)
  29.       utf8_chars = @state.current_font.to_utf8(glyph_code)
  30.  
  31.       # apply to glyph displacment for the current glyph so the next
  32.       # glyph will appear in the correct position
  33.       glyph_width = @state.current_font.glyph_width(glyph_code) / 1000.0
  34.       th = 1
  35.       scaled_glyph_width = glyph_width * @state.font_size * th
  36.  
  37.       # modification to the original pdf-reader code which otherwise accidentally removes spaces in some cases
  38.       # unless utf8_chars == SPACE
  39.       @characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
  40.       # end
  41.  
  42.       @state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
  43.     end
  44.   end
  45. end
  46.  
  47. class PDFTextProcessor
  48.   MAX_KERNING_DISTANCE = 10 # experimental value
  49.  
  50.   # pages may specify which pages to actually parse (zero based)
  51.   #   [0, 3] will process only the first and fourth page if present
  52.   def self.process(pdf_io, pages = nil)
  53.     pdf_io.rewind
  54.     reader = PDF::Reader.new(pdf_io)
  55.     fail 'Could not find any pages in the given document' if reader.pages.empty?
  56.     processed_pages = []
  57.     text_receiver = PageTextReceiverKeepSpaces.new
  58.     requested_pages = pages ? reader.pages.values_at(*pages) : reader.pages
  59.     requested_pages.each do |page|
  60.       unless page.nil?
  61.         page.walk(text_receiver)
  62.         runs = CustomPageLayout.new(text_receiver.characters, text_receiver.mediabox).runs
  63.  
  64.         # sort text runs from top left to bottom right
  65.         # read as: if both runs are on the same line first take the leftmost, else the uppermost - (0,0) is bottom left
  66.         runs.sort! {|r1, r2| r2.y == r1.y ? r1.x <=> r2.x : r2.y <=> r1.y}
  67.  
  68.         # group runs by lines and merge those that are close to each other
  69.         lines_hash = {}
  70.         runs.each do |run|
  71.           lines_hash[run.y] ||= []
  72.           # runs that are very close to each other are considered to belong to the same text "block"
  73.           if lines_hash[run.y].empty? || (lines_hash[run.y].last.last.endx + MAX_KERNING_DISTANCE < run.x)
  74.             lines_hash[run.y] << [run]
  75.           else
  76.             lines_hash[run.y].last << run
  77.           end
  78.         end
  79.         lines = []
  80.         lines_hash.each do |y, run_groups|
  81.           lines << {y: y, text_groups: []}
  82.           run_groups.each do |run_group|
  83.             group_text = run_group.map { |run| run.text }.join('').strip
  84.             lines.last[:text_groups] << ({
  85.               x: run_group.first.x,
  86.               width: run_group.last.endx - run_group.first.x,
  87.               text: group_text,
  88.             }) unless group_text.empty?
  89.           end
  90.         end
  91.         # consistent indexing with pages param and reader.pages selection
  92.         processed_pages << {page: page.number, lines: lines}
  93.       end
  94.     end
  95.     processed_pages
  96.   end
  97. end
  98.  
  99. if File.exists?(ARGV[0])
  100.   file = File.open(ARGV[0])
  101.   pages = PDFTextProcessor.process(file)
  102.   puts pages
  103.   puts "Parsed #{pages.count} pages"
  104. else
  105.   puts "Cannot open file '#{ARGV[0]}' (or no file given)"
  106. end

The overall output is an array of hashes where each hash covers the text on a page. Each page hash has an array of lines in which each line is also represented by an hash. A line has an y-position and an array of text groups found in this line. Lines are sorted from top to bottom ([0,0] is on the bottom left) and text groups from left to right:

  1. {
  2.   page: 1,
  3.   lines: [
  4.     {
  5.       y: 771.4006,
  6.       text_groups: [
  7.         {x: 60.7191, width: 164.6489200000004, text: "Some text on the left"},
  8.         {x: 414.8391, width: 119.76381600000008, text: "Some text on the right"}
  9.       ]
  10.     },
  11.     {
  12.       y: 750.7606,
  13.       text_groups: [{x: 60.7191, width: 88.51979999999986, text: "More text"}]
  14.     }
  15.   ]
  16. }

Comments (4)

JoshApril 18th, 2014 at 00:11

Hi Jonas, your PDF parsing code is awesome, great job. I’m using it in a rails app to parse PDF data by x,y coordinate location.

Unfortunately, I’ve had some trouble with the text output getting scrambled. For example, some text in the PDF I want to extract reads „3204.0“, but when I run it through your code, it comes out as :text=>“2634.0″, with the numbers transposed. This doesn’t happen all the time and I’m not sure what is triggering it, but it is happening regularly, on a variety of lines. Here is a link to the PDF. For example, the data I want to extract is in box 19:

http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplW2FormPdf&packetSummaryId=96755

Do you have any thoughts on how to keep the text from getting scrambled? I tried playing around with your code, but I was not able to find a solution.

Thanks,
Josh

jonasApril 18th, 2014 at 13:10

I have no clear idea why this is happening. What I observed is that the calculated width for all groups in your PDF is 0.0 so there seems to be something odd with the characters in the file or the way pdf-reader handles them. I tried to extract the text with pdf-reader’s basic reader.pages[0].text but that raised an exception (FloatDomainError Exception: Infinity).

Further, the label „Oil – BBLS“ in Box 19 (y: 449.22009999999995, x: 145.0) is only parsed correctly if I disable the duplicate filtering in CustomPageLayout. With filtering it is parsed as „Oil -BLS“, i.e. a space and a B are missing. It seems as if the PDF specified for each character in the group the exact same coordinates (or pdf-reader calculated the same for them all). This also fits with the observation that width is 0.0 for all groups and the permutation of characters withing text groups as your number or „russerP gnisaCe“ instead of „Casing Pressure“ (filtering disabled).

My experience is that pdf-reader’s internal calculation has problems with small fonts but the only symptom I got with that were missing or extra spaces. So in my code I use both variants, original parsing and keep spaces and decide based on the outcome which one to use. But that won’t help here.

The first thing I would do is have a look at the object stream of your PDF file to find out whether the PDF is the actual problem or not. Using pdftk you can uncompress the object stream and then look at it with a text editor. If the characters in there are all on the same position then there is your problem. Otherwise you could maybe open an issue for pdf-reader.

JoshApril 18th, 2014 at 17:53

Thanks for the feedback. For the record, I don’t think it’s a pdf-reader based problem. When I run the code pasted below, which only uses methods from pdf-reader, all the text I want from the PDF is displayed properly, which seems to indicate that pdf-reader can handle it.

Unfortunately pdf-reader does not provide location coordinates. So with your methods, i can get the location, but not the exact text and with pdf-reader, i can get the text, but not a consistent location. Vexing!

Thanks again for the help.

task :parse_and_save_w2_rrc_form_v3 => :environment do

require ‚pdf-reader‘
require ‚open-uri‘

receiver = PDF::Reader::RegisterReceiver.new
filename = „http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplW2FormPdf&packetSummaryId=66732“
io = open(filename)

PDF::Reader.open(io) do |reader|
reader.pages.each do |page|

page.walk(receiver)
receiver.callbacks.each do |cb|

if cb.values.first.to_s == „show_text“
puts cb[:args].to_s
end
end
end
end
end

______________________

Output (sample):
[„RAILROAD COMMISSION OF TEXAS“]
[“ „]
[„Oil and Gas Division“]
[„API No.“]
[“ „]
[„42- „]
[„7. RRC District No.“]
[„8. RRC Lease No.“]
[“ Oil Well Potential Test, Completion or Recompletion Report, and Log“]
[„PEARSALL (BUDA, S.)“]
[„HEITZ-FEHRENBACH UNIT“]
[„HUGHES, DAN A. COMPANY, L.P.“]
[„411736″]
[“ 1H „]
[„P O DRAWER 669 BEEVILLE, TX 78104-0669“]
[„38 , 3 , I&GN RR CO , A-347“]
[„BIG WELLS“]
[„Initial Potential“]
[„Reclass“]
[„(Explain In remarks)“]
[„Well record only“]
[„Retest“]
[„11. Purpose of filing“]
[„6b. Distance and direction to nearest town in this county.“]
[„6a. Location (Section, Block, and Survey)“]
[„4. ADDRESS“]
[„9. Well No.“]
[„RRC Operator No.“]

jonasApril 19th, 2014 at 13:01

I had a look at your two PDFs. Among the fonts defined in them there is a font (Font3) which has a lot of characters specified with width 0.
Do you have any influence on how the PDFs are generated? PDF type 1.2 is also pretty old, although I don’t know whether that actually could be a problem.

Using the debugger I found out that there also seems to be a problem with the fonts derived from basic fonts, which e.g. is used for the heading „RAILROAD COMMISSION OF TEXAS“. It is specified with Font0 which is derived from Times-Bold. Times-Bold belongs to the 14 basic fonts of the PDF standard so I don’t know why there is a problem with that (wrong specification of a derived font? wrong processing?).
I put a debugger statement right after the font check at the beginning of internal_show_text of PageTextReceiverKeepSpaces and played around with the glyph processing statements beneath. Turned out, the calculated width of each character of the heading is 0, the text itself is correct.

This messes up text processing in PageLayout and TextReceiver because it advances the text cursor within show_text by the width of processed characters. I guess there are two ways to fix this:
1. Change the PDF generation process such that the created files are compatible with the text processing of pdf-reader.
2. Find out why glyph width calculation yields 0 and fix that.
If you have no influence on PDF generation, the choice is easy 😉

Leave a comment

Your comment

(required)