Parsing PDF text with coordinates in Ruby
When I was looking for a gem to parse PDF text, pdf-reader turned out to be a good choice. Unfortunately there is only a simple text output for pages. To process content using text positions, a little customization is required to retrieve them.
Customized subclasses (1)
The Page class provides a #walk method which takes visitors that get called with the rendering instructions for the page content. To get access to the text runs on a page, a subclass of PageTextReceiver can be used, which only adds readers for the @characters and @mediabox attributes:
- class CustomPageTextReceiver < PDF::Reader::PageTextReceiver
- attr_reader :characters, :mediabox
- end
With these two values, PageLayout can be instantiated. It merges found characters into word groups (runs). To retrieve these runs afterwards, we also need a slighty chattier subclass:
- class CustomPageLayout < PDF::Reader::PageLayout
- attr_reader :runs
- end
Custom subclasses (2)
Using these two subclasses we could now retrieve the text from PDFs together with its coordinates. But I observed two drawbacks with the original implementations.
First, I had files for which the outputted runs contained duplicates which seems to stem from text with shadowing. This can be handled by rejecting duplicates while PageLayout processes them:
- class CustomPageLayout < PDF::Reader::PageLayout
- attr_reader :runs
- def group_chars_into_runs(chars)
- # filter out duplicate chars before going on with regular logic,
- # seems to happen with shadowed text
- chars.uniq! {|val| {x: val.x, y: val.y, text: val.text}}
- super
- end
- end
Second, in some cases pdf-reader missed spaces in the parsed text, which I think may happen because originally it calculates spaces itself and PageTextReceiver discards spaces found in the PDF stream. I found it to be more reliable to keep spaces and strip extra spaces during further processing:
- class PageTextReceiverKeepSpaces < PDF::Reader::PageTextReceiver
- # We must expose the characters and mediabox attributes to instantiate PageLayout
- attr_reader :characters, :mediabox
- private
- def internal_show_text(string)
- if @state.current_font.nil?
- raise PDF::Reader::MalformedPDFError, "current font is invalid"
- end
- glyphs = @state.current_font.unpack(string)
- glyphs.each_with_index do |glyph_code, index|
- # paint the current glyph
- newx, newy = @state.trm_transform(0,0)
- utf8_chars = @state.current_font.to_utf8(glyph_code)
- # apply to glyph displacment for the current glyph so the next
- # glyph will appear in the correct position
- glyph_width = @state.current_font.glyph_width(glyph_code) / 1000.0
- th = 1
- scaled_glyph_width = glyph_width * @state.font_size * th
- # modification to original pdf-reader code which accidentally removes spaces in some cases
- # unless utf8_chars == SPACE
- @characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
- # end
- @state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
- end
- end
- end
It is the original code except for the two highlighted lines which are commented out to keep also spaces.
Further processing
Based on the customized PageTextReceiver and PageLayout I wrote a basic processor which takes the runs of each page and brings them in a structured form for further processing.
The processor class can be found in the following script, invoke with ./script.rb /path/to/some.pdf
when the pdf-reader gem is installed:
- #! /usr/bin/ruby
- require 'pdf-reader'
- class CustomPageLayout < PDF::Reader::PageLayout
- attr_reader :runs
- # we need to filter duplicate characters which seem to be caused by shadowing
- def group_chars_into_runs(chars)
- # filter out duplicate chars before going on with regular logic,
- # seems to happen with shadowed text
- chars.uniq! {|val| {x: val.x, y: val.y, text: val.text}}
- super
- end
- end
- class PageTextReceiverKeepSpaces < PDF::Reader::PageTextReceiver
- # We must expose the characters and mediabox attributes to instantiate PageLayout
- attr_reader :characters, :mediabox
- private
- def internal_show_text(string)
- if @state.current_font.nil?
- raise PDF::Reader::MalformedPDFError, "current font is invalid"
- end
- glyphs = @state.current_font.unpack(string)
- glyphs.each_with_index do |glyph_code, index|
- # paint the current glyph
- newx, newy = @state.trm_transform(0,0)
- utf8_chars = @state.current_font.to_utf8(glyph_code)
- # apply to glyph displacment for the current glyph so the next
- # glyph will appear in the correct position
- glyph_width = @state.current_font.glyph_width(glyph_code) / 1000.0
- th = 1
- scaled_glyph_width = glyph_width * @state.font_size * th
- # modification to the original pdf-reader code which otherwise accidentally removes spaces in some cases
- # unless utf8_chars == SPACE
- @characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
- # end
- @state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
- end
- end
- end
- class PDFTextProcessor
- MAX_KERNING_DISTANCE = 10 # experimental value
- # pages may specify which pages to actually parse (zero based)
- # [0, 3] will process only the first and fourth page if present
- def self.process(pdf_io, pages = nil)
- pdf_io.rewind
- reader = PDF::Reader.new(pdf_io)
- fail 'Could not find any pages in the given document' if reader.pages.empty?
- processed_pages = []
- text_receiver = PageTextReceiverKeepSpaces.new
- requested_pages = pages ? reader.pages.values_at(*pages) : reader.pages
- requested_pages.each do |page|
- unless page.nil?
- page.walk(text_receiver)
- runs = CustomPageLayout.new(text_receiver.characters, text_receiver.mediabox).runs
- # sort text runs from top left to bottom right
- # read as: if both runs are on the same line first take the leftmost, else the uppermost - (0,0) is bottom left
- runs.sort! {|r1, r2| r2.y == r1.y ? r1.x <=> r2.x : r2.y <=> r1.y}
- # group runs by lines and merge those that are close to each other
- lines_hash = {}
- runs.each do |run|
- lines_hash[run.y] ||= []
- # runs that are very close to each other are considered to belong to the same text "block"
- if lines_hash[run.y].empty? || (lines_hash[run.y].last.last.endx + MAX_KERNING_DISTANCE < run.x)
- lines_hash[run.y] << [run]
- else
- lines_hash[run.y].last << run
- end
- end
- lines = []
- lines_hash.each do |y, run_groups|
- lines << {y: y, text_groups: []}
- run_groups.each do |run_group|
- group_text = run_group.map { |run| run.text }.join('').strip
- lines.last[:text_groups] << ({
- x: run_group.first.x,
- width: run_group.last.endx - run_group.first.x,
- text: group_text,
- }) unless group_text.empty?
- end
- end
- # consistent indexing with pages param and reader.pages selection
- processed_pages << {page: page.number, lines: lines}
- end
- end
- processed_pages
- end
- end
- if File.exists?(ARGV[0])
- file = File.open(ARGV[0])
- pages = PDFTextProcessor.process(file)
- puts pages
- puts "Parsed #{pages.count} pages"
- else
- puts "Cannot open file '#{ARGV[0]}' (or no file given)"
- end
The overall output is an array of hashes where each hash covers the text on a page. Each page hash has an array of lines in which each line is also represented by an hash. A line has an y-position and an array of text groups found in this line. Lines are sorted from top to bottom ([0,0] is on the bottom left) and text groups from left to right:
- {
- page: 1,
- lines: [
- {
- y: 771.4006,
- text_groups: [
- {x: 60.7191, width: 164.6489200000004, text: "Some text on the left"},
- {x: 414.8391, width: 119.76381600000008, text: "Some text on the right"}
- ]
- },
- {
- y: 750.7606,
- text_groups: [{x: 60.7191, width: 88.51979999999986, text: "More text"}]
- }
- ]
- }
Hi Jonas, your PDF parsing code is awesome, great job. I’m using it in a rails app to parse PDF data by x,y coordinate location.
Unfortunately, I’ve had some trouble with the text output getting scrambled. For example, some text in the PDF I want to extract reads „3204.0“, but when I run it through your code, it comes out as :text=>“2634.0″, with the numbers transposed. This doesn’t happen all the time and I’m not sure what is triggering it, but it is happening regularly, on a variety of lines. Here is a link to the PDF. For example, the data I want to extract is in box 19:
http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplW2FormPdf&packetSummaryId=96755
Do you have any thoughts on how to keep the text from getting scrambled? I tried playing around with your code, but I was not able to find a solution.
Thanks,
Josh
I have no clear idea why this is happening. What I observed is that the calculated width for all groups in your PDF is 0.0 so there seems to be something odd with the characters in the file or the way pdf-reader handles them. I tried to extract the text with pdf-reader’s basic reader.pages[0].text but that raised an exception (FloatDomainError Exception: Infinity).
Further, the label „Oil – BBLS“ in Box 19 (y: 449.22009999999995, x: 145.0) is only parsed correctly if I disable the duplicate filtering in CustomPageLayout. With filtering it is parsed as „Oil -BLS“, i.e. a space and a B are missing. It seems as if the PDF specified for each character in the group the exact same coordinates (or pdf-reader calculated the same for them all). This also fits with the observation that width is 0.0 for all groups and the permutation of characters withing text groups as your number or „russerP gnisaCe“ instead of „Casing Pressure“ (filtering disabled).
My experience is that pdf-reader’s internal calculation has problems with small fonts but the only symptom I got with that were missing or extra spaces. So in my code I use both variants, original parsing and keep spaces and decide based on the outcome which one to use. But that won’t help here.
The first thing I would do is have a look at the object stream of your PDF file to find out whether the PDF is the actual problem or not. Using pdftk you can uncompress the object stream and then look at it with a text editor. If the characters in there are all on the same position then there is your problem. Otherwise you could maybe open an issue for pdf-reader.
Thanks for the feedback. For the record, I don’t think it’s a pdf-reader based problem. When I run the code pasted below, which only uses methods from pdf-reader, all the text I want from the PDF is displayed properly, which seems to indicate that pdf-reader can handle it.
Unfortunately pdf-reader does not provide location coordinates. So with your methods, i can get the location, but not the exact text and with pdf-reader, i can get the text, but not a consistent location. Vexing!
Thanks again for the help.
task :parse_and_save_w2_rrc_form_v3 => :environment do
require ‚pdf-reader‘
require ‚open-uri‘
receiver = PDF::Reader::RegisterReceiver.new
filename = „http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplW2FormPdf&packetSummaryId=66732“
io = open(filename)
PDF::Reader.open(io) do |reader|
reader.pages.each do |page|
page.walk(receiver)
receiver.callbacks.each do |cb|
if cb.values.first.to_s == „show_text“
puts cb[:args].to_s
end
end
end
end
end
______________________
Output (sample):
[„RAILROAD COMMISSION OF TEXAS“]
[“ „]
[„Oil and Gas Division“]
[„API No.“]
[“ „]
[„42- „]
[„7. RRC District No.“]
[„8. RRC Lease No.“]
[“ Oil Well Potential Test, Completion or Recompletion Report, and Log“]
[„PEARSALL (BUDA, S.)“]
[„HEITZ-FEHRENBACH UNIT“]
[„HUGHES, DAN A. COMPANY, L.P.“]
[„411736″]
[“ 1H „]
[„P O DRAWER 669 BEEVILLE, TX 78104-0669“]
[„38 , 3 , I&GN RR CO , A-347“]
[„BIG WELLS“]
[„Initial Potential“]
[„Reclass“]
[„(Explain In remarks)“]
[„Well record only“]
[„Retest“]
[„11. Purpose of filing“]
[„6b. Distance and direction to nearest town in this county.“]
[„6a. Location (Section, Block, and Survey)“]
[„4. ADDRESS“]
[„9. Well No.“]
[„RRC Operator No.“]
I had a look at your two PDFs. Among the fonts defined in them there is a font (Font3) which has a lot of characters specified with width 0.
Do you have any influence on how the PDFs are generated? PDF type 1.2 is also pretty old, although I don’t know whether that actually could be a problem.
Using the debugger I found out that there also seems to be a problem with the fonts derived from basic fonts, which e.g. is used for the heading „RAILROAD COMMISSION OF TEXAS“. It is specified with Font0 which is derived from Times-Bold. Times-Bold belongs to the 14 basic fonts of the PDF standard so I don’t know why there is a problem with that (wrong specification of a derived font? wrong processing?).
I put a debugger statement right after the font check at the beginning of internal_show_text of PageTextReceiverKeepSpaces and played around with the glyph processing statements beneath. Turned out, the calculated width of each character of the heading is 0, the text itself is correct.
This messes up text processing in PageLayout and TextReceiver because it advances the text cursor within show_text by the width of processed characters. I guess there are two ways to fix this:
1. Change the PDF generation process such that the created files are compatible with the text processing of pdf-reader.
2. Find out why glyph width calculation yields 0 and fix that.
If you have no influence on PDF generation, the choice is easy 😉
Jonas, Vielen Dank from the USA.
I searched for hours for this exact solution and I’m so happy I found it.
Thank you very much!
Many thanks, Jonas,
woks like a charm!
Christian