How To Stop Pdfplumber From Reading The Header Of Every Pages?
I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I pro
Solution 1:
I don't think you can.
However, you can crop the document with the crop
method. This way, you can extract the text only for the cropped part of page, leaving out headers and footers.
Of course this method requires that you know in advance the height of headers and footers.
Here is the explanation of coords:
x0 =% Distance ofleft side ofcharacterfromleft side of page.
top =% Distance of top ofcharacterfrom top of page.
x1 =% Distance ofright side ofcharacterfromleft side of page.
bottom =% Distance of bottom of the characterfrom top of page.
Here is the code:
# Get text of whole document as string
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
for i, page inenumerate(pdf.pages):
my_width = page.width
my_height = page.height
# Crop pages
my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
page_crop = page.crop(bbox=my_bbox)
text = text+str(page_crop.extract_text()).lower()
pages.append(page_crop)
Post a Comment for "How To Stop Pdfplumber From Reading The Header Of Every Pages?"