Skip to content Skip to sidebar Skip to footer

How To Stop Pdfplumber From Reading The Header Of Every Pages?

I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I pro

Solution 1:

I don't think you can.

However, you can crop the document with the crop method. This way, you can extract the text only for the cropped part of page, leaving out headers and footers. Of course this method requires that you know in advance the height of headers and footers.

Here is the explanation of coords:

x0 =% Distance ofleft side ofcharacterfromleft side of page.
top =% Distance of top ofcharacterfrom top of page.
x1 =% Distance ofright side ofcharacterfromleft side of page.
bottom =% Distance of bottom of the characterfrom top of page.

Here is the code:

# Get text of whole document as string
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
    for i, page inenumerate(pdf.pages):
        my_width = page.width
        my_height = page.height
        # Crop pages
        my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
        page_crop = page.crop(bbox=my_bbox)
        text = text+str(page_crop.extract_text()).lower()
        pages.append(page_crop)

Post a Comment for "How To Stop Pdfplumber From Reading The Header Of Every Pages?"