Download Pdfs With Python
I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the 'fileEntryId' text
Solution 1:
Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.
import requests
from bs4 import BeautifulSoup
link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
inner_link = item.get("href")
resp = s.get(inner_link)
soup = BeautifulSoup(resp.text,"lxml")
pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
file_name = pdf_link.split("/")[-1].split("?")[0]
with open(f"{file_name}.pdf","wb") as f:
f.write(s.get(pdf_link).content)
Post a Comment for "Download Pdfs With Python"