Skip to content Skip to sidebar Skip to footer

Script Throws An Error When It Is Made To Run Using Multiprocessing

I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm provi

Solution 1:

It does not make sense to reference the global variable row in get_data(), because

  1. It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

  2. Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

So you would need to provide the row values as part of the data passed to the second argument of p.map(). But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pool


defget_info(isbn):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': isbn
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)
    soup = BeautifulSoup(res.text, "lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        return get_data(itemlink['href'])


defget_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError:
        itmtitle = "N\A"return itmtitle


defmain():
    wb = load_workbook('amazon.xlsx')
    ws = wb['content']

    isbnlist = []
    for row inrange(2, ws.max_row + 1):
        if ws.cell(row=row, column=1).value isNone:
            break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        titles = p.map(get_info, isbnlist)
        p.terminate()
        p.join()

    for row inrange(2, ws.max_row + 1):
        ws.cell(row=row, column=2).value = titles[row - 2]

    wb.save("amazon.xlsx")


if __name__ == '__main__':
    main()

Post a Comment for "Script Throws An Error When It Is Made To Run Using Multiprocessing"