How To Scrape All Results From Google Search Results Pages (python/selenium Chromedriver)

July 27, 2023 Post a Comment

I am working on a Python script using selenium chromedriver to scrape all google search results (link, header, text) off a specified number of results pages. The code I have seems

Solution 1:

You have an indentation problem:

You should to have return pageInfo outside for loop, otherwise you're returning results after first loop execution

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo

Like this:

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
return pageInfo

I've ran your code and got results:

[{'header': 'Cars (film) — Wikipédia', 'link': 'https://fr.wikipedia.org/wiki/Cars_(film)', 'text': "Cars : Quatre Roues, ou Les Bagnoles au Québec (Cars), est le septième long-métrage d'animation entièrement en images de synthèse des studios Pixar.\nPays d’origine : États-Unis\nDurée : 116 minutes\nSociétés de production : Pixar Animation Studios\nGenre : Animation\nCars 2 · Michel Fortin · Flash McQueen"}, {'header': 'Cars - Wikipedia, la enciclopedia libre', 'link': 'https://es.wikipedia.org/wiki/Cars', 'text': 'Cars es una película de animación por computadora de 2006, producida por Pixar Animation Studios y lanzada por Walt Disney Studios Motion Pictures.\nAño : 2006\nGénero : Animación; Aventuras; Comedia; Infa...\nHistoria : John Lasseter Joe Ranft Jorgen Klubi...\nProductora : Walt Disney Pictures; Pixar Animat...'}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Flash_McQueen', 'text': ''}, {'header': '', 'link': 'https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/', 'text': ''}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Martin_(Cars)', 'text': ''},

Suggestions:

Use a timer to control your for loop, otherwise you could be banned by Google due to suspicious activity

Steps: 1.- Import sleep from time: from time import sleep 2.- On your last loop add a timer:

for i in range(0 , numPages - 1):
    sleep(5) #It'll wait 5 seconds for each iteration
    nextButton = driver.find_element_by_link_text('Next')
    nextButton.click()
    infoAll.extend(scrape())

Learn Python Programming

How To Scrape All Results From Google Search Results Pages (python/selenium Chromedriver)

Solution 1:

Post a Comment for "How To Scrape All Results From Google Search Results Pages (python/selenium Chromedriver)"