How To Scrape All Results From Google Search Results Pages (python/selenium Chromedriver)
Solution 1:
You have an indentation problem:
You should to have return pageInfo
outside for loop, otherwise you're returning results after first loop execution
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
Like this:
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
I've ran your code and got results:
[{'header': 'Cars (film) — Wikipédia', 'link': 'https://fr.wikipedia.org/wiki/Cars_(film)', 'text': "Cars : Quatre Roues, ou Les Bagnoles au Québec (Cars), est le septième long-métrage d'animation entièrement en images de synthèse des studios Pixar.\nPays d’origine : États-Unis\nDurée : 116 minutes\nSociétés de production : Pixar Animation Studios\nGenre : Animation\nCars 2 · Michel Fortin · Flash McQueen"}, {'header': 'Cars - Wikipedia, la enciclopedia libre', 'link': 'https://es.wikipedia.org/wiki/Cars', 'text': 'Cars es una película de animación por computadora de 2006, producida por Pixar Animation Studios y lanzada por Walt Disney Studios Motion Pictures.\nAño : 2006\nGénero : Animación; Aventuras; Comedia; Infa...\nHistoria : John Lasseter Joe Ranft Jorgen Klubi...\nProductora : Walt Disney Pictures; Pixar Animat...'}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Flash_McQueen', 'text': ''}, {'header': '', 'link': 'https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/', 'text': ''}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Martin_(Cars)', 'text': ''},
Suggestions:
Use a timer to control your for loop, otherwise you could be banned by Google due to suspicious activity
Steps:
1.- Import sleep from time: from time import sleep
2.- On your last loop add a timer:
for i in range(0 , numPages - 1):
sleep(5) #It'll wait 5 seconds for each iteration
nextButton = driver.find_element_by_link_text('Next')
nextButton.click()
infoAll.extend(scrape())
Post a Comment for "How To Scrape All Results From Google Search Results Pages (python/selenium Chromedriver)"