首页 > 解决方案 > Linkedin 分页抓取工作

问题描述

我正在尝试在以下 URL 中从埃及的 LinkedIn 上抓取所有工作:

https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start=25

首先,我能够使用这段代码刮掉所有的职位:

job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []

for title in job_titles:
    c.append(title.text)
print(c)
print((len(c))) 

然后我意识到,要从一个页面转到另一个页面,我必须操作 URL 中的 start 参数,因为它在每个页面中递增 25,并且它使用以下代码工作:

page = 25 
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,5):
    page =  i * 25 
    pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))

但是,将这两个代码块放在一起:

page = 25 
job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,5):
    page =  i * 25 
    for title in job_titles:
        c.append(title.text)
    pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
print(c)
print((len(c))) 

给我这个错误:

StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-107-8fe8527b4d0f> in <module>
      6     page =  i * 25
      7     for title in job_titles:
----> 8         c.append(title.text)
      9     pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
     10 print(c)

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in text(self)
     74     def text(self):
     75         """The text of the element."""
---> 76         return self._execute(Command.GET_ELEMENT_TEXT)['value']
     77 
     78     def click(self):

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
    631             params = {}
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 
    635     def find_element(self, by=By.ID, value=None):

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=92.0.4515.159)

我该如何解决这个错误,以便我基本上可以运行这段代码,打开 40 个搜索页面的每一页,然后刮掉所有的职位?

标签: pythonseleniumselenium-webdriverbeautifulsoupselenium-chromedriver

解决方案


推荐阅读