首页 > 解决方案 > Python Selenium,抓取 LinkedIn:循环工作和教育历史

问题描述

我正在使用 Selenium 从 Python 中的 LinkedIn 个人资料中抓取数据。它大部分都在工作,但我不知道如何在他们的历史部分中为每个雇主或学校提取信息。

我正在学习以下教程:https ://www.linkedin.com/pulse/how-easy-scraping-data-from-linkedin-profiles-david-craven/

我正在查看此个人资料:https ://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk

这是我正在努力解决的 HTML 部分的部分片段:

<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
  <h2 class="pv-profile-section__card-heading">
    Experience
  </h2>

<!----></header>

  <ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&amp;v=beta&amp;t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
  <h3 class="t-16 t-black t-bold">Senior Software Engineer</h3>
  <p class="visually-hidden">Company Name</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      Wagestream
        <span class="pv-entity__secondary-title separator">Full-time</span>
  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates Employed</span>
      <span>Apr 2021 – Present</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item-v2">3 mos</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Location</span>
    <span>London, England, United Kingdom</span>
  </h4>
<!---->
</div>

</a>
<!---->    </div>

<!---->  </div>
</section>

接下来是更多的“li”部分。所以整个历史部分可以用 id="experience-section" 来标识,工作(相对于教育)历史可以在 "ul" 部分 class="pv-profile-section__section-info section-info pv-profile 中标识-section__section-info--has-more”。列表中第一个作业的信息可以用“li”section id="ember136" 来标识。

我正在尝试从本节中获取职位、公司、工作年限等,但不知道该怎么做。这是一些 python 代码来显示我尝试过的内容(跳过我的登录):

from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests

path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')

text=driver.page_source
sel = Selector(text) 

# Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name
sel.xpath('//*[@id="ember187"]/div[2]/p[2]/text()').extract_first()  

# This will give me all of the text in the Work Experience section
stuff = driver.find_element_by_id("experience-section")
items = html_list.find_elements_by_tag_name("ul")
items = html_list.find_elements_by_tag_name("h3")
for item in items:
    print(type(item))
    text = item.text
    print(text)

但是这些方法对于从个人资料中的每个工作中自动和系统地提取信息并不是很好。我想做的是像循环遍历每个“ul”部分中的“li”部分,并在“li”部分中,仅提取带有 class =“pv-entity__secondary-title t-14 t-black”的公司名称t-正常”。但是 find_element_by_class_name 只产生 NoneTypes。

我在概念上不确定如何使用 selenium 生成“ul”和“li”的可迭代列表,并在每次迭代中使用类名提取特定的文本位。

标签: pythonhtmlseleniumweb-scrapinglinkedin

解决方案


这是我想出的解决方案。我应该指出我在以下教程的 YouTube 评论中“交叉发布”:https ://www.youtube.com/watch?v=W4Md-kopmE

运行整个代码,但替换您的电子邮件和密码。

首先,打开浏览器,登录 LinkedIn,然后导航到相关的个人资料

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from time import sleep

# Path to the chromedriver.exe
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

driver.get('https://www.linkedin.com')

# Log into LinkedIn
username = driver.find_element_by_id('session_key')
username.send_keys('mail@mail.com')

sleep(0.5)

password = driver.find_element_by_id('session_password')
password.send_keys('password')

sleep(0.5)

log_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
log_in_button.click()

sleep(3)

# The example profile I am trying to scrape
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
sleep(3)

如果我只是开始尝试提取东西,我会得到一个错误。事实证明,我需要向下滚动到相关部分才能加载,否则不会创建任何数据:

# The experience section doesn't load until you scroll to it, this will scroll to the section
l= driver.find_element_by_xpath('//*[@id="oc-background-section"]')
driver.execute_script("arguments[0].scrollIntoView(true);", l)

要遍历工作经验,首先我确定它的“id”值,在本例中为“experience-section”。使用“find_element_by_id”方法获取它。

# Get stuff in work experience section
html_list = driver.find_element_by_id("experience-section")

此部分包含“li”元素列表(即标记值“li”),每个元素都包含每个过去工作的所有工作信息。使用“find_elements_by_tag_name”创建这些 WebElement 类型的列表。

# Jobs listed as li sections, create list of li 
items = html_list.find_elements_by_tag_name("li")

查看源代码,我注意到例如雇主名称可以通过标签“p”来识别。这会生成一个列表,有时它包含多个项目。确保选择您需要的:

x = items[0].find_elements_by_tag_name("p")
print(x[0].text)
# "Company Name"
print(x[1].text)
# "Wagestream Full-time"

最后循环遍历“li”部分,提取相关信息,提取字符串,并打印所需信息(或在 CSV 中另存为行):

# Loop through li list, extract each piece by tag name
for item in items:
    name_job = item.find_elements_by_tag_name("h3")
    name_emp = item.find_elements_by_tag_name("p")
    more = item.find_elements_by_tag_name("h4")
    job = name_job[0].text
    emp = name_emp[1].text
    # This just cleans up the string
    yrs = [item for item in more[0].text.split('\n')][1]
    loc = [item for item in more[2].text.split('\n')][1]
    
    print(job)
    print(emp)
    print(yrs)
    print(loc)

# terminates the application
driver.quit()


推荐阅读