首页 > 解决方案 > Matching words with information from a web scraper

问题描述

First of all sorry if this is in the wrong section, since it wasn't a coding question I didn't know in which section to put it.

My question is:

Let's say I created a web scraper that extracts all the informations from a job posting website. The information looks like this:

Row 1 -  Company X , Computer engineer
Row 2 -  Company X , Civil engineer
Row 2 -  Company Y , Data Scientist
Row 3 -  Company Z , Data Analyst

I want to create something in python or even excel if its easier that that flag automatically a row or score a company based on some predetermined words.

if engineer is the word in question then the score would be:

Company X = 2 , Company Y = 0 , Company Z = 0

If you need any detail don't hesitate. What am I suppose to search online for any kind of answer? Would NLP or Regex help me?

Thank you!

标签: pythonregexweb-scrapingnlp

解决方案


正则表达式足以解决您的问题。首先,您应该优化您抓取的数据,使其格式保持不变,然后您可以使用正则表达式提取数据。这是您的数据的示例:

import re
from pprint import pprint

REGEX = re.compile(r'Row (?P<row>\d+) *- *Company (?P<company>\S+) *, *(?P<profession>.*)')

rows = [
    'Row 1 -  Company X , Computer engineer',
    'Row 2 -  Company X , Civil engineer',
    'Row 2 -  Company Y , Data Scientist',
    'Row 3 -  Company Z , Data Analyst'
]

found_data = []

for row in rows:
    found = REGEX.match(row)
    if found:
        found_data.append([
            found.group('row'),
            found.group('company'),
            found.group('profession')
        ])
pprint(found_data)
[['1', 'X', 'Computer engineer'],
 ['2', 'X', 'Civil engineer'],
 ['2', 'Y', 'Data Scientist'],
 ['3', 'Z', 'Data Analyst']]


推荐阅读