python - Matching words with information from a web scraper
问题描述
First of all sorry if this is in the wrong section, since it wasn't a coding question I didn't know in which section to put it.
My question is:
Let's say I created a web scraper that extracts all the informations from a job posting website. The information looks like this:
Row 1 - Company X , Computer engineer
Row 2 - Company X , Civil engineer
Row 2 - Company Y , Data Scientist
Row 3 - Company Z , Data Analyst
I want to create something in python or even excel if its easier that that flag automatically a row or score a company based on some predetermined words.
if engineer is the word in question then the score would be:
Company X = 2 , Company Y = 0 , Company Z = 0
If you need any detail don't hesitate. What am I suppose to search online for any kind of answer? Would NLP or Regex help me?
Thank you!
解决方案
正则表达式足以解决您的问题。首先,您应该优化您抓取的数据,使其格式保持不变,然后您可以使用正则表达式提取数据。这是您的数据的示例:
import re
from pprint import pprint
REGEX = re.compile(r'Row (?P<row>\d+) *- *Company (?P<company>\S+) *, *(?P<profession>.*)')
rows = [
'Row 1 - Company X , Computer engineer',
'Row 2 - Company X , Civil engineer',
'Row 2 - Company Y , Data Scientist',
'Row 3 - Company Z , Data Analyst'
]
found_data = []
for row in rows:
found = REGEX.match(row)
if found:
found_data.append([
found.group('row'),
found.group('company'),
found.group('profession')
])
pprint(found_data)
[['1', 'X', 'Computer engineer'], ['2', 'X', 'Civil engineer'], ['2', 'Y', 'Data Scientist'], ['3', 'Z', 'Data Analyst']]
推荐阅读
- angular - 使用另一个变量的数字通过 Enum.[number] 调用枚举
- mysql - Mysql PDO 连接到外部服务器
- python - 获取 Falcon 应用程序中定义的所有路线的列表
- java - 数据库条目 onitemClick 的 getValue
- javascript - Aframe - Raycast 类过滤器在运行时不更新
- r - data.table 语法中的计算
- javascript - 响应式 html img 背景
- swift - 如何在 Swift Cocoa 中检测 OS X 中的用户不活动?
- javascript - 在 JQuery 中访问 iframe 的响应标头
- javascript - 粘性导航栏仅适用于变量