python - 我在网上抓取了一些评论,但我不知道如何将它们放入 excel 文件中,有人可以帮助我吗?
问题描述
我正在从网站中提取评论和信息,我想将它们放在一个 excel 文件中,同时保持信息的结构化。
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'website'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for statements in soup.findAll("h3",{'class' : "delta weight-bold half-margin-bottom"}):
print(statements.text)
for names in soup.findAll("div",{'class': "epsilon weight-bold inline-block"}):
print(names.text)
for used_software in soup.findAll("span",{'class' : "weight-semibold"}):
print(used_software.text, used_software.next_sibling)
解决方案
你可以使用pandas
(这里使用的是python3,python2需要做一些小改动):
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.capterra.com/p/104588/RecTrac/#reviews'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
statements = [
x.text.strip() for x in soup.findAll("h3", {'class': "delta weight-bold half-margin-bottom"})
]
print(statements)
names = [x.text.strip() for x in soup.findAll("div", {'class': "epsilon weight-bold inline-block"})]
print(names)
used_software = [x.text.strip() for x in soup.findAll("span", {'class': "weight-semibold"})]
used_software_sibling = [x.next_sibling for x in soup.findAll("span", {'class': "weight-semibold"})]
print(used_software)
print(used_software_sibling)
d = {
'statements': statements,
'names': names,
'used_software': used_software,
'sw_sibling': used_software_sibling,
}
df = pd.DataFrame.from_dict(dict([(k, pd.Series(v)) for k, v in d.items()]))
print(df)
df.to_csv('/tmp/out.csv', index=False)
最后的打印语句 ( print(df)
) 将显示:
statements names used_software sw_sibling
0 RecTrac is so close to being awesome! Verified Reviewer Used the software for: 6-12 months
1 Powerful software, but a steep learning curve ... Verified Reviewer Source: Capterra
2 Using this program for the last five years.... Michael B. Used the software for: 1-2 years
3 User-friendly membership management system--ea... Verified Reviewer Source: Capterra
4 Robust Software Verified Reviewer Used the software for: 2+ years
5 Very useful product, but could be more user fr... Kimberli D. Source: Capterra
6 Customer Service is great to work with. Brad B. Used the software for: 2+ years
7 NaN NaN Source: Capterra
8 NaN NaN Used the software for: 2+ years
9 NaN NaN Source: Capterra
10 NaN NaN Used the software for: 2+ years
11 NaN NaN Source: Capterra
12 NaN NaN Used the software for: 2+ years
13 NaN NaN Source: Capterra
并且.csv
会显示:
$ cat /tmp/out.csv
statements,names,used_software,sw_sibling
RecTrac is so close to being awesome!,Verified Reviewer,Used the software for:, 6-12 months
"Powerful software, but a steep learning curve when coming from other systems",Verified Reviewer,Source:, Capterra
Using this program for the last five years....,Michael B.,Used the software for:, 1-2 years
User-friendly membership management system--easy to learn and use,Verified Reviewer,Source:, Capterra
Robust Software,Verified Reviewer,Used the software for:, 2+ years
"Very useful product, but could be more user friendly.",Kimberli D.,Source:, Capterra
Customer Service is great to work with.,Brad B.,Used the software for:, 2+ years
,,Source:, Capterra
,,Used the software for:, 2+ years
,,Source:, Capterra
,,Used the software for:, 2+ years
,,Source:, Capterra
,,Used the software for:, 2+ years
,,Source:, Capterra
这是针对 OP 在评论中的示例的更新,这就是我爱你的程度@y.emond:
这是获得您想要的输出的快速而肮脏的方法,也许有更好的方法。
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.capterra.com/p/104588/RecTrac/#reviews'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
def add_skips(lst):
old_length = len(lst)
skipped_statements = []
print('old_length: ', old_length)
i = 0
while i < old_length:
print('i : ', i)
skipped_statements.append(lst[i])
skipped_statements.append(float('nan'))
i += 1
return skipped_statements
statements = [
x.text.strip() for x in soup.findAll("h3", {'class': "delta weight-bold half-margin-bottom"})
]
statements = add_skips(statements)
names = [x.text.strip() for x in soup.findAll("div", {'class': "epsilon weight-bold inline-block"})]
names = add_skips(names)
used_software = [x.text.strip() for x in soup.findAll("span", {'class': "weight-semibold"})]
used_software_sibling = [x.next_sibling for x in soup.findAll("span", {'class': "weight-semibold"})]
d = {
'statements': statements,
'names': names,
'used_software': used_software,
'sw_sibling': used_software_sibling,
}
df = pd.DataFrame.from_dict(dict([(k, pd.Series(v)) for k, v in d.items()]))
print(df)
df.to_csv('/tmp/out.csv', index=False)
输出:
statements names used_software sw_sibling
0 RecTrac is so close to being awesome! Verified Reviewer Used the software for: 6-12 months
1 NaN NaN Source: Capterra
2 Powerful software, but a steep learning curve ... Verified Reviewer Used the software for: 1-2 years
3 NaN NaN Source: Capterra
4 Using this program for the last five years.... Michael B. Used the software for: 2+ years
5 NaN NaN Source: Capterra
6 User-friendly membership management system--ea... Verified Reviewer Used the software for: 2+ years
7 NaN NaN Source: Capterra
8 Robust Software Verified Reviewer Used the software for: 2+ years
9 NaN NaN Source: Capterra
10 Very useful product, but could be more user fr... Kimberli D. Used the software for: 2+ years
11 NaN NaN Source: Capterra
12 Customer Service is great to work with. Brad B. Used the software for: 2+ years
13 NaN NaN Source: Capterra
NaN
在 excel/libreoffice 中打开时,所有值都是空单元格。
推荐阅读
- windows - 如何在 UWP 应用程序中创建应用程序范围的键盘快捷键,以取代子 XAML 元素具有的任何按键处理程序?
- c# - 我可以使用 WrapPanel 来组织 ListView 组标题吗?
- javascript - 如何在 Wild Web Developer 中关闭提示
- android-studio - 当我在 feature_branch 中有很多提交时,如何解决与 android studio 中的开发分支的 rebase 冲突?
- git - gitignore 和 gitattributes 优先级
- html - 如何放置工具提示文本以使其在隐藏时不占用空间?
- excel - 在打开数据验证时在 Excel 中手动编辑单元格内容
- c# - Google api OAuth 因重定向 uri 不匹配而失败
- javascript - AWS Cognito 和 Amplify:刷新会话时未发送 clientMetadata
- python - 如何替换字符串中第 N 次出现的单词?