python - Python:使用格式将大型文本文件转换为数据框
问题描述
制作一个网络抓取工具来制作列表,例如来自 spotify 的播放列表信息、来自 Indeed 的职位描述或来自 Linked In 的公司列表。我现在有大型文本文件,我想通过转换为 csv 或字典来格式化为数据帧。
文本文件:
Scribd
MobileQAEngineer
VitaminT
MobileQAEngineer
Welocalize
MobileQAEngineer
RWSMoravia
MobileQAEngineer
期望的输出:
Scribd,MobileQAEngineer
VitaminT,MobileQAEngineer
Welocalize,MobileQAEngineer
RWSMoravia,MobileQAEngineer
我虽然可以尝试以下方法:
if line of text does not have 4 \n afterwards
then it is the 1st tuple
if line of text has 4 \n afterwards
then it is the 2st tuple
with open(input("Enter a file to read: "),'r') as f:
for line in f:
newline = line + ":"
#f.write(newline)
print(newline)
在尝试在行尾放置一个“:”时,我最终在该行之前和之后放置了一个:
:
Scribd
:
MobileQAEngineer
:
:
VitaminT
:
MobileQAEngineer
:
:
Welocalize
:
MobileQAEngineer
:
:
RWSMoravia
:
MobileQAEngineer
:
解决方案
您可以使用解析数据regex
,然后将其转换为DataFrame
:
import re
import pandas as pd
with open('data.txt', 'r') as f:
data = f.read()
m = re.findall('(\w+)\n(\w+)', data)
d = {'Company': [c[0] for c in m], 'Position': [c[1] for c in m]}
df = pd.DataFrame(data=d)
输出:
Company Position
0 Scribd MobileQAEngineer
1 VitaminT MobileQAEngineer
2 Welocalize MobileQAEngineer
3 RWSMoravia MobileQAEngineer