python - 我想将以下文本文件转换为标题为列且相应数据为行的文件
问题描述
下载后,我拥有的所有数据都在文本文件 (.txt) 中。我有多个像这样的文件(至少 70000 个),我想我会先使用 cmd 中的 copy.txt 将它们合并到一个主文本文件中。然后我想分析 mastertext 文件,因此理想的输出是 .csv 文件。
出于分析目的,需要将数据转换为类似结构的表格。每个文件(在我拥有的 70000 个 .txt 文件中)都以“[”开头并以“]”结尾,每个新数据行都以“{”开头并以“}”结尾,我有多行数据相同的格式。我希望转换后的数据具有诸如“campaignId”之类的标题,以便我可以使用数据透视表分析列(就像您在 excel 中所做的那样)。我是编程新手,不确定 Python 中的哪些库可以帮助我实现目标。
这是我拥有的数据的示例。(注意 - 这些只是一个文件中的两行,我有大约 70000 个这样的文件)
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
解决方案
我做了一些假设,因为您提供的数据看起来像是 JSON 数据列表。这是因为您的条目之间有一个“,”。
这是我的input.json
文件。请注意,我在顶部和底部添加了 [ 和 ],因为这将为您的数据提供正确的 JSON 结构。
[
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
]
这是使用该pandas
库将数据处理成数据帧然后将其写入 CSV 文件的代码。
import json # Available by default, no install required.
import glob # Available by default, no install required.
import pandas as pd # Requires installation via pip.
# Initialise a list to store our results.
combined_json = []
# Set a glob pattern to *.txt since your files are txt files.
# You can also write the full path e.g. /home/user/textfiles/*.txt
text_files = glob.glob("*.txt")
# Loop through all the text files and combine them into a single JSON list.
# As for 70,000 files, I am unsure how the performance will turn out.
for json_text in text_files:
with open(json_text, 'r') as text_file:
combined_json.extend(json.load(text_file))
# Write all the files to a JSON file. For your future usage.
# You also can read directly from the combed_json variable.
with open('input.json', 'w') as json_file:
json.dump(combined_json, json_file, indent=2)
# Convert the JSON data into a dataframe, using the combined_json variable.
json_df = pd.json_normalize(combined_json)
# Write the data from the dataframe to the CSV file.
# Mode "w" will always overwrite the CSV file, use mode "a" to append text instead of overwriting.
json_df.to_csv("dataframe.csv", mode="w")
有关如何pd.json_normalize
工作的更多信息,您可以参考此处。
要开始使用该pandas
库,您可以参考此处。
如果您希望从字符串而不是文件加载 JSON,可以参考此处。
想了解更多glob
,可以参考这里。
推荐阅读
- node.js - 使用 Node.js 列出 Google Cloud Storage Bucket 中的文件夹
- java - java 8获取最大出现次数相等的元素
- android - 何时通过 ContentProvider 和 ContactsContract 设置联系人图像失败
- php - 查询不再工作
- java - 其他活动中导航抽屉中工具栏下方的活动
- algorithm - 比较 n^a * log(n)^b 和 n^c*log(n)^d 的运行时复杂度
- php - 重写时的数据库查询
- javascript - 如何选择每个
- 使用 Jquery 或 Javascript 依次单击每个按钮
- c++ - 带有 T = char* 的 `const T` 不是 `const char*`?
- java - 将逗号分隔的字符串转换为最后一个逗号的列表