python-3.x - 如何使用 Python 将多个文本文件的内容提取到 pandas 数据框中?
问题描述
我有 2 个包含如下内容的文本文件:
/*foo1.txt*/
Number of data records: 1000
Number of attributes: 231
Class attribute index: 231
Monotonic Transformation: None
Number of class labels: 10
Number of folds: 10
Test fold: 1
Random seed: 0
(Dis)similarity measure: Test_SVM
Task: SVMi
Number of bins (b): 10
Histogram type: EF
Number of trees (T): 0 (For tree-based methods.)
Sample size (W): 0 (For tree-based methods.)
Running Experiment... Please wait...
#Atts. considered as irrelevant: 0
Data size: 900; Query size: 100
Dimensionality of the space: 230
... using Test SVM for SVM ...
... Equal Frequency discretisation (b=10) ...
Max. num. of bins: 10, Min. num. of bins: 10
SVM Classification accuarcy scores (C=0.1): 0.5300
SVM Classification accuarcy scores (C=0.5): 0.6300
SVM Classification accuarcy scores (C=10): 0.7300
SVM Classification accuarcy scores (C=100): 0.7300
Done!
Total runtime: 6.8169 second.
/*foo2.txt*/
Number of data records: 1000
Number of attributes: 231
Class attribute index: 231
Monotonic Transformation: None
Number of class labels: 10
Number of folds: 10
Test fold: 1
Random seed: 0
(Dis)similarity measure: Test_SVM
Task: SVM
Number of bins (b): 30
Histogram type: EF
Number of trees (T): 0 (For tree-based methods.)
Sample size (W): 0 (For tree-based methods.)
Running Experiment... Please wait...
#Atts. considered as irrelevant: 0
Data size: 900; Query size: 100
Dimensionality of the space: 230
... using Test SVM for SVM ...
... Equal Frequency discretisation (b=30) ...
Max. num. of bins: 30, Min. num. of bins: 30
SVM Classification accuarcy scores (C=0.1): 0.6600
SVM Classification accuarcy scores (C=0.5): 0.7400
SVM Classification accuarcy scores (C=10): 0.8000
SVM Classification accuarcy scores (C=100): 0.8000
Done!
Total runtime: 8.2947 second.
df
目标是将两个文本文件(.txt 文件 foo1 和 foo2)的内容提取到应该如下所示的 pandas 数据框中。
如何获取上述数据框中的值?
编辑 - 由于实际 txt 文件中的文本结构不同,因此编辑问题以反映实际文本文件中的数据。
解决方案
更新(基于您在评论和讨论中共享的文本文件)
使用正则表达式模式从文件的文本内容中提取相关部分,然后使用另一个正则表达式模式查找所有 col-value 值对并将这些对映射到字典以创建记录。注意:我假设data
包含文本文件的文件夹,您可以将其替换为您的实际文件夹。
import re
from pathlib import Path
def read_files():
for file in Path('data').glob('*.txt'):
data = file.open().read()
m = re.search(r'(.*?)Running Exp.*?(?=SVM Class)(.*?)Done!', data, re.DOTALL)
c = re.findall(r'^(.*?)\s*:\s*(.*?)\s*(?:\(|$)', m.group(1), re.MULTILINE)
yield {**dict(c), 'Results': m.group(2).strip()}
df = pd.DataFrame(read_files())
Number of data records Number of attributes Class attribute index Monotonic Transformation Number of class labels Number of folds Test fold Random seed Task Number of bins (b) Histogram type Number of trees (T) Sample size (W) Results
0 1000 231 231 None 10 10 1 0 SVM 30 EF 0 0 SVM Classification accuarcy scores (C=0.1): 0.6600\nSVM Classification accuarcy scores (C=0.5): 0.7400\nSVM Classification accuarcy scores (C=10): 0.8000\nSVM Classification accuarcy scores (C=100): 0.8000
1 1000 231 231 None 10 10 1 0 SVMi 10 EF 0 0 SVM Classification accuarcy scores (C=0.1): 0.5300\nSVM Classification accuarcy scores (C=0.5): 0.6300\nSVM Classification accuarcy scores (C=10): 0.7300\nSVM Classification accuarcy scores (C=100): 0.7300
推荐阅读
- django - 在 Azure 上部署 Django(Windows 操作系统)
- c# - 为什么我会收到“当前类型的 IDatabaseProvider,是一个接口,无法构造。” 来自统一容器?
- python-3.x - 清理 python 逻辑来处理所有类型的字符串日期时间数据
- c# - asp:dropDownList 选择值而不是键
- javascript - 如何在 javascript 中编写这些代码行?
- c# - 具有未指定类型的泛型类的属性
- bash - Make:如果在卸载规则中测试评估问题
- elasticsearch-6 - Elasticsearch:使用日期过滤器为同一字段传递多个值
- javascript - 以 dist 角度添加文件
- reactjs - Typescript React类型的类型参数不可分配给类型的参数