首页 > 解决方案 > 如何使用 Python 将多个文本文件的内容提取到 pandas 数据框中?

问题描述

我有 2 个包含如下内容的文本文件:


/*foo1.txt*/

Number of data records: 1000
Number of attributes: 231
Class attribute index: 231
Monotonic Transformation: None
Number of class labels: 10
Number of folds: 10
Test fold: 1
Random seed: 0
(Dis)similarity measure: Test_SVM
Task: SVMi
Number of bins (b): 10
Histogram type: EF
Number of trees (T): 0 (For tree-based methods.)
Sample size (W): 0 (For tree-based methods.)
Running Experiment... Please wait...
     #Atts. considered as irrelevant: 0
     Data size: 900; Query size: 100
     Dimensionality of the space: 230
     ... using Test SVM for SVM ...
     ... Equal Frequency discretisation (b=10) ...
     Max. num. of bins: 10, Min. num. of bins: 10
SVM Classification accuarcy scores (C=0.1): 0.5300
SVM Classification accuarcy scores (C=0.5): 0.6300
SVM Classification accuarcy scores (C=10): 0.7300
SVM Classification accuarcy scores (C=100): 0.7300
Done!
Total runtime: 6.8169 second.


/*foo2.txt*/

Number of data records: 1000
Number of attributes: 231
Class attribute index: 231
Monotonic Transformation: None
Number of class labels: 10
Number of folds: 10
Test fold: 1
Random seed: 0
(Dis)similarity measure: Test_SVM
Task: SVM
Number of bins (b): 30
Histogram type: EF
Number of trees (T): 0 (For tree-based methods.)
Sample size (W): 0 (For tree-based methods.)
Running Experiment... Please wait...
     #Atts. considered as irrelevant: 0
     Data size: 900; Query size: 100
     Dimensionality of the space: 230
     ... using Test SVM for SVM ...
     ... Equal Frequency discretisation (b=30) ...
     Max. num. of bins: 30, Min. num. of bins: 30
SVM Classification accuarcy scores (C=0.1): 0.6600
SVM Classification accuarcy scores (C=0.5): 0.7400
SVM Classification accuarcy scores (C=10): 0.8000
SVM Classification accuarcy scores (C=100): 0.8000
Done!
Total runtime: 8.2947 second.

df目标是将两个文本文件(.txt 文件 foo1 和 foo2)的内容提取到应该如下所示的 pandas 数据框中。

桌子

如何获取上述数据框中的值?

编辑 - 由于实际 txt 文件中的文本结构不同,因此编辑问题以反映实际文本文件中的数据。

标签: python-3.xpandasdataframe

解决方案


更新(基于您在评论和讨论中共享的文本文件)

使用正则表达式模式从文件的文本内容中提取相关部分,然后使用另一个正则表达式模式查找所有 col-value 值对并将这些对映射到字典以创建记录。注意:我假设data包含文本文件的文件夹,您可以将其替换为您的实际文件夹。

import re
from pathlib import Path

def read_files():
    for file in Path('data').glob('*.txt'):
        data = file.open().read()
        m = re.search(r'(.*?)Running Exp.*?(?=SVM Class)(.*?)Done!', data, re.DOTALL)
        c = re.findall(r'^(.*?)\s*:\s*(.*?)\s*(?:\(|$)', m.group(1), re.MULTILINE)
        yield {**dict(c), 'Results': m.group(2).strip()}

df = pd.DataFrame(read_files())

  Number of data records Number of attributes Class attribute index Monotonic Transformation Number of class labels Number of folds Test fold Random seed  Task Number of bins (b) Histogram type Number of trees (T) Sample size (W)                                                                                                                                                                                                        Results
0                   1000                  231                   231                     None                     10              10         1           0   SVM                 30             EF                   0               0  SVM Classification accuarcy scores (C=0.1): 0.6600\nSVM Classification accuarcy scores (C=0.5): 0.7400\nSVM Classification accuarcy scores (C=10): 0.8000\nSVM Classification accuarcy scores (C=100): 0.8000
1                   1000                  231                   231                     None                     10              10         1           0  SVMi                 10             EF                   0               0  SVM Classification accuarcy scores (C=0.1): 0.5300\nSVM Classification accuarcy scores (C=0.5): 0.6300\nSVM Classification accuarcy scores (C=10): 0.7300\nSVM Classification accuarcy scores (C=100): 0.7300

推荐阅读