首页 > 解决方案 > 之后在 python 中用于数据准备和处理的正则表达式

问题描述

我有一个安静的大数据文件,它的状态不是很好,无法进行进一步处理。所以我想对其进行正则表达式,并在 pandas 中处理这些数据以进行进一步的数据分析。

Data-Information段在文件中重复并包含必要的信息。

到目前为止,我对正则表达式的方法是从中获取一些标题信息。我现在缺少的是数据点的所有三个部分。我只需要从Points到最后一个数据点的标题。我怎样才能将这些部分分成多个或一组?

^(?:Data-Information.*)
(?:\nName:\t+)(?P<Name>.+)
(?:\nSample:\t+)(?P<Sample>.+)
((?:\r?\n.+)+)
(?:\nSystem:\t+)(?P<System>.+)
(?:\r?\n(?!Data-Information).*)*

示例文件

Data-Information
Name:           Polymer A
Sample:     Sunday till Monday
User:           SUD
Count Segments:         5
Application:            RHEOSTAR
Tool:           CP
Date/Time:          24.10.2021; 13:37
System:         CP25

Constants:
- Csr [min/s]:          2,5421
- Css [Pa/mNm]:         2,54679

Section:            1
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 30 s
Measurement profile:
  Temperature           T[-1] = 25 °C

Section:            2
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   62  10,93   100 1.090   4,45    TGC,Dy_
2   64  11,05   100 1.100   4,5 TGC,Dy_
3   66  11,07   100 1.110   4,51    TGC,Dy_
4   68  11,05   100 1.100   4,5 TGC,Dy_
5   70  10,99   100 1.100   4,47    TGC,Dy_
6   72  10,92   100 1.090   4,44    TGC,Dy_


Section:            3
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 60 s

Section:            4
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
*** 1 ***   242 -6,334E+6   -0,0000115  72,7    0,296   TGC,Dy_
2   244 63,94   10,3    661 2,69    TGC,Dy_
3   246 35,56   20,7    736 2,99    TGC,Dy_
4   248 25,25   31  784 3,19    TGC,Dy_
5   250 19,82   41,4    820 3,34    TGC,Dy_


Section:            5
Number measuring points:            300

Time limit:         300 measuring points
            Duration 1 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   301 4,142   300 1.240   5,06    TGC,Dy_
2   302 4,139   300 1.240   5,05    TGC,Dy_
3   303 4,138   300 1.240   5,05    TGC,Dy_
4   304 4,141   300 1.240   5,06    TGC,Dy_
5   305 4,156   300 1.250   5,07    TGC,Dy_
6   306 4,153   300 1.250   5,07    TGC,Dy_


Data-Information
Name:           Polymer B
Sample:     Monday till Tuesday
User:           SUD
Count Segments:         5
Application:            RHEOSTAR
Tool:           CP
Date/Time:          24.10.2021; 13:37
System:         CP25

Constants:
- Csr [min/s]:          2,5421
- Css [Pa/mNm]:         2,54679

Section:            1
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 30 s
Measurement profile:
  Temperature           T[-1] = 25 °C

Section:            2
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   62  10,93   100 1.090   4,45    TGC,Dy_
2   64  11,05   100 1.100   4,5 TGC,Dy_
3   66  11,07   100 1.110   4,51    TGC,Dy_
4   68  11,05   100 1.100   4,5 TGC,Dy_
5   70  10,99   100 1.100   4,47    TGC,Dy_
6   72  10,92   100 1.090   4,44    TGC,Dy_


Section:            3
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 60 s

Section:            4
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
*** 1 ***   242 -6,334E+6   -0,0000115  72,7    0,296   TGC,Dy_
2   244 63,94   10,3    661 2,69    TGC,Dy_
3   246 35,56   20,7    736 2,99    TGC,Dy_
4   248 25,25   31  784 3,19    TGC,Dy_
5   250 19,82   41,4    820 3,34    TGC,Dy_


Section:            5
Number measuring points:            300

Time limit:         300 measuring points
            Duration 1 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   301 4,142   300 1.240   5,06    TGC,Dy_
2   302 4,139   300 1.240   5,05    TGC,Dy_
3   303 4,138   300 1.240   5,05    TGC,Dy_
4   304 4,141   300 1.240   5,06    TGC,Dy_
5   305 4,156   300 1.250   5,07    TGC,Dy_
6   306 4,153   300 1.250   5,07    TGC,Dy_

标签: pythonregexpandas

解决方案


一种选择是分两步完成。

首先使用以 Data-Information 开头的模式获取所有Data-Information部分,并匹配以下所有不以 Data-Information 开头的行。

^Data-Information(?:\n(?!Data-Information$).*)*

数据信息的正则表达式演示

对于每个部分,您可以匹配以 Points 开头的行,然后匹配所有包含至少一个字符的后续行(没有空行)

^Points\b.*(?:\n.+)+

点的正则表达式演示


推荐阅读