首页 > 解决方案 > 使用 python 读取大型文本文件比使用 Matlab 读取相同文本的相同代码要慢得多,知道为什么吗?

问题描述

我在 Matlab 中有以下代码用于读取文本文件文本文件具有 XML 格式,但我将其作为文本文件读取:

    function [jointAngleData,PositionData, AccelerationData,OrientationData, 
    AngularVelocityData,AngularAccelerationData,TimeStamps] = getDatafromMVNX 
    (file,eliminate_samples)
    fid=fopen (file);
    currentline=fgetl(fid);
    jointAngleData =[];
    PositionData = [];
    AccelerationData = [];
    OrientationData = [];
    AngularVelocityData = [];
    AngularAccelerationData = [];
    while ischar(currentline)

if (contains(currentline,'<jointAngle>'))  
     [data,~]=strsplit(currentline,'<\D*>','DelimiterType', 'RegularExpression');
     currentlinedata = str2num(data{2}); %#ok<*ST2NM>
     jointAngleData = [jointAngleData ; currentlinedata];  %#ok<*AGROW>
 end
 if (contains(currentline,'<position>'))
     [data,~]=strsplit(currentline,'<\D*>','DelimiterType', 'RegularExpression');
     currentlinedata = str2num(data{2});
     PositionData = [PositionData ; currentlinedata]; 
 end

 if (contains(currentline,'<acceleration>'))
     [data,~]=strsplit(currentline,'<\D*>','DelimiterType', 'RegularExpression');
     currentlinedata = str2num(data{2});
     AccelerationData = [AccelerationData ; currentlinedata]; 
 end

 if (contains(currentline,'<orientation>'))
     [data,~]=strsplit(currentline,'<\D*>','DelimiterType', 'RegularExpression');
     currentlinedata = str2num(data{2});
     OrientationData = [OrientationData ; currentlinedata]; 
 end
 if (contains(currentline,'<angularVelocity>'))
     [data,~]=strsplit(currentline,'<\D*>','DelimiterType', 'RegularExpression');
     currentlinedata = str2num(data{2});
     AngularVelocityData = [AngularVelocityData ; currentlinedata]; 
 end

 if (contains(currentline,'<angularAcceleration>'))
     [data,~]=strsplit(currentline,'<\D*>','DelimiterType', 'RegularExpression');
     currentlinedata = str2num(data{2});
     AngularAccelerationData = [AngularAccelerationData ; currentlinedata]; 
 end
 currentline=fgetl(fid);

 end
Data_ends = size(jointAngleData,1)-eliminate_samples;
jointAngleData = jointAngleData(1:Data_ends,:);
AccelerationData = AccelerationData(1:Data_ends,:);
OrientationData = OrientationData(4:Data_ends+3,:);
PositionData = PositionData(4:Data_ends+3,:);
AngularVelocityData = AngularVelocityData(1:Data_ends,:);
AngularAccelerationData = AngularAccelerationData(1:Data_ends,:);
TimeStamps = size(OrientationData,1);
end

对于相同的任务,我在 python 中编写了一个代码:

def _read_feature_text(line):


   start = line.find('>')+1
   lend = line.find('</') 
   workingportion = line[start:lend]
   return pd.DataFrame([np.fromstring(workingportion,sep= ' ')])


def read_mvnx(mvnxfile):

 from bs4 import BeautifulSoup
 myfile  = open (mvnxfile,"r")
 contents = myfile.read()
 orientation = pd.DataFrame()
 positions = pd.DataFrame()
 velocities = pd.DataFrame()
 accelerations = pd.DataFrame()
 angularVelocities = pd.DataFrame()
 angularAccelerations = pd.DataFrame()
 jointAngles = pd.DataFrame()
 with myfile:

    wholefilecontent = myfile.readlines()
    #line = myfile.readline()
    start_time = timeit.default_timer()

    for line in wholefilecontent:

        if ('orientation' in line):
            orientation = orientation.append(_read_feature_text(line),ignore_index = True)
        elif ('position' in line):
            positions = positions.append(_read_feature_text(line),ignore_index = True)
        elif ('velocity' in line):
            velocities = velocities.append(_read_feature_text(line),ignore_index = True)
        elif ('acceleration' in line):
            accelerations = accelerations.append(_read_feature_text(line),ignore_index = True)
        elif ('angularVelocity' in line):
            angularVelocities = angularVelocities.append(_read_feature_text(line),ignore_index = True)
        elif ('angularAcceleration' in line):
            angularAccelerations = angularAccelerations.append(_read_feature_text(line),ignore_index = True)
        elif ('joinAngle' in line):
            jointAngles = jointAngles.append(_read_feature_text(line),ignore_index = True)

    elapsed = timeit.default_timer() -start_time
    print(elapsed)

我什至尝试使用正则表达式和 BeautifulSoup 包。两者都没有给我更好的时机。任何建议为什么?有没有其他方法可以让它更快。更快,我的意思是比这个更快。

标签: pythonxmlmatlabtext

解决方案


对于我的代码,我发现让它变得太慢的原因是,在每一行中找到数据后,我将其转换为数据帧并将其附加到全局数据帧的末尾。这种转换使它超级慢。我通过将数据放在一个 numpy 数组中来修复它,然后在最后将整个 numpy 数组转换为一个数据帧。

我还使用 xmltodic 包来解析文件而不是逐行解析。


推荐阅读