首页 > 解决方案 > 解析跳过元素

问题描述

我有以下源数据结构(整个文件是 2.5gb,这就是我依赖解析的原因):


<!-- ====================================================================== -->

    <person id="10002042">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="censusId" class="java.lang.Integer" >3674945</attribute>
            <attribute name="employed" class="java.lang.Boolean" >false</attribute>
            <attribute name="hasLicense" class="java.lang.String" >yes</attribute>
            <attribute name="htsId" class="java.lang.Long" >2601700100002</attribute>
            <attribute name="isOutside" class="java.lang.Boolean" >true</attribute>
            <attribute name="isPassenger" class="java.lang.Boolean" >true</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
        <plan selected="yes">
            <activity type="outside" link="284251" facility="outside_1" x="653218.0059491959" y="6857536.564730054" end_time="09:49:38" >
            </activity>
            <leg mode="car_passenger" dep_time="09:49:38" trav_time="00:02:36">
                <route type="links" start_link="284251" end_link="63873" trav_time="00:02:36" distance="3117.285137236383" vehicleRefId="null">284251 660231 129607 129599 139064 641998 641663 159806 170160 85864 635804 572378 435246 190032 526059 525761 525778 525779 450362 63873</route>
            </leg>
            <activity type="outside" link="63873" facility="outside_2" x="656055.3097541996" y="6859009.979613776" end_time="09:52:18" >
            </activity>
            <leg mode="outside" dep_time="09:52:18" trav_time="00:00:00">
                <route type="generic" start_link="63873" end_link="85890" trav_time="00:00:00" distance="746.7439307235369"></route>
            </leg>
            <activity type="outside" link="85890" facility="outside_3" x="656635.5166858744" y="6859480.071535116" end_time="09:53:00" >
            </activity>
            <leg mode="car_passenger" dep_time="09:53:00" trav_time="00:01:21">
                <route type="links" start_link="85890" end_link="47652" trav_time="00:01:21" distance="1499.4956773327315" vehicleRefId="null">85890 202345 202323 202322 85868 569745 569762 535571 535243 616420 7195 408601 47652</route>
            </leg>
            <activity type="outside" link="47652" facility="outside_4" x="657143.7893766644" y="6860882.64702696" end_time="10:41:02" >
            </activity>
            <leg mode="outside" dep_time="10:41:02" trav_time="00:00:00">
                <route type="generic" start_link="47652" end_link="466140" trav_time="00:00:00" distance="16.659217552989976"></route>
            </leg>
            <activity type="outside" link="466140" facility="outside_5" x="657155.3197720037" y="6860894.671149082" end_time="10:43:55" >
            </activity>
            <leg mode="car_passenger" dep_time="10:43:55" trav_time="00:01:32">
                <route type="links" start_link="466140" end_link="85887" trav_time="00:01:32" distance="1841.175613889593" vehicleRefId="null">466140 666788 205956 205957 205958 315381 584891 7193 150557 535291 535555 569763 569764 569744 202426 202425 202424 535572 85887</route>
            </leg>
            <activity type="outside" link="85887" facility="outside_6" x="656620.921626125" y="6859492.595666251" end_time="10:45:38" >
            </activity>
            <leg mode="outside" dep_time="10:45:38" trav_time="00:00:00">
                <route type="generic" start_link="85887" end_link="63872" trav_time="00:00:00" distance="744.9330931635377"></route>
            </leg>
            <activity type="outside" link="63872" facility="outside_7" x="656043.6710628852" y="6859021.737831518" end_time="10:46:13" >
            </activity>
            <leg mode="car_passenger" dep_time="10:46:13" trav_time="00:02:37">
                <route type="links" start_link="63872" end_link="46435" trav_time="00:02:37" distance="3138.4720080186116" vehicleRefId="null">63872 63869 332997 332998 85873 525752 525750 525764 435247 635803 572374 572375 210451 159662 170159 159663 641997 641996 139065 129610 557816 525663 46435</route>
            </leg>
            <activity type="outside" link="46435" facility="outside_8" x="653338.6697731011" y="6857579.601421991" end_time="10:48:56" >
            </activity>
            <leg mode="outside" dep_time="10:48:56" trav_time="00:00:00">
                <route type="generic" start_link="46435" end_link="46426" trav_time="00:00:00" distance="187.1198640488319"></route>
            </leg>
            <activity type="outside" link="46426" facility="outside_9" x="653160.1865588573" y="6857523.409022551" end_time="10:49:17" >
            </activity>
            <leg mode="car_passenger" dep_time="10:49:17" trav_time="00:00:04">
                <route type="links" start_link="46426" end_link="398730" trav_time="00:00:04" distance="131.48553148334906" vehicleRefId="null">46426 46421 284422 506155 506168 398730</route>
            </leg>
            <activity type="outside" link="398730" facility="outside_10" x="653013.2075560454" y="6857532.214432823" end_time="10:49:27" >
            </activity>
        </plan>

    </person>

<!-- ====================================================================== -->

    <person id="10002043">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >90</attribute>
            <attribute name="censusId" class="java.lang.Integer" >3674946</attribute>
            <attribute name="employed" class="java.lang.Boolean" >false</attribute>
            <attribute name="hasLicense" class="java.lang.String" >yes</attribute>
            <attribute name="htsId" class="java.lang.Long" >2400810100001</attribute>
            <attribute name="isOutside" class="java.lang.Boolean" >true</attribute>
            <attribute name="isPassenger" class="java.lang.Boolean" >false</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >m</attribute>
        </attributes>
        <plan selected="yes">
            <activity type="outside" link="284251" facility="outside_1" x="653218.0059491959" y="6857536.564730054" end_time="08:29:24" >
            </activity>
            <leg mode="car" dep_time="08:29:24" trav_time="00:02:36">
                <route type="links" start_link="284251" end_link="63873" trav_time="00:02:36" distance="3117.285137236383" vehicleRefId="null">284251 660231 129607 129599 139064 641998 641663 159806 170160 85864 635804 572378 435246 190032 526059 525761 525778 525779 450362 63873</route>
            </leg>
            <activity type="outside" link="63873" facility="outside_2" x="656055.3097541996" y="6859009.979613776" end_time="08:32:04" >
            </activity>
            <leg mode="outside" dep_time="08:32:04" trav_time="00:00:00">
                <route type="generic" start_link="63873" end_link="85890" trav_time="00:00:00" distance="746.7439307235369"></route>
            </leg>
            <activity type="outside" link="85890" facility="outside_3" x="656635.5166858744" y="6859480.071535116" end_time="08:32:46" >
            </activity>
            <leg mode="car" dep_time="08:32:46" trav_time="00:01:21">
                <route type="links" start_link="85890" end_link="47652" trav_time="00:01:21" distance="1499.4956773327315" vehicleRefId="null">85890 202345 202323 202322 85868 569745 569762 535571 535243 616420 7195 408601 47652</route>
            </leg>
            <activity type="outside" link="47652" facility="outside_4" x="657143.7893766644" y="6860882.64702696" end_time="09:35:48" >
            </activity>
            <leg mode="outside" dep_time="09:35:48" trav_time="00:00:00">
                <route type="generic" start_link="47652" end_link="466140" trav_time="00:00:00" distance="16.659217552989976"></route>
            </leg>
            <activity type="outside" link="466140" facility="outside_5" x="657155.3197720037" y="6860894.671149082" end_time="09:42:26" >
            </activity>
            <leg mode="car" dep_time="09:42:26" trav_time="00:01:32">
                <route type="links" start_link="466140" end_link="85887" trav_time="00:01:32" distance="1841.175613889593" vehicleRefId="null">466140 666788 205956 205957 205958 315381 584891 7193 150557 535291 535555 569763 569764 569744 202426 202425 202424 535572 85887</route>
            </leg>
            <activity type="outside" link="85887" facility="outside_6" x="656620.921626125" y="6859492.595666251" end_time="09:44:09" >
            </activity>
            <leg mode="outside" dep_time="09:44:09" trav_time="00:00:00">
                <route type="generic" start_link="85887" end_link="63872" trav_time="00:00:00" distance="744.9330931635377"></route>
            </leg>
            <activity type="outside" link="63872" facility="outside_7" x="656043.6710628852" y="6859021.737831518" end_time="09:44:44" >
            </activity>
            <leg mode="car" dep_time="09:44:44" trav_time="00:02:37">
                <route type="links" start_link="63872" end_link="46435" trav_time="00:02:37" distance="3138.4720080186116" vehicleRefId="null">63872 63869 332997 332998 85873 525752 525750 525764 435247 635803 572374 572375 210451 159662 170159 159663 641997 641996 139065 129610 557816 525663 46435</route>
            </leg>
            <activity type="outside" link="46435" facility="outside_8" x="653338.6697731011" y="6857579.601421991" end_time="09:47:28" >
            </activity>
            <leg mode="outside" dep_time="09:47:28" trav_time="00:00:00">
                <route type="generic" start_link="46435" end_link="46426" trav_time="00:00:00" distance="187.1198640488319"></route>
            </leg>
            <activity type="outside" link="46426" facility="outside_9" x="653160.1865588573" y="6857523.409022551" end_time="09:47:49" >
            </activity>
            <leg mode="car" dep_time="09:47:49" trav_time="00:00:04">
                <route type="links" start_link="46426" end_link="398730" trav_time="00:00:04" distance="131.48553148334906" vehicleRefId="null">46426 46421 284422 506155 506168 398730</route>
            </leg>
            <activity type="outside" link="398730" facility="outside_10" x="653013.2075560454" y="6857532.214432823" end_time="09:55:48" >
            </activity>
            <leg mode="outside" dep_time="09:55:48" trav_time="00:00:00">
                <route type="generic" start_link="398730" end_link="284251" trav_time="00:00:00" distance="204.84459212547162"></route>
            </leg>
            <activity type="outside" link="284251" facility="outside_1" x="653218.0059491959" y="6857536.564730054" end_time="09:59:24" >
            </activity>
            <leg mode="car" dep_time="09:59:24" trav_time="00:01:56">
                <route type="links" start_link="284251" end_link="525753" trav_time="00:01:56" distance="2349.4934769631172" vehicleRefId="null">284251 660231 129607 129599 139064 641998 641663 159806 170160 85864 635804 572378 435246 362748 643661 525753</route>
            </leg>
            <activity type="outside" link="525753" facility="outside_11" x="655306.9611509901" y="6858641.834279304" end_time="10:35:48" >
            </activity>
            <leg mode="outside" dep_time="10:35:48" trav_time="00:00:00">
                <route type="generic" start_link="525753" end_link="133164" trav_time="00:00:00" distance="70.96782044637413"></route>
            </leg>
            <activity type="outside" link="133164" facility="outside_12" x="655356.203591104" y="6858692.93822857" end_time="10:44:25" >
            </activity>
            <leg mode="car" dep_time="10:44:25" trav_time="00:02:16">
                <route type="links" start_link="133164" end_link="46435" trav_time="00:02:16" distance="2594.925451303471" vehicleRefId="null">133164 133165 525784 525781 159395 582076 84099 84100 525760 435247 635803 572374 572375 210451 159662 170159 159663 641997 641996 139065 129610 557816 525663 46435</route>
            </leg>
            <activity type="outside" link="46435" facility="outside_8" x="653338.6697731011" y="6857579.601421991" end_time="10:46:48" >
            </activity>
            <leg mode="outside" dep_time="10:46:48" trav_time="00:00:00">
                <route type="generic" start_link="46435" end_link="46426" trav_time="00:00:00" distance="187.1198640488319"></route>
            </leg>
            <activity type="outside" link="46426" facility="outside_9" x="653160.1865588573" y="6857523.409022551" end_time="10:47:09" >
            </activity>
            <leg mode="car" dep_time="10:47:09" trav_time="00:00:04">
                <route type="links" start_link="46426" end_link="398730" trav_time="00:00:04" distance="131.48553148334906" vehicleRefId="null">46426 46421 284422 506155 506168 398730</route>
            </leg>
            <activity type="outside" link="398730" facility="outside_10" x="653013.2075560454" y="6857532.214432823" end_time="10:47:19" >
            </activity>
        </plan>

    </person>

<!-- ====================================================================== -->

    <person id="10004136">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >41</attribute>
            <attribute name="censusId" class="java.lang.Integer" >3675631</attribute>
            <attribute name="employed" class="java.lang.Boolean" >false</attribute>
            <attribute name="hasLicense" class="java.lang.String" >yes</attribute>
            <attribute name="htsId" class="java.lang.Long" >2403610200001</attribute>
            <attribute name="isOutside" class="java.lang.Boolean" >true</attribute>
            <attribute name="isPassenger" class="java.lang.Boolean" >false</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
        <plan selected="yes">
            <activity type="outside" link="284251" facility="outside_1" x="653218.0059491959" y="6857536.564730054" end_time="19:22:27" >
            </activity>
            <leg mode="car" dep_time="19:22:27" trav_time="00:01:56">
                <route type="links" start_link="284251" end_link="525753" trav_time="00:01:56" distance="2349.4934769631172" vehicleRefId="null">284251 660231 129607 129599 139064 641998 641663 159806 170160 85864 635804 572378 435246 362748 643661 525753</route>
            </leg>
            <activity type="outside" link="525753" facility="outside_11" x="655306.9611509901" y="6858641.834279304" end_time="19:24:31" >
            </activity>
        </plan>

    </person>

<!-- ====================================================================== -->

    <person id="10004137">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >53</attribute>
            <attribute name="censusId" class="java.lang.Integer" >3675632</attribute>
            <attribute name="employed" class="java.lang.Boolean" >true</attribute>
            <attribute name="hasLicense" class="java.lang.String" >yes</attribute>
            <attribute name="htsId" class="java.lang.Long" >1157470400001</attribute>
            <attribute name="isOutside" class="java.lang.Boolean" >true</attribute>
            <attribute name="isPassenger" class="java.lang.Boolean" >true</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >m</attribute>
        </attributes>
        <plan selected="yes">
            <activity type="outside" link="31240" facility="outside_13" x="652838.038196341" y="6858295.183610428" end_time="07:34:00" >
            </activity>
            <leg mode="access_walk" dep_time="07:34:00" trav_time="00:00:39">
                <route type="generic" start_link="31240" end_link="pt_StopPoint:59298" trav_time="00:00:39" distance="46.250835788845635"></route>
            </leg>
            <activity type="pt interaction" link="31240" x="652838.038196341" y="6858295.183610428" max_dur="00:00:00" >
            </activity>
            <leg mode="pt" dep_time="07:34:39" trav_time="00:02:21">
                <route type="enriched_pt" start_link="pt_StopPoint:59298" end_link="pt_StopPoint:59666" trav_time="00:02:21" distance="515.6409073075592">{"inVehicleTime":120.0,"transferTime":21.0,"accessStopIndex":26,"egressStopindex":27,"transitRouteId":"93517783-1_287780","transitLineId":"100110007:7","departureId":"93517632-1_287842_06:58:00"}</route>
            </leg>
            <activity type="pt interaction" link="31240" x="652838.038196341" y="6858295.183610428" max_dur="00:00:00" >
            </activity>
            <leg mode="egress_walk" dep_time="07:37:00" trav_time="00:08:29">
                <route type="generic" start_link="pt_StopPoint:59666" end_link="508756" trav_time="00:08:29" distance="610.543587585534"></route>
            </leg>
            <activity type="outside" link="508756" facility="outside_14" x="652601.8490830011" y="6857663.731302492" end_time="07:53:26" >
            </activity>
            <leg mode="access_walk" dep_time="07:53:26" trav_time="00:08:29">
                <route type="generic" start_link="508756" end_link="pt_StopPoint:59666" trav_time="00:08:29" distance="610.543587585534"></route>
            </leg>
            <activity type="pt interaction" link="508756" x="652601.8490830011" y="6857663.731302492" max_dur="00:00:00" >
            </activity>
            <leg mode="pt" dep_time="08:01:55" trav_time="00:24:04">
                <route type="enriched_pt" start_link="pt_StopPoint:59666" end_link="pt_StopPoint:59209" trav_time="00:24:04" distance="7410.255050348954">{"inVehicleTime":1260.0,"transferTime":184.905695489786,"accessStopIndex":3,"egressStopindex":17,"transitRouteId":"93517741-1_288723","transitLineId":"100110007:7","departureId":"93517701-1_288827_08:01:00"}</route>
            </leg>
            <activity type="pt interaction" link="508756" x="652601.8490830011" y="6857663.731302492" max_dur="00:00:00" >
            </activity>
            <leg mode="transit_walk" dep_time="08:26:00" trav_time="00:01:05">
                <route type="generic" start_link="pt_StopPoint:59209" end_link="pt_StopPoint:59212" trav_time="00:01:05" distance="78.60144797794317"></route>
            </leg>
            <activity type="pt interaction" link="pt_StopPoint:59209" x="651042.0886563308" y="6863599.716479325" max_dur="00:00:00" >
            </activity>
            <leg mode="pt" dep_time="08:27:05" trav_time="00:08:54">
                <route type="enriched_pt" start_link="pt_StopPoint:59212" end_link="pt_StopPoint:59470" trav_time="00:08:54" distance="2841.5271228126094">{"inVehicleTime":420.0,"transferTime":114.498793351715,"accessStopIndex":17,"egressStopindex":22,"transitRouteId":"95331274-1_267292","transitLineId":"100110008:8","departureId":"95331302-1_267323_08:07:00"}</route>
            </leg>
            <activity type="pt interaction" link="pt_StopPoint:59209" x="651042.0886563308" y="6863599.716479325" max_dur="00:00:00" >
            </activity>
            <leg mode="egress_walk" dep_time="08:36:00" trav_time="00:03:05">
                <route type="generic" start_link="pt_StopPoint:59470" end_link="269385" trav_time="00:03:05" distance="221.08599197383575"></route>
            </leg>
            <activity type="work" link="269385" facility="22974" x="649200.4" y="6861852.6" start_time="07:38:40" end_time="16:38:40" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="access_walk" dep_time="16:38:40" trav_time="00:03:05">
                <route type="generic" start_link="269385" end_link="pt_StopPoint:59470" trav_time="00:03:05" distance="221.08599197383575"></route>
            </leg>
            <activity type="pt interaction" link="269385" x="649200.4" y="6861852.6" max_dur="00:00:00" >
            </activity>
            <leg mode="pt" dep_time="16:41:45" trav_time="00:09:15">
                <route type="enriched_pt" start_link="pt_StopPoint:59470" end_link="pt_StopPoint:59212" trav_time="00:09:15" distance="2841.5271228126094">{"inVehicleTime":420.0,"transferTime":135.0,"accessStopIndex":6,"egressStopindex":11,"transitRouteId":"95305985-1_264552","transitLineId":"100110008:8","departureId":"95305925-1_264577_16:36:00"}</route>
            </leg>
            <activity type="pt interaction" link="269385" x="649200.4" y="6861852.6" max_dur="00:00:00" >
            </activity>
            <leg mode="transit_walk" dep_time="16:51:00" trav_time="00:01:05">
                <route type="generic" start_link="pt_StopPoint:59212" end_link="pt_StopPoint:59209" trav_time="00:01:05" distance="78.60144797794317"></route>
            </leg>
            <activity type="pt interaction" link="pt_StopPoint:59212" x="650982.2282691017" y="6863608.229197035" max_dur="00:00:00" >
            </activity>
            <leg mode="pt" dep_time="16:52:05" trav_time="00:19:54">
                <route type="enriched_pt" start_link="pt_StopPoint:59209" end_link="pt_StopPoint:59298" trav_time="00:19:54" distance="6894.614143041396">{"inVehicleTime":1140.0,"transferTime":54.498793351711356,"accessStopIndex":13,"egressStopindex":26,"transitRouteId":"93518107-1_287714","transitLineId":"100110007:7","departureId":"93518059-1_287550_16:35:00"}</route>
            </leg>
            <activity type="pt interaction" link="pt_StopPoint:59212" x="650982.2282691017" y="6863608.229197035" max_dur="00:00:00" >
            </activity>
            <leg mode="egress_walk" dep_time="17:12:00" trav_time="00:00:39">
                <route type="generic" start_link="pt_StopPoint:59298" end_link="31240" trav_time="00:00:39" distance="46.250835788845635"></route>
            </leg>
            <activity type="outside" link="31240" facility="outside_13" x="652838.038196341" y="6858295.183610428" end_time="17:14:00" >
            </activity>
        </plan>

    </person>

我想要的是提取trav_time存储在leg节点中的所有 ID 和相应的 。仅当planselected = "yes"时,才应存储所有腿的行程时间。

我的算法如下所示:

tree = ET.iterparse(gzip.open('V0_1pm/output_plans.xml.gz', 'r'))
traveltimes = defaultdict(list)
for xml_event, elem in tree:
    if elem.tag=='person':        
        items = list(elem)
        target = items[1]        
        if target.attrib['selected']=='yes':
            traveltimes[elem.attrib["id"]]
            legs = list(items[1])
            for leg in legs:
                if leg.tag=='leg':
                    traveltimes[leg.attrib["trav_time"]]
        elem.clear()    


traveltimes = pd.DataFrame.from_dict(traveltimes, orient='index')      

和这样的输出:

0002042
2   00:02:36
3   00:00:00
4   00:01:21
5   00:01:32
6   00:02:37
7   00:00:04
8   10002043
9   00:02:54
10  00:01:40
11  00:02:00
12  00:03:00
13  00:00:14
14  00:02:07
15  00:02:45
16  10004136
17  10004137
18  00:00:39
...

如您所见,在第 16 到 17 行中,10004136 的 trav_time 没有被存储,我不知道为什么。我将不胜感激任何帮助!

标签: pythonxmlpandasparsingelementtree

解决方案


我认为您的代码可以正常提取相关旅行时间,但我认为 defaultdict 数据类型存在问题。老实说,我从未使用过 defaultdict,所以我不知道它是如何工作的。但我认为问题在于,如果您有类似的旅行时间值,那么我认为它会使用相同的密钥进行更新。例如- 1:56 出现两次,一次 id 为 10002043,然后 id 为 10004136。所以我认为它正在更新。但是我尝试只使用 xml 的一部分来运行您的代码,并且运行良好。

from lxml import etree
import pandas as pd
from collections import defaultdict
tree = etree.fromstring(xml)
traveltimes=defaultdict(list)
count=0
for elem in tree:
    count=count+1
    if(count>=3):

        if(elem.tag=='person'):

            items=list(elem)
            target=items[1]

            if(target.attrib['selected']=='yes'):
                traveltimes[elem.attrib["id"]]
                legs = list(target)
                for leg in legs:
                    print(leg)
                    if leg.tag=='leg':
                        print(leg.attrib)
                        traveltimes[leg.attrib["trav_time"]]


print(traveltimes)

收到的输出是这样的 在此处输入图像描述

所以我认为它正在作为一个关键更新(虽然不确定)。


推荐阅读