首页 > 解决方案 > 处理两个父母之间的文本

问题描述

我有以下 html 文本,其中“tr”的数量是动态的:

<tr>
    <td>Dec 1, 2019 11:12 PM</td>
    <td>some text1</td>
    <td>some text2</td>
    <td>some text3</td>
    <td>
        <input type=button value="Add" id="add" onCLick="add(12345)" data-toggle="modal" data-target="#add" />
    </td>
    <td></td>
</tr>

<tr>
    <td>Dec 5, 2019 4:33 PM</td>
    <td>some text1</td>
    <td>some text2</td>
    <td>some text3</td>
    <td>
        <input type=button value="Add" id="add" onCLick="add(12345)" data-toggle="modal" data-target="#add" />
    </td>
    <td></td>
</tr>

<tr>
    <td>Dec 9, 2019 1:06 PM</td>
    <td>some text1</td>
    <td>some text2</td>
    <td>some text3</td>
    <td>
        <input type=button value="Add" id="add" onCLick="add(12345)" data-toggle="modal" data-target="#add" />
    </td>
    <td></td>
</tr>

我想得到以下结果:

Dec 1, 2019 11:12 PM | some text1 | some text2 | some text3 
Dec 5, 2019 4:33 PM | some text1 | some text2 | some text3 
Dec 9, 2019 1:06 PM | some text1 | some text2 | some text3 

我尝试使用 sed 分组:

sed '/^<tr>/d;:a;N;/^<\/tr>/M!s/\n/ /;ta;P;d'

但当然它不起作用。任何建议如何处理?

标签: htmlbashsed

解决方案


除非您想要一个快速而肮脏的解决方案,否则解析 html 应该使用 html 解析器完成,如评论中所述。

例如,使用 python:

import bs4 # bs4 stands for beautifulsoup, a html parser
import csv

# I open both input and output file
with open(<input>) as myinput, open(<output>, "w") as myoutput:
    # I parse the html
    soup = bs4.BeautifulSoup(myinput, 'html.parser')
    # I set the delimiter for the csv
    csvwriter = csv.writer(myoutput, delimiter="|")

    # For each tr tag
    for tr in soup.find_all('tr'):
        # Here I create a list that contains all text from td
        rows = [td.text for td in tr.find_all('td')]
        # I write the 4th first values as a csv row
        csvwriter.writerow(rows[:4])

现在,如果您不相信这个不错的解决方案,让我们看一下使用 awk 的快速而肮脏的解决方案:

awk '
   # I define here input and output delimiters
   BEGIN{FS="<|>"; OFS=" | "}
   # I store info in array td_info
   /<td>/{td_info[++counter]=$3} 
   # I print the info I need and clean td_info array and counter
   /<\/tr>/{
       print td_info[1], td_info[2], td_info[3], td_info[4]
       counter=0
       delete td_info
   }
' <input.html>

输出 :

Dec 1, 2019 11:12 PM | some text1 | some text2 | some text3
Dec 5, 2019 4:33 PM | some text1 | some text2 | some text3
Dec 9, 2019 1:06 PM | some text1 | some text2 | some text3

推荐阅读