首页 > 解决方案 > 从 html 敏捷包中过滤字符串

问题描述

我从 URL 中获取 html,然后选择元素table并选择其中包含其属性值的所有tr元素。现在我有 20 个左右这样的元素:tabletrid

<th class="nw">1 Jan</th><td class="nw">Friday</td><td><a href="/holidays/andorra/new-year-day">New Year&#39;s Day</a></td><td>National holiday</td>

如何从上面的元素中分别获取每个文本?
示例输出:1 Jan/Friday/New Year's Day/National holiday

var url = "https://www.timeanddate.com/holidays/andorra/";
var client = new HttpClient();
client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.5");
var html = await client.GetStringAsync(url);

var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);

var a1 = document.DocumentNode.Descendants("table")
    .Where(node => node.GetAttributeValue("id","").Equals("holidays-table"))
    .ToList();

var a2 = a1[0].Descendants("tr")
    .Where(node => node.GetAttributeValue("id","").Contains("tr"))
    .ToList();

标签: c#web-scrapinghtml-agility-pack

解决方案


这应该给你你想要的:

List<List<string>> holidays = document
    .DocumentNode
    .SelectNodes("//table[@id='holidays-table']/tbody/tr")
    .Select(tr => tr.ChildNodes
                    .Where(n => n.Name == "th" || n.Name == "td")
                    .Select(n => n.InnerText.Trim())
                    .ToList())
    .Where(row => row.Any())  // filter out empty rows
    .ToList();

foreach (var row in holidays)
{
    Console.WriteLine(string.Join(", ", row));
}

在这里工作演示:https ://dotnetfiddle.net/0SADls


推荐阅读