首页 > 解决方案 > 阅读某些网站时出现 HtmlAgilityPack 问题

问题描述

我在使用 HtmlAgilityPack 阅读某些网站时遇到问题。例如https://faranesh.comhttps://cbi.ir

问题:urlResponse 返回="\r\n\r\n\r\n

我试过这段代码,但它只返回空值。我想访问站点代码,但我无法帮助我。

C#阅读代码为:

    {
        var url = @"https://www.cbi.ir/";

        HtmlWeb web = new HtmlWeb();

        var Doc = web.Load(url);

        var node = Doc.DocumentNode.SelectSingleNode("//title");

        Console.WriteLine($"Title is {node.InnerText}");
    }

标签: c#html-agility-pack

解决方案


看起来您发送的示例是单页应用程序或高度基于 JavaScript 的。第一个示例返回以下 HTML:

<!DOCTYPE html>
<html lang="fa-IR">
<head>
<script type="9055e798d34ceda9b8089665-text/javascript">(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
            new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
        j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
        'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
    })(window,document,'script','dataLayer','GTM-MSQZK3S');</script>
<script type="9055e798d34ceda9b8089665-text/javascript">
        !function (t, e, n) {
            t.yektanetAnalyticsObject = n, t[n] = t[n] || function () {
                t[n].q.push(arguments)
            }, t[n].q = t[n].q || [];
            var a = new Date, r = a.getFullYear().toString() + "0" + a.getMonth() + "0" + a.getDate() + "0" + a.getHours(),
                    c = e.getElementsByTagName("script")[0], s = e.createElement("script");
            s.id = "ua-script-yn-2448-adv"; s.dataset.analyticsobject = n;
            s.async = 1; s.type = "text/javascript";
            s.src = "https://cdn.yektanet.com/rg_woebegone/scripts_v2/yn-2448-adv/rg.complete.js?v=" + r, c.parentNode.insertBefore(s, c)
        }(window, document, "yektanet");
    </script>
<base href="/">
<meta charset="UTF-8">
<meta name="theme-color" content="#2e9ed8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="language" content="fa" />
<link rel="apple-touch-icon" sizes="180x180" href="./apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="./favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="./favicon-16x16.png">
<link rel="manifest" href="./site.webmanifest">
<link rel="mask-icon" href="./safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#ffc40d">
<meta name="theme-color" content="#ffffff">
<meta name="google-signin-scope" content="profile email">
<link rel="search" type="application/opensearchdescription+xml" title="Faranesh" href="./opensearch.xml" />
<link rel="manifest" href="./manifest.json" />

如您所见,最初没有正文,也没有标题标签。

如果您想解析内容,包括 JavaScript 生成的 DOM 元素,您需要自动化 Headless 浏览器,而不是解析服务器返回的原始 HTML。

例如,尝试:

未在本地测试,但从其存储库中的示例中,可以看出以下几点:

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true
});
var page = await browser.NewPageAsync();
await page.GoToAsync("https://faranesh.com/");

var title = @"document.title";
Console.WriteLine($"Title: {title}");

推荐阅读