首页 > 解决方案 > 是否有跳过重复值写入文本文件的功能?

问题描述

我正在尝试从与“联系人”一词匹配的数组中获取所有超链接到 .csv 文本文件中。问题是如果它在同一网站上找到另一个联系人超链接,它将再次打印它。我该如何解决?此外,如何为包含关键字的特定 div 抓取多个网站:(“电话号码”、“地址”、“电子邮件”等)?

private void contactbutton_Click(object sender, EventArgs e)
    {
        ArrayList domainlist = new ArrayList();  
        const Int32 BufferSize = 128;
        // -- Location of domain list file --
        using (var fileStream = File.OpenRead("C:/Users/Username/Desktop/domains.txt"))           
        using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
        {
            String line;
            while ((line = streamReader.ReadLine()) != null)
                domainlist.Add(line);
        }
        foreach (string s in domainlist)
        {
            SearchHyperlinks("https://" + s);
        }
    }



public static void SearchHyperlinks(string address4)
    {

        HtmlWeb hw = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = hw.Load(address4);

        
        String GetAbsoluteUrlString(string baseUrl, string url)
        {
            var uri = new Uri(url, UriKind.RelativeOrAbsolute);
            if (!uri.IsAbsoluteUri)
                uri = new Uri(new Uri(baseUrl), uri);
            return uri.ToString();
        }      
        try
        {
            using (var w = new StreamWriter("C:/Users/Username/Desktop/hyperlink.csv"))
                foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[starts-with(., 'Contacts') or starts-with(., 'contacts') or starts-with(., 'CONTACTS') or starts-with (., 'Shop Contacts')]"))
                {
                    String hrefValue = link.Attributes["href"].Value;

                    if (hrefValue != null)
                    {
                        String fullhref = GetAbsoluteUrlString(address4, hrefValue);
                        Console.WriteLine(fullhref);
                        using (var textWriter = new StreamWriter("C:/Users/Username/Desktop/hyperlinks.csv", true))
                        {
                            var writer = new CsvWriter(textWriter, CultureInfo.InvariantCulture);
                            writer.Configuration.Delimiter = ",";
                            writer.WriteField(fullhref);
                            writer.NextRecord();
                        }
                    }

                }
        }

        catch (System.NullReferenceException)
        {
            Console.WriteLine("Hyperlinks not found");
        }

标签: c#htmlhtml-agility-pack

解决方案


将您的方法重写为实际Search,而不是Write

public static IEnumerable<string> SearchHyperlinks(string address4)
{
    var hw = new HtmlWeb();
    var doc = hw.Load(address4);

    String GetAbsoluteUrlString(string baseUrl, string url)
    {
        var uri = new Uri(url, UriKind.RelativeOrAbsolute);
        if (!uri.IsAbsoluteUri)
            uri = new Uri(new Uri(baseUrl), uri);
        return uri.ToString();
    }
    var links = doc.DocumentNode.SelectNodes("//a[starts-with(., 'Contacts') or starts-with(., 'contacts') or starts-with(., 'CONTACTS') or starts-with (., 'Shop Contacts')]");
    if(links == null)
       yield return break;
    foreach (var link in links)
    {
        var hrefValue = link.Attributes["href"].Value;
        if (hrefValue != null)
        {
            var fullhref = GetAbsoluteUrlString(address4, hrefValue);
            yield return fullhref;
        }
    }
}

然后你返回不同的值:

var distinct = SearchHyperlinks(input).Distinct();

然后你把它们都写在你想要的任何地方。


推荐阅读