首页 > 解决方案 > Hiw 在 Ruby 中合并数组中的元素集

问题描述

我有一个数组,里面装满了来自抓取网站的美国州名。

问题是所有有两个单词的状态最终都成为数组中的单独元素。“纽约”在数组中为["New", "York"].

我需要为数组中的每个两个字状态解决这个问题。

 ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
 "Connecticut", "Florida", "Georgia", "Idaho", "Illinois" "Indiana",
 "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", 
 "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
 "Montana", "Nebraska", "Nevada", "New", "Hampshire", "New", "Jersey",
 "New", "Mexico", "New", "York", "North", "Carolina", "North", "Dakota", 
 "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "South", "Carolina",
 "South", "Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", 
 "Washington", "West", "Virginia", "Wisconsin","Wyoming"]
def scrape_koa_states
  doc = Nokogiri::HTML(open("https://koa.com/campgrounds/"))
  title = doc.search("h4").text
  title_array = title.split
  title_array = title_array.delete_if{|ele| ele == "in"}
  title_array.map {|ele| ele.gsub!("Campgrounds", "")}
  new_array = title_array[2, 56]
  binding.pry
end
title = doc.search("h4").text
=> "KOA NewsletterCampgrounds in AlabamaCampgrounds in AlaskaCampgrounds in ArizonaCampgrounds in ArkansasCampgrounds in CaliforniaCampgrounds in ColoradoCampgrounds in ConnecticutCampgrounds in FloridaCampgrounds in GeorgiaCampgrounds in IdahoCampgrounds in IllinoisCampgrounds in IndianaCampgrounds in IowaCampgrounds in KansasCampgrounds in KentuckyCampgrounds in LouisianaCampgrounds in MaineCampgrounds in MarylandCampgrounds in MassachusettsCampgrounds in MichiganCampgrounds in MinnesotaCampgrounds in MississippiCampgrounds in MissouriCampgrounds in MontanaCampgrounds in NebraskaCampgrounds in NevadaCampgrounds in New HampshireCampgrounds in New JerseyCampgrounds in New MexicoCampgrounds in New YorkCampgrounds in North CarolinaCampgrounds in North DakotaCampgrounds in OhioCampgrounds in OklahomaCampgrounds in OregonCampgrounds in PennsylvaniaCampgrounds in South CarolinaCampgrounds in South DakotaCampgrounds in TennesseeCampgrounds in TexasCampgrounds in UtahCampgrounds in VermontCampgrounds in VirginiaCampgrounds in WashingtonCampgrounds in West VirginiaCampgrounds in WisconsinCampgrounds in WyomingCampgrounds in AlbertaCampgrounds in British ColumbiaCampgrounds in ManitobaCampgrounds in New BrunswickCampgrounds in Newfoundland and LabradorCampgrounds in Nova ScotiaCampgrounds in OntarioCampgrounds in Prince Edward IslandCampgrounds in Quebec"

标签: arraysruby

解决方案


代码存在多个问题。

search返回一个NodeSet

doc.search('h4').class # => Nokogiri::XML::NodeSet

NodeSet 是一个集合,就像一个数组,并且text,当提取 NodeSet 的节点文本时,会将结果一起连接成一个字符串。

doc.search('h4').text[0..40] # => "KOA NewsletterCampgrounds in AlabamaCampg"

从中恢复是一场噩梦,我们看到人们经常问这个问题,因为他们没有阅读文档。

这是来自 NodeSet 的text文档

获取所有包含的 Node 对象的内部文本

注意:这将连接 NodeSet 中所有 Node 对象的文本:

doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>')
doc.css('d').text # => "foobar"

相反,如果要返回 NodeSet 中所有节点的文本:

doc.css('d').map(&:text) # => ["foo", "bar"]

因此,如示例所示,使用:

doc.search('h4').map(&:text)[0..4] # => ["KOA Newsletter", "Campgrounds in Alabama", "Campgrounds in Alaska", "Campgrounds in Arizona", "Campgrounds in Arkansas"]

请注意,我正在使用[0..4]now 的一部分来减少数组的输出。你不想那样做。除非你要。

接下来,CSS 和 XPath 有办法在节点内部探查以查看它们的子节点是什么,所以让它们吧。CSS 几乎总是更具可读性,所以我大部分时间都使用它:

doc.search('h4:contains("Campgrounds")').map(&:text)[0..4] # => ["Campgrounds in Alabama", "Campgrounds in Alaska", "Campgrounds in Arizona", "Campgrounds in Arkansas", "Campgrounds in California"]

现在清理结果很简单:

doc.search('h4:contains("Campgrounds")').map { |h4| 
  h4.text[15..-1]
} 

结果是:

# => ["Alabama",
#     "Alaska",
#     "Arizona",
#     "Arkansas",
#     "California",
#     "Colorado",
#     "Connecticut",
#     "Florida",
#     "Georgia",
#     "Idaho",
#     "Illinois",
#     "Indiana",
#     "Iowa",
#     "Kansas",
#     "Kentucky",
#     "Louisiana",
#     "Maine",
#     "Maryland",
#     "Massachusetts",
#     "Michigan",
#     "Minnesota",
#     "Mississippi",
#     "Missouri",
#     "Montana",
#     "Nebraska",
#     "Nevada",
#     "New Hampshire",
#     "New Jersey",
#     "New Mexico",
#     "New York",
#     "North Carolina",
#     "North Dakota",
#     "Ohio",
#     "Oklahoma",
#     "Oregon",
#     "Pennsylvania",
#     "South Carolina",
#     "South Dakota",
#     "Tennessee",
#     "Texas",
#     "Utah",
#     "Vermont",
#     "Virginia",
#     "Washington",
#     "West Virginia",
#     "Wisconsin",
#     "Wyoming",
#     "Alberta",
#     "British Columbia",
#     "Manitoba",
#     "New Brunswick",
#     "Newfoundland and Labrador",
#     "Nova Scotia",
#     "Ontario",
#     "Prince Edward Island",
#     "Quebec"]

推荐阅读