scala - 如何在 scala IntelliJ 中处理这些大数据？

问题描述

已经有几天了，我开始在 IntelliJ 上学习 Scala，而且我正在自学。请承担我的菜鸟错误。我有一个超过 10,000 行和 13 列的 csv 文件。

列的标题是：

类别 | 评级 | 评论 | 尺寸 | 安装 | 类型 | 价格 | 内容分级 | 类型 | 最后更新 | 当前版本 | 安卓版

我确实设法使用以下代码读取并显示了 csv 文件：

import scala.io.Source


object task {
  def main(args: Array[String]): Unit = {
    for(line <- Source.fromFile("D:/data.csv"))
    {
      println(line)
    }
  }
}

这样做的问题是此代码显示一个字母或数字，移动到下一行并显示下一个字母或数字。它不会在一行中显示一行。

我想根据分配的评论和评级优先级找出每个类别（ART_AND_DESIGN、AUTO_AND_VEHICLES、BEAUTY...）的最佳应用。优先级分别定义为“评论”列的 60% 和“评级”列的 40%。使用这些分配的优先级值计算每个类别（ART_AND_DESIGN、AUTO_AND_VEHICLES、BEAUTY...）的值。这个值将帮助我们找到每个类别中最好的应用程序。您可以使用以下优先级公式方程。

优先级 = ( (((rating/max_rating) * 100) * 0.4) + (((reviews/max_reviews) * 100) * 0.6) )

这里 max_rating 是同一类别中给定数据的最大评分，例如类别（“ART_AND_DESIGN”），最大评分为“4.7”，max_reviews 是同一类别中应用的最大评论，例如类别（“ART_AND_DESIGN”）最大评论为“295221”。因此，类别（“ART_AND_DESIGN”）的第一个数据记录的优先级值为：

评分= 4.1，评论= 159，

最大评分= 4.7，最大评论= 295221

我的问题是，如何将每一列存储在数组中？这就是我计划计算数据的方式。如果有任何其他方法可以解决上述问题，我愿意接受建议。

如果有人愿意，我可以上传一小部分数据。

标签： scalaintellij-idea

解决方案

Source默认情况下给你一个字节Iterator。要遍历行，请使用.getLines：

 Source.fromFile(fileName)
   .getLines
   .foreach(println)

要将行拆分为数组，请使用split（假设列值不包括分隔符）：

  val arrays = Source.fromFile(fileName).getLines.map(_.split("|"))

不过最好避免使用原始数组。创建一个案例类可以得到更好、更易读的代码：

   case class AppData(
     category: String,
     rating: Int,
     reviews: Int, 
     size: Int,
     installs: Int, 
     `type`: String, 
     price: Double,
     contentRating: Int, 
     generes: Seq[String], 
     lastUpdated: Long,
     version: String,
     androidVersion: String
  ) {
     def priority(maxRating: Int, maxReview: Int) = 
       if(maxRatings == 0 || maxReviews == 0) 0 else 
         (rating * 0.4 / maxRating + reviews * 0.6 /maxReview) * 100
  }

  object AppData {
    def apply(str: String) = {
       val fields = str.split("|")
       assert(fields.length == 12)
       AppData(
         fields(0),
         fields(1).toInt,   
         fields(2).toInt,
         fields(3).toInt,
         fields(4).toInt,
         fields(5),
         fields(6).toDouble,
         fields(7).toInt,
         fields(8).split(",").toSeq,
         fields(9).toLong,
         fields(10),
         fields(11)
       )
    }
  }

现在你可以非常整洁地做你想做的事：

  // Read the data, parse it and group by category
  // This gives you a map of categories to a seq of apps 
  val byCategory = Source.fromFile(fileName)
    .map(AppData)
    .groupBy(_.category)

  // Now, find out max ratings and reviews for each category
  // This could be done even nicer with another case class and 
  // a monoid, but tuple/fold will do too 
  // It is tempting to use `.mapValues` here, but that's not a good idea
  // because .mapValues is LAZY, it will recompute the max every time 
  // the value is accessed!
  val maxes = byVategory.map { case (cat, data) => 
     cat -> 
        data.foldLeft(0 -> 0) { case ((maxRatings, maxReviews), in) => 
          (maxRatings max in.rating, maxReviews max in.reviews)
        }
  }.withDefault( _ => (0,0))

  // And finally go through your categories, and find best for each, 
  // that's it!
  val bestByCategory = byCategory.map { case(cat, apps) => 
    cat -> apps.maxBy { _.priority.tupled(maxes(cat)) }
  }

scala - 如何在 scala IntelliJ 中处理这些大数据？

问题描述

解决方案

推荐阅读