java - spark mlib ALS 算法的 Maven 依赖地狱
问题描述
我有一小段 java 代码来获取 apache spark 建议:
public class Main { public static class Rating实现Serializable { private int userId; 私人 int 电影 ID;私人浮动评级;私人长时间戳;
public Rating() {}
public Rating(int userId, int movieId, float rating, long timestamp) {
this.userId = userId;
this.movieId = movieId;
this.rating = rating;
this.timestamp = timestamp;
}
public int getUserId() {
return userId;
}
public int getMovieId() {
return movieId;
}
public float getRating() {
return rating;
}
public long getTimestamp() {
return timestamp;
}
public static Rating parseRating(String str) {
String[] fields = str.split(",");
if (fields.length != 4) {
throw new IllegalArgumentException("Each line must contain 4 fields");
}
int userId = Integer.parseInt(fields[0]);
int movieId = Integer.parseInt(fields[1]);
float rating = Float.parseFloat(fields[2]);
long timestamp = Long.parseLong(fields[3]);
return new Rating(userId, movieId, rating, timestamp);
}
}
static String parse(String str) {
Pattern pat = Pattern.compile("\\[[0-9.]*,[0-9.]*]");
Matcher matcher = pat.matcher(str);
int count = 0;
StringBuilder sb = new StringBuilder();
while (matcher.find()) {
count++;
String substring = str.substring(matcher.start(), matcher.end());
String itstr = substring.split(",")[0].substring(1);
sb.append(itstr + " ");
}
return sb.toString().trim();
}
static TreeMap<Long, String> res = new TreeMap<>();
public static void add(long k, String v) {
res.put(k, v);
}
public static void main(String[] args) throws IOException {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkSession spark = SparkSession
.builder()
.appName("SomeAppName")
.config("spark.master", "local")
.getOrCreate();
JavaRDD<Rating> ratingsRDD = spark
.read().textFile(args[0]).javaRDD()
.map(Rating::parseRating);
Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, Rating.class);
ALS als = new ALS()
.setMaxIter(1)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating");
ALSModel model = als.fit(ratings);
model.setColdStartStrategy("drop");
Dataset<Row> rowDataset = model.recommendForAllUsers(50);
rowDataset.foreach((ForeachFunction<Row>) row -> {
String str = row.toString();
long l = Long.parseLong(str.substring(1).split(",")[0]);
add(l, parse(str));
});
BufferedWriter bw = new BufferedWriter(new FileWriter(args[1]));
for (long l = 0; l < res.lastKey(); l++) {
if (!res.containsKey(l)) {
bw.write("\n");
continue;
}
String str = res.get(l);
bw.write(str);
}
bw.close();
}
}
我正在我的 pom.xml 中尝试不同的依赖项以使其运行,但所有变体都失败了。这个:
<dependency>
<groupId>com.sparkjava</groupId>
<artifactId>spark-core</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.4</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.4</version>
</dependency>
java.lang.ClassNotFoundException: text.DefaultSource失败,修复它我添加
org.apache.spark spark-sql-kafka-0-10_2.10 2.0.2
现在它与ClassNotFoundException: org.apache.spark.internal.Logging$class一起崩溃,为了修复它,我添加了另一个:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.2.2</version>
</dependency>
现在它失败了java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce来修复它我尝试了十几种其他组合,它们都失败了,最后一个是
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.2.2</version>
</dependency>
这又给了我ClassNotFoundException: text.DefaultSource,我该如何解决?在 Spark 中实现运行时链接背后有什么逻辑吗?
UPD:也试过
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>2.0.1</version>
</dependency>
</dependencies>
(这仍然给我java.lang.ClassNotFoundException: text.DefaultSource))
我还尝试了在这个问题中发布的依赖项,但它们也失败了:Resolving dependency questions in Apache Spark
源代码在此处提供,因此您可以自己尝试各种 maven 设置:https ://github.com/stiv-yakovenko/sparkrec
解决方案
最后我能够让它工作:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
</dependencies>
您必须使用这些确切的版本,否则它将以多种方式崩溃。
推荐阅读
- c - c 程序在代码块上运行但不在 clion 上运行,这怎么可能?
- angular - 安装 Angular CLI [问题:错误消息]
- eclipse - 编辑框/Nodeclipse EditBox 插件不再适用于 Eclipse 4.18.0
- css - 带标题的图像 - “图形”框不适应图像大小,更改图像大小,长标题不换行,
- java - Spring Boot弹性 - 设置默认日期格式
- multithreading - 在 scoped_threshhold 之外的 Rust 处理变量
- javascript - 在函数中返回某些内容后如何重新渲染反应组件?
- azure - Kusto 将 ID 转换为名称
- excel - 是否有使用 vba 复制另一个工作簿的防弹方法?
- javascript - 在缩放期间在时间刻度上设置刻度