pyspark - 在本地模型中运行 wordcount lines=sc.textFile('../spark_test.txt')lines.count()
问题描述
我在本地模型中运行 pyspark 来学习 wordcount 大小写。我在当前文件中写了一个 spark_test.txt,然后运行 pyspark command lines = sc.textFile('../spark_test.txt') lines.count()
,它抛出了一个错误:
输入路径不存在:file:/Users/lisl/myproject/spark_test.txt
** 我试过了:**
sc.textFile('spark_test.txt')
lines = sc.textFile('../spark_test.txt')
lines.count()
我希望 outpub 是文件的行数,但实际输出是:
file "/Users/lisl/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/lisl/myproject/learn_test/spark_test.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:55)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:835)
解决方案
首先,缺少一些代码行。请尝试下面的代码并确保您的文件的路径存在并且您已授予所有文件的读取权限 ( chmod a+r yourFile
)
text_file = sc.textFile("path/to/your/input/file")
wordCounts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
wordCounts.saveAsTextFile("path/to/your/output/file")
推荐阅读
- google-apps-script - Google Sheet Apps 脚本:如何编辑下面的 Apps 脚本以仅将数据从 A 列复制到 L 和 P 到 S?
- scala - spark UDF 不接受数组
- javascript - 忽略对“打印()”的调用。该文档是沙盒的,并且未设置“allow-modals”关键字。如何解决这个问题?
- javascript - 鼠标悬停在 div 项目上时如何在地图中显示标记位置 - 例如 AirBnb
- node.js - 不需要模块,但得到:找不到模块“模块名称”的声明文件。'/path/module-name.js' 隐含一个 'any' 类型
- c++ - VS Code 找不到我的#include 文件 - 尝试了所有可能的方法
- javascript - Axios 调用后无法附加 div
- php - 即使在提交表单时从列表中选择任何值后,状态列表下拉列表也会给出第一个值
- python - 如何使用 python 向代理发送 FIXML 消息?
- c# - C# 中的 ConcreteClass 中的项目不能与 GET 和 POST 调用一起正常工作