首页 > 解决方案 > 将大型 txt.gz 文件转换为 pyspark 中的 sqlcontext 数据框对象以进行文本分析

问题描述

我仍在尝试学习 pyspark,它对我来说几乎就像一门外语。所以我下载了一个大文本文件,亚马逊对其进行了评论。

from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('app')
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext

这是原始文件的样子:

sc.textFile('/home/john/Downloads/Software.txt.gz').take(12)

['product/productId: B000068VBQ',
 'product/title: Fisher-Price Rescue Heroes: Lava Landslide',
 'product/price: 8.88',
 'review/userId: unknown',
 'review/profileName: unknown',
 'review/helpfulness: 11/11',
 'review/score: 2.0',
 'review/time: 1042070400',
 'review/summary: Requires too much coordination',
 "review/text: I bought this software for my 5 year old. He has a couple of the other RH software games and he likes them a lot. This game, however, was too challenging for him. The biggest problem I see is that the game requires the child to be able to maneuver the vehicle using all 4 scroll keys on the keyboard. During one exercise, which by the way you can't get to the next level until you complete this exercise, the game requires that you use the keys to move while watching out for falling lava rocks and clouds, monitor a fuel gauge, watch arrow indicators that help you determine where objects are in the arena below, and watch a scope that shows animals when you're hovering over the top of them.I tried to perform this exercise myself and got frustrated. It's just too hard to expect even a 7 year old to complete this exercise let alone a 5 year old.There are some exercises he can complete himself but they mostly require using the left, right keys.I don't know who this game would be good for. Parts of it would be too easy for someone 7 or older. Yet some parts are too difficult for those younger than that.",
 '',
 'product/productId: B000068VBQ']

注意我们有 "product/productID", "product/title", ... 等等。有 10 个不同的类别,然后是一个空列表,然后类别以 "product/productID", "product/title", ...ETC。

我创建了一个解析器来分隔冒号右侧的项目:

def parse(x):
    result = x.split(':')
    if len(result)==1:
        return "SEPARATOR"
    else:
        return result[1]
Data1 =sc.textFile('/home/john/Downloads/Software.txt.gz').map(parse)

我使用相同的函数,但return result[0]要获取冒号左侧的值,请将其分配给Data0

我认为将它们压缩在一起,然后执行 groupByKey 会将“product/productID”、“product/title”、...“review/text”的所有条目组合成键值对。从某种意义上说,它奏效了:

zipped = Data0.zip(Data1)
f = sorted(zipped.groupByKey().mapValues(list).collect())

现在,f 是一个元组列表,它看起来像这样:

('product/price',
 [' 8.88',
  ' 8.88',
  ' 8.88',
  ' 8.88',
  ' 8.88',
  ' 8.88',
  ' 8.88',
  ' unknown',
  ' unknown',
  ' unknown',
  ' unknown',
   .
   .
   .

我真正想做的是将其放入 SQLContext 数据框中,其中键作为列名,右侧的所有条目作为列的项目条目。我无法找到这方面的示例,或者 Spark 中有关如何执行此操作的文档。我可以很容易地将它转换回 RDD 对象,

rdd = sc.parallelize(f)

这给了我键值对,但现在我如何把它变成一个火花数据框?

标签: python-3.xpysparkapache-spark-sqlnlp

解决方案


推荐阅读