首页 > 解决方案 > 将文本文件从字符串转换为列表

问题描述

我需要帮助将此文本文件 ( https://www.gutenberg.org/files/768/768.txt ) 从字符串转换为 Google Colab 上的列表。我需要文本文件在“ccx074@pglaf.org”之后开始,并在“END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS”之前结束,以便获得准确的总字数。下面列出的是我到目前为止的编码。

# download and installing pyspark in colab
!pip install -q pyspark

# download Wuthering Heights, by Emily Bronte
!wget -q https://www.gutenberg.org/files/768/768.txt

from pyspark import SparkContext
 sc = SparkContext()

import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('/content/768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('/content/768.txt','r') as f:
text = f.read()

#GET START LOC
start_loc = text.find("ccx074@pglaf.org") + len("ccx074@pglaf.org")

#GET END LOC
end_loc = text[start_loc:].find("***")

#SLICE THE TEXT STRING AND INDEXES
text[start_loc:start_loc+end_loc].replace("\n", " ")

标签: pythongoogle-colaboratory

解决方案


推荐阅读