python - 如何将抓取功能应用于数据框中的整个列
问题描述
这将是一篇很长的文章,因为我面临的问题嵌入在一个更大的项目中——感谢任何花时间阅读本文的人。
基本上,我正在抓取维基百科:精选文章页面。这个页面上有数百个文章链接,我已经成功地从这个页面编译了一个传记文章列表。使用了以下代码:
def __api_GET_latest_page(title):
parameters = {
"action": "parse",
"page": title,
"format": "json"
}
response_json = __get("revisions", title, parameters)
if("parse" in response_json.keys()
and "text" in response_json["parse"].keys()
and "*" in response_json["parse"]["text"].keys()):
return response_json["parse"]["text"]["*"]
return None
def __get(function_key, key, parameters, check_cache=True, write_cache=True):
target = "https://en.wikipedia.org/w/api.php"
cache_path = "cached_api"
params_unicode = str(parameters).encode('utf-8')
md5 = hashlib.md5(params_unicode).hexdigest()
return_json = None
cache_file = os.path.join(cache_path, function_key, str(key), md5)
cache_exists = os.path.isfile(cache_file)
if cache_exists:
try:
json_in = open(cache_file, "r")
json_str = json_in.read()
return_json = json.loads(json_str)
if "error" in return_json.keys() and "code" in return_json["error"].keys() and return_json["error"]["code"]=="maxlag":
cache_exists = False
except:
cache_exists = False
if not cache_exists:
cache_dir = os.path.dirname(cache_file)
if not os.path.isdir(cache_dir):
os.makedirs(cache_dir)
r = requests.get(target, params=parameters)
request_json = r.json()
json_out = open(cache_file, "w")
print(json.dumps(request_json), file=json_out)
return_json = request_json
return return_json
def __remove_tables_and_scripts(tree):
tags_to_remove = ["tbody", "td", "script"]
for tag in tags_to_remove:
elements = tree.find(f".//{tag}")
if elements is not None:
for e in elements:
e.getparent().remove(e)
return tree
def page_text(name, format, include_tables = False):
try:
result = __api_GET_latest_page(name)
except:
print("API request failed.")
if result:
e = etree.fromstring(result)
if not include_tables:
e = __remove_tables_and_scripts(e)
if format == "html":
return str(etree.tostring(e))
elif format == "text":
return ''.join(e.itertext())
elif format == "list":
return ''.join(e.itertext()).split('\n')
else:
print("Failed to retrieve a page.")
return None
上面的代码,特别是 page_text() 函数,获取任何维基百科页面的纯文本并将结果缓存在本地。无论如何,使用以下代码,我得到了维基百科特色文章页面中所有文章标题的列表,这些文章标题是传记:
def get_featured_biographies(t):
titles = page_text("Wikipedia:Featured articles", "list")
titles = titles[40: ]
titles = titles[:-7]
titles = list(filter(lambda x: x != '', titles))
list_featured_biographies = []
boolean = False
for elem in t:
if ('[edit]' in elem) and ('biographies' in elem) | ('Biographies' in elem):
boolean = True
continue
elif ('[edit]' in elem) and ('biographies' not in elem):
boolean = False
if boolean:
list_featured_biographies = list_featured_biographies + [elem]
else:
continue
return list_featured_biographies
list_featured_biographies = get_featured_biographies(titles)
这是一个输出示例:
这是我遇到问题的地方。我需要编写一个函数来抓取我创建的列表中的特色文章传记标题的所有单独页面。具体来说,我需要编写一个函数来提取每个传记的第一段。我使用以下代码成功完成了这项任务:
for title in list_featured_biographies:
page_content = page_text(title, "list")
list_of_values_with_keywords = []
for value in page_content:
if ('was a' in value) | ('was an ' in value) | ('is a ' in value) | ('is an ' in value):
list_of_values_with_keywords.append(value)
first_paragraph = list_of_values_with_keywords[0]
print(first_paragraph)
所以布朗温·班克罗夫特、费利斯·比托和让·贝莱特是前三个名字。以下屏幕截图显示了前三个名称的输出。
如您所见,我的输出本质上是第一段的列表。我想将这些信息组织成一个两列数据框,第一列是文章标题的名称,第二列是文章的第一段。以下代码在尝试实现此目的时遇到错误:
title2_list = []
list_of_first_para = []
for title in list_featured_biographies:
page_content = page_text(title, "list")
title2_list.append(title)
list_of_values_with_keywords = []
for value in page_content:
if ('was a' in value) | ('was an ' in value) | ('is a ' in value) | ('is an ' in value):
list_of_values_with_keywords.append(value)
first_paragraph = list_of_values_with_keywords[0]
list_of_first_para.append(first_paragraph)
data2_for_df = {'Article_Title':title2_list, 'First_Paragraph':list_of_first_para}
wiki1para_df = pd.DataFrame(data2_for_df)
print(wiki1para_df)
这是我遇到的错误:
IndexError Traceback (most recent call last)
<ipython-input-317-f36585876409> in <module>
13 return first_paragraph
14
16 print(first_paragraph)
<ipython-input-317-f36585876409> in get_first_paragraph(list)
9 list_of_values_with_keywords.append(value)
10
---> 11 first_paragraph = list_of_values_with_keywords[0]
12
13 return first_paragraph
IndexError: list index out of range
解决方案
推荐阅读
- c++ - OpenCV Mat 矩阵和 Eigen 矩阵之间的数据不匹配
- django - React 前端的 Rest-framework ID 问题以“颜色是,颜色否”模式显示数据
- html - Vuejs-在当前页面中POST请求后获取对象ID
- c++ - 如何定义一个可以结构化绑定的对象的概念?
- python - 如何安装蟒蛇 Python
- java - R.java 文件丢失/未在 android studio 中生成
- r - R 中德语变音符号 ä、ö、ü 和 ß 的编码问题
- merge - 检查 Stylecop 警告是否已减少
- sql - 在 RDB 中设计一个表来销售多个项目,并计算销售的价值
- c# - 如何在 .editorconfig 的 CSharp 编辑器的嵌套范围内更喜欢完全限定的“使用名称”