python - 如何合并 AWS Comprehend batch_detect_key_phrases() ResultList 和 ErrorList
问题描述
我有一个带有推文的数据框。每行对应 1 条推文。我可以使用 AWS Comprehend batch_detect_key_phrases() 获取关键短语。batch_detect_key_phrases() 在负载中返回一个 ResultList 和 ErrorList。为了将关键短语结果合并回数据框中,它们需要与原始推文对齐,因此我需要保持 ResultList 和 ErrorList 对齐。
第267 行的代码分别处理 ErrorList 和 ResultList。
根据 Python Boto 文档,“ErrorList (list) - 一个列表,其中包含每个包含错误的文档的一个对象。结果按索引字段按升序排序,并与输入列表中文档的顺序相匹配。 ..”
我在下面编写的代码使用 ResultList 和 ErrorList 索引号来确保它们被正确地合并到一个 keyPhrases 列表中,然后该列表将被合并回原始数据框。本质上,keyPhrases[0] 是与数据帧第 0 行关联的关键短语。如果在处理推文时出现错误,则会将占位符错误消息添加到数据帧中的该行。
我认为我可以保持 ResultList 和 ErrorList 对齐的唯一另一种方法是将 2 个列表合并到一个更大的列表中,该列表按它们各自的索引升序排列。接下来,我将处理该 1 个更大的列表。
是否有更简单的方法来处理 ResultList 和 ErrorList 以使它们保持对齐?
keyphraseResults = {'ResultList': [
{'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]},
{'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},
{'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]},
{'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}],
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
{"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}],
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}
# Holds the ordered list of key phrases that correspond to the data frame.
keyPhrases = []
# Set it to an arbitrarily large number in case ErrorList below is empty we'll still
# need a number for comparison.
errIndexlist = [9999]
# This will be inserted for the rows corresponding to the ErrorList.
ErrorMessage = "* Error processing keyphrases"
# Since the rows of the response need to be kept in alignment with the rows of the dataframe,
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
batchErroresults = keyphraseResults["ErrorList"]
errIndexlist = []
for entry in batchErroresults:
errIndexlist.append(entry["Index"])
print(entry)
# Sort the indicies to ensure they are in ascending order since that order is
# important for the logic below.
errIndexlist.sort(reverse = False)
if 'ResultList' in keyphraseResults:
batchResults = keyphraseResults["ResultList"]
for entry in batchResults:
resultDict = entry["KeyPhrases"]
if len(errIndexlist) > 0:
if entry['Index'] < errIndexlist[0]:
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 1:
results = results[2:]
keyPhrases.append(results)
else:
# Else we have an error to merge from the PRIOR result.
keyPhrases.append(ErrorMessage)
errIndexlist.remove(errIndexlist[0])
# THEN add the key phrase for the current result.
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 1:
results = results[2:]
keyPhrases.append(results)
print("\nFinal results are:")
for text in keyPhrases:
print(text)
解决方案
我根据这个SO post弄清楚了。
总的来说,合并ResultList和ErrorList,在Index上对合并后的列表进行排序,然后依次处理合并后的列表。
from operator import itemgetter
keyphraseResults = {'ResultList': [
{'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]},
{'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},
{'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]},
{'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}],
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
{"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}],
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}
keyPhrases = []
# This will be inserted for the rows in ErrorList or just make it empty.
ErrorMessage = "* Error processing keyphrases"
if len(keyphraseResults["ResultList"]) > 0 and len(keyphraseResults["ErrorList"]) > 0:
processResults = keyphraseResults["ResultList"].copy() + keyphraseResults["ErrorList"].copy()
elif len(keyphraseResults["ResultList"]) > 0:
processResults = keyphraseResults["ResultList"].copy()
else:
processResults = keyphraseResults["ErrorList"].copy()
processResults = sorted(processResults, key=itemgetter('Index'), reverse = False)
for entry in processResults:
if 'ErrorCode' in entry:
keyPhrases.append(ErrorMessage)
elif 'KeyPhrases' in entry:
resultDict = entry["KeyPhrases"]
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 2:
results = results[2:]
keyPhrases.append(results)
print("\nFinal results are:")
for text in keyPhrases:
print(text)
推荐阅读
- css - 如何在引导程序 4.4 中对齐卡片图像,以使所有卡片具有相等的宽度和高度。?
- c - 在输入 ENTER KEY 以获取输出后在命令提示符中输入后,它带我回到蓝屏并且在 turbo c7 中不显示输出
- google-earth-engine - Google Earth Engine 下载问题,这是由不可变的服务器端对象引起的吗?
- linux - Docker 内存使用和服务器卡住
- reactjs - 如何在反应上下文中反转状态
- mysql - MySQL Windows 10:Windows 32 和 64 位平台是否需要特定的 DLL?
- php - isset 从在本地主机上工作但不在服务器上工作的表单
- github - 为什么我的Github网站大部分时间打不开,但偶尔可以打开
- javascript - 如何使用 JQuery 为动态添加的组合列表设置选定的选项值?
- java - DynamoDB 存储哈希值而不是 json