python - Identify all instances of problematic quotation marks
问题描述
I have a (properly formed) large string variable that I turn into lists of dictionaries. I iterate over the massive string, split by newline characters, and run the following list(eval(i))
. This works for the majority of the cases, but for every exception thrown, I add the 'malformed' string into a failed_attempt
array. I have been inspecting the failed cases for an hour now, and believe what causes them to fail is whenever there is an extra quotation mark that is not part of the keys for a dictionary. For example,
eval('''[{"question":"What does "AR" stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')
Will fail because there is quotation marks around the "AR." If you replace the quotation marks with single quotation marks, e.g.
eval('''[{"question":"What does 'AR' stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')
It now succeeds.
Similarly:
eval('''[{"question":"Test Question, Test Question?","category":"DFB","answers":["2004","1930","1981","This has never occurred"],"sources":[""SOWELL: Exploding myths""]}]''')
Fails due to the quotes around "Sowell", but again succeeds if you replace them with single quotes.
So I need a way to identify quotes that appear anywhere other than around the keys of the dictionary (question
, category
, sources
) and replace them with single quotes. I'm not sure the right way to do this.
@Wiktor's submission nearly does the trick, but will fail on the following:
example = '''[{"question":"Which of the following is NOT considered to be "interstate commerce" by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'''
re.sub(r'("\w+":[[{]*")(.*?)("(?:,|]*}))', lambda x: "{}{}{}".format(x.group(1),x.group(2).replace('"', "'"),x.group(3)), example)
Out[170]: '[{"question":"Which of the following is NOT considered to be \'interstate commerce\' by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'
Notice that the second set of double quotation marks on "Interstate Commerce" in the answers is not replaced.
解决方案
Rather than converting the values extracted from this monster string back into a string representation of a list and then using eval(), simply take the things you get in variables and simply append the variables to the list.
Or construct a dict frpom the values rather than creating a string representation of a dictionary then evaluating it.
It doesn't help that you haven't put any code in your question, so these answers are sketchy. If you put a https://stackoverflow.com/help/minimal-reproducible-example in your question, with some minimal data - very minimal - a good one that doesn't cause an exception in eval() and a bad example that recreates the problem, then I should be able to better suggest how to apply my answer.
Your code must be doing something a bit like this:
import traceback
sourcesentences = [
'this is no problem'
,"he said 'That is no problem'"
,'''he said "It's a great day"'''
]
# this is doomed if there is a single or double quote in the sentence
for sentence in sourcesentences:
words = sentence.split()
myliststring="[\""+"\",\"".join(words)+"\"]"
print( f"The sentence is >{sentence}<" )
print( f"my string representation of the sentence is >{myliststring}<" )
try:
mylistfromstring = eval(myliststring)
print( f"my list is >{mylistfromstring}<" )
except SyntaxError as e:
print( f"eval failed with SyntaxError on >{myliststring}<")
traceback.print_exc()
print()
And this produces a SyntaxError on the third test sentence
Now let's try escaping characters in the variable before wrapping them in quotation marks:
# this adapts to a quote within the string
def safequote(s):
if '"' in s:
s = s.replace( '"','\\"' )
return s
for sentence in sourcesentences:
print( f"The sentence is >{sentence}<" )
words = [safequote(s) for s in sentence.split()]
myliststring="[\""+"\",\"".join(words)+"\"]"
print( f"my string representation of the sentence is >{myliststring}<" )
try:
mylistfromstring = eval(myliststring)
print( f"my list is >{mylistfromstring}<" )
except SyntaxError as e:
print( f"eval failed with SyntaxError on >{myliststring}<")
traceback.print_exc()
print()
This works, but is there a better way?
Isn't it a lot simpler avoiding eval which means avoiding constructing a string representation of the list which means avoiding problems with quotation marks in the text:
for sentence in sourcesentences:
print( f"The sentence is >{sentence}<" )
words = sentence.split()
print( f"my list is >{words}<" )
print()
推荐阅读
- python - ID重复时的熊猫新列
- json - 无法索引字符串:jq 的命令错误?
- python - Python - 在类初始化时返回默认方法
- r - 警告:使用数据表和闪亮应用程序内部的无效 JSON 响应
- amazon-web-services - 我只有一个弹性 IP,但仍然向我收费 - 每个弹性 IP 地址 0.005 美元,未附加到每小时运行的实例(按比例分配)
- unit-testing - 如何在 GitHub Actions 中配置 dotnet-tests-report?
- python - Python web scraper 移动到 Selenium 的下一页问题
- hyperledger-fabric - Hyperledger Fabric - 成员与同行的角色
- typescript - 在 Typescript 界面中映射(潜在的嵌套)成员
- uipath - 数据输入中的 UIPath 帮助