首页 > 解决方案 > Identify all instances of problematic quotation marks

问题描述

I have a (properly formed) large string variable that I turn into lists of dictionaries. I iterate over the massive string, split by newline characters, and run the following list(eval(i)). This works for the majority of the cases, but for every exception thrown, I add the 'malformed' string into a failed_attempt array. I have been inspecting the failed cases for an hour now, and believe what causes them to fail is whenever there is an extra quotation mark that is not part of the keys for a dictionary. For example,

eval('''[{"question":"What does "AR" stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')

Will fail because there is quotation marks around the "AR." If you replace the quotation marks with single quotation marks, e.g.

eval('''[{"question":"What does 'AR' stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')

It now succeeds.

Similarly:

eval('''[{"question":"Test Question, Test Question?","category":"DFB","answers":["2004","1930","1981","This has never occurred"],"sources":[""SOWELL: Exploding myths""]}]''')

Fails due to the quotes around "Sowell", but again succeeds if you replace them with single quotes.

So I need a way to identify quotes that appear anywhere other than around the keys of the dictionary (question, category, sources) and replace them with single quotes. I'm not sure the right way to do this.

@Wiktor's submission nearly does the trick, but will fail on the following:

example = '''[{"question":"Which of the following is NOT considered to be "interstate commerce" by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'''
re.sub(r'("\w+":[[{]*")(.*?)("(?:,|]*}))', lambda x: "{}{}{}".format(x.group(1),x.group(2).replace('"', "'"),x.group(3)), example)


Out[170]: '[{"question":"Which of the following is NOT considered to be \'interstate commerce\' by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'

Notice that the second set of double quotation marks on "Interstate Commerce" in the answers is not replaced.

标签: pythonregex

解决方案


Rather than converting the values extracted from this monster string back into a string representation of a list and then using eval(), simply take the things you get in variables and simply append the variables to the list.

Or construct a dict frpom the values rather than creating a string representation of a dictionary then evaluating it.

It doesn't help that you haven't put any code in your question, so these answers are sketchy. If you put a https://stackoverflow.com/help/minimal-reproducible-example in your question, with some minimal data - very minimal - a good one that doesn't cause an exception in eval() and a bad example that recreates the problem, then I should be able to better suggest how to apply my answer.

Your code must be doing something a bit like this:

import traceback

sourcesentences = [
     'this is no problem'
     ,"he said 'That is no problem'" 
     ,'''he said "It's a great day"''' 
]

# this is doomed if there is a single or double quote in the sentence
for sentence in sourcesentences:
    words = sentence.split()
    myliststring="[\""+"\",\"".join(words)+"\"]"    
    print( f"The sentence is >{sentence}<" )
    print( f"my string representation of the sentence is >{myliststring}<" )
    try:
        mylistfromstring = eval(myliststring)
        print( f"my list is >{mylistfromstring}<" )
    except SyntaxError as e:
        print( f"eval failed with SyntaxError on >{myliststring}<")
        traceback.print_exc()
    print()

And this produces a SyntaxError on the third test sentence

Now let's try escaping characters in the variable before wrapping them in quotation marks:

# this adapts to a quote within the string
def safequote(s):
    if '"' in s:
        s = s.replace( '"','\\"' )
    return s

for sentence in sourcesentences:
    print( f"The sentence is >{sentence}<" )
    words = [safequote(s) for s in sentence.split()]
    myliststring="[\""+"\",\"".join(words)+"\"]"    
    print( f"my string representation of the sentence is >{myliststring}<" )
    try:
        mylistfromstring = eval(myliststring)
        print( f"my list is >{mylistfromstring}<" )
    except SyntaxError as e:
        print( f"eval failed with SyntaxError on >{myliststring}<")
        traceback.print_exc()
    print()

This works, but is there a better way?

Isn't it a lot simpler avoiding eval which means avoiding constructing a string representation of the list which means avoiding problems with quotation marks in the text:

for sentence in sourcesentences:
    print( f"The sentence is >{sentence}<" )
    words = sentence.split()
    print( f"my list is >{words}<" )
    print()

推荐阅读