首页 > 解决方案 > 用于 CSV 解析的正则表达式?Python3 Re 模块

问题描述

是否有可用于解析 csv 的正则表达式(Python 重新兼容)?

标签: pythonpython-3.xregexcsvparsing

解决方案


编辑:我没有意识到csvPython 的标准库中有一个模块

这是正则表达式:(?<!,\"\w)\s*,(?!\w\s*\",). 它兼容 python 和 JavaScript。这是完整的解析脚本(作为 python 函数):

def parseCSV(csvDoc, output_type="dict"):
    from re import compile as c
    from json import dumps
    from numpy import array

    # This is where all the parsing happens
    """
    To parse csv files.
    Arguments:
    csvDoc - The csv document to parse.
    output_type - the output type this
                function will return
    """
    csvparser = c('(?<!,\"\\w)\\s*,(?!\\w\\s*\",)')
    lines = str(csvDoc).split('\n')

    # All the lines are not empty
    necessary_lines = [line for line in lines if line != ""]

    All  = array([csvparser.split(line) for line in necessary_lines])

    if output_type.lower() in ("dict", "json"):  # If you want JSON or dict
        # All the python dict keys required (At the top of the file or top row)
        top_line   = list(All[0])
        main_table = {}      # The parsed data will be here
        main_table[top_line[0]] = {
            name[0]: {
                thing: name[
                    # The 'actual value' counterpart
                    top_line.index(thing)
                ] for thing in top_line[1:]  # The requirements
            } for name in All[1:]
        }
        return dumps(main_table, skipkeys=True, ensure_ascii=False, indent=1)
    elif output_type.lower() in ("list",
                                 "numpy",
                                 "array",
                                 "matrix",
                                 "np.array",
                                 "np.ndarray",
                                 "numpy.array",
                                 "numpy.ndarray"):
        return All
    else:
        # All the python dict keys required (At the top of the file or top row)
        top_line   = list(All[0])
        main_table = {}      # The parsed data will be here
        main_table[top_line[0]] = {
            name[0]: {
                thing: name[
                    # The 'actual value' counterpart
                    top_line.index(thing)
                ] for thing in top_line[1:]  # The requirements
            } for name in All[1:]
        }
        return dumps(main_table, skipkeys=True, ensure_ascii=False, indent=1)

依赖项:NumPy 您需要做的就是插入 csv 文件的原始文本,然后该函数将以这种格式返回一个 json(或二维列表,如果您愿意):

{"top-left-corner name":{
     "foo":{"Item 1 left to foo":"Item 2 of the top row",
            "Item 2 left to foo":"Item 3 of the top row",
             ...}
     "bar":{...}
  }
}

这是一个例子: CSV.csv

foo,bar,zbar
foo_row,foo1,,
barie,"2,000",,

它输出:

{
 "foo": {
  "foo_row": {
   "bar": "foo1",
   "zbar": ""
  },
  "barie": {
   "bar": "\"2,000\"",
   "zbar": ""
  }
 }
}

如果您的 csv 文件格式正确,它应该可以工作(我测试的那些是由苹果的 Numbers 制作的)


推荐阅读