首页 > 解决方案 > Python: Use regex to extract a column of a file

问题描述

I am currently extracting columns in a file by using awk in os.system():

os.system("awk '{print $'%i'}' < infile > outfile"%some_column)
np.loadtxt('outfile')

Is there an equivalent way to accomplish this using regex?

Thanks.

Edit: I want to clarify that I am looking for the most optimal way to extract specific columns of large files.

标签: pythonregex

解决方案


根据您的数据分隔符是什么,正则表达式可能是矫枉过正。如果分隔符很简单(空格或特定字符/字符串),您可以使用string.split方法简单地分隔列。

这是一个示例程序来解释它是如何工作的:

column = 0  # First column
with open("data.txt") as file:
  data = file.readlines()
columns = list(map(lambda x: x.strip().split()[column], data))

要打破这一点:

column = 0
# Read a file named "data.txt" into an array of lines
with open("data.txt") as file:
  data = file.readlines()
# This is where we will store the columns as we extract them
columns = []
# Iterate over each line in the file
for line in data:
  # Strip the whitespace (including the trailing newline character) from the
  # start and end of the string
  line = line.strip()
  # Split the line, using the standard delimiter (arbitrary number of
  # whitespace characters)
  line = line.split()
  # Extract the column data from the desired index and store it in our list
  columns.append(line[column])
# columns now holds a list of strings extracted from that column

推荐阅读