java - 读取包含双引号和逗号的 csv
问题描述
更新:下面发布的工作解决方案
我正在尝试处理一个 csv 文件,并用逗号分隔它。但是,有几个地方的引号嵌入了逗号。
示例:“# 29. 正确识别、储存、使用有毒物质”
每个有逗号的引号都用“”包裹,有没有办法检测这个双引号并绕过逗号?
谢谢!
原始代码:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.FileWriter;
import java.io.PrintWriter;
public class csvFileReader {
public static void main(String[] args) {
String csvFile = "/Users/zzmle/Desktop/data.csv";
BufferedReader br = null;
String line = "";
String cvsSplitBy = ",";
int count=0;
try {
br = new BufferedReader(new FileReader(csvFile));
String firstline = br.readLine();
String[] header = firstline.split(",");
while ((line = br.readLine()) != null && count<10) {
//comma is the separator
String[] Restaurant = line.split(cvsSplitBy);
for (int i=0; i<header.length; i++) {
System.out.println(header[i]+": "+Restaurant[i]+" ");
}
System.out.println("-------------------");
count++;
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
工作解决方案:
// @author Zhiming Zhao
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.FileWriter;
import java.io.PrintWriter;
public class csvFileReader {
public static void main(String[] args) {
String csvFile = "data.csv";
BufferedReader br = null;
String line = "";
String cvsSplitBy = ",";
int count=0;
try {
br = new BufferedReader(new FileReader(csvFile));
String firstline = br.readLine();
String[] header = firstline.split(cvsSplitBy);
while ((line = br.readLine()) != null && count<10) { //count<10 is for testing purposes
String[] Restaurant = line.split(cvsSplitBy); //comma is the separator
process(Restaurant); //this is to deal with the commas within quotation marks (which split the elements and shifts them into the wrong places)
//this part prints the header + restaurant for the first ten lines
for (int i=0; i<header.length; i++) {
System.out.println(header[i]+": "+Restaurant[i]+" ");
}
System.out.println("-------------------");
count++;
}
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("The file cannot be found, check if the file is under root directory");
} catch (IOException e) {
e.printStackTrace();
System.out.println("Input & Output operations error");
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
// @brief This function specifically deal with the issue of commas within the quotation marks
// @detail it gets the index number of the 2 elements containing the quotation marks, then concats them all. It works with multiple quotation marks on the same line
public static void process(String[] data) {
int index1 = -1; //initialize the index of the first ", -1 for empty
int index2 = 0; //initialize the index of the second ", 0 for empty
for (int i=0; i<data.length; i++) {
if (String.valueOf(data[i].charAt(0)).equals("\"") && index1 == -1) { //if index1 is not empty and the first char of current element is "
index1 = i; //set index1 to current index number
}
if (String.valueOf(data[i].charAt(data[i].length()-1)).equals("\"") && index1 != -1) { //if index1 is not empty and the last char of current element is "
index2 = i; //set index2 to current index number
multiconcat(index1, index2, data); //concat the elements between index1 and index2
data = multidelet(index1+1, index2, data); //delete the elements that were copied (index1+1:index2)
i -= (index2-index1); //this is to reset the cursor back to index1 (could be replaced with i = index1)
index1 = -1; //set index1 to empty
}
}
}
// @brief Copy all elements between index1 and index2 to index1, doesn't return anything
public static void multiconcat(int index1, int index2, String[] data) {
for (int i=index1+1; i<=index2; i++) {
data[index1] += data[i];
}
}
// @brief Deletes the elements between index1+1 and index2
public static String[] multidelet(int index1, int index2, String[] data) {
String[] newarr = new String[data.length-(index2-index1+1)];
int n = 0;
for (int i=0; i<data.length; i++) {
if (index1 <= i && i <= index2) continue;
newarr[n] = data[i];
n++;
}
return newarr;
}
}
输出(带有引号和逗号的行之一),虽然它并不完美(引号内的逗号被吃掉了),但这是一个小问题,我懒得修复它,哈哈:
serial_number: DA08R0TCU
activity_date: 03/30/2018 12:00:00 AM
facility_name: KRUANG TEDD
violation_code: F035
violation_description: "# 35. Equipment/Utensils - approved; installed; clean; good repair capacity"
violation_status: capacity"
points: OUT OF COMPLIANCE
grade: 1
facility_address: A
facility_city: 5151 HOLLYWOOD BLVD
facility_id: LOS ANGELES
facility_state: FA0064949
facility_zip: CA
employee_id: 90027
owner_id: EE0000857
owner_name: OW0001034
pe_description: 5151 HOLLYWOOD LLC
program_element_pe: RESTAURANT (31-60) SEATS HIGH RISK
program_name: 1635
program_status: KRUANG TEDD
record_id: ACTIVE
score: PR0031205
service_code: 92
service_description: 1
row_id: ROUTINE INSPECTION ```
解决方案
My own solution: Read the first character of each element, if the first character is a double quote, concat this and the next ones (will need to use recursion for this) until there's an element with a double quote as the last character.
This will run considerably faster than reading char by char, as suggested by JGFMK. And I am not allowed to use external libraries for this project.
STILL IMPLEMENTING THIS, I will update if it works
EDIT: Working solution posted in original post
推荐阅读
- typescript - 从类型中选择一个键值对
- angular - 将类类型分配给泛型变量时,打字稿编译错误
- jsf - 如果我从 request.getParameter 请求某些内容,为什么当我尝试上传文件时 PrimeFaces FileUpload 不执行任何操作?不是重复的
- reactjs - 如何将箭头控件设置为自定义按钮
- string - MessageDlg 在 Delphi 10.3 中无法识别“制表符”字符(#9)
- ubuntu - 将 ubuntu 从 12.04 升级到 14.04
- xpath - 在骆驼路线中使用xpath提取xml根元素名称不起作用
- perl - 如何通过 Perl 处理来自多个目录的文件名?
- php - 如何将液体代码放入 php 字符串中
- deployment - 如何在项目部署时运行 SSIS 包?