java - 使用 Java 删除 HTML 文件中未闭合的标签
问题描述
我需要使用 Java 从 HTML 文件中删除未关闭的标签。有没有快速的方法来做到这一点?解析文件时自动删除未关闭标签的某些API?或者怎么做?
解决方案
这个想法是处理您的整个文件并为每个开始标签找到结束标签。如果找不到结束标签,我们保存开始标签的行号,以便稍后从文件中删除该行。
/*
* Returns a stack with the line numbers of tags that don't have a closing tag.
*/
public static Stack<Integer> removeUnclosedTags(String filePath) {
//Stores all HTML tags
Stack<String> tags = new Stack<>();
//Stores the line numbers for the tags
Stack<Integer> lineNumbers = new Stack<>();
//Stores the line numbers for tags without a closing one
Stack<Integer> linesToRemove = new Stack<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
int lineNumber = 0;
String line = br.readLine();
while (line != null) {
lineNumber++;
line = line.trim();
//No tag on this line or a tag that gets closed right away (e.g. <br />) - just continue
if(!line.contains("<") || line.contains("/>")) {
line = br.readLine();
continue;
}
//Check if line starts with a closing tag
if(line.trim().startsWith("</")) {
//If HTML tag matches the one on the top of the stack, remove it and continue
if(line.split("</")[1].split(">")[0].split(" ")[0].equals(tags.peek())) {
tags.pop();
lineNumbers.pop();
line = br.readLine();
//If it does not match, we have an unclosed tag and store the line number
} else {
System.out.println("unclosed tag at line number " + lineNumbers.peek() + ": " + tags.pop());
linesToRemove.push(lineNumbers.pop());
}
//If it is a starting tag
} else if(line.startsWith("<")) {
//Push it to the stack so we can compare it later
tags.push(line.split("<")[1].split(">")[0].split(" ")[0]);
lineNumbers.push(lineNumber);
line = br.readLine();
}
}
} catch (Exception e) {
e.printStackTrace();
}
return linesToRemove;
}
此方法返回一个堆栈,其中包含没有结束标记的行号。然后我们可以像这样删除它们:
public static void main(String[] args) {
String filePath = "/some/path/test.html";
Stack<Integer> lines = removeUnclosedTags(filePath);
File inputFile = new File(filePath);
File tempFile = new File(filePath.replace(".html", "_cleaned.html"));
BufferedReader reader;
BufferedWriter writer;
try {
reader = new BufferedReader(new FileReader(inputFile));
writer = new BufferedWriter(new FileWriter(tempFile));
String lineToRemove = "bbb";
String currentLine;
int lineNumber = 0;
while((currentLine = reader.readLine()) != null) {
lineNumber++;
if(lines.empty() || lineNumber != lines.peek()) {
writer.write(currentLine + System.getProperty("line.separator"));
} else {
lines.pop();
}
}
writer.close();
reader.close();
//Comment this line if you want a separate file
tempFile.renameTo(inputFile);
} catch (Exception e) {
e.printStackTrace();
}
}
推荐阅读
- c++ - 用递归函数求解布尔方程
- c# - SharePoint 2019 基于自定义表单的身份验证使用活动目录“未与您共享的网站”
- python - 在 Windows 上的 Python Popen 子进程中暂停 FFmpeg 编码
- c - 具有 0 个子子树的级别顺序二叉树插入
- python - Diestel-Leader 图的绘图球
- python - 通过迭代行的函数运行 df
- arrays - 为什么 if variable ~= "text" 在 lua 5.1/luajit 中不起作用?
- git - 如何在 Visual Studio git 中获取所有分支和标签
- python - 如何迭代奇数索引并将它们添加回原始列表
- python - 数字海洋中的 Django