antlr4 - ANTLR4 行注释和文本解析问题
问题描述
我正在编写 c++ 头样式文件的解析器,并面临正确处理行注释的问题。
自定义词法分析器.g4
lexer grammar CustomLexer;
SPACES : [ \r\n\t]+ -> skip;
COMMENT_START : '//' -> pushMode(COMMENT_MODE);
PRAGMA : '#pragma';
SECTION : '@section';
DEFINE : '#define';
UNDEF : '#undef';
IF : '#if';
ELIF : '#elif';
ELSE : '#else';
IFDEF : '#ifdef';
IFNDEF : '#ifndef';
ENDIF : '#endif';
ENABLED : 'ENABLED';
DISABLED : 'DISABLED';
EITHER : 'EITHER';
ANY : 'ANY';
DEFINED : 'defined';
BOTH : 'BOTH';
BOOLEAN_LITERAL : 'true' | 'false';
STRING : '"' .*? '"';
HEXADECIMAL : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT : '/**' .*? '*/';
NUMBER : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE : '{' .*? '}';
OPAREN : '(';
CPAREN : ')';
OBRACE : '{';
CBRACE : '}';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULUS : '%';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
EXCL : '!';
QMARK : '?';
COLON : ':';
COMA : ',';
OTHER : .;
fragment Int : [0-9] Digit* | '0';
fragment Digit : [0-9];
mode COMMENT_MODE;
COMMENT_MODE_DEFINE : '#define' -> type(DEFINE), popMode;
COMMENT_MODE_SECTION : '@section' -> type(SECTION), popMode;
COMMENT_MODE_IF : '#if' -> type(IF), popMode;
COMMENT_MODE_ENDIF : '#endif' -> type(ENDIF), popMode;
COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
COMMENT_MODE_PART : ~[\r\n];
CustomParser.g4:
parser grammar CustomParser;
options { tokenVocab=CustomLexer; }
compilationUnit
: statement* EOF
;
statement
: comment? pragmaDirective
| comment? defineDirective
| comment? undefDirective
| comment? ifDirective
| comment? ifdefDirective
| comment? ifndefDirective
| sectionLineComment
| comment
;
pragmaDirective
: PRAGMA char_sequence
;
subDirectives
: ifDirective+
| ifdefDirective+
| ifndefDirective+
| defineDirective+
| undefDirective+
| comment+
;
ifdefDirective
: IFDEF IDENTIFIER subDirectives+ ENDIF
;
ifndefDirective
: IFNDEF IDENTIFIER subDirectives+ ENDIF
;
ifDirective
: ifStatement elseIfStatement* elseStatement? ENDIF
;
ifStatement
: IF expression (subDirectives)*
;
elseIfStatement
: ELIF expression (subDirectives)*
;
elseStatement
: ELSE (subDirectives)*
;
defineDirective
: BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
;
undefDirective
: BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;
sectionLineComment
: COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment+
;
expression
: simpleExpression
| customExpression
| enabledExpression
| disabledExpression
| bothExpression
| eitherExpression
| anyExpression
| definedExpression
| comparisonExpression
| arithmeticExpression
;
arithmeticExpression
: arithmeticExpression (MULTIPLY | DIVIDE) arithmeticExpression
| arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
| OPAREN arithmeticExpression CPAREN
| expressionIdentifier
;
comparisonExpression
: comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
| comparisonExpression (AND | OR) comparisonExpression
| EXCL? OPAREN comparisonExpression CPAREN
| eitherExpression
| enabledExpression
| bothExpression
| anyExpression
| definedExpression
| disabledExpression
| customExpression
| simpleExpression
| expressionIdentifier
;
enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;
identifiers
: IDENTIFIER COMA?
;
line_comment
: COMMENT_START COMMENT_MODE_PART*
;
info_comment
: COMMENT_START COMMENT_MODE_PART*
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
我在头文件中的 95% 的指令和注释都可以正常工作,但很少有场景仍未正确处理:
1. 行注释
输入:
//1
//#define ID1 //2
这是令牌列表:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. line_comment
08. COMMENT_START: "//"
09. defineDirective:8
10. DEFINE: "#define"
11. IDENTIFIER: "ID1"
12. info_comment
13. COMMENT_START: "//"
14. COMMENT_MODE_PART: "2"
15.<EOF>
我想实现第 07 行的令牌是第 09 行的令牌的一部分,并解析为 COMMENT_START 令牌
2. 用文本定义指令
其他定义规则工作正常,但:
#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)
这些“定义”指令正在解析异常
对于解决我目前遇到的这两个问题或任何关于如何优化我的词法分析器/解析器的建议,我将不胜感激。
提前致谢!
==================================更新================ ==================== 第一个测试用例:
输入:
//1
//#define ID1 //2
当前结果:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. line_comment
08. COMMENT_START: "//"
09. defineDirective:8
10. DEFINE: "#define"
11. IDENTIFIER: "ID1"
12. info_comment
13. COMMENT_START: "//"
14. COMMENT_MODE_PART: "2"
15.<EOF>
预期结果:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. defineDirective:8
08. COMMENT_START: "//"
09. DEFINE: "#define"
10. IDENTIFIER: "ID1"
11. info_comment
12. COMMENT_START: "//"
13. COMMENT_MODE_PART: "2"
14.<EOF>
第二个测试用例:
输入:
#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL
当前结果:
01.compilationUnit
02. statement:2
03. defineDirective:5
04. DEFINE: "#define"
05. IDENTIFIER: "USER_DESC_2"
06. STRING: "\"Preheat for \""
07. IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>
预期结果:
01.compilationUnit
02. statement:2
03. defineDirective:5
04. DEFINE: "#define"
05. IDENTIFIER: "USER_DESC_2"
06. STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>
在预期结果中,STRING表示结果文本。在这里我真的不知道是增强STRING Lexer 令牌定义还是引入新的解析规则来覆盖这种情况更好
解决方案
将这篇文章、您之前的问题和 Bart 的答案混合在一起,并假设定义指令的形式为
optional_// #define IDENTIFIER replacement_value optional_line_comment
并给定输入文件input.txt
/**
* BLOCK COMMENT
*/
#pragma once
//#pragma once
/**
* BLOCK COMMENT
*/
#define CONFIGURATION_H_VERSION 12345
#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd
#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30 { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0
//================================================================
//============================= INFO =============================
//================================================================
/**
* SEPARATE BLOCK COMMENT
*/
// Line 1
// Line 2
//
//======================= this is a section ======================
// @section test
// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5
// Line 6
#define IDENTIFIER_THREE
//1
//#define ID1 //2
#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL
#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)
如果我很好地理解了您的两个问题,则语法必须为每个指令或注释生成一个语句,而不是指令。指令前面可以有一个注释,它成为语句的一部分。指令可以被注释掉,然后是内联行注释(即在同一行)。
语法Header.g4
(无痕迹):
grammar Header;
compilationUnit
@init {System.out.println("Last update 1253");}
: ( statement {System.out.println("Statement found : `" + $statement.text + "`");}
)* EOF
;
statement
: comment? pragma_directive
| comment? define_directive
| section
| comment
;
pragma_directive
: PRAGMA char_sequence
;
define_directive
: define_identifier replacement_comment[$define_identifier.statement_line]
;
define_identifier returns [int statement_line]
: LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
;
replacement_comment [int statement_line]
: anything+ line_comment?
| {getCurrentToken().getLine() == $statement_line}? line_comment
| {getCurrentToken().getLine() != $statement_line}?
;
section
: LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment
| SEPARATOR ( IDENTIFIER | EQUALS )*
;
line_comment
: LINE_COMMENT_DELIMITER anything*
;
anything
: IDENTIFIER
| CHAR_SEQUENCE
| STRING
| NUMBER
| OTHER
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA : '#pragma';
SECTION : '@section';
DEFINE : '#define';
STRING : '"' .*? '"';
EQUALS : '='+ ;
SEPARATOR : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS : [ \t]+ -> channel(HIDDEN) ;
NL : ( '\r' '\n'?
| '\n'
) -> channel(HIDDEN) ;
OTHER : . ;
执行 :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Header.g4
$ javac Header*.java
$ grun Header compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@84,315:321='#define',<'#define'>,19:0]
[@85,322:322=' ',<WS>,channel=1,19:7]
[@86,323:340='IDENTIFIER_20_30_A',<IDENTIFIER>,19:8]
[@87,341:343=' ',<WS>,channel=1,19:26]
[@88,344:344='[',<OTHER>,19:29]
[@89,345:345=' ',<WS>,channel=1,19:30]
[@90,346:346='1',<NUMBER>,19:31]
[@91,347:347=',',<OTHER>,19:32]
...
[@139,644:668='//=======================',<SEPARATOR>,34:0]
[@140,669:669=' ',<WS>,channel=1,34:25]
[@141,670:673='this',<IDENTIFIER>,34:26]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1253
Statement found : `/**
* BLOCK COMMENT
*/
#pragma once`
Statement found : `//#pragma once`
...
Statement found : `#define DEFAULT_A 10.0`
...
Statement found : `// Line 2`
Statement found : `//`
...
Statement found : `//#define IDENTIFIER_3 Version.h // Line 5`
Statement found : `// Line 6
#define IDENTIFIER_THREE`
Statement found : `//1
//#define ID1 //2`
Statement found : `#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
Statement found : `#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)`
Statement found : `#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)`
语法Header_trace.g4
(带痕迹):
grammar Header_trace;
compilationUnit
@init {System.out.println("Last update 1137");}
: statement[this.getRuleNames() /* parser rule names */]* EOF
;
statement [String[] rule_names]
locals [String rule_name, int start_line, int end_line]
@after { System.out.print("The next statement is a " + $rule_name);
$start_line = $start.getLine();
$end_line = $stop.getLine();
if ($start_line == $end_line)
System.out.print(" on line " + $start_line);
else
System.out.print(" on lines " + $start_line + " to " + $end_line);
System.out.println(" : ");
System.out.println("`" + $text + "`");
}
: comment? pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}
| comment? define_directive [rule_names] {$rule_name = $define_directive.rule_name;}
| section [rule_names] {$rule_name = $section.rule_name;}
| comment_only [rule_names] {$rule_name = $comment_only.rule_name;}
// comment_only can be replaced by comment when the trace is removed
;
pragma_directive [String[] rule_names] returns [String rule_name]
: PRAGMA char_sequence
{ $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
define_directive [String[] rule_names] returns [String rule_name]
locals [String dir_rule_name, int statement_line = 0]
@init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}
: define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]
{ $rule_name = $replacement_comment.rule_name; }
;
define_identifier returns [int statement_line]
: LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
;
replacement_comment [String dir_rule_name, int statement_line] returns [String rule_name]
: any+=anything+ line_comment?
{ $rule_name = $dir_rule_name + " with replacement value";
System.out.print(" anything matched : " );
if ($any.size() > 0)
for (AnythingContext r : $any)
System.out.print(r.getText());
else
System.out.print("(nothing)");
System.out.println();
}
| {getCurrentToken().getLine() == $statement_line}?
line_comment
{ $rule_name = $dir_rule_name + " WITHOUT replacement value and with inline line comment"; }
| {getCurrentToken().getLine() != $statement_line}?
{ $rule_name = $dir_rule_name + " WITHOUT replacement value"; }
;
section [String[] rule_names] returns [String rule_name]
: LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
{ $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
comment_only [String[] rule_names] returns [String rule_name]
: comment
{ $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
comment
: BLOCK_COMMENT
| line_comment
| SEPARATOR ( IDENTIFIER | EQUALS )*
;
line_comment
: LINE_COMMENT_DELIMITER anything*
;
anything
: IDENTIFIER
| CHAR_SEQUENCE
| STRING
| NUMBER
| OTHER
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA : '#pragma';
SECTION : '@section';
DEFINE : '#define';
STRING : '"' .*? '"';
EQUALS : '='+ ;
SEPARATOR : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS : [ \t]+ -> channel(HIDDEN) ;
NL : ( '\r' '\n'?
| '\n'
) -> channel(HIDDEN) ;
OTHER : .;
执行 :
$ a4 Header_trace.g4
$ javac Header*.java
$ grun Header_trace compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1137
The next statement is a pragma_directive on lines 1 to 4 :
`/**
* BLOCK COMMENT
*/
#pragma once`
...
anything matched : 10.0
The next statement is a define_directive with replacement value on line 20 :
`#define DEFAULT_A 10.0`
The next statement is a comment_only on line 22 :
`//================================================================`
...
The next statement is a comment_only on line 31 :
`// Line 2`
The next statement is a comment_only on line 32 :
`//`
...
anything matched : Version.h
The next statement is a define_directive with replacement value on line 39 :
`//#define IDENTIFIER_3 Version.h // Line 5`
The next statement is a define_directive WITHOUT replacement value on lines 41 to 42 :
`// Line 6
#define IDENTIFIER_THREE`
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 44 to 45 :
`//1
//#define ID1 //2`
anything matched : "Preheat for "PREHEAT_1_LABEL
The next statement is a define_directive with replacement value on line 47 :
`#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
...
碰巧,由于在定义指令规则的开头LINE_COMMENT_DELIMITER?
,正如您对 所做的那样,并且由于在之后没有特殊标记,因此在遇到行注释分隔符时不再需要切换到 COMMENT_MODE 模式。COMMENT_START?
//
第一种方法有一个困难:
define_directive
: LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER anything+ line_comment?
| LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();}
IDENTIFIER same_line_line_comment[$statement_line]
| LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER
same_line_line_comment [int statement_line]
: {getCurrentToken().getLine() == $statement_line}?
line_comment
以下几行
// Line 6
#define IDENTIFIER_THREE
//1
用第二种替代方法而不是第三种方法解析:
compare statement line 42 with comment line 44
line 44:0 rule same_line_line_comment failed predicate: {getCurrentToken().getLine() == $statement_line}?
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 41 to 42 :
`// Line 6
#define IDENTIFIER_THREE`
尽管same_line_line_comment
子规则用假值保护,语义谓词没有效果。这FailedPredicateException
是不可取的,并且跟踪消息是错误的。它可能与Find Visible Predicates有关。
解决方案是将#define 指令的处理拆分为固定部分define_identifier
规则和带有语义谓词的可变部分replacement_comment
规则(为了在解析决策中有效,必须将其放在替代的开头)。
推荐阅读
- powershell - 创建使用 PowerShell 脚本的 GitHub 操作
- php - 将简单图像挤压到 .obj
- google-sheets - 有没有办法根据输出范围的条件进行 SUMIFS?
- r - 在 R 中:如何根据窗口中 NA 值的数量有条件地使用 rollapply?
- python - SQLAlchemy bindparam 在绑定另一个值时替换 NULL
- node.js - 如何解决 GraphQL 中的 syntaxError?
- kubernetes - DaemonSet 中缺少必填字段
- javascript - 如果父事件已运行超过一段时间,如何运行函数
- python - 预测和拟合之间的keras形状不匹配
- amazon-web-services - 如何使用节点 SDK 选择 aws lambda 函数名称?