January 2023 - Skywind Inside

使用 LIBLR 解析带注释的 JSON

January 27th, 2023 skywind 4 comments

前文《基于 LR(1) 和 LALR 的 Parser Generator》里介绍了春节期间开发的小玩具 LIBLR ，今天春节最后一天，用它跑一个小例子，解析带注释的 json 文件。由于 python 自带 json 库不支持带注释的 json 解析，而 vscode 里大量带注释的 json 没法解析，所以我们先写个文法，保存为 json.txt：

# 定义两个终结符
%token NUMBER
%token STRING

start: value                {get1}
     ;

value: object               {get1}
     | array                {get1}
     | STRING               {get_string}
     | NUMBER               {get_number}
     | 'true'               {get_true}
     | 'false'              {get_false}
     | 'null'               {get_null}
     ;

array: '[' array_items ']'                  {get_array}
     ;

array_items: array_items ',' value          {list_many}
           | value                          {list_one}
           |                                {list_empty}
           ;

object: '{' object_items '}'                {get_object}
      ;

object_items: object_items ',' item_pair    {list_many}
            | item_pair                     {list_one}
            |                               {list_empty}
            ;

item_pair: STRING ':' value                 {item_pair}
         ;

# 词法：忽略空白
@ignore [ \r\n\t]*

# 词法：忽略注释
@ignore //.*

# 词法：匹配 NUMBER 和 STRING
@match NUMBER [+-]?\d+(\.\d*)?
@match STRING "(?:\\.|[^"\\])*"

有了文法，程序就很短了，50 多行足够：（点击 more 展开）

基于 LR(1) 和 LALR 的 Parser Generator

January 26th, 2023 skywind No comments

最近处理文本比较多，先前想增强下正则，看来不够用了，有同学推荐了我 Pyl 和 Lark，看了两眼，初看还行，但细看有一些不太喜欢的地方，于是刚好春节几天有空，从头写了一个 LR(1) / LALR 的 Generator，只有一个 LIBLR.py 的单文件，没有其它依赖：

GitHub – skywind3000/LIBLR: Parser Generator for LR(1) and LALR

用法很简单，给定文法，返回 Parser：

import LIBLR

# 注意这里是 r 字符串，方便后面写正则
# 所有词法规则用 @ 开头，从上到下依次匹配
grammar = r'''
start: WORD ',' WORD '!';

@ignore [ \r\n\t]*
@match WORD \w+
'''

parser = LIBLR.create_parser(grammar)
print(parser('Hello, World !'))

输出：

Node(Symbol('start'), ['Hello', ',', 'World', '!'])

默认没有加 Semantic Action 的话，会返回一颗带注释的语法分析树（annotated parse-tree）。

支持语义动作（Semantic Action），可以在生成式中用 {name} 定义，对应 name 的方法会在回调中被调用：

import LIBLR

# 注意这里是 r 字符串，方便后面写正则
grammar = r'''
# 事先声明终结符
%token number

E: E '+' T          {add}
 | E '-' T          {sub}
 | T                {get1}
 ;

T: T '*' F          {mul}
 | T '/' F          {div}
 | F                {get1}
 ;

F: number           {getint}
 | '(' E ')'        {get2}
 ;

# 忽略空白
@ignore [ \r\n\t]*
# 词法规则
@match number \d+
'''

# 定义语义动作：各个动作由类成员实现，每个方法的
# 第一个参数 rule 是对应的生成式
# 第二个参数 args 是各个部分的值，类似 yacc/bison 中的 $0-$N 
# args[1] 是生成式右边第一个符号的值，以此类推
# args[0] 是继承属性
class SemanticAction:
    def add (self, rule, args):
        return args[1] + args[3]
    def sub (self, rule, args):
        return args[1] - args[3]
    def mul (self, rule, args):
        return args[1] * args[3]
    def div (self, rule, args):
        return args[1] / args[3]
    def get1 (self, rule, args):
        return args[1]
    def get2 (self, rule, args):
        return args[2]
    def getint (self, rule, args):
        return int(args[1])

parser = LIBLR.create_parser(grammar, SemanticAction())
print(parser('1+2*3'))

输出：

（点击 more 查看更多）

Python 中使用组合方式构建复杂正则

January 17th, 2023 skywind No comments

正则写复杂了很麻烦，难写难调试，只需要两个函数，就能用简单正则组合构建复杂正则：

比如输入一个字符串规则，可以使用 {name} 引用前面定义的规则：

# rules definition
rules = r'''
    protocol = http|https
    login_name = [^:@\r\n\t ]+
    login_pass = [^@\r\n\t ]+
    login = {login_name}(:{login_pass})?
    host = [^:/@\r\n\t ]+
    port = \d+
    optional_port = (?:[:]{port})?
    path = /[^\r\n\t ]*
    url = {protocol}://({login}[@])?{host}{optional_port}{path}?
'''

然后调用 regex_build 函数，将上面的规则转换成一个字典并输出：

# expand patterns in a dictionary
m = regex_build(rules, capture = True)

# list generated patterns
for k, v in m.items(): 
    print(k, '=', v)

结果：

protocol = (?P<protocol>http|https)
login_name = (?P<login_name>[^:@\r\n\t ]+)
login_pass = (?P<login_pass>[^@\r\n\t ]+)
login = (?P<login>(?P<login_name>[^:@\r\n\t ]+)(:(?P<login_pass>[^@\r\n\t ]+))?)
host = (?P<host>[^:/@\r\n\t ]+)
port = (?P<port>\d+)
optional_port = (?P<optional_port>(?:[:](?P<port>\d+))?)
path = (?P<path>/[^\r\n\t ]*)
url = (?P<url>(?P<protocol>http|https)://((?P<login>(?P<login_name>[^:@\r\n\t ]+)(:(?P<login_pass>[^@\r\n\t ]+))?)[@])?(?P<host>[^:/@\r\n\t ]+)(?P<optional_port>(?:[:](?P<port>\d+))?)(?P<path>/[^\r\n\t ]*)?)

用手写直接写是很难写出这么复杂的正则的，写出来也很难调试，而组合方式构建正则的话，可以将小的简单正则提前测试好，要用的时候再组装起来，就不容易出错，上面就是组装替换后的结果。

下面用里面的 url 这个规则来匹配一下：

（点击 more 展开）

Skywind Inside

Archive

使用 LIBLR 解析带注释的 JSON

基于 LR(1) 和 LALR 的 Parser Generator

Python 中使用组合方式构建复杂正则

Categories

Recent Comments

Meta

Categories

Blogroll

Archives

Meta

Skywind Inside

Archive

使用 LIBLR 解析带注释的 JSON

基于 LR(1) 和 LALR 的 Parser Generator

Python 中使用组合方式构建复杂正则

Popular Posts

Tag Cloud

Categories

Recent Comments

Meta

Categories

Blogroll

Archives

Meta