Kenny Jue

home  about
      

rematch

June 07, 2020

Project Cover by Unknown Artist

About This Project

Rematch is a Go implementation of a simple grammar for order-independent string matching of words and patterns. It was created as Go’s regexp package does not support lookaheads (due to inefficient time complexity) thus making order-independent string matching infeasible.

With Rematch, I wanted to:

  • Create a simple grammar to match words and patterns that’s easy to memorize and understand
  • Use a mix of regex and standard library string functions to achieve order-independent string matching

How It Works

A Rematch expression is composed of alphanumeric, case sensitive words and patterns to be matched against an arbitrary target string. When processed, it produces a boolean match result, along with all found substrings if specified.

A word is identified as a token delimited by whitespaces, and behaves as a Regex word boundary \b would.

  • Word order is disregarded unlike most regex flavors.
  • When word matching, only alphanumeric tokens are compared with one another. Before matching occurs, any invalid characters present in the string will be replaced with whitespaces before being split with whitespace delimiters.
  • Words are not matched using regex.

A pattern is simply a string with any _, ?, and * wildcard operators present.

  • Unlike a word, it is matched against the entire string rather than word tokens.
  • This can allow matching more complex patterns such as URLs or words that may have punctuation or other non-alphanumeric characters present.
  • The wildcard operators are converted into their regex equivalents during the matching process.

Processing a Rematch expression consists of several core phases, and runs in linear time and space:

  1. Expression tokenization. During this phase, expressions are sanitized and tokenized. If invalid word or patterns are provided, it will be caught at this phase.
  2. Reorder expression tokens into Reverse Polish Notation using the Shunting-yard algorithm designed by Dijkstra. Invalid expression syntax will also be caught in this phase through a basic state machine. Any grouping operators will be discarded, and negated tokens will be recorded.
  3. Evaluation of the tokens in RPN. Any operands get pushed into a stack from the token collection, and are specially handled when an operator is seen. A match will be evaluated and any matching substrings will also be returned.

To determine which substrings get returned, Rematch will simply check if each word or pattern is present in the target string and return one or more matches if true. Note that these returned substrings are unordered and may have duplicates. However, there are some more complicated rules on which results get returned.

  • If the final match result of the expression evaluates into false, then the returned substring collection will be empty. It will be non-empty on most occasions if the final match result is true.
  • Infix operators such as + and | will not affect whether this matched substring is returned or not.
  • Example: exression (a+b)|c is matched against a target string. a is present, b is not, and c is present. Since it evaluates to a final result of true, anything matching a and c will be returned in the substring collection, even though the subexpression (a+b) evaluated to false.
  • However, the negation ! will affect whether a matched word or pattern is returned. The following examples are simple, but the core concept will persist for more complex expressions.

    • Example: expression !foo is matched against a target string that does not contain afoo word. While this means !foo will return true, it will not return any matching substrings as it does not exist. This is also the case
    • Example: expression !!foo is matched against a target string that does contain a foo word. As two ! operators would cancel out, foo will be returned as a matched substring.
    • Example: expression !foo|bar is matched against a target string that contains both a foo and bar word. Because foo is negated, it will not be returned as a matched substring - even though it is is present in the target string. bar, however, is also present in the target string, so this will get returned along with a final evaluated match result of true.

Expression Examples

  • cat|food_and_yarn - this matches the word cat or a pattern where food and yarn are present in this order and can be separated by zero or more whitespaces.

  • cat*|fo?d - this matches a pattern that is prefixed with cat or a pattern that starts with fo and ends with d and contains no more than one character in between.

  • foo*bar+!(two+3) - this matches a pattern that begins with foo and ends with bar. Anything or nothing can be present in between. If this and two and3 are present in the string, will evaluate to false. In other words, the string must contain the word one but not two and 3, though one of these appearing in the string is fine (sort of acts like an XOR in this example)

  • Using Rematch’s pattern matching capabilities, it’s possible to match URLs despite being limited to alphanumeric characters only: https???github?com?pixeltopic?rematch

    • However keep in mind something like httpsgithub.compixeltopicrematch could be matched, or https$$$github$com&pixeltopic&rematch. This behavior makes it less effective at matching exact patterns, but more effective at finding similar patterns.

In the future, exact matches with non-alphanumeric symbols could be implemented, but requires wrapping words and patterns in double quotes and escaping wildcard operators and quotes when applicable.

Grammar

Rematch supports the following grammar:

  • | OR infix operator, used between operands. Specifies at least one of the operands it modifies must be present in the target string.
  • + AND infix operator, used between operands. Specifies that both of the operands it modifies must be present in the target string.
  • * wildcard (0 to n). When evaluating, * gets converted into a lazy match wildcard in regex: [\s\S]*?.
  • ? wildcard (0 to 1). When evaluating, ? gets converted into a regex [\s\S]?.
  • _ whitespace wildcard (0 to n). When evaluating, _ gets converted into a lazy whitespace match in regex: [\s]*?. Works like an asterisk * wildcard, but only captures whitespaces instead of all characters. This operator is the only way to match whitespaces.
  • () grouping to override standard operator precedence, which is left to right.
  • ! NOT operator, used before words. Use this with caution, as you may end up with broad query matches.

Installation

You can get the latest release of Rematch by using:

go get github.com/pixeltopic/rematch

import "github.com/pixeltopic/rematch"

Usage

package main

import (
    "fmt"
	  "github.com/bwmarrin/discordgo"
	  "github.com/pixeltopic/rematch"
)

func main() {
    // the words "moon" and "cow" must both be present in the string.
    res, _ := rematch.EvalRawExpr(
        "moon+cow", "The cow jumped over the moon.") 
    fmt.Println(res)
}

See /examples for more detailed usage.

Links