Understanding complex RegEx

I’ve been wondering for a while if there was a good way of reverse engineering the meaning/function from a complex Regular Expression pattern such as the one used in the make_clickable function in WordPress.  This morning while debugging an issue with this function causing occasional segfaults in php I started searching around for a suitable tool and found YAPE::Regex::Explain to be the only reasonable solution.

So with trepidation I started installing modules from CPAN and converting an example script to take the RegEx using in make_clickable. I ended up with the following simple script:

use strict;
use warnings;
use YAPE::Regex::Explain;

my $re = qr/(?<!--=[\'"])(?<=[*\')+.,;:!&#038;\$\s-->])(\()?([\w]+?:\/\/(?:[\w\\x80-\\xff#%~\/?@\[\]-]|[\'*(+.,;:!=&\$](?![\b\)]|(\))?([\s]|$))|(?(1)
\)(?![\s<.,;:]|$)|\)))+)/; print YAPE::Regex::Explain->new($re)->explain();

Which generated the following detailed output:

The regular expression:

(?-imsx:(?<!=[\'”])(?<=[*\’)+.,;:!&\$\s>])(\()?([\w]+?://(?:[\w\\x80-\\xff#%~/?@\[\]-]|[\’*(+.,;:!=&\$](?![\b\)]|(\))?([\s]|$))|(?(1)\)(?![\s<.,;:]|$)|\)))+))

matches as follows:

NODE                     EXPLANATION
———————————————————————-
(?-imsx:                 group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
———————————————————————-
(?<!                     look behind to see if there is not:
———————————————————————-
=                        ‘=’
———————————————————————-
[\'”]                    any character of: ‘\”, ‘”‘
———————————————————————-
)                        end of look-behind
———————————————————————-
(?<=                     look behind to see if there is:
———————————————————————-
[*\’)+.,;:!&\$\s>]       any character of: ‘*’, ‘\”, ‘)’, ‘+’,
‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘&’, ‘\$’,
whitespace (\n, \r, \t, \f, and ” “),
‘>’
———————————————————————-
)                        end of look-behind
———————————————————————-
(                        group and capture to \1 (optional
(matching the most amount possible)):
———————————————————————-
\(                       ‘(‘
———————————————————————-
)?                       end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
———————————————————————-
(                        group and capture to \2:
———————————————————————-
[\w]+?                   any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the least amount possible))
———————————————————————-
://                      ‘://’
———————————————————————-
(?:                      group, but do not capture (1 or more
times (matching the most amount
possible)):
———————————————————————-
[\w\\x80-                any character of: word characters (a-
\\xff#%~/?@\[\]-         z, A-Z, 0-9, _), ‘\\’, ‘x’, ‘8’, ‘0’
]                        to ‘\\’, ‘x’, ‘f’, ‘f’, ‘#’, ‘%’, ‘~’,
‘/’, ‘?’, ‘@’, ‘\[‘, ‘\]’, ‘-‘
———————————————————————-
|                        OR
———————————————————————-
[\’*(+.,;:!=&\$]         any character of: ‘\”, ‘*’, ‘(‘, ‘+’,
‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘=’, ‘&’,
‘\$’
———————————————————————-
(?!                      look ahead to see if there is not:
———————————————————————-
[\b\)]                   any character of: ‘\b’ (backspace),
‘\)’
———————————————————————-
|                        OR
———————————————————————-
(                        group and capture to \3 (optional
(matching the most amount
possible)):
———————————————————————-
\)                       ‘)’
———————————————————————-
)?                       end of \3 (NOTE: because you are
using a quantifier on this capture,
only the LAST repetition of the
captured pattern will be stored in
\3)
———————————————————————-
(                        group and capture to \4:
———————————————————————-
[\s]                     any character of: whitespace (\n,
\r, \t, \f, and ” “)
———————————————————————-
|                        OR
———————————————————————-
$                        before an optional \n, and the end
of the string
———————————————————————-
)                        end of \4
———————————————————————-
)                        end of look-ahead
———————————————————————-
|                        OR
———————————————————————-
(?(1)                    if back-reference \1 matched, then:
———————————————————————-
\)                       ‘)’
———————————————————————-
(?!                      look ahead to see if there is not:
———————————————————————-
[\s<.,;:]                any character of: whitespace (\n,
\r, \t, \f, and ” “), ‘<‘, ‘.’,
‘,’, ‘;’, ‘:’
———————————————————————-
|                        OR
———————————————————————-
$                        before an optional \n, and the end
of the string
———————————————————————-
)                        end of look-ahead
———————————————————————-
|                        else:
———————————————————————-
\)                       ‘)’
———————————————————————-
)                        end of conditional on \1
———————————————————————-
)+                       end of grouping
———————————————————————-
)                        end of \2
———————————————————————-
)                        end of grouping
———————————————————————-

5 thoughts on “Understanding complex RegEx

  1. Not bad! These types of hairy regular expressions remind me that I should consider using regex comments for anything nearing complex patterns.

  2. Maybe we should use the “x” pattern modifier for complicated regex like this in WordPress so we can add comments to the trickier parts.

    Can you open a ticket with the steps to reproduce the segfault?

Comments are closed.