Understanding complex RegEx

I’ve been wondering for a while if there was a good way of reverse engineering the meaning/function from a complex Regular Expression pattern such as the one used in the make_clickable function in WordPress.  This morning while debugging an issue with this function causing occasional segfaults in php I started searching around for a suitable tool and found YAPE::Regex::Explain to be the only reasonable solution.

So with trepidation I started installing modules from CPAN and converting an example script to take the RegEx using in make_clickable. I ended up with the following simple script:

use strict;
use warnings;
use YAPE::Regex::Explain;

my $re = qr/(?<!--=[\'"])(?<=[*\')+.,;:!&#038;\$\s-->])(\()?([\w]+?:\/\/(?:[\w\\x80-\\xff#%~\/?@\[\]-]|[\'*(+.,;:!=&\$](?![\b\)]|(\))?([\s]|$))|(?(1)
\)(?![\s<.,;:]|$)|\)))+)/; print YAPE::Regex::Explain->new($re)->explain();

Which generated the following detailed output:

The regular expression:

(?-imsx:(?<!=[\'"])(?<=[*\')+.,;:!&\$\s>])(\()?([\w]+?://(?:[\w\\x80-\\xff#%~/?@\[\]-]|[\'*(+.,;:!=&\$](?![\b\)]|(\))?([\s]|$))|(?(1)\)(?![\s<.,;:]|$)|\)))+))

matches as follows:

NODE                     EXPLANATION
———————————————————————-
(?-imsx:                 group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
———————————————————————-
(?<!                     look behind to see if there is not:
———————————————————————-
=                        ‘=’
———————————————————————-
[\'"]                    any character of: ‘\”, ‘”‘
———————————————————————-
)                        end of look-behind
———————————————————————-
(?<=                     look behind to see if there is:
———————————————————————-
[*\')+.,;:!&\$\s>]       any character of: ‘*’, ‘\”, ‘)’, ‘+’,
‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘&’, ‘\$’,
whitespace (\n, \r, \t, \f, and ” “),
‘>’
———————————————————————-
)                        end of look-behind
———————————————————————-
(                        group and capture to \1 (optional
(matching the most amount possible)):
———————————————————————-
\(                       ‘(‘
———————————————————————-
)?                       end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
———————————————————————-
(                        group and capture to \2:
———————————————————————-
[\w]+?                   any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the least amount possible))
———————————————————————-
://                      ‘://’
———————————————————————-
(?:                      group, but do not capture (1 or more
times (matching the most amount
possible)):
———————————————————————-
[\w\\x80-                any character of: word characters (a-
\\xff#%~/?@\[\]-         z, A-Z, 0-9, _), ‘\\’, ‘x’, ‘8’, ‘0’
]                        to ‘\\’, ‘x’, ‘f’, ‘f’, ‘#’, ‘%’, ‘~’,
‘/’, ‘?’, ‘@’, ‘\[', '\]‘, ‘-‘
———————————————————————-
|                        OR
———————————————————————-
[\'*(+.,;:!=&\$]         any character of: ‘\”, ‘*’, ‘(‘, ‘+’,
‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘=’, ‘&’,
‘\$’
———————————————————————-
(?!                      look ahead to see if there is not:
———————————————————————-
[\b\)]                   any character of: ‘\b’ (backspace),
‘\)’
———————————————————————-
|                        OR
———————————————————————-
(                        group and capture to \3 (optional
(matching the most amount
possible)):
———————————————————————-
\)                       ‘)’
———————————————————————-
)?                       end of \3 (NOTE: because you are
using a quantifier on this capture,
only the LAST repetition of the
captured pattern will be stored in
\3)
———————————————————————-
(                        group and capture to \4:
———————————————————————-
[\s]                     any character of: whitespace (\n,
\r, \t, \f, and ” “)
———————————————————————-
|                        OR
———————————————————————-
$                        before an optional \n, and the end
of the string
———————————————————————-
)                        end of \4
———————————————————————-
)                        end of look-ahead
———————————————————————-
|                        OR
———————————————————————-
(?(1)                    if back-reference \1 matched, then:
———————————————————————-
\)                       ‘)’
———————————————————————-
(?!                      look ahead to see if there is not:
———————————————————————-
[\s<.,;:]                any character of: whitespace (\n,
\r, \t, \f, and ” “), ‘<‘, ‘.’,
‘,’, ‘;’, ‘:’
———————————————————————-
|                        OR
———————————————————————-
$                        before an optional \n, and the end
of the string
———————————————————————-
)                        end of look-ahead
———————————————————————-
|                        else:
———————————————————————-
\)                       ‘)’
———————————————————————-
)                        end of conditional on \1
———————————————————————-
)+                       end of grouping
———————————————————————-
)                        end of \2
———————————————————————-
)                        end of grouping
———————————————————————-

Tags: , ,

5 Responses to “Understanding complex RegEx”

  1. Joseph Scott says:

    Not bad! These types of hairy regular expressions remind me that I should consider using regex comments for anything nearing complex patterns.

  2. Austin says:

    Maybe we should use the “x” pattern modifier for complicated regex like this in WordPress so we can add comments to the trickier parts.

    Can you open a ticket with the steps to reproduce the segfault?



d
go to dashboard
l
go to login
h
show/hide help
e
edit post/page
r
comment on post/page
m
go to moderate comments
esc
cancel
%d bloggers like this: