Archived Boards and Threads... > Parham's PHP Tutorials

PHP Lesson 17 - Regular Expressions

(1/4) > >>

Parham:
Regular Expressions are one of the trickiest things to learn.  There are a lot of components to it, but it can at the same time be very strong.  A regular expression is an expression which lets you match an arbitrary strong, dissect it, and check it for validity.  A regular expression uses as set of characters to match strings someone inputs to them.  For those of you that have used DOS or the Unix shell, "dir *.txt" (for DOS) or "ls *.txt" are both regular expressions which ask that the dir/ls commands only return strings that end with ".txt" and have "any other character" before them.

Why would you want to use regular expressions in you're scripts?  The biggest reason would be to validate what a user inputs into fields in a HTML form and submits to your PHP script.  I won't go into the negatives, but for example, if you had the HTML field "age", you would only expect the user to input a number.  If the user inputs anything other than numbers, you don't want that information to go into your database.  You can use regular expressions to validate what the user inputs in the "age" field, and if they type in something bad, you can warn them.

The six basic simple characters used in regular expression are:

Pattern: a*
Matches: '', 'a', 'aa', ...
Explanation: match "a" zero or more times

Pattern: b+
Matches: 'b', 'bb', ...
Explanation: match "b" one or more times

Pattern: ab?c
Matches: 'ac', 'abc'
Explanation: match "a" followed by "b" optionally and then "c"

Pattern: [abc]
Matches: 'a' or 'b' or 'c'
Explanation: match "a" or "b" or "c" once

Pattern: [a-c]
Matches: 'a' or 'b' or 'c'
Explanation: Abbreviation for the above

Pattern: [abc]*
Matches: '', 'accb', ...
Explanation: Combination of "one from a set" and "zero or more"; match "a" or "b" or "c" zero or more times from the set

The "^" character is used to check to see whether something "starts at the beginning of the string".  The "$" character is used to check whether something "finishes at the end of the string".  The "|" character is used as the "or" separator.  The "|" character is not like the square bracket characters, because the | character separates regular expressions, NOT characters.  Brackets are used to group regular expressions.  Curly brackets are used to match regular expressions a certain amount of times (or a minimum/maximum amount of times).  I know this is a little too much to take it, but soon there will be a massive amount of examples to explain all of these regular expressions characters.

There are also a few special characters which are used to set common characters.  Those are:

\t -> Tab
\n -> Newline
\r -> Carriage Return
\* -> Asterisk
\\ -> Backslash
\d -> Digits [0-9]
\w -> Word [a-zA-Z0-9_] (letters, numbers, and the underscore)
\s -> Space [\t\r\n] (a tab, a carriage return, a newline)
. -> Anything except end-of-line [^\n] (literally any character that isn't a newline)

The function used in PHP to match a string using regular expressions is the preg_match() function.  This function uses Perl's regular expression feature to match a string.  The function takes the following [simple] parameters:

int preg_match (string pattern, string subject)

The "pattern" must start and end with the "/" character.  The main reason for this is that this function uses the Perl regular expressions library, and Perl uses "/"'s in its functions (if you used Perl, regular expressions don't use functions, instead they use m///).  The function returns a 1 if the "pattern" matched something in the "subject" and 0 otherwise.

Here are a few simple examples:


--- Code: ---echo preg_match("/a/", "a"); //matches "a"
echo preg_match("/b/", "a"); //doesn't match, needs a "b"
echo preg_match("/a+/",""); //doesn't match, needs to have at least 1 "a"
echo preg_match("/a+/","a"); //matches, at least one "a"
echo preg_match("/a+/","aaaaaa"); //matches, at least one "a"
echo preg_match("/a*/",""); //matches, 0 or more "a"'s
echo preg_match("/a*/","aaaaaaaaaa"); //matches, 0 or more "a"'s
echo preg_match("/[xyz]/","x"); //matches, there is an "x"
echo preg_match("/[xyz]/","y"); //matches, there is an "y"
echo preg_match("/[xyz]/","z"); //matches, there is an "z"
echo preg_match("/[xyz]/","a"); //doesn't match, there is neither "x", "y", or "z"
echo preg_match("/[a-z]/","q"); //matches, "q" is in the range from "a" to "z"
echo preg_match("/[0-9]/","5"); //matches, "5" is in the range from "0" to "9"
echo preg_match("/[0-9]/","s"); //doesn't match, "s" is not in the range from "0" to "9"

--- End code ---

examples of the "|" character:


--- Code: ---//note that the | does not match only the chararacter before or after,
//the | character matches everything either before or after unless you group
//them

//not grouped
echo preg_match("/ab|cd/","ab"); //matches
echo preg_match("/ab|cd/","cd"); //matches

//grouped
echo preg_match("/a(b|c)d/","abd"); //matches
echo preg_match("/a(b|c)d/","acd"); //matches
echo preg_match("/a(b|c)d/","ad"); //doesn't match

--- End code ---

examples of the "*" character:


--- Code: ---echo preg_match("/ab*/","abbbb"); //matches
echo preg_match("/ab*/","bbbbb"); //fails

--- End code ---

examples of the "+" character:


--- Code: ---echo preg_match("/a+b/","aaaab"); //matches
echo preg_match("/a+b/","b"); //fails

--- End code ---

examples with "\w" character:


--- Code: ---echo preg_match("/\w+/","abc"); //matches
echo preg_match("/\w+/","a_b_c"); //matches
echo preg_match("/\w+/","0123456789"); //matches
echo preg_match("/\w+/","-"); //fails, "-" is not a part of \w
echo preg_match("/\w+/"," "); //fails, space is not a part of \w
echo preg_match("/\w+/",""); //fails, have to have a least one \w

--- End code ---

examples with "?" character:


--- Code: ---echo preg_match("/a?b?c?/","a"); //matches
echo preg_match("/a?b?c?/","b"); //matches
echo preg_match("/a?b?c?/","c"); //matches
echo preg_match("/a?b?c?/","abc"); //matches
echo preg_match("/a?b?c?/","ab"); //matches
echo preg_match("/a?b?c?/","bc"); //matches

--- End code ---

examples with "^" and "$" characters:


--- Code: ---echo preg_match("/^im/","image"); //matches
echo preg_match("/^im/","imagine"); //matches
echo preg_match("/^im/","embrace"); //doesn't match
echo preg_match("/er$/","programmer"); //matches
echo preg_match("/er$/","designer"); //matches
echo preg_match("/er$/","designing"); //doesn't match
echo preg_match("/^(ab|cd)$/","ab"); //matches
echo preg_match("/^(ab|cd)$/","cd"); //matches
echo preg_match("/^(ab|cd)$/","abcd"); //doesn't match
echo preg_match("/^(ab|cd)$/","xy"); //doesn't match

--- End code ---

examples with curly brackets character:


--- Code: ---echo preg_match("/a{2}/","aaa"); //matches, found "aa" somewhere
echo preg_match("/^a{2}$/","aa"); //matches, entire string is "aa"
echo preg_match("/^a{2}$/","aaa"); //doesn't match, entire string isn't "aa"
echo preg_match("/a{2,4}/","aaa"); //matches, minimum "aa", maximum "aaaa"
echo preg_match("/a{2,4}/","aaaa"); //matches
echo preg_match("/a{2,4}/","a"); //doesn't match
echo preg_match("/^a{2,4}$/","aabaa"); //doesn't match

--- End code ---

a few common regular expressions (these are by no means secure... JUST simple):


--- Code: ---echo preg_match("/^[-.\w]+\@[-.\w]+$/","email@email.com"); //email addresses
echo preg_match("/^\d{2}$/","24"); //ages
echo preg_match("/^(19|20)\d\d$/","1983"); //years
echo preg_match("/^([\w\s]+)$/","hello there"); //a simple string
echo preg_match("/^(http:\/\/www\.|http:\/\/|www\.)([\w\.\/\=\?\&\-]+)$/","http://www.google.com"); //urls

--- End code ---

Tyris:
hey, thanks a lot for that!!!! :)
just wish you'd written it a bit earlier ;)
It really helped me understand Regular Expressios a lot better than I did...
as well as cleared up quite a few problems I was pondering.

one question...
with

--- Code: ---echo preg_match("/^(ab|cd)$/","abcd"); //doesn't match
--- End code ---
that kinda thing...
from what it seems like you've stated in the ^ and $ description...
it should match either an 'ab or cd' at the start, and either an 'ab or cd' at the end...
so wouldnt that pass...?
I understand that you've said that ^blah$ means the WHOLE string is and ONLY is "blah"... but I dont quite understand why because of your explanation of ^ and $...
how would you make it start AND end with "ab"... coz preg_match("/^ab$/","ab_blah_ab"") would apparently fail...

thanx.

Parham:
I should probably clear that up... i'm 99.9999% sure the ^ and $ characters can only exist at the very beginning and very end of the regex, nowhere inbetween (they mean nothing in those places).  Now if you only have the ^, that'll mean that your regex must match the very beginning of the string.  If you only have $, that means your regex must match the very end.  If you have both, then your regex must match the entire string.

Look carefully at the regex and the example you're inquiring about Tyris.  If you read the regex back, it says something like this: "From beginning to end, the string must contain 'ab' OR 'cd'".  The "|" character splits up your regex, and it'll match either one or the other, NOT both.  If you wanted to alter that regex to match both, then you'd have to do something like this:


--- Code: ---echo preg_match("/^(ab|cd)(ab|cd)$/","abcd"); //doesn't match

--- End code ---

now if you wanted to make it start end end with "ab", then this would probably be the regex you want:


--- Code: ---echo preg_match("/^ab.*ab$/","abcdab"); //note, .* means match any character that isn't a newline zero or more times

--- End code ---

Reading this regex literally: "The string must start with 'ab', then anything can exist in the middle, and it must end with 'ab'"

You might just be thinking of them a little to generally or abstract.  Say out loud what you want to match, and how you want to match it, and then try to build a regex from that.  When you know something MUST exist for it to match, put it in there literally, and then just try to fill the gaps with more general regex characters.

Hope that helps :)

Parham:
oh, for you VERY curious people, there is this tool that'll check to see whether regular expressions match strings you input:

http://www.weitz.de/regex-coach/

There is also A LOT of literature about the subject.  O'Reilly's "Mastering Regular Expressions" over at http://www.amazon.ca/exec/obidos/ASIN/0596002890/qid=1070772784/sr=1-1/ref=sr_1_0_1/701-8818493-0043539

I know post-secondary institutions are teaching theory on regular expressions also.  I at least was taught this year about them, both the theory and practically.

Tyris:
ok, sweet thanks... just wanted the info the both means it must match the entire string as such.. :)

am using the coach on my laptop too (tho laptop is dead temporarily)...
Thanx again :)

Navigation

[0] Message Index

[#] Next page

Go to full version