Welcome Guest!

If you do not have an account yet on The Web Squeeze forums, please Register! It’s FREE and there are many benefits:

  • Receive Fast Advice
  • Learn Programming Languages
  • Get Professional Website Reviews
  • Quick Troubleshooting Assistance

> Regex

This is a discussion on Regex, within the PHP section. This forum and the thread "Regex" are both part of the Programming Your Website category.

2 Pages V   1 2 >  
Reply to this topicStart new topic
> Regex
Ryan
post May 12 2008, 09:54 PM
Post #1


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



I suck at regex, and I just can't get the hang of it no matter how long I stare at tutorials.

How would I retrieve the URL from an anchor tag in HTML using preg_replace?

For example: <a href="/thispage.php"> to /thispage.php


--------------------
Go to the top of the page
 
+Quote Post
Rakuli
post May 12 2008, 10:39 PM
Post #2


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



Do you want to retrieve the href or replace it in the string?

CODE
// To replace the string

$string =  pre_replace('/<a[\s]{1}href\=\"([\w\.0-9\/\-\%\?\=]+)\"[\s]?>', '$1', $string);

echo $string;

// To catch the string
$matches = array();

preg_match(('/<a[\s]{1}href\=\"([\w\.0-9\/\-\%\?\=]+)\"[\s]?>', $string, $matches);

echo $matches[1];



That should work I think -- without testing:

To Explain

/ = open the regex

<a[\s]{1} = match this exactly <a and one space

href\=\" = match this exactly escaping the equals and quotes

( = open the backreference capture. (means this will be caught and can be used in replace string ($1) or caught in a match

[\w\.0-9\/\-\%\?\=\:\;\&]+ = Looking for characters you may find in a URL \w means any word character (a-z A-Z - _) full stops, numbers 0-9 and the rest... surrounded by square brackets to flag that this is a class of characters we are looking for. The plus sign means we need at least one of the preceding characters to match

) = close the capture

\"[\s]?> look for quote followed by 0 or one spaces and the the closing bracket.

I hope that helps a little


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
rewake
post May 13 2008, 11:02 AM
Post #3


Rapid Squeezer
****

Group: Mentor
Posts: 205
Joined: 14-February 08
From: NY, USA
Member No.: 127



Hi guys,

The only suggestion I have is you may want to open up the regexp a bit to allow for other attributes within the link tag, like title, name, etc.

Rich


--------------------
QUOTE
if ($name=='will') echo '/(bb|[^b]{2})/';

Raineri Jewelers | MySpace | Facebook | deviantART
Go to the top of the page
 
+Quote Post
Rakuli
post May 13 2008, 05:55 PM
Post #4


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



<a[\s]{1}href\=\"([\w\.0-9\/\-\%\?\=]+)\"(.*)?>

Changing the regex to that above will take care of Rewakes suggestion -- without getting too finicky. Regex are greedy by default so it will match that link with additional attributes provided the href is the first one.


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 14 2008, 05:22 AM
Post #5


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



Ah, it works. I'm running into a couple problems, though.

I was using this to pick up anchors on a page, and it's coughing when it comes to <acronym> and yours is also picking up <a href="#identifier"> when I don't want it to. Can you make it ignore identifier links?

CODE
preg_match_all("@<a[^>]*?>@siu", $site, $matches);



And a quick question... Where does the $1 get the data from? Is it the first set of parenthesis? Would the next set of () be $2?


--------------------
Go to the top of the page
 
+Quote Post
c010depunkk
post May 14 2008, 05:40 AM
Post #6


Rapid Squeezer
****

Group: Members
Posts: 164
Joined: 14-February 08
From: Willich, Germany
Member No.: 56



QUOTE (Ryan @ May 14 2008, 12:22 PM) *
Where does the $1 get the data from? Is it the first set of parenthesis? Would the next set of () be $2?

yes and yes!!! you're catching on biggrin.gif


--------------------
www.c010depunkk.com ~ the hangout of a web developer
Go to the top of the page
 
+Quote Post
Rakuli
post May 14 2008, 06:49 AM
Post #7


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



Woops, forgot to mention that in preg_replace $1, $2 etc correspond to the parenthesis and their matches as C010depunkk has confirmed.

Okay so your previous regex was picking up acronym's when it shouldn't and you want mine to ignore #identifiers.. Okay, you could change to something like

<a[\s]{1}href\=\"([\w\.0-9\/\-\%\?\=^#]{1}[\w\.0-9\/\-\%\?\=]+)\"(.*)?>

That should ignore the inpage links but still allow for indentifiers appended to the end of URL's


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 14 2008, 08:01 PM
Post #8


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



Could it remove the identifiers at the end of the URL, actually?

Does [\s]{1} mean that there can only be one space after the <a? I originally thought the {1} corresponded with the $1, but looking at it didn't make sense smile.gif

Would this match <a href=" and <a href='

<a[\s]{1}href\=[\"\']


--------------------
Go to the top of the page
 
+Quote Post
Rakuli
post May 14 2008, 08:10 PM
Post #9


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



QUOTE (Ryan @ May 15 2008, 11:01 AM) *
Could it remove the identifiers at the end of the URL, actually?

Does [\s]{1} mean that there can only be one space after the <a? I originally thought the {1} corresponded with the $1, but looking at it didn't make sense smile.gif

Would this match <a href=" and <a href='

<a[\s]{1}href\=[\"\']


I can see that you are starting to get the hang of it now -- be careful though, you can fall terribly in love with regular expressions and their usefulness.

Yes, that last snippet would match both " and ' and the curly braces numbers mean {1} would say exactly one, {1,} would say at least one (+ would mean the same thing but you can use {3,} for at least 3)... {1,5} would mean between 1 and 5 and {,6} would mean maximum 6 occurrences. These numbers always relate to pattern or character class directly preceding them.



Now, to remove the identifier at the end, this gets a little trickier

<a[\s]{1}href\=\"([\w\.0-9\/\-\%\?\=^#]{1}[\w\.0-9\/\-\%\?\=&;]+)(#[\w\.0-9\/\-\%\?\=&;]*)?\"(.*)?>

There is a new sub pattern there at the end of the href="" which says, look for # followed by some url characters.. the question mark after the pattern says that there can either be one or none of the preceding pattern and still match. The first subpattern in the paranthesis will now have the whole url and query string excluding the identifier at the end.

Cheers,


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 14 2008, 09:05 PM
Post #10


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



Eek, that's huge. Two things, though. The preg_match <a[\s]{1}href\=\"([\w\.0-9\/\-\%\?\=^#]{1}[\w\.0-9\/\-\%\?\=]+)\"(.*)?> is matching the anchor and everything after it, when I want it to stop after the a closes.

For instance, it grabs <a href="#this" title="this and that">this and that</a></li> and <a href="index.php" title="this">some more of that</a></li>

Is it possible to not match links with just an identifier link in the first place? So it would completely ignore <a href="#skip">.

Thanks a ton for your help. Is there a program that helps you learn Regex? I learn best doing examples, and there's really not a way you can see why a certain regular expression doesn't match a string or give an example of what it would match.


--------------------
Go to the top of the page
 
+Quote Post
Rakuli
post May 14 2008, 10:01 PM
Post #11


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



LOL -- okay, so if you were a little confused before, you're going to love it now.

I'll rewrite it and explain as best I can but it will involve "lookarounds" using lookaheads and behinds to skip anything that is just an identifier link (And we'll turn off the greediness so it doesn't catch so much at once).

/<a[\s]+href=[\"\']{1}(?!#)([\w\.\d\/\-\%\?\=&;]+)(#[\w\.0-9\/\-\%\?\=&;])?[\"\']{1}[\w\d\.\-\/\%\?\=&]*>/

Okay, so I've refined it a touch this time around...

<a followed by at least one or more spaces [\s]+ then href= with " or '.. Then comes the tricky part <a[\s]+href=[\"\']{1} will only be considered a match if it is not followed by a # this is a negative lookahead.... (?!#)... (? lookahead ! negative # regex to look for ) close. Bascially the rest is fairly much the same except I have narrowed the scope for allowed characters after the href closes to stop it matching the entire file when you search.

See how that goes.


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 14 2008, 10:40 PM
Post #12


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



It's not matching anchors with the title attribute. Is that why it was greedy?


--------------------
Go to the top of the page
 
+Quote Post
Rakuli
post May 14 2008, 11:07 PM
Post #13


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



Forgot to include an ' and " as valid characters after the href.. Due to the fact that there is a good chance any character could appear between the href and the closing >, will go back to the match any character method but switch greediness off to try and stop it from matching everything after the href.(note the /U at the end, this is the greedy off pattern modifier)

/<a[\s]+href=[\"\']{1}(?!#)([\w\.\d\/\-\%\?\=&;]+)(#[\w\.0-9\/\-\%\?\=&;])?[\"\']{1}.*>/U


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 14 2008, 11:58 PM
Post #14


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



Okay, one more problem, and I think it's minor. The preg_replace isn't catching <a href="/">.


--------------------
Go to the top of the page
 
+Quote Post
Rakuli
post May 15 2008, 01:04 AM
Post #15


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



When you say isn't catching? You want to be able to nab it in a back reference? You can just put parenthesis around it and count from left to right to get your $1, $2 etc...


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 15 2008, 01:18 AM
Post #16


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



preg_replace("/<a[\s]{1}href\=\"(http:\/\/[www.]?ryanfait.co.uk[\/]?)?([\w\.0-9\/\-\%\?\=^#]{1}[\w\.0-9\/\-\%\?\=&;]+)(#[\w\.0-9\/\-\%\?\=&;]*)?\"(.*)?>/", "$2", $anchor);

It doesn't pull out the "/" in this: <a href="/" title="Las Vegas Web Design" accesskey="1">


--------------------
Go to the top of the page
 
+Quote Post
Ryan
post May 16 2008, 12:47 AM
Post #17


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow, London
Member No.: 133



I still can't figure out why it's not working on <a href="/">. Sorry to be a pain Rakuli, but I have learned quite a bit about regex in this little thread smile.gif


--------------------
Go to the top of the page
 
+Quote Post
Rakuli
post May 16 2008, 01:09 AM
Post #18


Squeeze Machine
*****

Group: Team Leaders
Posts: 568
Joined: 13-February 08
From: Catching the squeezed drips downunder.
Member No.: 13



The reason it isn' t matching / to the root is because of this part

([\w\.0-9\/\-\%\?\=^#]{1}[\w\.0-9\/\-\%\?\=&;]+

It wants one of the first lot (without the #) AND the + after the second set requires one or more of those... change that plus to a question mark and you will be in business.

Cheers,

preg_replace("/<a[\s]{1}href\=\"(http:\/\/[www.]?ryanfait.co.uk[\/]?)?([\w\.0-9\/\-\%\?\=^#]{1}[\w\.0-9\/\-\%\?\=&;]?)(#[\w\.0-9\/\-\%\?\=&;]*)?\"(.*)?>/", "$2", $anchor);


--------------------
Bright Idea? -- Don't Let it disappear
Go to the top of the page
 
+Quote Post
Ryan
post May 16 2008, 01:15 AM
Post #19


Rapid Squeezer
****

Group: Members
Posts: 119
Joined: 14-February 08
From: Hounslow,