Regular Expressions [Archive]

Kon-Tiki

26-01-2006, 09:32 AM

I'm learning Regex now, and'm quite stumped. I'd ask the instructor, but he's said to be no good with them either yesterday. I've made a test-page to get the hang of it, but its output's not what it should do, according to how I read the code.

*$string = "Best niet [b] bevet Moddervet Niet vet";
*echo $string, "<br>";

*eregi("(\[b\])", $string, $resultaat);
*$open_tags = count($resultaat);

*eregi("(\[/b\])", $string, $resultaat_closed);
*$closed_tags = count($resultaat_closed);

*echo "Open: ", $open_tags, "<br>Closed: ", $closed_tags;

That's my code (stripped of all set-up tags like <html> etc). This's its output:
Best niet [b] bevet Moddervet Niet vet
Open: 2
Closed: 0
No matter how many results it should find, it'll always give 2, except if there're no results, like in Closed. Then it'll give 0. I don't see why it doesn't give 1 for the Open one in the current example. Any help'd be greatly appreciated :)

Rogue

26-01-2006, 09:59 AM

I did work a bit with regular expressions, but not in PHP.

Can you first explain what do you like program to do?

Kon-Tiki

26-01-2006, 10:32 AM

I would like the program to tell me how many times it encounters [*b*] and how many times it encounters [/*b*] (without the *s) in a certain string. It then should add as many [/*b*]s as needed if the amount of [*b*]s is bigger than the amount of [/*b*]s.

In the end, it should turn all the [*b*][/*b*]-tags to HTML's [b] tags, but it should close as many as are opened, or it'd mess up my input. I eventually'd like to do that with hyperlinks too, which makes it so complicated.

Bobbin Threadbare

26-01-2006, 10:59 AM

This is PHP? :blink:

Kon-Tiki

26-01-2006, 11:23 AM

Yep, regular expressions within PHP. It's one of the biggest brainwreckers out there for it, but sometimes a necessity. For more basic information, you can always look here (http://www.regular-expressions.info). The site explains what it is, but it won't help me solve problem. Preg_match_all() can, though. Am going to try that one out now.

Data

26-01-2006, 11:29 AM

uhmn you misunderstood what is placed in $resultaat

in $resultaat[0] the copy of the total matched string is placed
in $resultaat[1] is the match for the first () sequence
in $resultaat[2] is the match for the second () sequence stored.

so if you match for [*b*] then you will find in 0 [*b*] (as it's the full match) and in 1 (as it the first set of () that is matched). As you don't match for a second pair of () you will find any more matches.
The count idea you are trying to implement will not work this way.

Kon-Tiki

26-01-2006, 11:42 AM

Var_dump() gives wrong results too.

$string = "kakakapipikaka";
$reg_ex = "(ka)";
eregi($reg_ex, $string, $test);
echo "Test: ", var_dump($test), "<br>";

Output: Test: array(2) { [0]=> *string(2) "ka" [1]=> *string(2) "ka" }
The way that should be, would be [0]=> string(2) "ka" [1]=> string(2) "ka" [2]=> string(2) "ka" [3]=> string(2) "ka" [4]=> string(2) "ka" [5]=> string(2) "ka" (with something different for [0]), if I understand correctly.

A different solution works now, though.
$number = preg_match_all($reg_ex, $string, $resultaat) returns 5 :)

Data

26-01-2006, 11:52 AM

well the example you posted using var_dump gives the correct output for
ereg(i)

you don't understand what eregi matches. (it only matches the string ka once.)
and as that is the total string as well you see it in both 0 and 1

Kon-Tiki

26-01-2006, 11:56 AM

So eregi() stops the moment it finds a match. That's pretty useful, but not for what I'm trying to do :D Thanks for the clarification :cheers:

Data

26-01-2006, 12:17 PM

well it can match it more often if you specify it in the regexstring.
but it's not suitable for counting.
the preg_match_all seems a better choice for that

Kon-Tiki

26-01-2006, 12:19 PM

Quick side-question... is it possible to make function in PHP of which an argument's not necessary? A function that'd be in the manual as function_name(int one, int two, [int three])

Rogue

26-01-2006, 12:30 PM

What are you trying to do?

Replace [b] with <b>?

If that's the case, then use ereg(i)_replace.

Kon-Tiki

26-01-2006, 12:31 PM

Or str_replace(), but it wouldn't make sure each [*b*]'d have a [/*b*], which's another thing that needs to be checked and fixed.

Data

26-01-2006, 12:46 PM

uhm

function blah($arg1, $arg2=10) {

can be called like:
blah(10)
and blah(10,20)

Rogue

26-01-2006, 12:47 PM

Then in reg. expression, you have to check for this:

'(\[b\])?(\[\/b\])'

or something like that.

So, you're looking for regular expression that in one line has both opening and closing tag, and use replace function to fix it.

I'm not sure if ? will work for all characters between, data might be able to tell you that. (or just check reference on the page you posted above)

When you are done, check for tags that have no matching opening/closing tag.

plix

26-01-2006, 04:42 PM

Originally posted by Anubis@Jan 26 2006, 08:47 AM
Then in reg. expression, you have to check for this:
'(\[b\])?(\[\/b\])'
That wouldn't work as that statement doesn't require the opening bold tag to exist. A better (though not great) solution would be:
/\[b\](.*?)\[\/b\]/ig
which uses non-greedy matching (so you'd be best to use the perl regexp functions in PHP rather than the ereg functions).

If you're trying to write a forgiving BBCode parser and aren't just learning regexp then you're going about things all wrong. Regular expressions are for matching patterns, not for constructing push-down automata. Keep in mind that there's nesting and nesting requirements which regular expressions just can't handle well (it's possible with things like look-ahead and look-behind matching, but it's not pretty, it's not fast, and it's not reliable).

Kon-Tiki

26-01-2006, 06:41 PM

How do you suggest I'd go 'bout it then?

Reup

26-01-2006, 06:46 PM

Originally posted by plix@Jan 26 2006, 07:42 PM
it's possible with things like look-ahead and look-behind matching, but it's not pretty, it's not fast, and it's not reliable
Not to mention a b*tch to debug...

plix

26-01-2006, 08:39 PM

Originally posted by Kon-Tiki+Jan 26 2006, 02:41 PM****</div><table border='0' align='center' width='95%' cellpadding='3' cellspacing='1'><tr><td>QUOTE (Kon-Tiki @ Jan 26 2006, 02:41 PM)</td></tr><tr><td id='QUOTE'> How do you suggest I'd go 'bout it then? [/b]
******QuoteBegin-plix
Regular expressions are for matching patterns, not for constructing push-down automata.[/quote]
Use a push-down automaton (a finite state machine doesn't include a stack, which is necessary to do open- and close-tag matching. It's a bit harder to implement if you haven't written one before, but it's not only easier to maintain, it's much more flexible.

Kon-Tiki

26-01-2006, 08:45 PM

Hmmm... I've never heard of a push-down automaton. I'll see if I can find a tutorial 'bout it :ok:

Reup

26-01-2006, 08:54 PM

Me neither. There's an excellent tutorial on finite state automate on this website (http://chortle.ccsu.edu/CS355/FiniteAutomata/Section01/sect01_1.html) though. And check out the wikipedia links on Pushdown automaton (http://en.wikipedia.org/wiki/Pushdown_automaton). Might get you further along the way. It seems a bit over the top though for a simple pattern matching operation, or am I being naive here?

Kon-Tiki

26-01-2006, 09:12 PM

It's a bit more than pattern matching. It's case insensitive pattern replacing with filling up lacking parts of a pattern. Let me explain it more concrete, with the example of a guestbook.

In the guestbook, users can use BBcode (the []-tags you can use on forums as well). It's quickly done by using str_replace(), which'd replace all BBcode tags with their corresponding HTML in a given string (which's the user's input, in this case).

Now take this case:
User_one: Foo [b]bar
If I'd use str_replace(), all text after the tag'd be bold, until I'd accidentally use a [/b] somewhere. User_two, User_three and User_one's next post all'll be bold.

What the guestbook should do instead, is check if there're less closing tags for each tag than opening tags, then add the closing tags at the end of the input, so that it'll look like this (in a case of bad user input):
User_one: Foo [b]bar [b]Foo[b]bar[/b][/b][/b]
After that, I can do a str_replace on the tags (still's a bit more complicated for url-tags, but that still's possible to do. Already got it to work)

If that works, it'll be good, but it'll still not catch all cases of input. It'd miss this, for example:
User_one: Foo[B]bar [b]Foo[b]bar[/b][/b]
It'll add only two closing-tags at the end, for the simple reason that it won't match the capital B. For simple, one-letter tags, it's still doable, though. In cases like three-letter tags, like the URL-one, it's too cumbersome.

That's what the problems and purposes are. It's more than just a pattern matching, due to the possible adding and the definite replacing. Regular expressions seem to be providing horrible code for this. That or I didn't code it well :angel:

plix

26-01-2006, 09:49 PM

Originally posted by Reup@Jan 26 2006, 04:54 PM
Might get you further along the way. It seems a bit over the top though for a simple pattern matching operation, or am I being naive here?
Naive. One of those methods (FSM, PDA, LBA, etc) is standard in any kind of parsing. The key idea here is what most people don't seem to understand: parsing != pattern matching. This is the primary reason why most BBCode parsers and other such small HTML-related parsers (sanitizers, etc) are so fragile and error-prone: because the developers simply assumed that using some pattern matching and replacement they could kludge together an easy parser.

About a year ago I wrote an extremely forgiving, correcting, (X)HTML parser which validates against an arbitrary DTD (for custom tag support) and supports callback filters for custom rendering of elements. It's *way* more complex than what you need to do basic BBCode parsing and sanitization, but it's based on the exact same idea. However, since my implementation was correcting and supported filters it required the development of a full parse tree, which you shouldn't need. Unless you want to do complex transformations you can probably get away with doing things in-place.

Note: FSMs and PDAs are not the same thing, only similar. The stack is absolutely crucial for an HTML or BBCode parser, which is why a FSM is not appropriate.

plix

26-01-2006, 09:58 PM

Originally posted by Kon-Tiki@Jan 26 2006, 05:12 PM
That's what the problems and purposes are. It's more than just a pattern matching, due to the possible adding and the definite replacing. Regular expressions seem to be providing horrible code for this. That or I didn't code it well :angel:
The problem with the case-insensitivity is easily fixed. Perl regular expressions have a bunch of modifiers available for such things. You probably want to be using at least the i and g modifiers (for case insensitive and global matching, respectively).

Another major problem with using regexps is for this is that running them across multiple lines can be a real pain (it's possible, but it complicates things).