Go Back   Forums > Community Chatterbox > Tech Corner > Programming
Memberlist Forum Rules Today's Posts
Search Forums:
Click here to use Advanced Search

Reply
 
Thread Tools Display Modes
Old 26-01-2006, 08:54 PM   #21
Reup
10 GOSUB Abandonia
20 GOTO 10
 
Reup's Avatar

 
Join Date: Dec 2004
Location: Eindhoven, Netherlands
Posts: 1,508
Default

Me neither. There's an excellent tutorial on finite state automate on this website though. And check out the wikipedia links on Pushdown automaton. Might get you further along the way. It seems a bit over the top though for a simple pattern matching operation, or am I being naive here?
Reup is offline                         Send a private message to Reup
Reply With Quote
Old 26-01-2006, 09:12 PM   #22
Kon-Tiki
[BANNED]

 
Join Date: Sep 2004
Location: Dentergem, Belgium
Posts: 1,811
Default

It's a bit more than pattern matching. It's case insensitive pattern replacing with filling up lacking parts of a pattern. Let me explain it more concrete, with the example of a guestbook.

In the guestbook, users can use BBcode (the []-tags you can use on forums as well). It's quickly done by using str_replace(), which'd replace all BBcode tags with their corresponding HTML in a given string (which's the user's input, in this case).

Now take this case:
Code:
User_one: Foo [b]bar
If I'd use str_replace(), all text after the tag'd be bold, until I'd accidentally use a [/b] somewhere. User_two, User_three and User_one's next post all'll be bold.

What the guestbook should do instead, is check if there're less closing tags for each tag than opening tags, then add the closing tags at the end of the input, so that it'll look like this (in a case of bad user input):
Code:
User_one: Foo [b]bar [b]Foo[b]bar[/b][/b][/b]
After that, I can do a str_replace on the tags (still's a bit more complicated for url-tags, but that still's possible to do. Already got it to work)

If that works, it'll be good, but it'll still not catch all cases of input. It'd miss this, for example:
Code:
User_one: Foo[B]bar [b]Foo[b]bar[/b][/b]
It'll add only two closing-tags at the end, for the simple reason that it won't match the capital B. For simple, one-letter tags, it's still doable, though. In cases like three-letter tags, like the URL-one, it's too cumbersome.

That's what the problems and purposes are. It's more than just a pattern matching, due to the possible adding and the definite replacing. Regular expressions seem to be providing horrible code for this. That or I didn't code it well :angel:
Kon-Tiki is offline                         Send a private message to Kon-Tiki
Reply With Quote
Old 26-01-2006, 09:49 PM   #23
plix
Game freak

 
Join Date: Oct 2005
Location: ,
Posts: 113
Default

Quote:
Originally posted by Reup@Jan 26 2006, 04:54 PM
Might get you further along the way. It seems a bit over the top though for a simple pattern matching operation, or am I being naive here?
Naive. One of those methods (FSM, PDA, LBA, etc) is standard in any kind of parsing. The key idea here is what most people don't seem to understand: parsing != pattern matching. This is the primary reason why most BBCode parsers and other such small HTML-related parsers (sanitizers, etc) are so fragile and error-prone: because the developers simply assumed that using some pattern matching and replacement they could kludge together an easy parser.

About a year ago I wrote an extremely forgiving, correcting, (X)HTML parser which validates against an arbitrary DTD (for custom tag support) and supports callback filters for custom rendering of elements. It's *way* more complex than what you need to do basic BBCode parsing and sanitization, but it's based on the exact same idea. However, since my implementation was correcting and supported filters it required the development of a full parse tree, which you shouldn't need. Unless you want to do complex transformations you can probably get away with doing things in-place.

Note: FSMs and PDAs are not the same thing, only similar. The stack is absolutely crucial for an HTML or BBCode parser, which is why a FSM is not appropriate.
plix is offline                         Send a private message to plix
Reply With Quote
Old 26-01-2006, 09:58 PM   #24
plix
Game freak

 
Join Date: Oct 2005
Location: ,
Posts: 113
Default

Quote:
Originally posted by Kon-Tiki@Jan 26 2006, 05:12 PM
That's what the problems and purposes are. It's more than just a pattern matching, due to the possible adding and the definite replacing. Regular expressions seem to be providing horrible code for this. That or I didn't code it well :angel:
The problem with the case-insensitivity is easily fixed. Perl regular expressions have a bunch of modifiers available for such things. You probably want to be using at least the i and g modifiers (for case insensitive and global matching, respectively).

Another major problem with using regexps is for this is that running them across multiple lines can be a real pain (it's possible, but it complicates things).
plix is offline                         Send a private message to plix
Reply With Quote
Reply


Similar Threads
Thread Thread Starter Forum Replies Last Post
regular sgtboat Blah, blah, blah... 4 26-11-2008 03:30 AM
When will regular updates start? Doink Blah, blah, blah... 6 13-12-2007 10:00 AM


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump
 


The current time is 01:15 PM (GMT)

 
Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.