L2J Server

Posted: **Tue Jul 10, 2012 10:24 am**

I've been updating a few pages in the wiki during the past couple of days, and I came across a somewhat dumb wordfilter. I tried putting the word "afford" in the text, but it said "ford found, gtfo!"...
Perhaps the wordfilter should check for whitespace/punctuation around the word?
Because if it's not "/[\s\.,-\_+!@#\$]+ford[\s\.,-\_+!@#\$]+/iu" or smth like that, then it's not the word you're looking for.
http://www.morewords.com/contains/ford/
Shows the words containing "ford" in them.
There are definately other words that could be mis-filtered, and IMHO they shouldn't be getting in the way like this.

Posted: **Tue Jul 10, 2012 6:21 pm**

too much ford cars loans, insurance and other spam ads on wiki.

you are right, maybe some complex regex fix that.

Posted: **Tue Jul 10, 2012 6:43 pm**

What, are there that many despite registration and captcha?

Posted: **Tue Jul 10, 2012 6:45 pm**

ThePhoenixBird wrote:too much ford cars loans, insurance and other spam ads on wiki.

you are right, maybe some complex regex fix that.

Posted: **Wed Jul 11, 2012 11:31 pm**

jurchiks wrote:What, are there that many despite registration and captcha?

hell yeah, actually that crazy regex spam word filter is what stops bots from posting junk in wiki

check how the bots bypass the captchas on the user registration log http://l2jserver.com/wiki/Special:Log/newusers

Posted: **Thu Jul 12, 2012 7:57 am**

What's your regex like? Is it smth like this?

Code: Select all

$someString = 'audi|mazda|ford';$regex = "/\b($someString)\b/iu";$result = preg_match_all($regex, $yourTextHere, $matches);if ($result){    print_r($matches);}

Posted: **Mon Sep 17, 2012 1:38 am**

Code: Select all

## Spam Regex$wgSpamRegex =  "/".                        # The "/" is the opening wrapper                "s-e-x|zoofilia|sexyongpin|grusskarte|geburtstagskarten|animalsex|".                "job|bureau|employer|jobster|salary|jobs|bangalore|india|employee|blackberry|".                "freelancing|career|medical|airlines|government|florida|".                "toyota|mercedes|benz|chevrolet|honda|".                "diet|snack|carbohydrates|diets|cholesterol|vitamins|".                "brokers|banks|insurance|fargo|commonwealth|bank|credit|federal|wachovia|".                "sex-with|dogsex|adultchat|adultlive|camsex|sexcam|livesex|sexchat|footjob|".                "chatsex|onlinesex|adultporn|adultvideo|adultweb.|hardcoresex|hardcoreporn|".                "teenporn|xxxporn|lesbiansex|livegirl|livenude|livesex|livevideo|camgirl|pussy|".                "spycam|voyeursex|casino-online|online-casino|kontaktlinsen|cheapest-phone|".                "laser-eye|eye-laser|fuelcellmarket|lasikclinic|cragrats|parishilton|".                "paris-hilton|paris-tape|2large|fuel-dispenser|fueling-dispenser|huojia|".                "jinxinghj|telematicsone|telematiksone|a-mortgage|diamondabrasives|".                "reuterbrook|sex-plugin|sex-zone|lazy-stars|eblja|liuhecai|".                "buy-viagra|-cialis|-levitra|boy-and-girl-kissing|". # These match spammy words                "dirare\.com|".           # This matches dirare.com a spammer's domain name                "overflow\s*:\s*auto|".   # This matches against overflow:auto (regardless of whitespace on either side of the colon)                "height\s*:\s*[0-4]px|".  # This matches against height:0px (most CSS hidden spam) (regardless of whitespace on either side of the colon)                "\\s*a\s*href|".         # This blocks all href links entirely, forcing wiki syntax                "display\s*:\s*none".     # This matches against display:none (regardless of whitespace on either side of the colon)                "/i";                     # The "/" ends the regular expression and the "i" switch which follows makes the test case-insensitive                                          # The "\s" matches whitespace                                          # The "*" is a repeater (zero or more times)                                          # The "\s*" means to look for 0 or more amount of whitespace

Posted: **Mon Sep 17, 2012 1:39 am**

the regex is based mostly on the words used to spam our wiki

Posted: **Mon Sep 17, 2012 8:04 am**

1) if you put "sex" and "porn" in there, you can throw out all words that contain those words in them... shortens the regex by a good amount.
2) job/jobs - only the former one is necassary...
3) I'd put just "viagra" instead of "buy-viagra".
4) what is the actual code that parses this pattern? Does it go directly into preg_match*?
And what's with all the "-"?

L2J Server

wiki

wiki

Re: wiki

Re: wiki

Re: wiki

Re: wiki

Re: wiki

Re: wiki

Re: wiki

Re: wiki