Page 1 of 1

wiki

Posted: Tue Jul 10, 2012 10:24 am
by jurchiks
I've been updating a few pages in the wiki during the past couple of days, and I came across a somewhat dumb wordfilter. I tried putting the word "afford" in the text, but it said "ford found, gtfo!"...
Perhaps the wordfilter should check for whitespace/punctuation around the word?
Because if it's not "/[\s\.,-\_+!@#\$]+ford[\s\.,-\_+!@#\$]+/iu" or smth like that, then it's not the word you're looking for.
http://www.morewords.com/contains/ford/
Shows the words containing "ford" in them.
There are definately other words that could be mis-filtered, and IMHO they shouldn't be getting in the way like this.

Re: wiki

Posted: Tue Jul 10, 2012 6:21 pm
by ThePhoenixBird
too much ford cars loans, insurance and other spam ads on wiki.

you are right, maybe some complex regex fix that.

Re: wiki

Posted: Tue Jul 10, 2012 6:43 pm
by jurchiks
What, are there that many despite registration and captcha?

Re: wiki

Posted: Tue Jul 10, 2012 6:45 pm
by Zoey76
ThePhoenixBird wrote:too much ford cars loans, insurance and other spam ads on wiki.

you are right, maybe some complex regex fix that.

Image

Re: wiki

Posted: Wed Jul 11, 2012 11:31 pm
by ThePhoenixBird
jurchiks wrote:What, are there that many despite registration and captcha?
hell yeah, actually that crazy regex spam word filter is what stops bots from posting junk in wiki

check how the bots bypass the captchas on the user registration log http://l2jserver.com/wiki/Special:Log/newusers

Re: wiki

Posted: Thu Jul 12, 2012 7:57 am
by jurchiks
What's your regex like? Is it smth like this?

Code: Select all

$someString = 'audi|mazda|ford';$regex = "/\b($someString)\b/iu";$result = preg_match_all($regex, $yourTextHere, $matches);if ($result){    print_r($matches);}

Re: wiki

Posted: Mon Sep 17, 2012 1:38 am
by ThePhoenixBird

Code: Select all

## Spam Regex$wgSpamRegex =  "/".                        # The "/" is the opening wrapper                "s-e-x|zoofilia|sexyongpin|grusskarte|geburtstagskarten|animalsex|".                "job|bureau|employer|jobster|salary|jobs|bangalore|india|employee|blackberry|".                "freelancing|career|medical|airlines|government|florida|".                "toyota|mercedes|benz|chevrolet|honda|".                "diet|snack|carbohydrates|diets|cholesterol|vitamins|".                "brokers|banks|insurance|fargo|commonwealth|bank|credit|federal|wachovia|".                "sex-with|dogsex|adultchat|adultlive|camsex|sexcam|livesex|sexchat|footjob|".                "chatsex|onlinesex|adultporn|adultvideo|adultweb.|hardcoresex|hardcoreporn|".                "teenporn|xxxporn|lesbiansex|livegirl|livenude|livesex|livevideo|camgirl|pussy|".                "spycam|voyeursex|casino-online|online-casino|kontaktlinsen|cheapest-phone|".                "laser-eye|eye-laser|fuelcellmarket|lasikclinic|cragrats|parishilton|".                "paris-hilton|paris-tape|2large|fuel-dispenser|fueling-dispenser|huojia|".                "jinxinghj|telematicsone|telematiksone|a-mortgage|diamondabrasives|".                "reuterbrook|sex-plugin|sex-zone|lazy-stars|eblja|liuhecai|".                "buy-viagra|-cialis|-levitra|boy-and-girl-kissing|". # These match spammy words                "dirare\.com|".           # This matches dirare.com a spammer's domain name                "overflow\s*:\s*auto|".   # This matches against overflow:auto (regardless of whitespace on either side of the colon)                "height\s*:\s*[0-4]px|".  # This matches against height:0px (most CSS hidden spam) (regardless of whitespace on either side of the colon)                "\\s*a\s*href|".         # This blocks all href links entirely, forcing wiki syntax                "display\s*:\s*none".     # This matches against display:none (regardless of whitespace on either side of the colon)                "/i";                     # The "/" ends the regular expression and the "i" switch which follows makes the test case-insensitive                                          # The "\s" matches whitespace                                          # The "*" is a repeater (zero or more times)                                          # The "\s*" means to look for 0 or more amount of whitespace

Re: wiki

Posted: Mon Sep 17, 2012 1:39 am
by ThePhoenixBird
the regex is based mostly on the words used to spam our wiki

Re: wiki

Posted: Mon Sep 17, 2012 8:04 am
by jurchiks
1) if you put "sex" and "porn" in there, you can throw out all words that contain those words in them... shortens the regex by a good amount.
2) job/jobs - only the former one is necassary...
3) I'd put just "viagra" instead of "buy-viagra".
4) what is the actual code that parses this pattern? Does it go directly into preg_match*?
And what's with all the "-"?