DiscussWorldIssues - Socio-Economic Religion and Political Uncensored Debate

DiscussWorldIssues - Socio-Economic Religion and Political Uncensored Debate (http://www.discussworldissues.com/forums/index.php)
-   General Discussion (http://www.discussworldissues.com/forums/forumdisplay.php?f=27)
-   -   Anyone good at regular expressions? (computing) (http://www.discussworldissues.com/forums/showthread.php?t=245737)

Wmshyrga 05-02-2007 07:32 AM

Anyone good at regular expressions? (computing)
 
I have posted this at a specialist forum but activity is rather slow, and was hoping maybe some of you computer scientists might be able to help me with it.

I have spent a long time today trying to build a regular expression for screen scraping a website. The (example) text it will be scanning is:

" Glouster Museums: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England
"

What I am trying to get from that is ONLY the museum name, and the address, with none of the html. So I would like:

Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY


My regEx at the moment is -

[\w \\=\"]*\-[\w, ]*LS[\d ]*[\w, ]*land

Which returns -

Abbey Home Museum<span class=text ALIGN="justify"> - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England

This is close, but I need to omit the html tags, and ideally get rid of the trailing 'England' too. The latter is not so essential however, just getting rid of the HTML will do.

I am using this program http://www.webscrape.com/, which you run from the command line. I know it might be a bit of a long shot but does anyone have an idea of what I can try? Thanks.

Wmshyrga 05-02-2007 07:43 AM

Ok final stretch, found this which selects everything within html tags-




Now how to include that in my expression? I want it get everything in my initial expressions EXCEPT for everything in html tags ().

Anyone? http://www.discussworldissues.com/fo...lies/wink1.gif


All times are GMT +1. The time now is 06:24 PM.

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
Search Engine Friendly URLs by vBSEO 3.6.0 PL2