![]() |
Anyone good at regular expressions? (computing)
I have posted this at a specialist forum but activity is rather slow, and was hoping maybe some of you computer scientists might be able to help me with it.
I have spent a long time today trying to build a regular expression for screen scraping a website. The (example) text it will be scanning is: " Glouster Museums: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England " What I am trying to get from that is ONLY the museum name, and the address, with none of the html. So I would like: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY My regEx at the moment is - [\w \\=\"]*\-[\w, ]*LS[\d ]*[\w, ]*land Which returns - Abbey Home Museum<span class=text ALIGN="justify"> - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England This is close, but I need to omit the html tags, and ideally get rid of the trailing 'England' too. The latter is not so essential however, just getting rid of the HTML will do. I am using this program http://www.webscrape.com/, which you run from the command line. I know it might be a bit of a long shot but does anyone have an idea of what I can try? Thanks. |
Ok final stretch, found this which selects everything within html tags-
Now how to include that in my expression? I want it get everything in my initial expressions EXCEPT for everything in html tags (). Anyone? http://www.discussworldissues.com/fo...lies/wink1.gif |
All times are GMT +1. The time now is 06:24 PM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
Search Engine Friendly URLs by vBSEO 3.6.0 PL2