extracting txt from html pages returned from asp and other query engines

Discussion in 'Programming/Scripts' started by richard.campbell, May 8, 2007.

  1. richard.campbell

    richard.campbell New Member

    Hi:

    What I am asking for is how you might best use any of the standard programming tools (the answer will define how I then go one to do parsing, sorting, and deleting) to take the out put from either of the sample URLs I have pasted below and capture it into an appendable txt file that can be further manipulated.

    http://www.jobbank.gc.ca/JobResult_...obBank&Category=4212,4214,5254,6471,6472,6474

    http://www.van.net/rew/vannet.nsf/c...earchOrder=1&SearchMax=0&SearchThesaurus=TRUE


    I am currently loading about 60+ queries running in tabs in Firefox, selecting all, pasting into an empty file and then running many itterative search & replaces to sort and weed out dups and inappropriate entries.

    I intend to hand this off to a non-techie, Microsoft-type user (as a Windows VMware VM of DSL3.01) so I need to automate as much as I can because they will not understand the software or the html code stripping and reformatting process, let alone be able to handle having 60+ browser tabs open.

    Manually, this is still currently about a 4 hour job under Ubuntu Linux, with many structural shortcuts and templates. Under Windows it is about a 12-16 hour job. As this is being done by volunteers aiding charities in helping the homeless and addicted, anything that can help a non-technical volunteer through all the mountains of information is a big plus.

    I thank you for your help in advance. If I am in the wrong area, let me know that too.

    RC
     
    Last edited: May 8, 2007
  2. Ben

    Ben ISPConfig Developer ISPConfig Developer

    If I got this right and take a look at this links, you just want to filter the necessary content to work on further with, right?
    If so, why don't you just directly use perl (maybe better for a parser) or php, and built this like modules to parse the content for each service, by getting the content e.g. via wget, analyze it do what every you want to.
    For that you could just make a webfrontent or sth. similar to control those modules, that can be used by anybody.
    You just have to make sure how you or any other develop gets informed in case of the internal page structures changes that makes you module inoperable...
     
  3. richard.campbell

    richard.campbell New Member


    It would seem that you meant to forward me a link to check out, Ben. But none arrived. Could you amend that?

    Thx
     

Share This Page