Web Spiders

shinyjoy · Oct 29, 2007

Hiii All,

Can anyone help me out in creating web spiders. I have been able to do it site specifically using CURL ie for one site. But i need to integrate several sites. Can anyone help me out??

Regards,

Shiny

edge · Oct 29, 2007

Maybe this is of some use: http://www.sphider.eu/

shinyjoy · Oct 30, 2007

Thank you but........

edge said:

Maybe this is of some use: http://www.sphider.eu/
Click to expand...

But you see , that spider is to perform only searches. What I wanted was to login into authorized sites using user name and password and retrieve data from that site. Its possible using CURL, i did one site using it, but I have to use several sites , hence to make a generalized one using database and all.

Regards,

Shiny

falko · Oct 30, 2007

Should be possible with Curl. wget and Snoopy ( http://sourceforge.net/projects/snoopy/ ) might be other options.

leblanc · Nov 15, 2007

how about javascript?

raw html webspiders are a thing of the past...

you need a full blown browser api @ your finger tips.

how about pages that modify the dom after the page has already been loaded?
ajax

example safari books does this.. just to make it difficult on the end user from simply stripping the html...

see if you can use: WebClient Class (System.Net)
you can use c# and c# visual studio express to test it out.

if not u need to build mozilla or hijack ie using dll's the goal is to work with a browser programatically. See mono project for their mozilla client api.

its on my todo list..

petter5 · Feb 21, 2008

OmniFind

You can download OmniFind for free from

http://omnifind.ibm.yahoo.net/register/form.php
It will meet your requirements.

it's based on nutch

http://lucene.apache.org/nutch/

Omnifind is easy to use and and can be installed by absolutely noobs.

It runs on :

* 32-bit Red Hat Enterprise Linux
Version 5
* 32-bit SUSE Linux Enterprise 10
* 32-bit Windows XP SP2
* 32-bit Windows 2003 Server SP2

/ Petter

jonepain · Dec 24, 2009

There are a few ways to spider.The first, which I'll call general spidering,simply grabs a page, and searches it for whatever you're looking for for instance,a search phrase.The second, specific spidering, grabs only a certain portion of a page.This scenario is useful in cases where you might want to grab news headlines from another site.If you want to get fancy,you can build in functionality to ignore links that are within the same site.You have used ASP page.There are a few drawbacks,however. Normally,you can get around this issue by not allowing the ITC to use default values specify the values every time.Another, more serious, problem involves licensing issues.ASPs do not have the ability to invoke the license manager.The license manager checks the key in the actual component, and compares it to the one in the Windows registry.If they're not the same,the component won't work.

Log in or Sign up

Web Spiders

shinyjoy New Member

edge Active Member Moderator

shinyjoy New Member

falko Super Moderator Howtoforge Staff

leblanc New Member

petter5 New Member

jonepain New Member

Share This Page

Log in or Sign up

Web Spiders

shinyjoy New Member

edge Active Member Moderator

shinyjoy New Member

falko Super Moderator Howtoforge Staff

leblanc New Member

petter5 New Member

jonepain New Member

Share This Page

Useful Searches