如何通过网络抓取该网站

I have a website here

About 100 companies are listed here. How can I save the next 100 companies programmatically using Python(or C#). On the bottom of this page

Showing 1 - 100 of 528 << Previous | Next >>

is seen.How can I access the link

Next >>

programmatically.This link is seen as base url + '#'(http://money.rediff.com/indices/bse/bsesmallcap#). How to save all the 1-528 company details(as seperate webpages:1-100,101-200 etc). Is there any special tailormade programs for these kind of tasks.

You don't even need scrapy or anything like that--there's no link to find with that "Next" link, since it's actually javascript:

javascript:nextPage(document.paging.totalPages.value)

I used Chrome's developer tools to see what request it was actually making, and it turns out it's just a simple unauthenticated POST request. You can get any page you want with the following:

import requests
r = requests.post('http://money.rediff.com/indices/bse/bsesmallcap',
              data={'currentPageNo': 3, 'RowPerPage': 100})
print r.text

All you have to do is change the 'currentPageNo' argument to get whichever page you're looking for. You could probably also change the number of rows per page, but I didn't experiment with that. Update: You can't; I tried.

In terms of actually saving the information, you can use BeautifulSoup to grab the data from each request and store it or save it. Given that the table regularly has the 'dataTable' class on each page, it's pretty easy to find. So, given that there are 6 pages, you'd end up with code that looks something like:

import requests
from bs4 import BeautifulSoup as BS
for page in range(1, 7):
    r = requests.post('http://money.rediff.com/indices/bse/bsesmallcap',
                      data={'currentPageNo': page, 'RowPerPage': 100})
    soup = BS(r.text)
    table = soup.find(class_='dataTable')
    # Add table information to whatever output you plan to use

The full link to "each page" is: http://money.rediff.com/indices/bse/bsesmallcap&cTab=12&sortBy=&sortDesc=&pageType=indices_wise&currentPageNo=1&RowPerPage=100&bTab=12

(I've removed the totalPages aspect, since you'll need to scrape this bit yourself)

Once you know the number of pages (from scraping), you can increment the currentPageNo until you have all the rows.

You can increase the number of RowsPerPage, but there seems to be an internal limit of 200 rows (even if you change it to say, 500)

A spin on jdotjdot's answer using PyQuery instead of BeautifulSoup, I like it for the jQuery-esque notation for traversing. It will use urllib by default or requests for scraping.

from pyquery import PyQuery as pq
for page in range(1, 3):
    # POST request
    d = pq(url="http://money.rediff.com/indices/bse/bsesmallcap",
           data={"currentPageNo": page, "RowPerPage": 50},
           method="post")
    # jQuery-esque notation for selecting elements
    d("table.dataTable").text()