Fork me on GitHub

初学Python----urllib2模块与正则表达式组合而成的小爬虫

最近开始学Python,在http://www.the5fire.com/python-sohuspider-software-people.html这个博客上看到作者写的简单的从搜狐上爬小说的爬虫,正好学习学习练练手,发现代码是几年前的,而搜狐网早已经更新了网页,原代码已经不能用了,正好我来改改。下面是改过之后的代码,亲测有效:

#!/usr/bin/python

import re
import urllib2
import sys

def getPage(url,offset = '3399'):
    realurl = "%s%s%s" % (url,offset,".html")
    print realurl
    content = urllib2.urlopen(realurl).read()

    content_re = re.compile(r'<div class="chapter"></div>\n(.*[\n])+?[\t]</div>')
    try:
        content_list = content_re.search(content,re.S).group(0)
    except Exception,e:
        print str(e)
        return
    contentresult = content_list

    fp = open(r'renxingruanjian.txt','a')

    contentresult = contentresult.replace('<div class="chapter"></div>','')
    contentresult = contentresult.replace('</p><p>','')
    contentresult = contentresult.replace('</p></p>','')
    contentresult = contentresult.replace('<p><p>','')
    contentresult = contentresult.replace(' ','')
    contentresult = contentresult.replace('     ','')
    contentresult = contentresult.replace('</div>','')
    #content = content.replace(' ','')
    #content = content.replace('        ','')
    fp.write(contentresult + '\n')
    fp.flush()
    fp.close()

def getBook(url,startoffset,endoffset):
    while startoffset < endoffset:
        getPage(url,offset = str(startoffset))
        startoffset += 1

if __name__ == '__main__':
    getBook(url = 'http://lz.book.sohu.com/chapter-121',startoffset = 3399,endoffset = 3426)

转载请注明出处:BackNode

My zhiFuBao

Buy me a cup of coffee

blogroll

social