Day6-分布式爬虫并打造搜索引擎全过程-同乐学堂

一、登陆知乎

1、首先建立知乎的爬虫，就直接在之前项目下建立。

scrapy genspider zhihu  www.zhihu.com

2、直接卡在知乎倒立的验证码识别上

知乎是越来越难爬取了，因为可恶的验证码，还整个倒立的，万幸的是有大神用AI破解了，哈哈！

安装过程，我只想用四个拼接字符来来形容 “F ”+,"U"+"C" ,+"K"!!!

https://github.com/muchrooms/zheye 因为网站只有几个G的空间，有兴趣的童鞋，还是自己下去吧~

然后把zheye的库copy到自己的项目当中去！

这个库包安装指南：

pip install -i https://pypi.douban.com/simple/ Keras==2.0.1 (出错)

有个依赖出错了，需要单独安装：

pip install scipy-0.19.1-cp35-cp35m-win_amd64.whl

然后重新执行：pip install -i https://pypi.douban.com/simple/ Keras==2.0.1 （成功）

pip install -i https://pypi.douban.com/simple/ tensorflow==1.0.1 （成功）

pip install -i https://pypi.douban.com/simple/ h5py==2.6.0（出错）

pip install -i https://pypi.douban.com/simple/ Pillow==3.4.2 (要退出pycharm才可以！)

单独安装

pip install h5py-2.7.1-cp35-cp35m-win_amd64.whl

单独安装的俄库文件在这里下载：

http://www.lfd.uci.edu/~gohlke/pythonlibs/

pip install -i https://pypi.douban.com/simple/ requests

Keras==2.0.1
Pillow==3.4.2 （跳过，因为系统有最新版本）
#jupyter==1.0.0
#matplotlib==1.5.3
numpy==1.12.1
scikit-learn==0.18.1
tensorflow==1.0.1
h5py==2.6.0

突破知乎中文倒立验证码：

# -*- coding: utf-8 -*-
import scrapy
import re
import json
import datetime
class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['http://www.zhihu.com/']
    headers = {
        "HOST": "www.zhihu.com",
        "Referer": "https://www.zhizhu.com",
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36"
    }
    def parse(self, response):
        pass
    def start_requests(self):
        return [scrapy.Request('https://www.zhihu.com/#signin', headers=self.headers, callback=self.login)]
    def login(self, response):
        response_text = response.text
        match_obj = re.match('.*name="_xsrf" value="(.*?)"', response_text, re.DOTALL)
        xsrf = ''
        if match_obj:
            xsrf = (match_obj.group(1))
        if xsrf:
            post_url = "https://www.zhihu.com/login/phone_num"
            post_data = {
                "_xsrf": xsrf,
                "phone_num": "15811006613",
                "password": "xxxxxxx",
                "captcha": ""
            }
            import time
            t = str(int(time.time() * 1000))
            captcha_url_cn =  "https://www.zhihu.com/captcha.gif?r={0}&type=login&lang=cn".format(t)
            yield scrapy.Request(captcha_url_cn, headers=self.headers,meta={"post_data":post_data},callback=self.login_after_captcha_cn)
    def login_after_captcha_cn(self, response):
        #验证知乎倒立汉字验证码
        with open("captcha.jpg", "wb") as f:
            f.write(response.body)
            f.close()
        from zheye import zheye
        z = zheye()
        positions = z.Recognize('captcha.jpg')
        pos_arr = []
        if len(positions) == 2:
            if positions[0][1] > positions[1][1]:
                pos_arr.append([positions[1][1], positions[1][0]])
                pos_arr.append([positions[0][1], positions[0][0]])
            else:
                pos_arr.append([positions[0][1], positions[0][0]])
                pos_arr.append([positions[1][1], positions[1][0]])
        else:
            pos_arr.append([positions[0][1], positions[0][0]])
        post_url = "https://www.zhihu.com/login/phone_num"
        post_data = response.meta.get("post_data", {})
        if len(positions)  == 2:
            post_data["captcha"] = '{"img_size": [200, 44], "input_points": [[%.2f, %f], [%.2f, %f]]}' % (
                pos_arr[0][0] / 2, pos_arr[0][1] / 2, pos_arr[1][0] / 2, pos_arr[1][1] / 2)
        else:
            post_data["captcha"] = '{"img_size": [200, 44], "input_points": [%.2f, %f]}' % (
                pos_arr[0][0] / 2, pos_arr[0][1] / 2)
        post_data["captcha_type"] = "cn"
        return [scrapy.FormRequest(
            url=post_url,
            formdata=post_data,
            headers=self.headers,
            callback=self.check_login
        )]
    def check_login(self, response):
        #验证服务器的返回数据判断是否成功
        text_json = json.loads(response.text)
        if "msg" in text_json and text_json["msg"] == "登录成功":
            for url in self.start_urls:
                yield scrapy.Request(url, dont_filter=True, headers=self.headers)