docker+phantomjs+haproxy 搭建phantomjs集群

时间：2018-04-13 23:55:40 阅读：372 评论：0 收藏：0 [点我收藏+]

　　目标：

　　　　搭建一个远程的phantomjs服务器，提供高可用服务，支持并发。

　　原料：

　　　　1、docker环境、docker-compose环境

　　　　2、phantomjs镜像： docker.io/wernight/phantomjs

　　　　3、haproxy镜像：haproxy:latest

　　docker-compose 项目目录结构

　　phantomjs/

　　　　haproxy/

　　　　　　haproxy.cfg

　　　　docker-compose.yml

　　配置文件内容

　　docker-compose.yml 配置

version: "2"
services:
    phantomjs1:
        image: docker.io/wernight/phantomjs
        ports:
            - "8910"
        command: phantomjs --webdriver=8910 --cookies-file=/cookies.txt
        restart: always
        # 内存限制 单位 bytes 大B
        mem_limit: 2000000000
        expose:
            - "8910"

    phantomjs2:
        image: docker.io/wernight/phantomjs
        ports:
            - "8910"
        command: phantomjs --webdriver=8910 --cookies-file=/cookies.txt
        restart: always
        # 内存限制 单位 bytes 大B
        mem_limit: 2000000000
        expose:
            - "8910"

    haproxy:
        image: haproxy:latest
        volumes:
            - ./haproxy:/haproxy-override
            - ./haproxy/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
        links:
            - phantomjs1
            - phantomjs2
        restart: always
        ports:
            - "8910:8910"
            - "8911:8911"
        expose:
            - "8910"
            - "8911"

View Code

　　haproxy.cfg 配置内容

global
  log 127.0.0.1 local0
  log 127.0.0.1 local1 notice

defaults
  log global
  mode http
  #option httplog
  option dontlognull
  # option  redispatch # 后端挂掉 则重定向别的机器
  retries 5  # 连续5次检查失败 则判定不可用
  timeout connect 5000ms
  timeout client 50000ms
  timeout server 50000ms
  timeout check 20s  # 超时20s才判定服务不可用

listen stats
    bind 0.0.0.0:8911
    stats enable
    stats uri /

frontend balancer
    bind 0.0.0.0:8910
    mode http
    default_backend phantomjs_backends

backend phantomjs_backends
    mode http
    option forwardfor
    #balance source
    balance hdr(Cookie)  # 必须用这个 爬虫方面做了hook
    #balance url_param sessionId check_post 64
    server phantomjs1 phantomjs1:8910 check inter 10000  # 增加check检查间隔10s
    server phantomjs2 phantomjs2:8910 check inter 10000
    option httpchk GET /status
    http-check expect status 200

View Code

　　配置完毕

　　在phantomjs目录下执行

　　　　启动命令　　docker-compose up -d

　　　　停止命令　　docker-compose stop

　　使用phantoms服务：

　　　　http://机器ip:8910

　　查看集群状态

　　　　http://机器ip:8911

　　下面天坑：

　　　　一个一个看：

　　　　　　1、python selenium远程连接phantomjs服务时使用的http链接不支持类似cookie、session之类的会话机制，

　　　　而phantomjs由于使用了haproxy做负载均衡，haproxy默认是轮询后端服务器处理请求，每次请求都会定向到不同的

　　　　后端服务器。所以selenium在第一次请求发起新建phantomjs session的命令，获取了 phantomjs sessionId

　　　　之后，再次使用sessionId来操作phantomjs的时候，由于请求被发送到了不同的后端服务器，导致无法找到相应

　　　　sessionId的资源，所以根本无法使用。而haproxy其他的负载均衡策略基本也都不可用。

　　　　　　先明确一下我想达到的效果：

　　　　　　　　1）第一次请求（新建phantomjs session）是随机分配，并且均匀分布的

　　　　　　　　2）后续请求除非服务器挂掉，否则不能更改服务器（挂掉没办法，本次操作肯定中断了，得重新开始）

　　　　　　下面逐个分析一下haproxy的负载均衡策略：

　　　　　　　　1）roundrobin 默认轮询不可用

　　　　　　　　2）static-rr 根据权重不可用（权重这个东西并不能保证绝对不换机器）

　　　　　　　　3）leastconn 最少连接呵呵

　　　　　　　　4）source 对来源ip做hash 不可用（除非我的来源ip均匀分布，并且请求频率均匀分布，要不然

　　　　　　　　　　　　　　肯定负载肯定会集中分布在某几台机器上）

　　　　　　　　5）uri 对请求的url？前的部分或全部做hash 不可用（每次进行的操作都差不多，访问的api并不均匀分布）

　　　　　　　　6）url_param 根据指定的GET参数（或POST参数）做hash 不可用（第一次请求的时候木有sessionId 。。。）

　　　　　　　　7）hdr(name) 根据指定的header（如user-agent）做hash 不可用（selenium请求无状态 每个 + 每次请求

　　　　　　　　　　　　　　的header都一毛一样，还不让修改，不过我最终选的还是这个，后面会介绍如何修改）

　　　　　　　　8）rdp-cookie(name) 根据cookie来选择不可用（ selenium请求无状态 ）

　　　　　　下面是放大招的时刻：

　　　　　　　　经过上面的分析，貌似没啥办法了，不过经过我苦思冥想，埋头研究selenium源码，终于发现了一个可以在不修改源码

　　　　　　的情况下修改每次远程调用phantomjs api服务时发送请求的header的方法。废话不多说，上代码:

# coding:utf8
from selenium.webdriver.remote import remote_connection
# hook
import base64


class MyRemoteConnection(object):
    @classmethod
    def get_remote_connection_headers(cls, parsed_url, keep_alive=False):
        """
        Get headers for remote request.

        :Args:
         - parsed_url - The parsed url
         - keep_alive (Boolean) - Is this a keep-alive connection (default: False)
        """

        headers = {
            ‘Accept‘: ‘application/json‘,
            ‘Content-Type‘: ‘application/json;charset=UTF-8‘,
            ‘User-Agent‘: ‘Python http auth‘
        }

        if parsed_url.username:
            base64string = base64.b64encode(‘{0.username}:{0.password}‘.format(parsed_url).encode())
            headers.update({
                ‘Authorization‘: ‘Basic {}‘.format(base64string.decode())
            })

        if keep_alive:
            headers.update({
                ‘Connection‘: ‘keep-alive‘
            })
        # 下面这几行是我加的  重点在于keep_alive的非严格限制 以及可以在创建
        # remote driver是传递
        headers.update({
            "Cookie": keep_alive,
        })
        return headers


# 覆盖selenium包中的对应方法
remote_connection.RemoteConnection.get_remote_connection_headers = MyRemoteConnection.get_remote_connection_headers

View Code

　　　　　　原理：

　　　　　　　　selenium.webdriver.remote.remote_connection中有个类 RemoteConnection的get_remote_connection_headers

　　　　　　方法控制每次调用api时使用的header，并且还接受一个参数 keep_alive，更重要的是 keep_alive参数在创建remote

　　　　　　driver的时候可以传递，更更重要的是这个keep_alive 参数无论在哪里都只检查bool值，而不是具体值，所以我们可以把

　　　　　　它作为一个唯一标识符，来放到header中，并在haproxy中做对应值的检查，只要生成keep_alive的算法是均匀分布的，就

　　　　　　完美满足了我的要求。

　　　　　　　　于是我选择了header中的Cookie值，在代码运行前动态hook这个方法，将keep_alive放入header中的Cookie值，然

　　　　　　后在创建webdriver对象的时候生成一个唯一的keep_alive值传递进去，见代码：

# coding:utf8
import time
import hashlib
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver

desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
browser = webdriver.Remote(
    command_executor=‘http://localhost:8910‘,
    desired_capabilities=desired_capabilities,
    # 被hook 作为 唯一标示
    keep_alive="{}".format(hashlib.md5("{}".format(time.time()).encode()).hexdigest())
)