前言
通过本文的学习,您将可以
1. 了解TBase对部署环境的要求
2. 掌握部署TBase前需要对操作系统做的准备性工作
3. 掌握TBase的安装部署操作
4. 掌握TBase实例初始化
5. 了解TBase安装中的其他注意事项
第一章 部署环境介绍
1.1 机器配置
1.2 管理角色设备分配方案
1.3.1 操作系统部署要求
1.4 TBase安装包与许可证的获取
安装TBase,你需要以下文件:
——以上请联系腾讯商务代表,或在 https://github.com/Tencent/TBase 获取。
第二章 服务器环境准备
2.1 测试数据盘的随机同步写入性能
time dd if=/dev/zero of=test bs=1M count=1024 oflag=dsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 17.4811 s, 61.4 MB/s
real 0m17.482s
user 0m0.002s
sys 0m0.748s
2.2 较准机器的时间
crontab –e (写入以下内容)
--NTP服务器可自行选择
*/20 * * * * /usr/sbin/ntpdate ntpupdate.tencentyun.com >/dev/null &
2.3 防火墙与selinux配置
关闭SELinux
setenforce 0
设置SELinux开机不自启动
vi /etc/sysconfig/selinux
将其中的SELINUX= XXXXXX修改为SELINUX=disabled
2.4 修改limits.conf参数(生产环境)
[root@tbase01]#cat /etc/security/limits.conf
--准许单个进程打开文件数
* soft nofile 131072
* hard nofile 131072
--准放运行的进程数
* soft nproc 131072
* hard nproc 131072
--限制内核文件的大小
* soft core unlimited
* hard core unlimited
2.5 修改内核参数sysctl.conf(生产环境)
[root@tbase01]# vi /etc/sysctl.conf
--所有在消息队列中的消息总和的最大值(msgmnb=64k)
kernel.msgmnb = 65536
--指定内核中消息队列中消息的最大值(msgmax=64k)
kernel.msgmax = 65536
--共享内存总量,以页为单位(4K/页),默认这个值足够大了
kernel.shmall = 4294967296
--单个共享内存段的最大值 ,这个值需要大到让应用的内存段不用分成多个创建,否则性能会下降
kernel.shmmax=137438953472
--可以创建的内存段数 ,默认这个值足够大了
kernel.shmmni = 4096
--4个数据分别对应:SEMMSL、SEMMNS、SEMOPM、SEMMNI这四个核心参数
--SEMMSL :用于控制每个信号集的最大信号数量,建议最小值为postgresql最大连接数+10
--SEMOPM: 每个 semop 系统调用可以执行的信号操作的数量。建议设置 SEMOPM 等于SEMMSL 。
--SEMMNI :内核参数用于控制整个 Linux 系统中信号集的最大数量,建议128
--SEMMNS:用于控制整个 Linux 系统中信号(而不是信号集)的最大数,为SEMMSL * SEMMNI
kernel.sem = 50100 64128000 50100 1280
--准许系统所有进程一共可以打开的文件数量
fs.file-max = 6553600
--系统允许的最大异步IO队列长度
fs.aio-max-nr = 1048576
--增大locale port数量,默认是从32768开始
net.ipv4.ip_local_port_range = 1024 65535
--keepalive认定探测失效时间,默认是2小时,7200秒
net.ipv4.tcp_keepalive_time = 60
--在认定连接失效之前,发送多少个TCP的keepalive探测包。缺省值是9
net.ipv4.tcp_keepalive_probes = 6
--每隔多少秒重新发送keepalive探测包,默认是75秒
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.ip_forward=1
vm.dirty_background_bytes = 102400000
2.6 主机名与IP配置
修改主机名
hostnamectl set-hostname XXXXXX
IP地址配置
IP地址无特别要求,可以存在多个网段,只能保证网络是连通的就行。
带宽要求
同IDC,同城最好是万兆,并且要求是低于3ms的延迟。
异地视要同步数据量要求,在做备机时需要比较高的带宽,延迟按设计是30ms左右。
2.7 /etc/ssh/sshd_config配置
vi /etc/ssh/sshd_config
UseDNS no
CentOS 6 版本service sshd restart
CentOS 7 版本systemctl restart sshd
第三章 应用程序部署
3.1 上传安装包并解压
[root@TBASE01~]# pwd
/root
[root@TBASE01~]# unzip package_name.zip
tar -zxf package_name.tar.gz
3.2 配置安装选项
[root@TBASE01~ ]# vi tbase_mgr/conf/role.info
#eth ip idc_name root password role( OssCenterMaster OssCenterSlave OssAgent Confdb Alarm TStudio Zookeeper)
#Notice: The machine that will be installed OssCenterMaster or OssCenterSlave can‘t be installed OssAgent
ens33 192.168.43.129 idc_1 root oracle OssCenterMaster|Confdb|TStudio|Etcd
ens33 192.168.43.130 idc_1 root oracle OssCenterSlave|Confdb|Alarm
--Etcd组件是confdb倒换和center倒换的容灾组件,Etcd部署个数必须是奇数个
--Etcd建议跟center或者confdb部署在一起,OssCenter支持一主多备。如果机器数量有足够多,除部署以上所示组件的两台机器,其他机器可以只部署OssAgent即可
--(建议最少部署3个Etcd组件,可靠性更高)。
3.3 运行安装程序
[root@TBASE01tbase_mgr]# ./tbase_mgr.sh install
Hey, Welcome and now we will install TBase OSS by the flowing steps:
0. Check role configuration read from conf/role.info ...
1. Install some base rpm packages such as dos2unix/createrepo/expect and so on ...
2. Check root password and do some initalization on all machines read from conf/role.info ...
3. Check package requires ...
4. Create OS user tbase specifided in conf/oss/oss_init.conf ...
5. Create yum repository which store all the rpm packages needed by TBase OSS ...
6. Install and Start ConfdbMaster ...
7. Install and Start all ConfdbSlaves ...
8. Install and Start Alarm Server ...
9. Install and Start Zookeeper If needed ...
10. Install and Start OssCenterMaster and all OssCenterSlaves ...
11. Install and Start all Agents ...
12. Install default tbase_pgxz package...
13. Install and Start Tstudio ...
Ready to continue (Yy|Nn) ?
3.4 安装完成
0. Now start to check role configuration ...
1. Now start to install dos2unix/createrepo/expect and so on ...
2. Now start to check root password and do some initalization on all machines ...
3. Now start to check package requires ...
4. Now start to create OS user tbase ...
5. Now start to create yum repository ...
6. Now start to install Etcd ...
7. Now start to install all Confdbs ...
8. Now start to install Alarm Server ...
9. Now start to install all OssCenters ...
10. Now start to install all OssAgents ...
11. Now start to install default tbase_pgxz...
--[ 2020-12-01 22:39:21 ] install studio
12. Now start to install TStudio ...
13. Now start to write etcd keys ...
Successed to install TBase OSS, visit http://192.168.43.129:8080 to continue ...
Successed to install TStudio, visit http://192.168.43.129:5050
default account: postgres@postgres.com password: postgres
3.4 安装失败解决办法
[root@TBASE01logs]# cd /root/tbase_mgr/logs
[root@TBASE01logs]# ll
total 780
-rw------- 1 root root 784 Dec 1 22:30 authorized_keys.root
-rw-r--r-- 1 root root 392 Dec 1 22:30 authorized_keys.root.192.168.43.129
-rw-r--r-- 1 root root 392 Dec 1 22:30 authorized_keys.root.192.168.43.130
-rw-r--r-- 1 root root 524730 Dec 1 22:39 tbase_mgr.sh.info.log.2020-12-01
-rw-r--r-- 1 root root 38726 Dec 4 17:07 tbase_mgr.sh.info.log.2020-12-04
-rw-r--r-- 1 root root 70527 Dec 5 22:10 tbase_mgr.sh.info.log.2020-12-05
-rw-r--r-- 1 root root 4622 Dec 6 00:07 tbase_mgr.sh.info.log.2020-12-06
-rw-r--r-- 1 root root 133901 Dec 7 15:11 tbase_mgr.sh.info.log.2020-12-07
在提示查看的日志文件的末尾,有报错中的具体执行语句,手动执行即可显示详细的错误信息
解决后再次执行安装程序即可
[root@TBASE01TBase_mgr]# sh tbase_mgr.sh install
docker版本与内核冲突
如果docker版本与内核冲突,我们可以忽略安装docker,这样的Tstudio组件也要放弃安装
以下是修改方法:
编辑tbase_mgr.sh文件,修改以下内容:
1)把tstudio_install > /dev/null 2>&1注释掉
2)把yum install -y dos2unix expect lrzsz net-tools lsof createrepo httpd httpd-tools php docker中“docker”删除
如果一定要安装tstduio,找到当前内核版本下的docker,安装好后,再独立部署tstudio组件。
3.5 yum安装源恢复方法
[root@TBase_mgr]# rm -rf /etc/yum.repos.d/PGXZ.repo
[root@TBASE01TBase_mgr]# mv /etc/yum.repos.d/backup/* /etc/yum.repos.d/
[root@TBASE01TBase_mgr]# yum makecache
[root@TBASE01TBase_mgr]# yum clean all
所有机器都要恢复,如不再需要安装其它软件,可以不需要恢复
第四章 安装失败问题汇总
4.1 IDC参数写为hostname错误
root@tbase01 tbase_mgr]# cat logs/tbase_mgr.sh.info.log.2020-09-25
[ 2020-09-25 11:08:33 ] [ read_machine_info ] [ERROR]: Agent must be specified, exit
[ 2020-09-25 11:08:33 ] [ check_machine_idc ] [ERROR]: The machine_ip in (tbase01), but the tbase01 is not exist.
[ 2020-09-25 11:08:33 ] [ check_machine_idc ] [ERROR]: Fail to check machine idc in idc_1:local, exit
[ 2020-09-25 11:08:33 ] [ check_progress ] [ERROR]: from [read_machine_info] Abort progress due to an error occurs during step above, check logs/tbase_mgr.sh.info.log.2020-09-25 for more details ...
解决方法
--这里idc_1在OSS配置文件中写死,不建议修改
查看tbase_mgr/conf/role.info配置文件里的idc配置,不是hostname
[root@cdh01 tbase_mgr]# cat conf/role.info
eth0 192.168.43.129 idc_1 root xxxxxxxx OssCenterMaster|Confdb|TStudio|Etcd
eth0 192.168.43.130 idc_1 root xxxxxxxx OssCenterSlave|Confdb|Alarm
2、安装日志报yum源问题
Other repos take up 737 M of disk space (use --verbose for details)
[ 2020-09-25 11:26:09 ] [ conf_yum_repo ] [ERROR]: Failed to exec: ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@192.168.43.199 "yum -y install dos2unix expect readline createrepo net-tools lsof uuid" < /dev/null, retcode: 1, retmsg: Loaded plugins: fastestmirror, langpacks
Determining fastest mirrors, exit
[ 2020-09-25 11:26:09 ] [ check_progress ] [ERROR]: from [conf_yum_repo] Abort progress due to an error occurs during step above, check logs/tbase_mgr.sh.info.log.2020-09-25 for more details ...
解决方法
1、通过执行日志中的命令,发现yum源因权限问题无法访问
http://192.168.43.199/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidde
2、修改/etc/httpd/conf/httpd.conf
将denied改为granted
<Directory />
AllowOverride none
Require all granted
</Directory>
Error: Package: policycoreutils-python-2.5-11.el7_3.x86_64 (PGXZ)
Requires: policycoreutils = 2.5-11.el7_3
Installed: policycoreutils-2.5-29.el7.x86_64 (@anaconda)
policycoreutils = 2.5-29.el7
Available: policycoreutils-2.5-11.el7_3.x86_64 (PGXZ)
policycoreutils = 2.5-11.el7_3
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
解决方法
安装包冲突,需要删除原来的包
yum -y remove policycoreutils-2.5-29.el7.x86_64
3、 Etcd数量为偶数的问题
[ 2020-12-01 15:41:59 ] [parse_machine_role] [Info]: 192.168.43.130, OssCenterSlave|Confdb|Alarm|Etcd r=[Alarm]
[ 2020-12-01 15:41:59 ] [parse_machine_role] [Info]: 192.168.43.130, OssCenterSlave|Confdb|Alarm|Etcd r=[Etcd]
[ 2020-12-01 15:41:59 ] [ parse_machine_role ] [Info]: XZEtcds 192.168.43.130 r=[Etcd]
[ 2020-12-01 15:41:59 ] [ read_machine_info ] [ERROR]: Agent must be specified, exit
[ 2020-12-01 15:41:59 ] [ read_machine_info ] [ERROR]: The number of Etcd must be a odd number. etcd_cnt=2, etcd_server=192.168.43.129
[ 2020-12-01 15:41:59 ] [ check_progress ] [ERROR]: from [read_machine_info] Abort progress due to an error occurs during step above, check logs/tbase_mgr.sh.info.log.2020-12-01 for more details ...
解决方案
查看role.info,发现Etcd配置了两个,必须为奇数
[root@cdh01 tbase_mgr]# cat conf/role.info
ens33 192.168.43.129 cdh root oracle OssCenterMaster|Confdb|TStudio|Etcd
ens33 192.168.43.130 cdh02 root oracle OssCenterSlave|Confdb|Alarm|Etcd --删掉Etcd组件
4、root密码错误
[ 2020-12-01 15:51:32 ] [ selinux_policy_off ] [INFO]: spawn_pswd=oracle1
[ 2020-12-01 15:51:32 ] [ selinux_policy_off ] [INFO]: after spawn_pswd=oracle1
[ 2020-12-01 15:51:34 ] [ selinux_policy_off ] [ERROR]: Failed to exec: ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@192.168.43.129 "setenforce 0;iptables -I INPUT -j ACCEPT;service iptables save", retcode: 3, exit
解决方案
查看role.info,发现root密码错误
[root@cdh01 tbase_mgr]# cat conf/role.info
ens33 192.168.43.129 idc_1 root oracle1 OssCenterMaster|Confdb|TStudio|Etcd
ens33 192.168.43.130 idc_1 root oracle2 OssCenterSlave|Confdb|Alarm
5、网卡参数错误
[ 2020-12-01 16:00:46 ] [ check_net_eth ] [ERROR]: 192.168.43.129 not match 1ens33
解决方案
查看role.info,发现网络端口不存在
[root@cdh01 tbase_mgr]# cat conf/role.info
1ens33 192.168.43.129 idc_1 root oracle1 OssCenterMaster|Confdb|TStudio|Etcd
ens33 192.168.43.130 idc_1 root oracle2 OssCenterSlave|Confdb|Alarm
6、IP错误
测试发现,这步不影响YUM安装软件
[ 2020-12-01 16:04:33 ] [ selinux_policy_off ] [ERROR]: Failed to exec: ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@192.168.43.131 "sed -i ‘s/SELINUX=enforcing/SELINUX=disabled/‘ /etc/selinux/config", retcode: 5, exit
解决方案
查看role.info,发现IP
[root@cdh01 tbase_mgr]# cat conf/role.info
ens33 192.168.43.129 idc_1 root oracle OssCenterMaster|Confdb|TStudio|Etcd
ens33 192.168.43.131 idc_1 root oracle OssCenterSlave|Confdb|Alarm
7、不进入 tbase_mgr目录执行脚本
sh tbase_mgr/tbase_mgr.sh install日志报错:This system is not registered with an entitlement server. You can use subscription-manager to register.Determining fastest mirrors, exit
解决方法
cd tbase_mgr
sh tbase_mgr.sh install
8、docker安装失败
[ 2020-12-01 20:28:02 ] [ tstudio_install ] [ERROR]: Failed to exec: ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@192.168.43.129 "service docker start && sleep 10 && docker info" < /dev/null, retcode: 1, retmsg: , exit
[ 2020-12-01 20:28:02 ] [ check_progress ] [ERROR]: from [tstudio_install] Abort progress due to an error occurs during step above, check logs/tbase_mgr.sh.info.log.2020-12-01 for more details ...
解决
测试上述命令,发现docker进程启动失败
ils ...
[root@cdh tbase_mgr]# ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@192.168.43.129 "service docker start && sleep 10 && docker info" < /dev/null
Redirecting to /bin/systemctl start docker.service
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[root@cdh tbase_mgr]# systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2020-12-01 20:30:19 CST; 11s ago
Docs: http://docs.docker.com
Process: 5490 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY (code=exited, status=1/FAILURE)
Main PID: 5490 (code=exited, status=1/FAILURE)
Dec 01 20:30:17 cdh systemd[1]: Starting Docker Application Container Engine...
Dec 01 20:30:17 cdh dockerd-current[5490]: time="2020-12-01T20:30:17.953932215+08:00" level=info msg="libcontainerd: new containerd process, pid: 5497"
Dec 01 20:30:18 cdh dockerd-current[5490]: time="2020-12-01T20:30:18.965695547+08:00" level=warning msg="devmapper: Usage of loopback devices ...section."
Dec 01 20:30:19 cdh dockerd-current[5490]: time="2020-12-01T20:30:19.006156809+08:00" level=warning msg="devmapper: Base device already exists...ignored."
Dec 01 20:30:19 cdh dockerd-current[5490]: time="2020-12-01T20:30:19.021358030+08:00" level=fatal msg="Error starting daemon: error initializi...DRIVER>)"
Dec 01 20:30:19 cdh systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Dec 01 20:30:19 cdh systemd[1]: Failed to start Docker Application Container Engine.
Dec 01 20:30:19 cdh systemd[1]: Unit docker.service entered failed state.
Dec 01 20:30:19 cdh systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
journalctl -xe
--详细日志
-- Unit docker-storage-setup.service has begun starting up.
Dec 01 20:38:48 cdh docker-storage-setup[14921]: ERROR: Docker has been previously configured for use with devicemapper graph driver. Not creating a new t
Dec 01 20:38:48 cdh systemd[1]: docker-storage-setup.service: main process exited, code=exited, status=1/FAILURE
Dec 01 20:38:48 cdh docker-storage-setup[14921]: INFO: Docker state can be reset by stopping docker and by removing /var/lib/docker directory. This will d
Dec 01 20:38:48 cdh systemd[1]: Failed to start Docker Storage Setup.
-- Subject: Unit docker-storage-setup.service has failed
Dec 1 20:28:01 cdh docker-storage-setup: ERROR: Docker has been previously configured for use with devicemapper graph driver. Not creating a new thin pool as existing docker metadata will fail to work with it. Manual cleanup is required before this will succeed.
Dec 1 20:28:01 cdh docker-storage-setup: INFO: Docker state can be reset by stopping docker and by removing /var/lib/docker directory. This will destroy existing docker images and containers and all the docker metadata.
Dec 1 20:28:01 cdh systemd: docker-storage-setup.service: main process exited, code=exited, status=1/FAILURE
Dec 1 20:28:01 cdh systemd: Failed to start Docker Storage Setup.
Dec 1 20:28:01 cdh systemd: Unit docker-storage-setup.service entered failed state.
根据上述描述,发现是/var/lib/docker/已经存在,删除就可以。这个情况是我之前安装使用过DOCKER导致的
[root@cdh tbase_mgr]# rm -rf /var/lib/docker/
[root@cdh tbase_mgr]#
[root@cdh tbase_mgr]# ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@192.168.43.129 "service docker start && sleep 10 && docker info" < /dev/null
Redirecting to /bin/systemctl start docker.service
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-253:0-69112218-pool
Pool Blocksize: 65.54 kB
原文:https://www.cnblogs.com/dotalf/p/14098083.html