hive安装
1.上传解压
2.cd conf ;新增hive-site.xml (配置mysql数据库信息)
vim hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://pengyy63:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>
3.cd lib
上传 mysql 驱动包 mysql-connector-java-5.1.39.jar
4.启动 hive
bin/hive
hive 1.x版本直接启动不会报错
hive 3.x版本启动成功后输入 sql 命令会报以下错误
FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
解决方案:
bin/schematool -initSchema -dbType mysql
从报错原因可以看出,没有初始化,需要在Hive安装目录的bin目录下面执行
hive语法
1.show databases;
2.create database pengyy;
3.use pengyy;
4.show tables;
5.select * from t_user;
6.显示列名称 set hive.cli.print.header=true;--仅在当前窗口设置中生效
7.显示数据库名称 set hive.cli.print.current.db=true;--仅在当前窗口设置中生效
--备注:要想将数据库名称和列名称在hive启动时就生效,可以在 当前用户目录下创建 .hiverc
cd ~ --返回家目录
vim .hiverc
set hive.cli.print.current.db=true;
set hive.cli.print.header=true;
8.建表
use pengyy;
create table t_movies(id string,name string,director string)
row format delimited
fields terminated by ',';
vim movies.txt;
1,哪吒,张艺谋
2,星空,唐晶
3,啦啦,呼呼
hadoop fs -put /user/hive/warehouse/pengyy.db/t_movies/
hive客户端:
select * from t_moives;
+--------------+----------------+--------------------+
| t_movies.id | t_movies.name | t_movies.director |
+--------------+----------------+--------------------+
| 1 | 哪吒 | 张艺谋 |
| 2 | 星空 | 唐晶 |
| 3 | 啦啦 | 呼呼 |
vim movies2.txt
aa,战狼2,吴京
bb,三生三世十里桃花,杨幂
hadoop fs -put movies2.txt /user/hive/warehouse/pengyy.db/t_movies/
select * from t_moives;
+--------------+----------------+--------------------+
| t_movies.id | t_movies.name | t_movies.director |
+--------------+----------------+--------------------+
| 1 | 哪吒 | 张艺谋 |
| 2 | 星空 | 唐晶 |
| 3 | 啦啦 | 呼呼 |
| aa | 战狼2 | 吴京 |
| bb | 三生三世十里桃花 | 杨幂 |
+--------------+----------------+--------------------+
hive服务启动
hive server前台启动
在 pengyy42 hive服务端上启动 bin/hiveserver2
在 pengyy43 hive客户端上启动 bin/beeline
bin/beeline
!connect jdbc:hive2://pengyy42:10000
如果出现以下问题:19/07/30 22:49:29 [main]: WARN jdbc.HiveConnection: Failed to connect to pengyy42:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://pengyy42:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)
解决方案:
在hadoop的配置文件core-site.xml中添加如下属性:
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
就将上面配置hadoop.proxyuser.xxx.hosts和hadoop.proxyuser.xxx.groups中的xxx设置为root(即你的错误日志中显示的User:xxx为什么就设置为什么)。“*”表示可通过超级代理“xxx”操作hadoop的用户、用户组和主机。重启hdfs hiveserver beeline客户端
这样改的原因:
主要原因是hadoop引入了一个安全伪装机制,使得hadoop 不允许上层系统直接将实际用户传递到hadoop层,而是将实际用户传递给一个超级代理,由此代理在hadoop上执行操作,避免任意客户端随意操作hadoop
hive server后台启动
linux 1 表示标准输出 2 表示错误输出
./test.sh
./test.sh &
./test.sh 1>test.log 2>err.log & 将控制台上的输出内容重定向到日志文件中
linux 中 /dev/null 可以被看做是一个 黑洞文件 ,所有写入它的内容都会永远丢失
如果脚本中的输出内容不想要,可以写入 /dev/null 文件中
./test.sh 1>/dev/null 2>/dev/null &
./test.sh 1>/dev/null 2>&1 & --将2的输出重定向引用1的输出重定向
fg 将后台程序显示在前台 然后 ctrl+c 退出
fg 1
nohup ./test.sh 1>/dev/null 2>&1 & --程序一直在后台运行
hive server 后台启动
nohup bin/hiveserver2 >/dev/null 2>&1 &
hive server服务启动默认占用 10000 端口,查看启动可以通过命令查看
netstat -ntlp
客户端第一种启动方式
bin/beeline
!connect jdbc:hive2://pengyy42:10000
客户端第一种启动方式
bin/beeline -u jdbc:hive2://pengyy42:10000 -n root
hive 脚本化运行
1.直接输命令 hive -e "select * from t_user"
2.新建shell脚本,然后运行shell脚本
vim test1.sh
#!/bin/bash
hive -e "create table t_count_sex(sex string,number int)"
hive -e "insert into t_count_sex select sex,count(1) from t_user group by sex"
hive -e "select * from t_count_sex"
sh test1.sh
3.直接新建 test2.sql脚本,通过命令 hive -f test2.sql 运行
vim test2.sql
select * from t_user;--一大堆的sql脚本 这里省略
hive -f test2.sql
hive 外部表内部表
内部表(managed_table):表目录按照hive的规范来部署,位于hive的仓库目录 /user/hive/warehouse/
内部表删除,1.删除hdfs上的数据文件;2.删除表的元数据(mysql hive库中的元数据信息)
外部表(external_table):表目录由建表用户自己指定
vim pv_log.txt
192.168.31.22,http://sina.com/a,2019-07-31 20:00:23
192.168.31.22,http://sina.com/b,2019-07-26 15:00:31
192.168.31.23,http://sina.com/k,2019-07-31 20:09:27
192.168.31.24,http://sina.com/a,2019-07-27 20:00:31
192.168.31.25,http://sina.com/h,2019-07-31 20:08:29
192.168.31.28,http://sina.com/a,2019-07-29 20:22:31
192.168.31.22,http://sina.com/e,2019-07-31 20:07:22
192.168.31.88,http://sina.com/c,2019-07-31 20:00:11
hadoop fs -mkdir -p /pvlogs
hadoop fs -put pv_log.txt /pvlogs
use pengyy;
create external table t_pv_log(ip string,url string ,access_time string)
row format delimited
fields terminated by ','
location '/pvlogs'
;
select * from t_pv_log;
hive 分区表
cd /usr/local/test
vim pv_log.31
192.168.31.22,http://sina.com/a,2019-07-31 20:00:23
192.168.31.22,http://sina.com/b,2019-07-31 15:00:31
192.168.31.23,http://sina.com/k,2019-07-31 20:09:27
192.168.31.24,http://sina.com/a,2019-07-31 20:00:31
192.168.31.25,http://sina.com/h,2019-07-31 20:08:29
192.168.31.28,http://sina.com/a,2019-07-31 20:22:31
192.168.31.22,http://sina.com/e,2019-07-31 20:07:22
192.168.31.88,http://sina.com/c,2019-07-31 20:00:11
vim pv_log.30
192.168.31.22,http://sina.com/a,2019-07-30 20:00:23
192.168.31.22,http://sina.com/b,2019-07-30 15:00:31
192.168.31.23,http://sina.com/k,2019-07-30 20:09:27
192.168.31.24,http://sina.com/a,2019-07-30 20:00:31
192.168.31.25,http://sina.com/h,2019-07-30 20:08:29
192.168.31.28,http://sina.com/a,2019-07-30 20:22:31
192.168.31.22,http://sina.com/e,2019-07-30 20:07:22
192.168.31.88,http://sina.com/c,2019-07-30 20:00:11
vim pv_log.29
192.168.31.22,http://sina.com/a,2019-07-29 20:00:23
192.168.31.22,http://sina.com/b,2019-07-29 15:00:31
192.168.31.23,http://sina.com/k,2019-07-29 20:09:27
192.168.31.24,http://sina.com/a,2019-07-29 20:00:31
192.168.31.25,http://sina.com/h,2019-07-29 20:08:29
192.168.31.28,http://sina.com/a,2019-07-29 20:22:31
192.168.31.22,http://sina.com/e,2019-07-29 20:07:22
192.168.31.88,http://sina.com/c,2019-07-29 20:00:11
--创建分区表
create external table t_pv_log_daily(ip string,url string ,access_time string)
partitioned by (dtime string)
row format delimited
fields terminated by ','
;
将 linux 本地文件加载到hive
进入 hive 客户端:
load data local inpath "/usr/local/test/pv_log.31" into table t_pv_log_daily partition(dtime='20190731');
load data local inpath "/usr/local/test/pv_log.31" into table t_pv_log_daily partition(dtime='20190730');
load data local inpath "/usr/local/test/pv_log.31" into table t_pv_log_daily partition(dtime='20190729');
注意:分区字段不能是表定义中的已存在字段。会将分区字段看做是表的一个伪字段
创建表结构一样的表:create table t_pv_log_daily_2 like t_pv_log_daily; --不含数据,表结构一模一样
创建表字段与查询结果一样的表:
create table t_pv_log_daily_3 as select * from t_pv_log_daily;--含有数据
select * from t_pv_log_daily;
+------------------+---------------------+-----------------------------+-----------------------+
| t_pv_log_daily.ip | t_pv_log_daily.url | t_pv_log_daily.access_time | t_pv_log_daily.dtime |
+------------------+---------------------+-----------------------------+-----------------------+
| 192.168.31.22 | http://sina.com/a | 2019-07-29 20:00:23 | 20190729 |
| 192.168.31.22 | http://sina.com/b | 2019-07-29 15:00:31 | 20190729 |
| 192.168.31.23 | http://sina.com/k | 2019-07-29 20:09:27 | 20190729 |
| 192.168.31.24 | http://sina.com/a | 2019-07-29 20:00:31 | 20190729 |
| 192.168.31.25 | http://sina.com/h | 2019-07-29 20:08:29 | 20190729 |
| 192.168.31.28 | http://sina.com/a | 2019-07-29 20:22:31 | 20190729 |
| 192.168.31.22 | http://sina.com/e | 2019-07-29 20:07:22 | 20190729 |
| 192.168.31.88 | http://sina.com/c | 2019-07-29 20:00:11 | 20190729 |
| 192.168.31.22 | http://sina.com/a | 2019-07-30 20:00:23 | 20190730 |
| 192.168.31.22 | http://sina.com/b | 2019-07-30 15:00:31 | 20190730 |
| 192.168.31.23 | http://sina.com/k | 2019-07-30 20:09:27 | 20190730 |
| 192.168.31.24 | http://sina.com/a | 2019-07-30 20:00:31 | 20190730 |
| 192.168.31.25 | http://sina.com/h | 2019-07-30 20:08:29 | 20190730 |
| 192.168.31.28 | http://sina.com/a | 2019-07-30 20:22:31 | 20190730 |
| 192.168.31.22 | http://sina.com/e | 2019-07-30 20:07:22 | 20190730 |
| 192.168.31.88 | http://sina.com/c | 2019-07-30 20:00:11 | 20190730 |
| 192.168.31.22 | http://sina.com/a | 2019-07-31 20:00:23 | 20190731 |
| 192.168.31.22 | http://sina.com/b | 2019-07-31 15:00:31 | 20190731 |
| 192.168.31.23 | http://sina.com/k | 2019-07-31 20:09:27 | 20190731 |
| 192.168.31.24 | http://sina.com/a | 2019-07-31 20:00:31 | 20190731 |
| 192.168.31.25 | http://sina.com/h | 2019-07-31 20:08:29 | 20190731 |
| 192.168.31.28 | http://sina.com/a | 2019-07-31 20:22:31 | 20190731 |
| 192.168.31.22 | http://sina.com/e | 2019-07-31 20:07:22 | 20190731 |
| 192.168.31.88 | http://sina.com/c | 2019-07-31 20:00:11 | 20190731 |
+------------------+---------------------+-----------------------------+-----------------------+
可以看出分区字段 dtime 也被看成表的一个伪字段
desc t_pv_log_daily;
+--------------------------+------------+----------+
| col_name | data_type | comment |
+--------------------------+------------+----------+
| ip | string | |
| url | string | |
| access_time | string | |
| dtime | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| dtime | string | |
+--------------------------+------------+----------+
show create table t_pv_log_daily;
----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `t_pv_log_daily`( |
| `ip` string, |
| `url` string, |
| `access_time` string) |
| PARTITIONED BY ( |
| `dtime` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://pengyy22:9000/user/hive/warehouse/pengyy.db/t_pv_log_daily' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transient_lastDdlTime'='1564577640') |
+----------------------------------------------------+
desc formatted t_pv_log_daily;
| col_name | data_type | comment |
+---------------------------+-----------------------------------------+-----------------------+
| # col_name | data_type | comment |
| ip | string | |
| url | string | |
| access_time | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| dtime | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | pengyy | NULL |
| OwnerType: | USER | NULL |
| Owner: | root | NULL |
| CreateTime: | Wed Jul 31 20:54:00 CST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://pengyy22:9000/user/hive/warehouse/pengyy.db/t_pv_log_daily | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | EXTERNAL | TRUE |
| | bucketing_version | 2 |
| | numFiles | 3 |
| | numPartitions | 3 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 1249 |
| | transient_lastDdlTime | 1564577640 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | field.delim | , |
| | serialization.format | , |
+-------------------------+-------------------------------------------+-----------------------+
数据导入导出
进入hive客户端:
将本地文件导入 hive
load data local inpath '/usr/local/test/user.info' into table t_user;
将 hdfs 文件导入 hive
load data inpath '/user2.info' into table t_user;
区别:
本地文件导入hive ,是复制,会在相应的表目录下创建一个相同的文件
hdfs文件导入hive ,是移动,会将文件从 hdfs源路径移动到 对应表的目录下
原文:https://www.cnblogs.com/pengyy/p/11318262.html