CDH4.3版本中并没有提供现成的Parquet安装包,所以如果在Hive或Impala中需要使用Parquet格式,需要手动进行安装,当创建Parquet格式的表时,需要定义Parquet相关的InputFormat,OutputFormat,Serde,建表语句如下
1
2
3
4
5
6
7
|
hive> create table parquet_test(x int , y string) > row format serde ‘parquet.hive.serde.ParquetHiveSerDe‘ > stored as inputformat ‘parquet.hive.DeprecatedParquetInputFormat‘ > outputformat ‘parquet.hive.DeprecatedParquetOutputFormat‘ ; FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat |
提交语句会报错,原因是parquet.hive.DeprecatedParquetOutputFormat类并没有在Hive的CLASSPATH中配置,此类属于$IMPALA_HOME/lib目录下的parquet-hive-1.2.5.jar,所以在$HIVE_HOME/lib目录下建立个软链就可以了
1
2
|
cd $HIVE_HOME /lib ln -s $IMPALA_HOME /lib/parquet-hive-1 .2.5.jar |
继续提交建表语句,报错如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
hive> create table parquet_test(x int , y string) > row format serde ‘parquet.hive.serde.ParquetHiveSerDe‘ > stored as inputformat ‘parquet.hive.DeprecatedParquetInputFormat‘ > outputformat ‘parquet.hive.DeprecatedParquetOutputFormat‘ ; Exception in thread "main" java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.hive.ql.plan.CreateTableDesc.validate(CreateTableDesc.java:403) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8858) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8190) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:459) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:349) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:938) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by : java.lang.ClassNotFoundException: parquet.hadoop.api.WriteSupport at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 20 more |
报错的原因是因为缺少一些Parquet相关的jar文件,直接下载到$HIVE_HOME/lib目录下即可
1
2
3
4
5
6
|
cd /usr/lib/hive/lib for f in parquet-avro parquet-cascading parquet-column parquet-common parquet-encoding parquet-generator parquet-hadoop parquet-hive parquet-pig parquet-scrooge parquet- test -hadoop2 parquet-thrift > do > curl -O https: //oss .sonatype.org /service/local/repositories/releases/content/com/twitter/ ${f} /1 .2.5/${f}-1.2.5.jar > done > curl -O https: //oss .sonatype.org /service/local/repositories/releases/content/com/twitter/parquet-format/1 .0.0 /parquet-format-1 .0.0.jar |
继续提交建表语句,正常通过。成功建表后,需要将其他表中的数据Load到Parquet格式的表中,在执行HQL过程中,需要使用Parquet相关的jar文件,有两种方法,一种是在运行语句前对每一个jar都执行add jar操作,比较麻烦。第二种是修改hive-site.xml文件进行配置
1
2
3
4
|
<property> <name>hive.aux.jars.path< /name > <value> file : ///usr/lib/hadoop/lib/parquet-hive-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-hadoop-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-avro-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-cascading-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-column-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-common-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-encoding-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-format-1 .0.0.jar, file : ///usr/lib/hadoop/lib/parquet-generator-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-scrooge-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-test-hadoop2-1 .2.5.jar, file : ///usr/lib/hadoop/lib/parquet-thrift-1 .2.5.jar< /value > < /property > |
配置好后,需要设置下parquet.compression属性,来标识格式转换后的压缩方式,目前支持UNCOMPRESSED,GZIP,SNAPPY三种格式。然后就可以通过insert...select进行格式转换了
参考资料:http://cmenguy.github.io/blog/2013/10/30/using-hive-with-parquet-format-in-cdh-4-dot-3/
原文:http://www.cnblogs.com/ledemi/p/6322804.html