es for apache hadoop

时间：2018-09-05 19:23:59 阅读：204 评论：0 收藏：0 [点我收藏+]

es for apache hadoop 是一个开源的、独立的小型库，允许hadoop作业(mapreduce、hive、pig、cascading、spark)与es交互。可以将其视为允许数据双向流动的连接器。

es for apache hadoop 为mapreduce、cascading、pig和hive提供了一流的支持，因此使用es实际上就像使用hadoop 集群中的资源一样。

es for apache hadoop 是es 的客户端库，虽然它具有支持hadoop/spark 操作的扩展功能。

Note

Elasticsearch for Apache Hadoop does not support rolling upgrades well. During a rolling upgrade, nodes that elasticsearch-hadoop is communicating with will be regularly disappearing and coming back online. Due to the constant connection failures that elasticsearch-hadoop will experience during the time frame of a rolling upgrade there is high probability that your jobs will fail. Thus, it is recommended that you disable any elasticsearch-hadoop based write or read jobs against Elasticsearch during your rolling upgrade process.

可以从maven 中央仓库获取es for apache hadoop 库

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>6.3.0</version>
</dependency>

es for apache hadoop 适用于hadoop 2.x(也就是YARN)环境。es 5.5版本开始就停止对hadoop 1.x的支持了。

一、与hive整合，需要引入elasticsearch-hadoop-hive-6.3.0.jar包

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop-hive</artifactId>
  <version>6.3.0</version>
</dependency>

安装

在hive classpath中提供elasticsearch-hadoop.jar(可从maven仓库中下载)，以下方法任一种均可：

1、在hive 命令行中执行add jar /path/elasticsearch-hadoop.jar命令。

路径可以是本地路径，也可以是hdfs路径。建议使用hdfs路径，这样命令就可以在多个hive机器上执行了

2、启动hive服务时，使用auxpath参数

hive -- auxpath=/path/elasticsearch-hadoop.jar

3、启动hive服务时，指定hive.aux.jars.path属性值

hive -hiveconf hive.aux.jars.path=/path/elasticsearch-hadoop.jar

4、直接在hive-site.xml文件中更改hive.aux.jars.path属性值

<property>
  <name>hive.aux.jars.path</name>
  <value>/path/elasticsearch-hadoop.jar</value>
  <description>A comma separated list (with no spaces) of the jar files</description>
</property>

配置

hive在建表时，创建external表，并使用tblproperties指定一些es相关的属性

CREATE EXTERNAL TABLE artists (...)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler‘
TBLPROPERTIES(‘es.resource‘ = ‘radio/artists‘, ‘es.index.auto.create‘ = ‘false‘)

映射

默认情况下，elasticsearch-hadoop 使用hive表的字段名和类型映射es中的数据。但有些情况下，在hive中可以使用的名称在es中不能使用，比如一些es的关键字。对于这种情况，可以在建hive表时指定es.mapping.names属性，值是以逗号分隔的映射名称列表，映射格式是hive字段名称:es字段名称。如下：

CREATE EXTERNAL TABLE artists (...)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler‘
TBLPROPERTIES(‘es.resource‘ = ‘radio/artists‘, ‘es.mapping.names‘ = ‘date:@timestamp , url:url_123 ‘);

上例中，hive外部表artists的date列、url列分别对应es中artists索引的@timestamp字段、url_123字段。

hive不区分大小写，但是es区分。为了避免列名大小写对不上造成的信息丢失，elasticsearch-hadoop会将hive列名称全转为小写。

hive 通过特殊值NULL处理缺失值。这意味着在es上运行一个不正确的查询(如查询一个不存在的字段)时，hive表将填充NULL而不是抛异常。

写数据到es

es for apache hadoop

原文：https://www.cnblogs.com/koushr/p/9593697.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)