Hive安装及使用

hive安装及简单实用

本文旨在提供一种hive的安装方法，分享了在安装过程中踩到的各种坑，并介绍了实用spark读写hive数据库。

什么是hive

hive是一个构建在hadoop上的数据仓库框架，是应facebook每天产生的海量新兴社会网络数据进行管理和（机器）学习的需求而产生和发展的，其设计目的是让精通sql技能但java编程技能较弱的分析师能够对facebook存放在hdfs中的大规模数据集执行查询。

hive与传统数据仓库的区别

传统数据仓库采用oracle或mysql等数据库搭建，其数据也是存储在这些数据库中；而hive是建立在hadoop的hdfs上的数据仓库基础框架，其数据存储在hdfs上，但其元数据（metadata）因其metaStore配置不同而存在不同的地方。

hive的metastore有三种不同的配置方式：

默认是采用嵌入方式：内部集成derby数据库（hive自带的），元数据存储在deby中；
- 这种方式的局限在于：只允许创建一个连接，同时只允许一个用户访问，适用于作演示用的demo
- 这种方式的特性是hive服务和metastore服务（derby）在同一个jvm下
本地模式：元数据存储在外部数据库mysql（或其他）中
- mysql和hive运行在同一台物理机上，多用于开发和测试
- 允许创建多个连接
远程模式：元数据存储在外部数据库mysql（或其他）中
- hive和mysql运行在不同的机器上，多用于生产环境，允许创建多个连接

hive的metastore的三种不同配置方式，也对应了三种不同的安装方式，本文主要介绍第二种安装方式。

hive安装前置条件

安装hadoop
- 本文采用伪分布式安装
- 可参考：http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html 中的 Pseudo-Distributed Operation 一节。
安装mysql
- mac下使用brew install mysql命令就可以

安装hive

安装hive并配置环境变量

在mac下使用命令brew install hive ，该命令默认将hive安装在/usr/local/Cellar/hive下，接下来为hive创建环境变量，在/etc/profile文件中添加

#Hive
export HCAT_HOME="/usr/local/opt/hive/libexec/hcatalog"
export HIVE_HOME="/usr/local/Cellar/hive/3.1.1"
PATH=".$PATH:$HCAT_HOME/bin:$HIVE_HOME/bin"

并使用source /etc/profile 使其生效。

在mysql中为hive创建用户，及初始化数据库

# 登录数据库命令行
mysql -u root
# 创建数据库hive
create database hive;
# 新建用户hadoop，主机为localhost，密码为mysql
create user 'hadoop'@'localhost' identified by 'mysql';
# 为刚创建的用户授权
grant all privileges on *.* to 'hadoop'@'localhost' with grant option;
flush privileges;
# 查看用户
select user,host from mysql.user;

修改hive配置文件

配置文件路径: /usr/local/Cellar/hive/3.1.1/libexec/conf

在该路径下创建hive-site.xml文件并添加如下配置：

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<!--加入上一步创建的mysql的用户名、密码、数据库表连接url -->
	<property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hadoop</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>mysql</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>mysql
        <value>jdbc:mysql://localhost:3306/hive?useSSL=false</value>
    </property>
    <!-- 数据库连接驱动器 -->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <!-- 版本信息配置 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
</configuration>

在hadoop中创建hive所需的仓库

# $HADOOP_HOME为hadoop的主目录
$HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir   -p  /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

hive初始化mysql中的数据库hive

# $HIVE_HOME为hive安装路径的主目录
$HIVE_HOME/bin/schematool -dbType msyql -initSchema

在这一步时，遇到了如下问题

org.apache.hadoop.hive.metastore.HiveMetaException: Failed to load drive

这是因为找不到数据库的驱动器，解决方法是，在mysql官网中下载jar包放在 /usr/local/Cellar/hive/3.1.1/libexec/lib 目录下即可。注意在这个过程中，必须注意到下载的jar包和本地安装的mysql版本的一致性，否则会出现如下问题：

org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
Underlying cause: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException : Client does not support authentication protocol requested by server; consider upgrading MySQL client

mysql connector jar包下载地址：https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.12.zip （可换成对应版本的地址）

启动hive的metastore server服务进程

$HIVE_HOME/bin/hive --service metastore &

登录hive客户端

# 若配置了hive环境变量，可以直接执行hive
$HIVE_HOME/bin/ hive 

至此，就完成了hive的安装。

hive sql的基本命令

# 基本的操作
create database if not exists sparktest;
create table if not exists sparktest.student(id int, name string, gender string, age int);
use sparktest;
show tables;
insert into student values(1, "Xueqian','F',23);
select * from student;

hive在执行sql语句时，会将insert等语句转换成map reduce任务去执行，比如在执行上述insert语句后，控制台会有以下流程显示：

hive> insert into student values(1,'Xueqian','F',23);
Query ID = horizonliu_20181218151635_973814dc-f035-4365-a6db-963d9c4fbabe
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1545117373444_0001, Tracking URL = http://HORIZONLIU-MC0:8088/proxy/application_1545117373444_0001/
Kill Command = /usr/local/hadoop-2.7.6/bin/mapred job  -kill job_1545117373444_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-18 15:16:44,175 Stage-1 map = 0%,  reduce = 0%
2018-12-18 15:16:49,324 Stage-1 map = 100%,  reduce = 0%
2018-12-18 15:16:53,438 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_1545117373444_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/sparktest.db/student/.hive-staging_hive_2018-12-18_15-16-35_709_3404301773481002595-1/-ext-10000
Loading data to table sparktest.student
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   HDFS Read: 17760 HDFS Write: 318 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 19.439 seconds

在向hive表中插入数据时，遇到了如下问题，导致数据插入失败：

hive> insert into student values(1, 'Xueqian','F',23);
Query ID = horizonliu_20181217224658_f3553e1a-d7b5-47a0-a065-8c7442a87b0d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1545049302909_0001, Tracking URL = http://HORIZONLIU-MC0:8088/proxy/application_1545049302909_0001/
Kill Command = /usr/local/hadoop-2.7.6/bin/mapred job  -kill job_1545049302909_0001
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2018-12-17 22:47:06,953 Stage-1 map = 0%,  reduce = 0%
Ended Job = job_1545049302909_0001 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

为了定位这个问题，找了hive的日志，发现还没配置，所以我先配置了日志，然后再次执行了insert语句，翻看日志。

配置日志路径

hive的日志配置在/usr/local/Cellar/hive/3.1.1/libexec/conf/hive-log4j2.properties 文件中，包含基本的日志配置选项

# list of properties
property.hive.log.level = INFO
property.hive.root.logger = DRFA
# 这里替换成你想存储日志的路径
property.hive.log.dir = /usr/local/Cellar/hive/3.1.1/libexec/conf/logs
property.hive.log.file = hive.log
property.hive.perflogger.log.level = INFO

定位问题
- 再次执行insert语句，打开日志文件发现和控制台上显示的并无差别，怎么搞！

因此在网上查hive无法插入数据的问题，搜索到：https://stackoverflow.com/questions/11185528/what-is-hive-return-code-2-from-org-apache-hadoop-hive-ql-exec-mapredtask ，告诉通过hadoop的jobtracker去定位问题。于是我去找hadoop的jobtracker的web界面，网上给的localhost:50030 端口，但无法打开，以为是hadoop没有成功启动jobtracker，但后来发现是hadoop1.x和hadoop2.x使用的端口不一样，hadoop2.x jobtracker使用的端口号是8088，打开localhost:8088 看到相关界面如下：

点击failed找到失败的任务，从logs 中获取相关日志，发现出奇简单：

/bin/bash /bin/java no such file or directory

加入hadoop关键字搜索了一下，得到解决方案：

https://stackoverflow.com/questions/33968422/bin-bash-bin-java-no-such-file-or-directory，是没有配置 hadoop的 hadoop-config.sh 文件，该文件在/usr/local/hadoop-2.7.6/libexec 路径下。做如下配置：

# Attempt to set JAVA_HOME if it is not set
if [[ -z $JAVA_HOME ]]; then
  # On OSX use java_home (or /Library for older versions)
  if [ "Darwin" == "$(uname -s)" ]; then
    if [ -x /usr/libexec/java_home ]; then
      export JAVA_HOME=$(/usr/libexec/java_home)
    else
      export JAVA_HOME=/Library/Java/Home
    fi
  fi

接下来重启hadoop，再在hive中执行insert语句，发现执行成功。

使用spark操作hive数据库

本小节主要介绍使用spark读写hive中的数据，在这之前，需要对spark进行一些配置，主要是修改spark-env.sh文件，该文件在 /usr/local/spark-2.3.1-bin-hadoop2.7/conf 路径下。在其中加入：

# 各路径的配置依赖你的安装路径
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop-2.7.6/bin/hadoop classpath)
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home
export CLASSPATH=$CLASSPATH:/usr/local/Cellar/hive/3.1.1/libexec/lib:/usr/local/spark-2.3.1-bin-hadoop2.7/jars
export SCALA_HOME=/usr/local/scala-2.12.6
export SPARK_WORKER_MEMORY=1G
export HADOOP_HOME=/usr/local/hadoop-2.7.6
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.6/etc/hadoop
export HIVE_CONF_DIR=/usr/local/Cellar/hive/3.1.1/libexec/conf
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/spark-2.3.1-bin-hadoop2.7/jars/mysql-connector-java-8.0.12.jar
export SPARK_MASTER_IP=localhost

接下来通过spark-shell来测试下hive的读写，在spark的bin目录下执行./spark-shell 命令进入spark命令行。执行：

scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveCtx = new HiveContext(sc)
scala> val studentRDD = hiveCtx.sql("select * from sparktest.student").rdd
scala> studentRDD.foreach(t => println("Name:"+t(1)+",Gender:"+t(2)+",Age:"+t(3)))

在执行上述的第三步时，遇到错误，说是找不到sparktest.student 这个表，但这个表在hive中是的确存在的，且能通过hadoop的web页面找到：

那为什么spark读不到student表信息？这里我怀疑在spark在读hive的metastore的时候默认读的是derby(hive嵌入模式）中的数据，而derby中明显没有这个数据库的元信息，因此spark也就无法找到student表。所以考虑到在hive配置的时候hive-site.xml中加入了关于mysql的配置，这里spark可能也需要用这个配置去识别，所以将 hive-site.xml文件复制到spark的 conf 目录下，重新启动spark.shell，再次执行 val studentRDD = hiveCtx.sql("select * from sparktest.student").rdd 。此时出现无法找到mysql-connector对应驱动，将

mysql-connector-java-8.0.12.jar 包copy到spark的jars目录下，并将路径加到spark-env.sh，成功解决上述问题、读到hive表中的数据。

参考博客

hive安装：https://www.jianshu.com/p/5c11073d19d3

hive安装过程中遇到的问题：https://blog.csdn.net/BrotherDong90/article/details/49661731

mysql创建/删除/授权/查看用户：https://blog.csdn.net/u014453898/article/details/55064312

Spark入门-连接Hive读写数据（DataFrame）：http://dblab.xmu.edu.cn/blog/1086-2/

hive表插入数据失败：https://stackoverflow.com/questions/11185528/what-is-hive-return-code-2-from-org-apache-hadoop-hive-ql-exec-mapredtask

hive表插入数据失败，需配置hadoop-config.sh文件：https://stackoverflow.com/questions/33968422/bin-bash-bin-java-no-such-file-or-directory

hive日志配置：https://www.jianshu.com/p/e5d3c59634fc