环境准备

Hive需要依赖如下环境

集群的拓扑如下

安装MySQL

安装MySQL

1
2
3
4
5
6
7
8
9
10
11
12
# 安装mysql的yum源
yum install wget
wget https://repo.mysql.com/mysql80-community-release-el7-1.noarch.rpm
yum install mysql80-community-release-el7-1.noarch.rpm
yum update -y

# 安装mysql
yum install mysql-community-server

# 配置mysql的开机自启
systemctl start mysqld
systemctl enable mysqld

安装踩坑

我这里安装报错提示如下

1
2
3
4
5
6
7
8
9
10
warning: /var/cache/yum/x86_64/7/mysql80-community/packages/mysql-community-common-8.0.28-1.el7.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 3a79bd29: NOKEY
Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql


The GPG keys listed for the "MySQL 8.0 Community Server" repository are already installed but they are not correct for this package.
Check that the correct key URLs are configured for this repository.


Failing package is: mysql-community-common-8.0.28-1.el7.x86_64
GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql

具体原因不明确,直接用以下命令绕过GPG验证

1
yum install mysql-community-server --nogpgcheck

登录到MySQL

cat /var/log/mysqld.log

1
2
3
4
5
6
7
8
9
10
11
12
[root@master ~]# cat /var/log/mysqld.log
2022-04-25T07:22:42.969583Z 0 [System] [MY-013169] [Server] /usr/sbin/mysqld (mysqld 8.0.28) initializing of server in progress as process 52213
2022-04-25T07:22:42.987672Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-04-25T07:22:43.426894Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-04-25T07:22:45.853037Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: Tu9&xtetlynf
2022-04-25T07:22:52.474308Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.28) starting as process 52262
2022-04-25T07:22:52.490859Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-04-25T07:22:52.673778Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-04-25T07:22:53.295166Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2022-04-25T07:22:53.295239Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2022-04-25T07:22:53.371308Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2022-04-25T07:22:53.371379Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.28' socket: '/var/lib/mysql/mysql.sock' port: 3306 MySQL Community Server - GPL.

可以找到初始密码为Tu9&xtetlynf

登录到mysql使用命令mysql -u root -p

修改密码规则

默认密码规则比较复杂,为了方便修改密码规则等级

1
2
set global validate_password.policy=0;
set global validate_password.length=1;

但是我这里又出错了

1
2
mysql> set global validate_password.policy=0;
ERROR 1820 (HY000): You must reset your password using ALTER USER statement before executing this statement.

第一次使用输入任何命令都会提示需要修改密码,但是默认密码规则太复杂,需要调整密码规则,但是调整依旧需要先修改密码,所以我们只能先修改个复杂密码,直接把初始密码的后两位改了

alter user user() identified by "Tu9&xtetly00";

然后再调整密码规则等级

现在就可以修改一个简单点的密码了
alter user user() identified by "123456";

外网访问

配置允许外网ip通过root用户访问

1
2
3
4
5
6
7
use mysql;
-- 更新host字段
update user set host = '%' where user = 'root';
-- 查看修改结果
select host user from user;
-- 刷新生效
flush privileges;
1
2
3
4
5
6
7
8
9
10
mysql> select host,user user from user;
+-----------+------------------+
| host | user |
+-----------+------------------+
| % | root |
| localhost | mysql.infoschema |
| localhost | mysql.session |
| localhost | mysql.sys |
+-----------+------------------+
4 rows in set (0.01 sec)

安装Hive

下载Hive

这是下载页面

Apache Downloads

我这里下载hive 3.1.3的版本

apache-hive-3.1.3-bin.tar.gz

1
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gzhttps://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

解压Hive

1
2
tar -zxvf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin /opt

修改Hive配置文件

进入hive安装目录的conf文件夹下

编辑hive-env.sh

从模板复制一份

cp hive-env.sh.template hive-env.sh

添加一行

export HADOOP_HOME=/opt/hadoop-3.1.4

创建hive-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Hive临时文件 -->
<property>
<name>hive.exec.scratchdir</name>
<value>hdfs://master:8020/user/hive/tmp</value>
</property>

<!-- Hive仓库文件 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://master:8020/user/hive/warehouse</value>
</property>

<!-- Hive日志文件 -->
<property>
<name>hive.querylog.location</name>
<value>hdfs://master:8020/user/hive/log</value>
</property>


<!-- MySQL配置 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&&useSSL=false&allowPublicKeyRetrieval=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>

配置MySQL的jdbc驱动

下载MySQL的jdbc驱动

MySQL :: Download Connector/J

解压得到文件mysql-connector-java-8.0.28.jar

移动到hive的lib文件夹

cp mysql-connector-java-8.0.28.jar /opt/apache-hive-3.1.3-bin/lib/

配置环境变量

编辑/etc/profile文件,添加

1
2
export HIVE_HOME=/opt/apache-hive-3.1.3-bin
export PATH=$PATH:$HIVE/bin

将Hive的Guava库与Hadoop保持一致

删除Hive自带的Guava库

rm -rf $HIVE_HOME/lib/guava-19.0.jar

复制Hadoop的Guava库到Hive中

cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

初始化元数据表

schematool -dbType mysql -initSchema

1
2
Initialization script completed
schemaTool completed

看到这个就说明初始化好了

启动元数据服务

hive --service metastore &

运行Hive

现在就可以进入hive shell命令行了

hive shell

踩坑

随便输入各命令显示大量的日志信息

参考这篇文章

关于hive shell 中夹杂大量日志信息的问题 (xueai8.com)

不过本人使用的hbase 2.4.11已经移除了log4j,被reload4j替代,同理这里直接把reload修改后缀使其不生效即可。

现在shell就清净了

Hive实现词频统计

创建文件

本地新建文件/root/test.txt

内容如下

1
2
3
I am a student
I learn hadoop
I learn MapReduce

创建数据库与数据表

1
2
create database test;
create table words(text string);

加载并查看数据

1
2
load data local inpath '/root/test.txt' overwrite into table words;
select * from words;

输出结果如下

1
2
3
4
5
6
7
8
9
10
11
hive> load data local inpath '/root/test.txt' overwrite into table words;
Loading data to table default.words
OK
Time taken: 1.828 seconds

hive> select * from words;
OK
I am a student
I learn hadoop
I learn MapReduce
Time taken: 3.729 seconds, Fetched: 3 row(s)

词频统计

1
2
3
4
SELECT word, count(word) as wordcount 
FROM(SELECT explode (split(text, ' ')) as word
FROM words) as tmp
GROUP BY word;

输出结果如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
hive> SELECT word, count(word) as wordcount
> FROM(SELECT explode (split(text, ' ')) as word
> FROM words) as tmp
> GROUP BY word;

Query ID = root_20220425171959_28a278a7-511b-46c7-bcbd-babc03c6acce
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1650876267702_0001, Tracking URL = http://master:8088/proxy/application_1650876267702_0001/
Kill Command = /opt/hadoop-3.1.4/bin/mapred job -kill job_1650876267702_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-04-25 17:20:22,861 Stage-1 map = 0%, reduce = 0%
2022-04-25 17:20:39,686 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.0 sec
2022-04-25 17:20:51,175 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 18.24 sec
MapReduce Total cumulative CPU time: 18 seconds 240 msec
Ended Job = job_1650876267702_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 18.24 sec HDFS Read: 9102 HDFS Write: 223 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 240 msec

OK
I 3
MapReduce 1
a 1
am 1
hadoop 1
learn 2
student 1
Time taken: 53.971 seconds, Fetched: 7 row(s)