第一步:创建数据库

首先通过show databases命令查看已存在的数据库。然后使用create命令创建一个新的数据库,在本实验中命名为“demo+学号后4位”。

show databases;

1
CREATE DATABASE IF NOT EXISTS demo0740;
1
2
3
4
5
6
7
8
9
10
hive> CREATE DATABASE IF NOT EXISTS demo0740;
OK
Time taken: 0.074 seconds

hive> show databases;
OK
default
demo0740
test
Time taken: 0.038 seconds, Fetched: 3 row(s)

第二步:使用创建的数据库

使用USE命令,将你创建的demo数据库设置为当前使用的数据库。

use demo0740;

第三步:创建表

通过create table命令创建一个表,表名users,创建完成后,用describe 表名 命令查看建表结果,确保建表成功。表属性包含:

id:int,记录编号,具有唯一性

uid:string,用户id

item_id:string,商品id

behavior_type:int,包括浏览、收藏、加购物车、购买,分别为1、2、3、4

item_category:string,商品分类

visit_date:date,该记录产生时间

province:string,用户所在省份

1
2
3
4
5
6
7
8
9
10
11
12
CREATE TABLE IF NOT EXISTS users(
id int,
uid string,
item_id string,
behavior_type int,
item_category string,
visit_date date,
province string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

第四步:在表中加载数据。

按照给定的附件中表样式(users_table.txt,字段间隔为tab)自己编写更多的txt数据,然后使用load data local inpath +‘路径’命令,注意:local表示加载本地系统中文件的数据,而不是HDFS中的。

users_table.txt文件内容

1
2
3
4
5
6
7
8
9
10
1    u001    w12340    1    c01    2019-11-10    shanghai
2 u001 w98881 4 c01 2019-11-11 shanghai
3 u001 w12444 4 c02 2019-11-11 shanghai
4 u002 w12340 4 c01 2019-11-11 hangzhou
5 u002 w86658 4 c05 2019-11-11 hangzhou
6 u003 w87658 3 c02 2019-11-10 nanjing
7 u003 w88667 4 c01 2019-11-11 nanjing
8 u004 w77668 2 c03 2019-11-09 nanjing
9 u004 w12334 1 c03 2019-11-09 nanjing
10 u005 w12324 4 c04 2019-11-11 shanghai
1
load data local inpath "/root/users_table.txt" into table users;

输出结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
hive> load data local inpath "/root/users_table.txt" into table users;
Loading data to table demo0740.users
OK
Time taken: 0.48 seconds

hive> select * from users;
OK
1 u001 w12340 1 c01 2019-11-10 shanghai
2 u001 w98881 4 c01 2019-11-11 shanghai
3 u001 w12444 4 c02 2019-11-11 shanghai
4 u002 w12340 4 c01 2019-11-11 hangzhou
5 u002 w86658 4 c05 2019-11-11 hangzhou
6 u003 w87658 3 c02 2019-11-10 nanjing
7 u003 w88667 4 c01 2019-11-11 nanjing
8 u004 w77668 2 c03 2019-11-09 nanjing
9 u004 w12334 1 c03 2019-11-09 nanjing
10 u005 w12324 4 c04 2019-11-11 shanghai
Time taken: 0.222 seconds, Fetched: 10 row(s)

第五步:从表中检索数据。

(1) 查询visit_date为“2019-11-11”的记录。(给定检索条件)

1
2
SELECT * FROM users
WHERE visit_date = '2019-11-11';

输出结果

1
2
3
4
5
6
7
8
9
10
hive> SELECT * FROM users
> WHERE visit_date = '2019-11-11';
OK
2 u001 w98881 4 c01 2019-11-11 shanghai
3 u001 w12444 4 c02 2019-11-11 shanghai
4 u002 w12340 4 c01 2019-11-11 hangzhou
5 u002 w86658 4 c05 2019-11-11 hangzhou
7 u003 w88667 4 c01 2019-11-11 nanjing
10 u005 w12324 4 c04 2019-11-11 shanghai
Time taken: 0.402 seconds, Fetched: 6 row(s)

(2) 查询visit_date为“2019-11-11”的前5条用户购买商品的商品分类和用户所在省份。(给定检索条件,并限定检索数量,limit)

1
2
3
SELECT item_category, province FROM users
WHERE visit_date = '2019-11-11'
LIMIT 5;

输出结果

1
2
3
4
5
6
7
8
9
10
11
hive> SELECT item_category, province FROM users
> WHERE visit_date = '2019-11-11'
> LIMIT 5;

OK
c01 shanghai
c02 shanghai
c01 hangzhou
c05 hangzhou
c01 nanjing
Time taken: 0.236 seconds, Fetched: 5 row(s)

(3) 使用聚合函数count()计算出表内有多少行数据。

1
SELECT count(*) FROM users;

输出结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
hive> SELECT count(*) FROM users;

Query ID = root_20220425184634_6ea834ce-d99a-42cd-9741-ffa1188ec3b9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1650876267702_0002, Tracking URL = http://master:8088/proxy/application_1650876267702_0002/
Kill Command = /opt/hadoop-3.1.4/bin/mapred job -kill job_1650876267702_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-04-25 18:46:49,277 Stage-1 map = 0%, reduce = 0%
2022-04-25 18:46:58,650 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.21 sec
2022-04-25 18:47:07,972 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.2 sec
MapReduce Total cumulative CPU time: 8 seconds 200 msec
Ended Job = job_1650876267702_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.2 sec HDFS Read: 13756 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 200 msec

OK
10
Time taken: 34.846 seconds, Fetched: 1 row(s)

可以看见输出结果为10

(4) 查询uid不重复的数据有多少条。(distinct)

1
SELECT count(DISTINCT uid) FROM users;

输出结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
hive> SELECT count(DISTINCT uid) FROM users;

Query ID = root_20220425184757_c1464142-e985-47f7-82d1-d078b3c69eba
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1650876267702_0003, Tracking URL = http://master:8088/proxy/application_1650876267702_0003/
Kill Command = /opt/hadoop-3.1.4/bin/mapred job -kill job_1650876267702_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-04-25 18:48:12,757 Stage-1 map = 0%, reduce = 0%
2022-04-25 18:48:22,144 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.75 sec
2022-04-25 18:48:36,668 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.37 sec
MapReduce Total cumulative CPU time: 8 seconds 370 msec
Ended Job = job_1650876267702_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.37 sec HDFS Read: 9748 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 370 msec

OK
5
Time taken: 41.043 seconds, Fetched: 1 row(s)