Hive 基础篇

Hive 基本概念

Hive 简介

官方介绍
Apache Hive™ 数据仓库软件有助于使用 SQL 读取、编写和管理驻留在分布式存储中的大型数据集。

可以将结构投影到已存储的数据中。提供了命令行工具和 JDBC 驱动程序以将用户连接到 Hive。

通俗理解
Hive 就是在 Hadoop 上架了一层 SQL 接口，可以将 SQL 翻译成 MapReduce 去 Hadoop 上执行，这样就使得

数据开发和分析人员很方便的使用 SQL 来完成海量数据的统计和分析，而不必使用编程语言开 MapReduce 那么麻烦。

Hive 数据类型

正常数据类型

# 整型，4个字节
    int

# 大整型，8个字节
    bigint

# 字符串，最大长度2G
    String

复杂数据类型

# 数组，相同类型元素的数组 array<String>
    array => ['tom','jerry']

# 结构体，事先定义好字段名称 struct<column int, fruit string>
    struct => {1,'apple'}

# K-V对的集合，事先定义好k-v类型 map<int,string>
    map => {1:'apple'}

# 联合类型，此例子是 map 和 array 的嵌套
    unoin => {'jack':['tom','jerry']}

Hive 应用场景

Hive 适用于离线数据的分析，操作的是通用格式的（如通用的日志文件）、被 Hadoop 

管理的数据文件，它支持类 sql，比编写 MapReduce 的 java 代码来的更加方便，

它的定位是数据仓库，存储和分析历史数据。

Hive 安装与配置

安装前的准备

1	将 Hive 相关软件下的文件或文件夹均拷贝到 /home/centos 下

安装 MySQL

# 0.说明
    本地安装 MySQL ，仓库镜像与安装文件皆为下载好的

# 1.安装 MySQL 的仓库镜像
    sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm
			
# 2.安装mysql
    cd mysql
    sudo yum -y localinstall *

# 3.启动并修改配置mysql密码
    centos> systemctl start mysqld
    centos> systemctl enable mysqld
    # 进入mysql中
    centos> mysql -uroot

    # 设置mysql密码为root
    mysql> update mysql.user set password=password('root');
    # 刷新权限列表
    mysql> flush privileges;

# 4.退出mysql并重新进入
    mysql> exit
    centos> mysql -uroot -proot

安装 Hive

# 1.解压
    tar -xzvf apache-hive-2.1.1-bin.tar.gz -C /soft/

# 2.创建符号链接
    cd /soft/
    ln -s apache-hive-2.1.1-bin/ hive

# 3.环境变量
    # hive环境变量
    export HIVE_HOME=/soft/hive
    export PATH=$PATH:$HIVE_HOME/bin

# 4.生效环境变量
    source /etc/profile

# 5.修改hive配置文件
    # 重命名所有的 template 文件后缀去掉
    rename '.template' '' *.template

    # 修改 hive-env.sh 文件，添加
    HADOOP_HOME=/soft/hadoop

    # 重命名 hive-default.xml 文件为 hive-site.xml
    mv hive-default.xml hive-site.xml

    # 修改hive-site.xml		//替换自己的hive-site.xml
    <!-- 连接MySQL的JDBC driver -->
    <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
    </property>

    <!-- 连接MySQL的密码 -->
    <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
    <description>password to use against metastore database</description>
    </property>

    <!-- 连接MySQL的url -->
    <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://s101:3306/hive</value>
    </property>

	# 修改 hive-site.xml 中的${system:user.name}
        # 和${system:java.io.tmpdir}	//可选
        sed -i 's@${system:user.name}@centos@g' hive-site.xml
        sed -i 's@${system:java.io.tmpdir}@/home/centos/hive@g' hive-site.xml

        # 拷贝mysql驱动到hive下
        cp ~/mysql-connector-java-5.1.44.jar /soft/hive/lib/

        # 在mysql中创建数据库hive
        create database hive;
				
        # 初始化元数据库
        schematool -initSchema -dbType mysql

体验 Hive

1 2	# 输入 hive 进入命令行 hive

启动 Hive 的顺序

启动 ZooKeeper

1	xzk.sh start

xzk.sh 脚本内容

启动 Hadoop(HDFS+MR)

1	start-all.sh

启动 Hive

hive

Hive 2代服务

Hive 的 jdbc 接口，用户可以连接此端口来连接 Hive 服务器

# 先启动
    hiveserver2

# 使用新一代客户端 beeline 连接 hiveserer2
    beeline -u jdbc:hive2://localhost:10000

Hive 与 Hadoop 的关系

关系

Hive构建在 Hadoop 之上
HQL 中对查询语句的解释、优化、生成查询计划是由 Hive 完成的
所有的数据都是存储在 HDFS 中
查询计划被转化为 MapReduce 任务，在 Hadoop 中执行

（有些查询没有 MR 任务，如：select * from table）
Hadoop 和 Hive 都是用 UTF-8 编码的

Hive 语句和 MR 执行过程对应

Hive 与传统数据库对比

对比1

SQL => struct query language

HQL => hive query language  类sql，和sql语句差别不大

no sql => not only sql     不仅仅是sql，和sql语句差距较大

对比2

传统数据库 OLTP

online transaction process 在线事务处理

# 事务：
    ACID
    A：atomic	        原子性
    C: consistent	一致性
    I：isolation	隔离性
    D: durability	持久性

# RDBMS(关系型数据库管理系统)
    # 范式
        # 第一范式
        无重复的列,一列只能包含一个字段
        
        # 第二范式
        主键约束，一行只能被唯一标识

        # 第三范式
        非主键字段要严格依赖于主键字段

Hive

OLAP：online analyze process 在线分析处理

Hive数据仓库：反范式设计，允许甚至推荐冗余

提供类似SQL的查询语言，我们称之为HiveQL或者HQL

熟悉的，快速的，可伸缩的，可扩展的

Hive 的数据存储机制

Hive 中的表为文件夹

实际数据为文件

Hive 基本操作

Hive 建表指定分隔符

Demo

# 创建表 user2 ，指定行格式化分隔符中的字段分隔符为制表位 \t
    create table users2 (id int, name string, age int)
    row format delimited fields terminated by '\t';

# 向表中插入数据
    insert into users2 values(1,'tom',18);

分隔符类型

# 字段分隔符
    fields terminated by '\t'

# array 类型成员分隔符
    collection items terminated by ','

# map 的 key 和 value 之间的分隔符
    map keys terminated by ':'

# 行分隔符,必须放在最后
    lines terminated by '\n'

Hive 插入数据

Hive 插入数据的方法

Hive 插入数据不是使用 insert，而是 load

Hive 插入数据的顺序

# 1.先定义好表结构
    create table employee(name string,
    work_place array<string>,
    sex_age struct<sex:string, age:int>,
    score map<string, int>,
    depart_title map<string, string>)
    row format delimited
    fields terminated by '|'  
    collection items terminated by ',' 
    map keys terminated by ':'
    lines terminated by '\n' 
    stored as textfile;

# 2.准备数据。数据格式要和表结构对应 employee.txt
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
	
# 3.空表中使用load命令加载数据
    load data local inpath '/home/centos/files/employee.txt' into table employee;

# 4.取出所有的成员
    # array获取
    select name ,work_place[0] from employee;
    # 结构体获取
    select name ,sex_age.sex from employee;
    # map成员获取
    select name, score['Python'] from employee;

插入复杂类型数据 insert

# 设置显示表头,临时修改，永久修改需要修改配置文件 hive-site.xml
    set hive.cli.print.header=true;

# 插入复杂类型使用转储：insert xxx select xxx
    # 通过 select 语句构造出 array 类型
    select array('tom','tomas','tomson') ;
    insert into employee(name,work_place)
    select 'tom',array('beijing','shanghai','guangzhou');

    # 通过select语句构造出map类型
    select map('bigdata',100);
    insert into employee(name,score)
    select 'tomas', map('bigdata',100);

    # 通过select语句构造出struct类型
    select struct('male',10);
    select named_struct('sex','male','age',10);
    insert into employee(name,sex_age)
    select 'tomson',named_struct('sex','male','age',10);

load命令详解

前提: 先建表

create table duowan(id int, name string, pass string, mail string, nickname string)
row format delimited
fields terminated by '\t' 
lines terminated by '\n' 
stored as textfile;

使用 load

# 1.load 本地数据,相当于上传或者复制，源文件不变
    load data local inpath '/home/centos/files/employee.txt' into table employee;

# 2.load hdfs 数据,相当于移动
    load data inpath '/duowan_user.txt' into table duowan;

# 3.load本地数据 + 覆盖原始数据
    load data local inpath '/home/centos/files/employee.txt' overwrite into table employee;

# 4.load hdfs数据 + 覆盖原始数据
    load data inpath '/duowan_user.txt' overwrite into table duowan;

JDBC 连接 Hive

报错

# 错误
    java.lang.RuntimeException: 
    org.apache.hadoop.ipc.RemoteException:
    User: centos is not allowed to impersonate anonymous

# 修改 hive-site.xml 配置文件禁用连接验证
    hive.server2.enable.doAs=false

Demo

本地 navicat 连接服务器中的 MySQL

主要目的是查看 Hive 的元数据

报错

# 错误
    host 192.168.10.1 is not allowed to connect to mysql server

# 在服务器端的 MySQL 中输入
    grant all privileges on *.* to 'root'@'192.168.10.1' identified by 'root';
    flush privileges;

在 Hive 中执行 shell 语句

1 2	# 在hive命令行执行shell语句 !sh ls -l /home/centos

在 Hive 中执行 HDFS 语句

# 在hive命令行执行HDFS语句
    dfs -ls /;

# 在 Hive 中执行 HDFS 相关操作速度快的原因是
# Hive 在启动时已经开启了 jvm 虚拟机
# 同时这也是 Hive 占内存的原因

Hive 中的 DDL 操作

说明

DDL（Data Definition Languages）语句：数据定义语言

这些语句定义了不同的数据段、数据库、表、列、索引等数据库对象的定义。

常用的语句关键字主要包括 create、drop、alter等。

alter

# 1.修改表名
    alter table wc rename to wc3;	

# 2.添加列
    alter table customers add columns(wife string);

# 3.删除列，不能用
    alter table customers drop column wife;

# 4.替换所有列,修改字段名
    alter table customers replace columns(id int, name string, age int);

# 5.将id列改名为no，类型为string
     alter table customers change column id no string;

实现高效的 JOIN 查询

说明

在 Hive 中如何实现高效的 JOIN 查询

join 最好将大表放在右侧，避免内存溢出异常

普通 join 操作

说明

# 返回行数 a ∩ b
    a inner join b

# 返回行数 a
    a left [outer] join b

# 返回行数 b
    a right [outer] join b

# 返回行数 a+b - (a ∩ b)
    a full [outer] join b

# 返回行数 a * b
    a cross join b

例子

# 1.创建 customer 表
    create table customers(id int, name string, age int)
    row format delimited
    fields terminated by '\t';

# 2.创建 orders 表
    create table orders(oid int, oname string, oprice float, uid int)
    row format delimited
    fields terminated by '\t';

# 3.创建并加载数据
    load data local inpath '/home/centos/files/customers.txt' into table customers;

    load data local inpath '/home/centos/files/orders.txt' into table orders;

# 4.使用join

    # 内连接
    select a.id, a.name, b.oname, b.oprice
    from customers a inner join orders b
    on a.id=b.uid;

    # 左外连接
    select a.id, a.name, b.oname, b.oprice
    from customers a left outer join orders b
    on a.id=b.uid;

    # 右外连接
    select a.id, a.name, b.oname, b.oprice
    from customers a right outer join orders b
    on a.id=b.uid;

    # 全外连接
    select a.id, a.name, b.oname, b.oprice
    from customers a full outer join orders b
    on a.id=b.uid;

高效的 join 查询

map join

说明

# 1.结构
    小表+大表
    将小表加入到分布式缓存，通过迭代大表所有数据进行处理

# 2.调优

    # 在老版的 hive 中(0.7)之前，所有的 join 操作都是
    # 在 reduce 端执行的(reduce 端join)
    # 想要进行 map 端 join，需要进行以下操作
    SET hive.auto.convert.join=true; 

    #  声明暗示 a join b , a小表，b大表
    # /*+ mapjoin(小表) */
    SELECT /*+ MAPJOIN(a) */ a.id, a.name, b.orderno, b.oprice
    from customers a inner join orders b
    on a.id=b.cid;

    # 在新版hive中，如果想要进行map端join

    # 设置自动转换成map端join
    jdbc:hive2://> SET hive.auto.convert.join=true;	
    # 设置map端join中小表的最大值
    jdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;

common join

说明

# 1.说明
    common join 即 reduce 端 join

# 2.实现的两种方式
    # 声明暗示，指定大表
    /*+ STREAMTABLE(大表) */
    # 将大表放在右侧

测试

# 1.使用以下两个表进行测试
    customers和orders
	
# 2.不写任何暗示，观察是 map 端 join 还是 reduce join
    SELECT  a.no, a.name, b.oname, b.oprice
    from customers a inner join orders b
    on a.no=b.uid;

# 3.写暗示，观察效果
    SELECT /*+ MAPJOIN(a) */ a.no, a.name, b.oname, b.oprice
    from customers a inner join orders b
    on a.no=b.uid;

# 4.将自动转换 map join 设置成 false
    SET hive.auto.convert.join=false;	

# 5.写 reduce 端 join 的暗示，观察结果
    SELECT /*+ STREAMTABLE(a) */ a.no, a.name, b.oname, b.oprice
    from customers a inner join orders b
    on a.no=b.uid;

Hive 的内置函数应用

基本操作

# 查看函数
    show functions;

# 查看函数的用法
    desc function function_name;

时间函数

# 显示当前数据库
    select current_database();

# 显示当前日期
    select current_date();

# 显示当前时间戳(精确到毫秒)
    select current_timestamp();

# 将时间格式化
    select date_format( current_timestamp(), 'yyyy-MM-dd HH:mm:ss');

# 将日期转换成时间戳(精确到秒)
    select unix_timestamp(current_timestamp());

# 将时间戳转换成日期
    select from_unixtime(2378934500000,'yyyy-MM-dd');

# 计算两个指定日期相差多少天
    select datediff('2018-08-08','2017-05-10');

String函数

# 切分，按空格切分成 array 数组 ["hello","word"]
    select split('hello word',' ');

# 截取字符串，从第7位开始，结果为 word
    select substr('hello word', 7);

# 去掉前后的空格
    select trim(' world');           

# 格式化数字
format_number()

    # 1,234.35
    select format_number(1234.345,2);

    # 001234.35
    select format_number(1234.345,'000000.00');

    # 1234.35
    select format_number(1234.345,'########.##');

# 拼串操作，返回hello world
    select concat('hello', ' world');
    
# 查看字符串的长度
    select length('helloworld');

条件语句

# if
# 相当于三元运算符,第一个表达式成立返回第二个表达式,第一个不成立返回第三个表达式。
    select if( age > 10 ,'old', 'young') from user_par;

# case
# 小于20，返回young，小于40，返回middle，否则，返回old
    select case when age<20 then 'young' when age<40 then 'middle' else 'old' end from user_par;

explode

# 1.描述

    separates the elements of array a into multiple rows
    or the elements of a map into multiple rows and columns

    分裂 array 数组的中的元素变成多行

    或者分裂 map 中的元素变成多行和多列
    
    是一个表生成函数

# 2.使用方法
    explode(array)

split

# 1.描述
    Splits str around occurances that match regex

    拆分与正则表达式匹配的事件

# 2.使用方法
    split(str, regex)

substr

# 1.描述
    截取从指定位置开始指定长度的的字符串

    默认从第一位开始截取

# 2.使用方法
    substr(str, pos[, len])

Hive shell的高级使用方式

计算 Top 10

# 创建表
    create table duowan(id int, name string, pass string, mail string, nickname string)
    row format delimited
    fields terminated by '\t'
    lines terminated by '\n'
    stored as textfile;

# 加载数据
    load data inpath '/duowan_user.txt' into table duowan;

# 开始执行
    select pass , count(*) as count
    from duowan
    group by pass
    order by count
    desc
    limit 10; 

# 设置reduce个数
    set mapreduce.job.reduces=2;

WordCount

思路1

# 1.将 array 类型成员分隔符指定为空格
    collection items terminated by ' '

# 2.调用 explode 函数将数组分裂成多行
    explode

实现1

# 1.创建表wc
    create table wc(line array<string>)
    row format delimited
    collection items terminated by ' ';

# 2.加载数据
    load data local inpath '/home/centos/files/1.txt' into table wc;

# 3.查看 explode 之后的效果

    select explode(line) from wc;

# 4.编写sql语句
    select word, count(*) as count
    from (select explode(line) word from wc ) a
    group by word
    order by count
    desc;

思路2

1
2
3

# 1.建立一个表，只有一行字符串

# 1.调用 split 方法

实现2

# 1.创建表wc2	
    create table wc2(line string)
    row format delimited;

# 2.加载数据
    load data local inpath '/home/centos/files/1.txt' into table wc2;

# 3.查看 explode 之后的效果

    select explode(split(line,' '))

    from wc2;

# 4.编写sql语句

    select word , count(*) as count
    from (select explode(split(line,' ')) as word from wc2) a
    group by word
    order by count
    desc;
```		  

### 最高气温统计

``` bash
# 1.创建tmp表
    create table temp(line string);

# 2.加载数据
    load data local inpath '/home/centos/files/temp' into table temp;

# 3.查看截取的年份和温度字段
    select substr(line,16,4) as year, substr(line,88,5) as temp 
    from temp;

# 4.使用 cast 将 string 转换成 int,方法如下
    cast(substr(line,88,5) as int)

# 5.查看截取的年份和温度字段，将温度类型由 string 转换成 int
    select substr(line,16,4) as year,
    cast(substr(line,88,5) as int) as temp 
    from temp;

# 6.获取最高温度，并过滤非法字段9999
    select year, max(temp) from
    (select substr(line,16,4) as year,
    cast(substr(line,88,5) as int) as temp from temp) a
    where temp != 9999
    group by year;

Hive 常用参数配置

Hive 使用 Hadoop 本地模式

1 2	# 1.让 Hive 自动使用 Hadoop 的本地模式运行作业，提升处理性能 set hive.exec.mode.local.auto=true;

Hive 自定义函数和Transform的使用技巧

	
	UDF	//user define function	
		//输入单行，输出单行，类似于 format_number(age,'000')

	UDTF	//user define table-gen function
		//输入单行，输出多行，类似于 explode(array);

	UDAF	//user define aggr function
		//输入多行，输出单行，类似于 sum(xxx)

	

UDF:
	1、继承于UDF
	2、evaluate方法

@Description(
        name = "myudf",
        value = "this is a sum function",
        extended = "eg: select myudf(1,2) => 3 , select myudf(\"hello\",\"world\") => helloworld"
)

public class MyUDF extends UDF {

    public int evaluate(int i , int j){
        return i+j;
    }

    public String evaluate(String a, String b){
        return a+b;
    }
}


	注册hive自定义函数
		1、将hive自定义函数打包并发送到/soft/hive/lib下
		2、重启hive
		3、注册函数
			create function myudf as 'com.oldboy.hive.udf.MyUDF'		//永久函数
			create temporary function myudf as 'com.oldboy.hive.udf.MyUDF'	//临时函数

Hive UDF/UDAF开发实例

Hive 执行过程分析及优化策略

Hive 优化策略

性能工具

explain
对 Hive 语句进行解析，解析为 MR 的各个阶段

analyze
对表进行分析，得出 SQL 语句的优化策略和基本统计

设计优化

分区表

说明

partition，是一个字段，是文件夹

前提：在 Hive 进行 where 子句查询的时候，会将条件语句和全表进行比对，
搜索出所需的数据，性能极差，partition 就是为了避免全表扫描

# 1.创建非分区表 user_nopar
    create table user_nopar
    (id int, name string, age int, province string, city string)
    row format delimited
    fields terminated by '\t';

# 2.加载数据
    load data local inpath '/home/centos/files/user_nopar.txt' into table user_nopar;

# 3.创建分区表 user_par
    create table user_par(id int, name string, age int)
    partitioned by(province string, city string)
    row format delimited
    fields terminated by '\t';

# 4.手动添加分区
    alter table user_par
    add partition(province='beijing',city='beijing');

# 5.将数据加载到指定分区(分区可以不存在)
    load data local inpath '/home/centos/files/customers.txt'
    into table user_par
    partition (province='shanxi',city='taiyuan');

# 6.将表清空
    truncate table user_par;

# 7.设置动态分区非严格模式，无需指定静态分区
    set hive.exec.dynamic.partition.mode=nonstrict;

# 7.插入数据动态指定分区
    insert into user_par
    partition(province,city)
    select * from user_nopar;

分桶表

避免分区文件夹过多

索引

数据文件优化

文件格式

压缩

存储

Hive 基本概念

Hive 简介

Hive 数据类型

Hive 应用场景

Hive 安装与配置

安装前的准备

安装 MySQL

安装 Hive

体验 Hive

启动 Hive 的顺序

Hive 与 Hadoop 的关系

关系

Hive 语句和 MR 执行过程对应

Hive 与传统数据库对比

对比1

对比2

Hive 的数据存储机制

Hive 基本操作

Hive 建表指定分隔符

Hive 插入数据

Hive 插入数据的方法

Hive 插入数据的顺序

插入复杂类型数据 insert

load命令详解

JDBC 连接 Hive

本地 navicat 连接服务器中的 MySQL

在 Hive 中执行 shell 语句

在 Hive 中执行 HDFS 语句

Hive 中的 DDL 操作

说明

alter

实现高效的 JOIN 查询

说明

普通 join 操作

高效的 join 查询

map join

common join

测试

Hive 的内置函数应用

基本操作

时间函数

String函数

条件语句

explode

split

substr

Hive shell的高级使用方式

计算 Top 10

WordCount

Hive 常用参数配置

Hive 使用 Hadoop 本地模式

Hive 自定义函数和Transform的使用技巧

Hive UDF/UDAF开发实例

Hive 执行过程分析及优化策略

Hive 优化策略

性能工具

设计优化

分区表

分桶表

数据文件优化

作业和查询优化