基于 CentOS7.2 环境编译 Spark-2.2.0 源码

编译 Spark-2.2.0 源码

参考 基于CentOS6.4环境编译Spark-2.1.0源码

参考 Spark 2.2.0下载安装及源码编译

需求分析


实际工作中,Spark 官网所提供的安装包不能满足我们的需求。

因为环境的不同出现问题

所以必须要结合实际环境用 Spark 源码进行编译后使用。


根据 Spark 官方文档编译模块的介绍如下:

1
The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+. Note that support for Java 7 was removed as of Spark 2.2.0.

得出编译 Spark-2.2.0 需要 Maven 3.3.9Java 8+

前期准备

  1. Java8 的安装

  2. Maven3.3.9 的安装

下载maven后进行环境变量的设置,设置maven的内存使用,在环境变量中加入如下命令

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

  1. Scala 的安装

  2. git:直接输入命令:sudo yum install git 下载 git

Spark-2.2.0 源码下载

Spark 下载地址

解压

将下载好的 spark-2.2.0.tgz 通过 Xftp 传输到 /abs/software 目录

解压 spark-2.2.0.tgz

1
2
3
# 将Spark源码包解压到/abs/app/中

tar -zxvf spark-2.2.0.tgz -C /abs/app/

解压后的目录机构如下所示:

配置

添加 CDH 的 maven repositorypom.xml

1
2
3
4
5
<repository>
<id>cloudera</id>
<name>cloudera Repository</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>

修改位于 /abs/app/spark-2.2.0/devmake-distribution.sh 文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 注释以下内容:

#VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null | grep -v "INFO" | tail -n 1)
#SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | tail -n 1)
#SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | tail -n 1)
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
# | grep -v "INFO"\
# | fgrep --count "<id>hive</id>";\
# # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
# # because we use "set -o pipefail"
# echo -n)

# 加入下面的内容:

VERSION=2.2.0
SCALA_VERSION=2.11
SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0
SPARK_HIVE=1

编译

查看 Spark 官方文档编译源码部分

我们可以使用 Spark 源码目录中的 dev 下的 make-distribution.sh 脚本,官方提供的编译命令如下:

1
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn

可以根据具体的条件来编译 Spark,比如我们使用的 Hadoop 版本是 2.6.0-cdh5.7.0,并且我们需要将 Spark 运行在 YARN 上、支持对 Hive 的操作,那么我们的 Spark 源码编译脚本就是:

1
./dev/make-distribution.sh  --name 2.6.0-cdh5.7.0  --tgz  -Dhadoop.version=2.6.0-cdh5.7.0  -Phadoop-2.6  -Phive -Phive-thriftserver -Pyarn

编译成功后的界面效果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
main:
[INFO] Executed tasks
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 26.000 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 18.265 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 20.524 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 35.170 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 20.208 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 29.250 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 32.780 s]
[INFO] Spark Project Core ................................. SUCCESS [07:07 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 50.509 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 42.277 s]
[INFO] Spark Project Streaming ............................ SUCCESS [03:55 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:49 min]
[INFO] Spark Project SQL .................................. SUCCESS [05:04 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:16 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 10.737 s]
[INFO] Spark Project Hive ................................. SUCCESS [02:04 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 12.745 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 27.419 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 29.455 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [01:28 min]
[INFO] Spark Project Assembly ............................. SUCCESS [ 15.889 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 24.513 s]
[INFO] Spark Project External Flume ....................... SUCCESS [ 26.800 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [ 5.247 s]
[INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [ 28.671 s]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 26.704 s]
[INFO] Spark Project Examples ............................. SUCCESS [01:44 min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 6.346 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 31.861 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 5.706 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 37:22 min
[INFO] Finished at: 2018-03-08T22:01:29+08:00
[INFO] Final Memory: 91M/440M
[INFO] ------------------------------------------------------------------------

可以在 /abs/app/spark-2.2.0 目录中看到生成了 spark-2.2.0-bin-2.6.0-cdh5.7.0.tgz

spark-2.2.0-bin-2.6.0-cdh5.7.0.tgz 移动到 /abs/software/

1
mv spark-2.2.0-bin-2.6.0-cdh5.7.0.tgz /abs/software/

小结

编译 Spark 源码没有想象中的那么简单,感谢解决问题的前辈们把自己的解决方案发布在互联网。

也希望有人能从我这里得到解决问题的思路。

------ 本文结束------
如果对您有帮助的话请我喝瓶水吧!