在Ubuntu 20.04/18.04和Debian 10/9上安装Apache Spark

时间:2020-02-23 14:33:31  来源:igfitidea点击:

欢迎我们关于如何在Ubuntu 20.04/18.04&Debian 9/8/10上安装Apache Spark的教程。
Apache Spark是一个开源分布式通用聚类计算框架。
它是一个用于大数据和机器学习处理的快速统一分析引擎。

Spark在Java,Scala,Python和R中提供高级API,以及支持一般执行图的优化引擎。
它还支持丰富的高级工具集,包括SQL和结构化数据处理的Spark SQL,MLLIB用于机器学习,GraphX用于图形处理和火花流。

在Ubuntu 20.04/18.04/Debian 9/8/10上安装Apache Spark

在我们在Ubuntu/Debian上安装Apache Spark之前,让我们更新我们的系统包。

sudo apt update
sudo apt -y upgrade

现在使用所示的步骤在Ubuntu 18.04/debian 9上安装Spark。

第1步:安装Java

Apache Spark需要Java运行,让我们确保我们在Ubuntu/Debian系统上安装了Java。

对于默认系统Java:

sudo apt install default-jdk

使用命令验证Java版本:

java -version

对于Ubuntu 18.04上的Java 8:

sudo apt update
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer oracle-java8-set-default

对于缺少add-apt-repository命令,请检查debian/ubuntu上安装Add-APT存储库

第2步:下载Apache Spark

从下载页面下载最新版本的Apache Spart。
截至此更新,这是2.4.5.

curl -O https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

提取火花tar包。

tar xvf spark-2.4.5-bin-hadoop2.7.tgz

移动提取到/opt /目录后创建的Spark文件夹。

sudo mv spark-2.4.5-bin-hadoop2.7//opt/spark

设置火花环境

打开Bashrc配置文件。

vim ~/.bashrc

添加:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

激活更改。

source ~/.bashrc

第3步:启动独立主服务器

我们现在可以使用start-master.sh命令启动独立主服务器。

# start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

该过程将在TCP端口8080上侦听。

# ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <->

Web UI看起来如下。

我的火花URL是Spark://Ubuntu:7077.

第4步:启动Spark Worker过程

start-slave.sh命令用于启动Spark Worker进程。

$start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

如果我们在$path中没有脚本,则可以首先找到它。

$locate start-slave.sh
/opt/spark/sbin/start-slave.sh

我们还可以使用绝对路径来运行脚本。

第5步:使用Spark shell

使用spark-shell命令访问spark shell。

# /opt/spark/bin/spark-shell
19/04/25 21:48:59 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0)
19/04/25 21:48:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/25 21:49:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://static.13.127.203.116.clients.your-server.de:4040
Spark context available as 'sc' (master = local[*], app id = local-1556221755866).
Spark session available as 'spark'.
Welcome to
      ____              __
    /__/__  ___ _____//__
    _\ \/_ \/_ `/__/ '_/
   /___/.__/_,_/_//_/_\   version 2.4.1
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
scala> println("Hello Spark World")
Hello Spark World
scala>

如果你是一个Python人,请使用pyspark。

# /opt/spark/bin/pyspark
Python 2.7.15rc1 (default, Nov 12 2016, 14:31:15) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
19/04/25 21:53:44 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0)
19/04/25 21:53:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/25 21:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
    /__/__  ___ _____//__
    _\ \/_ \/_ `/__/ '_/
   /__/.__/_,_/_//_/_\   version 2.4.1
      /_/
Using Python version 2.7.15rc1 (default, Nov 12 2016 14:31:15)
SparkSession available as 'spark'.
>>>

使用下面的命令轻松关闭主机和从Spark流程。

$SPARK_HOME/sbin/stop-slave.sh
$SPARK_HOME/sbin/stop-master.sh