PySpark
练习回顾
1. Download data set MovieLen data set and PySpark
! wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
! unzip ml-latest-small.zip
! ls
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark2. Start a session for spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
# SparkSession: 创建一个spark的实例
# builder: 构造器,用于添加其他设定
# appName("..."): 实例应用名称
# master("local[*]"): 连接spark 到cluster, local指本地,[*]指任意core 数目
# 如果是local[4]: 连接到本地4个cores
# getOrCreate(): obtain existing instance or create new instance if not exist
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("moive analysis") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Test the spark
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3, False)3. Load data
Note when creating dataframe:
When converting Pandas DataFrame to PySpark DataFrame, use StructType + createDataFrame + pandas dataframe
When creating pyspark dataframe directly, use createDataFrame + list of dictionary, each element in list = one row in table
4. Change data type:
5. Query:
5.1 DataFrame method
5.2 SQL method
First register SQL tables from Spark data frame
Then use spark.sql to run SQL directly
6. User Define Function in PySpark
Last updated
Was this helpful?