PySpark

练习回顾

1. Download data set MovieLen data set and PySpark

! wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
! unzip ml-latest-small.zip
! ls
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

2. Start a session for spark

import findspark
findspark.init()
from pyspark.sql import SparkSession
# SparkSession: 创建一个spark的实例
# builder: 构造器,用于添加其他设定
# appName("..."): 实例应用名称
# master("local[*]"): 连接spark 到cluster, local指本地,[*]指任意core 数目
# 如果是local[4]: 连接到本地4个cores
# getOrCreate(): obtain existing instance or create new instance if not exist
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("moive analysis") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
# Test the spark
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3, False)

3. Load data

Note when creating dataframe:

  1. When converting Pandas DataFrame to PySpark DataFrame, use StructType + createDataFrame + pandas dataframe

  2. When creating pyspark dataframe directly, use createDataFrame + list of dictionary, each element in list = one row in table

4. Change data type:

5. Query:

5.1 DataFrame method

5.2 SQL method

First register SQL tables from Spark data frame

Then use spark.sql to run SQL directly

6. User Define Function in PySpark

Last updated