PySpark

练习回顾

1. Download data set MovieLen data set and PySpark

! wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
! unzip ml-latest-small.zip
! ls
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

2. Start a session for spark

import findspark
findspark.init()
from pyspark.sql import SparkSession
# SparkSession: 创建一个spark的实例
# builder: 构造器,用于添加其他设定
# appName("..."): 实例应用名称
# master("local[*]"): 连接spark 到cluster, local指本地,[*]指任意core 数目
# 如果是local[4]: 连接到本地4个cores
# getOrCreate(): obtain existing instance or create new instance if not exist
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("moive analysis") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
# Test the spark
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3, False)

3. Load data

Note when creating dataframe:

  1. When converting Pandas DataFrame to PySpark DataFrame, use StructType + createDataFrame + pandas dataframe

  2. When creating pyspark dataframe directly, use createDataFrame + list of dictionary, each element in list = one row in table

4. Change data type:

5. Query:

5.1 DataFrame method

5.2 SQL method

First register SQL tables from Spark data frame

Then use spark.sql to run SQL directly

6. User Define Function in PySpark

Last updated

Was this helpful?