from_json() function is used to parse a JSON string column into a StructType or another complex type (e.g., ArrayType). It is commonly used when working with JSON data stored as strings in a column, allowing you to extract structured information.
Syntax: -
pyspark.sql.functions.from_json(col, schema, options={})
col: The column containing JSON strings to parse.schema: A schema that specifies the expected structure of the JSON data (can be defined usingStructTypeorArrayType).options: A dictionary of options for JSON parsing (optional).
Example: -
- Parsing JSON string.
# Example data
data = [("1", '{"name":"Alice", "age":30}'),
("2", '{"name":"Bob", "age":25}')]
# Create a DataFrame
df = spark.createDataFrame(data, ["id", "json_string"])
# Define schema
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Use from_json to parse JSON strings
parsed_df = df.withColumn("parsed_json", from_json(col("json_string"), schema))
# Extract fields from the parsed JSON
result_df = parsed_df.select("id", col("parsed_json.name").alias("name"), col("parsed_json.age").alias("age"))
# Show result
result_df.show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 30|
| 2| Bob| 25|
+---+-----+---+
- Parsing JSON Array: -
from pyspark.sql.types import ArrayType
# Example data
data = [("1", '[{"name":"Alice", "age":30}, {"name":"Bob", "age":25}]')]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "json_array"])
# Define array schema
array_schema = ArrayType(StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
]))
# Parse JSON array
parsed_array_df = df.withColumn("parsed_json", from_json(col("json_array"), array_schema))
# Show result
parsed_array_df.show(truncate=False)
+---+-------------------------------------------------+
| id|parsed_json |
+---+-------------------------------------------------+
| 1 |[{Alice, 30}, {Bob, 25}] |
+---+-------------------------------------------------+
Common Use Cases
- Parsing streaming JSON data: Often used in structured streaming where JSON data arrives in string format.
- Nested data extraction: Allows access to deeply nested fields in JSON.
- Schema enforcement: Ensures that JSON strings conform to the expected schema.