JSON, Schemas and Types

3 min readJan 2, 2023

In this post, you will learn about why you might use JSON and why you will use Schemas and Types.

Tabular VS. Semi-structured Data.

Tabular data is the common association with data and contains ordered rows and columns. You need to consider the following with tabular data:

What happens when your data changes?
Null values will be present when you are missing data.
The data ranges in complexity.

Semi-structured data does not conform to a formal structure:

All observations do not necessarily share fields.
Also known as self-describing structure.
Allows for hierarchies.

JSON.

Here are some things you need to know about JSON.

Works in attribute-value pairs.
Can serialise data:
No object-relational impedance mismatch or match types between environments)
Nested and hierarchical data w/ complex relations.
It often doesn’t work with SQL.
Reducing data model constraints can reduce quality.

Schemas.

Schemas are a vital part of Spark’s data structure. They help improve performance and code reliability by defining the column names and data types in a table. Using DataFrames allows Spark to optimize query execution because it knows the structure of the data. In computer science, data structures and algorithms are essential concepts. The right data structures can make algorithms more efficient. So why use schemas? Because they help you get the most out of your data and algorithms and ensure fast and reliable processing.

Schemas with Semi-structured JSON Data.

In tabular data, like that found in CSV files or databases, each row has a value for every column. However, semi-structured data, such as JSON, do not need to follow a formal data model. A feature may appear zero, once, or multiple times for a given observation. This type of data is well-suited for hierarchical data and evolving schemas. JSON is a common example of semi-structured data and consists of attribute-value pairs.

Reading JSON w/ InferSchema and JSON Lines.

Reading in JSON is not too different from reading in CSV files.

Much like the CSV reader, the JSON reader also assumes two things:

One JSON object per line.
That it is delineated by a new line.

This format is referred to as JSON Lines or newline-delimited JSON.

Here is how we can read your JSON file:

CREATE OR REPLACE TEMPORARY VIEW fireCallsJSON
USING JSON
OPTIONS (
    path "/mnt/davis/fire-calls/fire-calls-truncated.json"
  )

We can take a look at our table like this:

DESCRIBE fireCallsJSON

Spark can automatically determine the schema of your data, but this process (called schema inference) can impact performance as it requires scanning all of your data. To avoid this, you can define your own schema. This allows you to specify the data types and select which fields to include in the data. This can improve performance and allow you to focus on the data that is relevant to you.

Primitive and Non-primitive Types.

Spark’s types package provides the building blocks for creating schemas.

There are two main types of data: primitive and non-primitive (also known as composite).

Primitive types, like FloatType and StringType, contain the data itself.

Non-primitive types, like ArrayType and MapType, contain references to memory locations and allow a field to contain an arbitrary number of elements in an array or map form.

Non-primitive types are made up of primitive types, such as an Array of IntegerType.

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.