Accessing Data and File Formats

Kasim Ali
5 min readDec 29, 2022

In this post, you will learn how to properly access data and learn about different file formats within Databricks.

Accessing Data and File Formats | Kasim Data

There are a number of ways you can connect to remote data. In this lesson, you will learn how to connect to object or BLOB stores and JDBC connections. Object stores are a common way of storing a large amount of data. JDBC is an API for Java environments.

Spark works quite well with JDBC within a function called predicate pushdown. Where certain aspects of SQL can be done by the database before transferring that database to the spark cluster. This saves a lot of network transfer.

DBFS (Database File System) works alongside S3 (Amazon Simple Storage Service). To read and write from our data source, we need to mount a data bucket.

Here is what that could like look:

As you can see I have taken out the information and left a period as a placeholder.

AWS_BUCKET_NAME refers to the name of the bucket as it currently sits in S3.

MOUNT_NAME refers to how it will appear in the Databricks environment.

Here is some boilerplate code to mount the bucket:

dbutils allows us a number of different functions to access our data.

We can now test our connection by running a query to see the data.

Try the following:

%fs ls /mnt/xxx

In this case, your mount name will be in place of ‘xxx’.

This will allow you to see your directories.

The following will allow you to see all of your mounts:

%fs mounts

Serial JDBC Reads.

JDBC is an API that allows Java programs to connect to databases. Spark, which runs on the Java Virtual Machine, can easily connect to databases using JDBC. Predicate pushdown, an optimization used by Spark, allows databases to handle certain parts of a query (predicates) before the data is transferred to the Spark cluster, reducing data transfer and improving performance. JDBC is well-suited for use with Spark due to its ability to communicate predicate pushdown through the JDBC API.

The following code will make sure we are connected to the best postgres driver since we are connecting to a postgres database:

%scala
Class.forName("org.postgresql.Driver")

We also need to create a JDBC string and specify a few options:

You can now run queries on the table called ‘twitterJDBC’.

Parallel JDBC Reads.

We can use parallelism in our JDBC read to run the operation using multiple concurrent threads.

You will notice, it looks very similar besides three additional parameters:

We passed:

  • lowerBound
  • upperBound
  • numPartitions

What this tells spark is that we want two partitions, between the two bounds.

By adjusting the number of partitions we can actually speed up the query run time. We can measure this by using a simple SELECT * query.

File Formats.

You have likely seen a number of different file formats and compression options. More than likely, you know about CSV files. The ideal format for distributed systems is Delta.

Delta is column based and is built on the back of the powerful file format, apache parquet.

You will learn which is the most appropriate file format to use for a number of different conditions.

Here, you can see we have found some information on the file we are working with. The size refers to the number of bytes. This means our file is approximately 1.8GB.

This means you would likely need at least 1.8GB of RAMM to process this file. We would need additional resources for data manipulation and for any other system processes that are happening in the background.

Let’s take a look at the first 1000B of the data:

You can see, that this file is separated using a ‘:’ as a delimiter.

You might have noticed that the file is a .txt file. We can convert this into a CSV using the following code:

CREATE OR REPLACE TEMPORARY VIEW fireCallsCSV
USING CSV
OPTIONS (
path "/..",
header "true",
sep ":"
)

Let’s take a look at our data:

Spark has interpreted all our data as strings which can create problems because storing integers as strings are less efficient and can cause interpretation problems when manipulating our data. We are also unable to do integer operations.

We can resolve this by running the same command and by passing an ‘inferSchema’ query.

CREATE OR REPLACE TEMPORARY VIEW fireCallsCSV
USING CSV
OPTIONS (
path "/..",
header "true",
sep ":",
inferSchema "True"
)

This command will take a LOT longer to run too. The first command took around .28 seconds. Whereas this command took spark around 1.6 minutes.

However, now your schema has been inferred and we can store our data more efficiently and have access to integer operations.

I will now run an experiment so that you can see the difference in speeds when working with several other file types such as Gzip and Bzip.

Here are the results when we try to view some basic information:

.txt: 2.00 seconds.

.gz: 1.72 seconds.

.bzip: 0.71 seconds.

Next, we will convert these into CSV files using the code we have previously used:

.gzip: 51.05 seconds.

.bzip: 1.96 minutes.

.parquet: 2.69 seconds.

The reason why the parquet file was the fastest is due to ‘parquet’ formats storing data as well as the associated metadata.

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.

--

--