How to read Csv files in Spark

In this tutorial, we are going to work on Databricks community edition and read a csv file using Spark.


For going through the tutorial, we are going to upload a csv file from the following dataset.

https://www.ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-country.

For completing the task, please follow the next steps.

Steps to read a Csv file in Spark.

Step 1: Upload the Csv file.

First you need to upload the file to a File System that spark is using, In my case as I am using Databricks I am going to use its filesystem (DBFS).

If you are using Databricks, you can check if uploaded files exist. For that create a Notebook on Spark or Sql an execute the following command.

%fs ls /FileStore/shared_uploads/alvaroagx@gmail.com/

If the file is there, results should look like this.

Step 2: Load the Csv Dataset into a DataFrame or a Table

After loading the file you can load the Csv Data into a dataset or a Dataframe as follows.

Load Csv dataset into an Spark Dataframe

For loading the Csv data into a Dataframe use the following Scala code.

val csvFile = "dbfs:/FileStore/shared_uploads/alvaroagx@gmail.com/data.csv"
val tempDF = (spark.read           
   .option("sep", ",")
   .option("header", "true")    
   .option("inferSchema", "true")                 
   .csv(csvFile)
)

On the first row we are setting the Csv file path and on the second row we are setting the dataframe and store it into tempDF.

When reading Spark Csv files we have different options, in this example we are using the following.

Sep. This option gives up the option to decide the separator that we are currently using in the csv file, in our case that would be comma (,).

Header. By setting this option True, we are assuming that the first csv line contains the headers. If we change this option to false, headers of this dataframe are going to be set by default values.

interSchema. When this option is true, apache spark infers the data types of each column on the Csv File. If this option is set to false, It will upload all the columns as Strings.

After Executing this command Spark is going to return the following message.

Load Csv dataset into an Spark Table

For loading the csv into a table you need to create a Sql Notebook.

After loading the data you will have to execute the following command for creating a table and storing the csv information there.

CREATE TABLE TestData
USING csv
OPTIONS (
path "/FileStore/shared_uploads/alvaroagx@gmail.com/data.csv",
header "true");

This is going to create a table that is going to be store into the Spark File system, on the options we are setting the path on which the Csv file is located.

Also, we are setting the header to true as our Csv files contains a header. If you choose false, column names are going to be set as Column (n).

Step 3: Check data Schema.

For checking the schema of the dataframe that you loaded, please execute the following command.

tempDF.printSchema()

This is going to display the information of the dataframe that we loaded, and it contains the column name, data type and if the column can contain null values on it.

If you want to check the table information, please proceed to execute the following query.

DESCRIBE TABLE TestData;

It returns the same information than the printSchema option that we used for the dataframe.

Step 4: Explore the loaded data.

For checking the information that you loaded in a dataframe you can proceed to visualize it on the following way.

display(tempDF)

If you want to check that the information is loaded into a table, you will have to explore the table by using a Sql Query.

SELECT * 
FROM TestData;

This query only going to return the first 1000 rows of data.

As you can see, both options, returns the same information, none of them is better than the other.

This is because multi-talented teams typically work on this platform so you can handle the data by using Sql Queries or Scala/Python languages.


Leave a Reply

Your email address will not be published. Required fields are marked *