What datasets could we use for learning data processing on Spark.

An issue that we have when we are starting to learn spark is that there is not much information on datasets available for working on them.


I found that there is an interesting dataset that is being maintained by databricks team. You can check its documentation on the following link https://docs.databricks.com/data/databricks-datasets.html.

For accessing this dataset, please follow the next steps.

Steps for using the databricks datasets.

Step 1: Check available datasets on folder.

This consist in checking the folder that contains the databricks community datasets.

For accessing this folder all you need to do is to create a notebook on databricks community. It is there by default.

For checking what datasets does this folder contains run the following command.

display(dbutils.fs.ls("/databricks-datasets"))

This will return the list of datasets available.

After checking what datasets are available you can check what is insight the directories for example Now, I am going to check what files are insight dbfs:/databricks-datasets/airlines/

display(dbutils.fs.ls("/databricks-datasets/flights"))

This is going to return us the list o files available on that directory.

Step 2: Check README files inside the folder.

This is going to give you an insight of what does the dataset contains. For doing that please execute the following command.

val df = spark.read.text("/databricks-datasets/flights/README.md")
    df.printSchema()
    df.show(200,false)

This will store the md file into a dataframe them it will display that dataframe containing the information on the dataset.

Step 3: Load the dataset into a dataframe

For being able to use it you will have to load the information and work on it you will have the load the dataset into a dataframe. For doing that you will have to use the following command.

val csvFile = "/databricks-datasets/flights/departuredelays.csv"
val tempDF = (spark.read         
   .option("sep", ",")
   .option("header", "true")  
   .option("inferSchema", "true")           
   .csv(csvFile)
)

If you don’t understand how are we loading data from a Csv into a dataframe check the following link. https://datamajor.net/how-to-read-csv-files-in-spark/.

After doing that you can check the schema of the dataframe.

tempDF.printSchema()

Then you can proceed to display the dataframe.

As you have seen, there is a lot of datasets available on this directory. This gives you a good Starpoint for learning on how to work on big datasets and a chance for practicing your Spark skills.


Leave a Reply

Your email address will not be published. Required fields are marked *