An issue that we have when we are starting to learn spark is that there is not much information on datasets available for working on them.
I found that there is an interesting dataset that is being maintained by databricks team. You can check its documentation on the following link https://docs.databricks.com/data/databricks-datasets.html.
For accessing this dataset, please follow the next steps.
Steps for using the databricks datasets.
Step 1: Check available datasets on folder.
This consist in checking the folder that contains the databricks community datasets.
For accessing this folder all you need to do is to create a notebook on databricks community. It is there by default.
For checking what datasets does this folder contains run the following command.
display(dbutils.fs.ls("/databricks-datasets"))
This will return the list of datasets available.

After checking what datasets are available you can check what is insight the directories for example Now, I am going to check what files are insight dbfs:/databricks-datasets/airlines/
display(dbutils.fs.ls("/databricks-datasets/flights"))
This is going to return us the list o files available on that directory.

Step 2: Check README files inside the folder.
This is going to give you an insight of what does the dataset contains. For doing that please execute the following command.
val df = spark.read.text("/databricks-datasets/flights/README.md") df.printSchema() df.show(200,false)
This will store the md file into a dataframe them it will display that dataframe containing the information on the dataset.

Step 3: Load the dataset into a dataframe
For being able to use it you will have to load the information and work on it you will have the load the dataset into a dataframe. For doing that you will have to use the following command.
val csvFile = "/databricks-datasets/flights/departuredelays.csv" val tempDF = (spark.read .option("sep", ",") .option("header", "true") .option("inferSchema", "true") .csv(csvFile) )
If you don’t understand how are we loading data from a Csv into a dataframe check the following link. https://datamajor.net/how-to-read-csv-files-in-spark/.
After doing that you can check the schema of the dataframe.
tempDF.printSchema()

Then you can proceed to display the dataframe.

As you have seen, there is a lot of datasets available on this directory. This gives you a good Starpoint for learning on how to work on big datasets and a chance for practicing your Spark skills.