In this tutorial, we are going to work on Databricks community edition and use spark-xml library.
For going though the tutorial, we are going use a csv that is stored into a folder on databricks community. If you need further information, please go to the following URL. https://datamajor.net/datasets-to-use-for-learning-data-processing-on-spark/.
For completing the task, please follow the next steps.
Steps to convert dataframes into csv files in Spark.
Step 1: Download and generate a jar file for spark-xml.
Spark xml is a library that allows us to convert dataframes into xml files, for this tutorial we are going to Download the project containing spark-xml on Zip format from the following git repository. https://github.com/databricks/spark-xml.
Then proceed to unzip it on a folder.
After that we are going to download SBT to build a JAR file, go to the following link https://www.scala-sbt.org/download.html and proceed to install it.
After SBT is installed in your computer proceed to check if it is install correctly by typing SBT on the “Command Prompt”, If it is installed correctly you will receive the following message.
Next go to the folder where you unzipped the spark-xml library and execute the following command.
Don’t forget that you would need to install and configure JDK before executing this command as it is going to display the following message if you didn’t.
Also, you need to install a compatible JDK. In my case I tried to run the sbt package command with a JDK 16 but it failed.
You can check JDK/Scala Compatibility on the following link: https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html.
Next, you need to check that you have the right version of JDK (I am using JDK 12.0.2) proceed to execute the following command.
After sbt package execution finalize, you will receive the following message confirming that the JAR file was generated and copied into a folder.
You can also use a Maven repository for uploading the library, but from what I’ve seen, It is not updated.
Step 2: Install spark-xml library into a Databricks Cluster.
Go to Databircks community, then proceed to go to the cluster that you are currently using and go to the libraries tab, next click on “Install New”.
Then on the Install Library Window drag and drop the file that we generated on step 1 and click on Install.
Wait a few seconds until status changes from “Installing” to “Installed”
Then if you are using a new Runtime version (In my case I am using 8.2), You will need to install a Maven library called “maven-jaxb2-plugin”.
For that Go to “Workspace” > “Shared” > “Create” > “Library”.
Then click on “Search Package”.
On “Search Packages” search on “Maven Central” for “maven-jaxb2-plugin”.
Then I am going to select “org.jvnet.jaxb2.maven2” as in some Java forums they recommend to use this one but you can try with different ones.
After selecting the one you want to install, click on “Select” and click on “Create”.
Next, you are going to be redirected to the following window.
On the bottom, you can check on which clusters the library is install now.
After installing the library, I am sure that there should not be any issue when running the conversions to XML files.
Step 3. Convert a dataset into an Xml file.
For that we are going to load a dataset that consist in airport codes into a dataframe.
val csvFile = "/databricks-datasets/flights/airport-codes-na.txt" val tempDF = (spark.read .option("sep", " ") .option("header", "true") .option("inferSchema", "true") .csv(csvFile)
Then we are going to display it, just to check that it loaded correctly.
After that we are going to import the libraries and export this to our DBFS.
import org.apache.spark.sql.SparkSession import com.databricks.spark.xml._ val spark = SparkSession.builder.getOrCreate() val selectedFlights = tempDF.select("City","Country","IATA") selectedFlights.write .option("rootTag", "airports") .option("rowTag","airport") .xml("/FileStoreemail@example.com/nflights.xml")
If you received the following message, It worked as expected.
Finally this is going to generate a file XML containing the structure that we defined on the script.