Spark Interview Question - Online Assessment Coding Test Round | Using Spark with Scala




In this blog, we will have a discussion about the online assessment asked in one of the  IT organization in India. Now-a-days in Spark interview, candidates are being asked to take an online coding test before getting into the Spark technical interview discussion. This online Spark Coding Challenge, provided here needs to be completed within 3 hours of time and should be submitted in the portal. Lets us see what are the scenario and the dataset given to solve the problem. The requirement is for the Spark Scala developer with 4+ years of experience in Spark and other bigdata tools.

DataSet:

You are given with the MovieLens having one million records in the dataset for this online exercise. The dataset contains,
  • Movies dataset, 
  • Ratings dataset, and 
  • Users data
also you are given with the readme.txt file, which gives you the detailed information about dataset.

The dataset can be downloaded from the link given here Download MovieLens Data.


Technology To Be Used:

Read the below points carefully before starting with your assessment.

  • You can use any IDE to develop the code. [I would recommend you to use Intellji IDE which will be much flexible for development, build JAR and for unit testing.]
  • Write the Code in Spark Scala with SBT or Maven for the following questions. Your code should be in production quality. Any version of Spark can be used in your development.
  • Create a Zip file that contains your Maven (or) SBT project including any test data that you used for sharing with us. Do not include the large dataset inside the folder 
  • Don't include any output files inside the zip folder, since we can generate them by running this code from our end.

Steps to Setup Intelliji for Spark Scala with SBT:

You can follow the below video to setup Spark Scala with SBT in your own machine for the development.



If you face any issue while setting up your machine, reach out to me by commenting below or mail me.
Please do support by subscribing and sharing, if you like my content.

Let's move on to the questions now.

Assessment Steps:

Step 1 - Create a new Project named as sparkDemo. Define a proper project folder structure and dependencies using SBT or Maven. Add all the required Spark package/dependencies to the Project. 


Step 2 - Write a Spark code to generate the required output for below use cases and save the output in the target folder inside the Project.
    1. Create a CSV file containing list of movies with number of users who rated the movie and average rating per movie. The file has to with three columns, i.e, MovieId, No of users, Average rating. Header column is not required. [Note - Use RDD for this Task (No Dataset or No Dataframes)]
    2. Create a CSV file containing list of unique Genres and number of movies under each genres. The file should contain two columns i.e, Genres, No of movies. Column headers are not required. [Note - Use RDD for this Task (No Dataset or No Dataframes)].
    3. Generate a output of format parquet that contains top 100 movies based on their ratings. This should have following fields in it. i.e, Rank (from 1 - 100), MovieId, Title, Average Rating.
Step 3 - Write Unit Test case for Step 2.1 and 2.2

Step 4 - Create a simple read me text file with Spark Submit command and set of instruction on how to run the each part in Step 2.

Final Note:

Try this assessment in your own machine and validate the results by yourself. If you need answer for these question post a github link of your answer or share the answers you have in a mail, I will verify and send the link of answers that I prepared for your verification. 

Hope by solving this you will be ready to face any online or coding round in Spark Scala. Leave you comments on this tutorial. I am happy to hear from you.

Happy Learning!!! 

Post a Comment

5 Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. November 15, 2020 at 3:29 AM

    can we do this entire activity in pyspark?

    ReplyDelete
    Replies
    1. i believe we can, all we need to use is rdds

      Delete
  3. can you please share answers for these

    ReplyDelete
  4. Did you post the solution ?

    ReplyDelete