home   Archives  About  Etc  Authors

Vacancy Recommender Hackaton with Spark



BigData Republic organized a small hackathon for the Big Data course I currently follow at university. The challenge was to build a job recommendation system using real data from one of their clients, RandStad, which is a big employment agency. To my surprise, I ended up with the highest score and went home with a nice book as a prize. I was fully convinced that the score I achieved was very low, and I know for a fact that the road to victory had way less to do with intelligence than with strategic pragmatism. I will not share the Spark notebook itself, as the data we worked with is not open and much of the code was already provided by BigData Republic. Nevertheless I did gain some insights that I would like to share.

The challenge

Employment agencies such as RandStad want to show customers looking for a job the most relevant vacancies, given their preferences. The challenge for this hackathon was to build a recommender system that predicts a top 15 of vacancies, that can be shown to the user.

Data

All data was anonymized.

  • A dataset containing information about the behavior of clients in the webinterface of RandStad. It stores whether users opened a particular vacancy, started an application or finished a vacancy, alongside further information about that vacancy, such as how many hours per week it is, the wage per hour etc.
  • A dataset of user profiles storing user preferences, such as the desired wage, minimum and maximum working hours, and maximum travel distance.
  • A dataset of vacancies, of which we will make a selection for recommendation.

Architecture of the solution

The basic model used for recommendation is Collaborative filtering using alternating least squares.

There are two basic ingredients for this type of recommendation systems:

1) We have some data of users using some items, e.g. buying products in a supermarket. We can represent this in a user-item matrix. However, most users do not buy all items, and most items are not bought by all users, so this matrix is sparse, i.e. mostly filled with zero-entries.

2) We thus need some way to associate users with products they didn’t buy yet so we can potentially recommend those products, based on the knowledge we already have of user preferences for particular products. In other words, zero-entries need to be filled in with a preference estimation. The Collaborative Filtering with ALS technique does this through finding a factorization of the user-item matrix into two matrices with lower dimensions, that map users onto a number of latent factors (a “user profile”), and these latent factors back unto the items (an “item profile”). With ALS one tries to find two matrices that approximate the bigger input matrix when they are multiplied with each other. Based on these smaller estimated matrices with latent factors, it is possible to re-compute the user-item association matrix, which now has preference scores for items that previously had zero-entries.

To implement this model in Spark, there are two major things to take into consideration:

Implicit versus explicit feedback

Preferences of users for particular products can be explicit, for example when you ask users to rate the products they buy on a scale from 1 to 10 in a questionnaire. However, one can also have an implicit measure of preferences. If for example a particular customer very often buys cucumbers, we can infer from that that user has a preference for cucumbers, even though we do not have an explicit normalized rating of cucumbers.

When it comes to Big Data, it is more likely that you have implicit preference data at your disposal. In the case of this hackathon, the indirect information we have of customer preference is a log of what vacancies users click on in the vacancy search machine of RandStad. If users click more on a particular type of vacancy, e.g. for management functions, we can infer this user prefers management functions, rather than for example being a cashier in a supermarket.

Cold-start problem

Another challenge for this setup is the so-called cold start problem. Computing an user-item association matrix for a given set of users and items is computationally quite expensive. But in the case of a big employment agency, new job vacancies come in continuously. Unless you retrain the whole model, you then cannot recommend these new vacancies, which obviously is very undesirable. At the same time, it is prohibitive to continuously redo all your work to include these vacancies in real-time.

The workaround suggested by the people from BigData Republic and used in this hackathon, is to not train the recommendation model on user-vacancy preferences, but instead on user-function preferences. This is a good solution because function titles are not as volatile as individual vacancy descriptions. In other words, if a new vacancy comes in, we already know the preference of a user for that function title, because the ALS model is trained on many other vacancies with the same function description.

We thus end up with a model like this (written in Scala):

 val als = new ALS()
  .setMaxIter(20)
  .setRegParam(0.001)
  .setRank(10)
  .setUserCol("candidate_number")
  .setItemCol("function_index")
  .setRatingCol("rating")
  .setImplicitPrefs(true)
val model = als.fit(grouped_train)

grouped_train is the data of user clicks where vacancies are grouped under their function name.

Recommending vacancies

But given that basic model, we have a recommendation score for functions, and not vacancies. If we take the top 3 preferred functions for a user, and then join all vacancies on these function descriptions, then we end up with a very large list of recommended vacancies for a user.

Therefore the rest of the work in the hackathon was to come up with a good way of selecting a top 15 in this long list of vacancies. This is done by joining in profile data containing further user preferences such as the desired wage, working times, and maximum traveling distance. Based on that information you can either filter out vacancies, or integrate these preferences in a final weighted recommendation score.

The end result of this whole process is a top 15 of vacancies to first display to the end user.

Parameter optimization, weighing factors for a final prediction

Everyone used the same general approach with the ALS model, so what distinguished my solution from others where 1) model parameters and 2) further scoring and processing of vacancies based on profile data.

This is where the hackathon really started feeling “hacky” to me.

A major practical limitation was that I was running a Spark notebook on a real-life data problem, within docker, on an old ThinkPad with limited computing power and memory. This effectively resulted in the Spark notebook kernel dying on me regularly, so running the whole data pipeline even once was quite a hassle. Using fancy techniques to search for optimal parameter settings where thus out of the question for me, and I had to resort to playing around with parameters manually.

Especially because running the whole process took a while, I really wanted to be smart about what parameter combinations I tried out. But the somewhat disappointing answer (not a bad answer though) I got from one of the BigData Republic people was that there were no very specific rules of thumb, for example for choosing the amount of latent factors in the ALS model. Normally, instead of having 12Gb of working memory, similar Spark code would be run on a cluster with 1TB of working memory… which allows automated search for the best parameter settings.

From there on pragmatism took over. With respect to model parameters, the adagium “higher is better” did not hold for me, first of all because it made my pc crash, and secondly because the risk of overfitting on the training data became larger. So w.r.t default ALS paramaters, I actually only lowered them: less iterations and less latent factors in the matrix factorization.

The largest improvement in my final score was achieved by using profile data and weighing various factors differently. We computed a score for whether the vacancy matched the preferred working hours or not, and a normalized score for how far away the job is from the candidate. These factors, together with the recommendation score for the function title of a particular vacancy, were weighed together to produce a final score per vacancy. It turned out that people care a lot about how far the job is, and I gave this factor a very big weight of 10:1 compared to the recommendation score for the actual function title (but note that only vacancies for the top 3 function descriptions were taken into account, so the ALS model already fulfilled its purpose).

Result and reflection

The final score for the competition was a very simple recall measure, i.e. what percentage of the vacancies candidates actually applied for (can be extracted from the dataset of browsing behavior) was recommended in the top 15 vacancies by the recommendation model. My final recall score on a test set was 16.8% (19.8% on the validation set). A baseline performance of 2.9% for comparison was calculated by always predicting the 15 most popular vacancies.

I thought my score was pretty low (and I’m sure it is) so I was very surprised to win, but given that all competitors were beginners and faced similar hardware issues as I did, the playing field of recall scores was more or less between 13-17%. People with more interesting ideas about parameter optimization where probably not successful in their efforts due to serious hardware limitations. Perhaps people also put more effort in optimizing their ALS model, only to see it overfit on the training data and really drop in score on the test data. The overall impression I am left with, is that real data science is extremely hard to do properly. For the mortals not designing the algorithms and data structures themselves, the most intelligence is required for choosing the right methods for the problem at hand, and making smart design decisions on what information to exploit. But apart from that, I have the feeling that the average attitude is: please don’t ask too about the internals of the algorithms or the meaning of a parameter setting. I suspect that for many people in the data business “data science/engineering” is mostly slapping together pre-existing models and making computers crunch a lot on optimizing them.

Tools used

  • Docker
  • Scala
  • Spark ML
  • Spark Dataframes
  • Spark SQL
  • My poor old ThinkPad

Webmentions


Do you want to link a webmention to this post?
Provide the URL of your response for it to show up here.

Comments

Alex on Monday, May 20, 2019:

Sounds like you’re at the level where you can do useful things with what you know ๐Ÿ™Œ. In the shallow ‘you can get money for this’ sense.

Curious to see what you’ll do.

Regarding the article, this sadly seems to be the first that has bits that I as a layman can’t follow. Specifically the collaborative filtering with ALS. That said, I’m impressed you managed to make everything else accessible even though I hadn’t heard of most of these concepts. Quite the feat ๐Ÿ‘.

Finally, why run computationally intensive work on your laptop? It sounds like you could’ve done significantly better with more power. It’s not what laptops are they’re for and luckily for us compute power is cheap!

Edwin on Monday, May 20, 2019
In reply to Alex

Thanks for reading and responding!

Not explaining ALS and Collaborative Filtering is indeed not exactly the intention of this blog, where I try to explain as much as possible in simple terms. Nevertheless it was a conscious decision to not explain it in detail, because that would be a post in itself. On top of that, in line with my conclusion, for most people it’s really not that necessary or interesting to know all the details. I know you are not one of those people, of course. For a more in-depth explanation, you can look here.

W.r.t. to the computing on a laptop… that’s just a bit stupid. Not every student gets or has access to a cluster (most don’t). For educational purposes it’s also not absolutely necessary, and the playing field is more level if everyone just uses their own machines. Obviously, for any real project this is an absolute no-go! :-)

Edo on Monday, May 20, 2019:

Really cool post. As a web developer it’s really cool to read an approachable article from the trenches. From my perspective, I see a lot of people being interested in ML, but I see very few practical applications in the wild.

The roadblocks that you describe hitting remind me of my early programming days. It’s hard to explain to outsiders what kind of problems you’re experiencing day-to-day, when all they really understand is that you “write code” or “build websites”. Good luck explaining how you’re debugging “Undefined is not a function” on a daily basis.

One question that I have is whether these “function names” that you’re learning against, are these a fixed set of predefined names, or were they free form text?

Edwin on Monday, May 20, 2019
In reply to Edo

Thanks for your response! The function names are read out from a database with the clicking behaviour of people. They were recorded per clicking action, so in order to make them available for ML some preprocessing was needed. The algorithm only deals with numeric values and no strings, so we need to index the function names:

val indexer = new StringIndexer()
  .setInputCol("function_name")
  .setOutputCol("function_index")
  .setHandleInvalid("skip")

// Fit the indexer
val stringIndexerModel = indexer.fit(clicks_train)

And then we need to train the model on the click information aggregrated over functions per candidate:


// By transforming the stringindexer we get a new column 'function_index'
val clicks_train_indexed = stringIndexerModel.transform(clicks_train)
val clicks_val_indexed = stringIndexerModel.transform(clicks_val)

val grouped_train = clicks_train_indexed.groupBy("candidate_number", "function_index").agg(sum("action").alias("rating"))

Alex on Monday, May 20, 2019
In reply to Edo

@Edo I think you can explain these things to those that are curious! Take an example in Feynman. After watching some ingenious explanations by the man it seems all you need to do is put your mind to it ๐Ÿ˜„.

Edwin on Tuesday, May 21, 2019
In reply to Edo

@Alex I also think you can and should explain complicated things in clear language, and moreover that this is a true sign of understanding. In that regard, Feynman is an absolute genius of course, even though he claims himself that he is an average man that worked hard … and in that regard I could have done a better job in this post. But don’t underestimate the difficulty of making things easy :-)!


Leave a comment

Thank you

Your post has been submitted and will be published once it has been approved.

OK

Something went wrong!

Your comment has not been submitted. Return to the page by clicking OK.

OK