Using Machine Learning to Retrieve Relevant CVs Based on Job Description
We use the average word embeddings (AWE) model for retrieving relevant CVs based on a job description. We present a step-by-step guide in order to combine domain-trained word embeddings with pre-trained embeddings for Spanish documents (CVs). We also use Principal Component Analysis (PCA) as a reduction technique used to put similar dimensions to word embeddings results.
Information retrieval (IR) models are composed of an indexed corpus and a scoring or ranking function. The main goal of an IR system is to retrieve relevant documents or web pages based on a user request. During the retrieval, the scoring function is used to sort the retrieved documents according to their relevance to the user query. The classic IR models such as BM25 and language models are based on the bag-of-words (BOW) indexing scheme. BOW models have two major weaknesses: they lose the context where a word appears and they also ignore its semantics. Latent semantic indexing (LSI) is a technique used to handle this problem but when the number of documents increases, the process of indexing becomes computationally expensive. The standard technique used to overcome this is to train word or paragraph embeddings over a corpus or use pre-trained embeddings.
Word embeddings (WE) are distributed representations of terms obtained from a neural network model. These continuous representations have been used recently in different natural language processing tasks. The average word embeddings (AWE) is a popular technique to represent long sequences of text, not just a term.
In our case, a set of CVs is available, but job descriptions are priorly unknown and we need to provide a solution based on an unsupervised learning approach. Thus, word embeddings seem to be a good starting point for our experiments.
Architecture is defined in the next figure.
Step 1: Train Domain Word Embeddings (Trained WEs)
As a first step, we build a balanced corpus of CVs from four known job profiles: Java, Tester, SAP HCM, and SAP SD. As CVs are written in several formats and with different styles and vocabulary, we decide to use only nouns and verbs in order to obtain just the important and relevant information from the CV. Once the corpus is built, we pass it through Word2vec with the configuration parameters as follows: window size is 5, minimum word count is 3, and dimensions are 200. CBOW is the default Word2vec model used.
We use Python 3.6.1 with Anaconda 64 bit for Linux Ubuntu 16.04 LTS. To install several libraries, the
pip install command must be run as follows:
After all needed packages are installed, we create a function to retrieve all CVs from a specific folder, read them (using textract), lemmatize them (using pattern3), and finally create the word embeddings (using gensim). The python function responsible for extracting the text from CVs (PDF, TXT, DOC, DOCX) is defined as follows:
Once all the embeddings are saved into
dir_model_name and we have set the word embeddings to the global variable model, we can use PCA technique to reduce dimensions of the pretrained word embeddings.
Step 2: Download and Reduce Pretrained Word Embeddings (Pretrained PCA WEs)
After we download the Spanish pre-trained word embeddings, we observe that these vectors have 300 dimensions and our proposed domain trained embeddings have 200 dimensions. We decide to reduce 300-dimensional vectors into 200 dimensions and then we build our hybrid space with both word embeddings spaces. The following function is responsible for reducing dimensions of the pre-trained word embeddings:
Step 3: Build the Hybrid Word Embeddings Space and Retrieve Relevant Documents (CVs)
We show a service developed in the lab where we essentially load both embeddings spaces and finally select, when a request comes, the embedding space that must be used. For instance, if the user puts the title of the job “Java,” we will load the trained embedding space. When another unknown profile is typed, i.e. “Cobol Analyst,” then pre-trained word embeddings are used. Besides, for each CV and job request, a mean word embedding vector is computed. Finally, we just retrieve the top three CVs that match with the job description requirement. The following Python function is responsible for this processing block:
And that’s it!