This week, I continued my work identifying a suitable dataset for developing workforce planning machine learning models. Eventually, I decided to use the employee absenteeism dataset available on Kaggle.com as the base. The inherent challenge with the dataset, however, was its lack of a binary variable to use for classification models. Moreover, the dataset’s numeric variables, i.e., length of service and absent hours, would not provide enough depth to train and test any machine learning models adequately. As a result, significant effort was made to aggregate various variables against each record, such as the number of a specific job title per store or the turnover rate of a position, store location and estimated turnover likelihood per employee record. As a result, several variables were created based on existing data that was used to define a single classification for training and testing called “Workforce planning risk”. This variable then helped create a secondary classification called “Recruitment required?” this would check if a record were terminated (1) and had a high risk (1) and then would classify the record as True. This variable will now be used to develop any machine learning models, which will be completed over the next week.
Leave a Reply