I’ve been working on the preprocessing for the Data Mining project (First post). There are a few things that need to be done to the original data entries in order to make them anonymous and make it possible to work with them with any data mining software or any algorithm which I may implement. Those things are called preprocessing. To do the preprocessing I’ve made two Python scripts. Available at GitHub.
The dataset is a registry of the academical success of every student of the Computer Engineering Department at the University of Évora since 1995. That means that each entry is a relation between a course, a student and his success at the given course. So, there is an entry for every time a student as completed a course, and that entry also contains his grade. There is also an entry for every time a student has failed a course, whether because he simply didn’t have a good enough grade, because he didn’t show up for evaluations, or for any other undesirable situations.
The dataset was made over the course of several years at the University of Évora since the start of the Computer Engineering course and has 52265 entries and each entry has 19 fields. There are entries for undergraduate courses, master’s degree courses and PhD courses. The number of entries for undergraduate courses make up 49461 entries of the dataset, while the master’s and PhD courses only make up for 2438 and 366, respectively.
Because of changes over the years, there are quite a few courses that no longer exist, courses that changed name, or courses that still exist, have the same name, but also have different course codes. Because of this, it becomes hard to identify which are the same courses over the years.
The first thing to do is to remove entries that will be useless. Because of the small number of entries of the master’s and PhD courses, they are removed from the preprocessed dataset.
Inside the University there are many instances of students first starting undergraduates in other areas, that are not related to computer engineering and then changing undergraduate to computer engineering. When that happens they’re previous complete courses come into the dataset, so there are many entries that contain unique courses. These unique courses are removed from the final dataset.
Some interesting things to note about the dataset are the changes over the years in terms of existing courses. Some older courses no longer exist, for example, there are some courses that show up in the first years of the dataset but don’t show up in recent years because they no longer exist. These courses with the same name but different course codes are kept into the final dataset but they are seen as different courses for two reasons. First, joining both entry types into a single one is not easily done automatically, and second, even if the courses have the same name there might have been considerable changes, so considering course A of 1997 to be the different from course A from 2011 may be desirable.
One very important thing in this dataset is that the student identification needs to be anonymous. Every entry as both the student’s number and the student’s name. In the final version the name of the student is simply removed because it isn’t doing anything there in the first place, and the number is replaced by another value. That other value will be an incrementing integer starting from 0. So one student will have its number replaced by 0, the other by 1, and so on.
The final think to consider in the dataset are some missing values and some redundant information. The final three fields of the dataset are the grade of the entry, the status, if it was approved or not, if the student skipped that evaluation, or if the evaluation was canceled, and if that is the final result. To make things simple, the status of an entry will be binary, so the student will either be approved at the courser or not. In many entries where the student doesn’t get approved there is not grade, in the final version of the dataset they all get 0. The last field, detailing if that is the final result, will be removed because it doesn’t add anything new to the dataset.
The final result is a dataset with only 49299 entries and 12 fields. The entries are all anonymous and there are no missing values. There aren’t entries with courses that have only one or two entries. All values of grades and approval status are normalized.
The next steps of this project will deal with the application of algorithms to find association rules.