Archive | November 2014

Data Mining Project at UE – Part 1

This semester I have a course on Data Mining. This area of knowledge is concern with exploring large quantities of information using different methods to make some sense of data. I will be blogging about my project for the next couple of mouths and might submit some code to GitHub. The results of this project might became useful for my Master’s Degree Thesis. I haven’t talked about that in my blog, but I will make it so in the future.

Dataset Used and Preprocessing

For the Data Mining project there was an opportunity to work with a Dataset from my University,which isthe University of ɉvora (EU). The dataset comes from the Computer Engineering Department (may not be official name in English). Each entry of the dataset consist in a relation between a course from Computer Engineering, information about the student, and if the student finished that course or didn’t finished it, as well as the final grade, if there is any. The dataset has information for students in undergraduate courses and in master’s degree courses. The first entries were made in the academic year of 1994/1995, when the course first started at the University, and there are a total of a bit over 52000 entries in the dataset.

The first thing that needs to be done with the dataset deals with some pre-processing. The existing dataset needs to be processed and transformed into a working dataset with a few changes that will be specified.

First, the dataset contains the names and numbers of students. To keep things anonymous, the names will be removed from the final working dataset. Student numbers are a little bit problematic. The student number for undergraduates at UE always starts with the letter ‘l’ and is followed by an integer number, for example l27402. The Masters number is the same but starts with an ‘m’, for example m11153. In the dataset both numbers are represented without the letter, so there may be an older student with an undergraduate number that will collide with a master’s number from a newer student. Fortunately, it is possible to distinguish an entry between an undergraduate student and a master’s student due to the entries having undergraduate and masters courses with different codes.

Another thing with entries in the dataset deals with missing values. There are a lot of missing values. An ideal entry should contain information on the student, information on the course, and information about the completion of that course, whether the student succeeded or failed, if the course was done during the normal season or another season, the final grade, in case of success, among other things. However, many cases can be found where there is an entry without information on the success of the student, without any grade and without explicitly saying that the student failed the course. For now it is assumes that such a case means that the student simply didn’t made the course. Making assumption over the data needs to be done carefully. In some cases it might be better to simply disregard entries.

What Will Be Done With The Dataset

There are a lot of things that can be done with the dataset. The initial idea of working with the dataset dealt with churn prediction. This kind of prediction is common place with companies like ISPs, telecommunications companies, insurance companies, and most companies that provide some sort of service. The objective is to determine when a client will cancel a certain service to go to competing company, to ask for another similar service, or try to renegotiate some payment scheme. If there is a good enough model to predict client cancellation, then the company can have some sort of mechanism to avoid a particular client from leaving, for example, if a client was leaving because he felt that another company offered a better service, then the company may step in a propose a new deal. For students in colleges there aren’t many services being offered, but it is still useful to determine when a student may leave the course, or how long will the student take to finish said course.

Another thing that might be useful to do with this dataset deals with association rules. It is common to find cases where students that finished the course Programming I with certain grades in the first semester will most probably finish Programming II on the following semester with similar grades, or if the student finished Calculus I, then he will probably finish Calculus II. Other similar rules are found throughout the course. The objective here is to find such association rules in an automatic way. The rules may take into account not only the completion of the course but also grading, because completing a course with a 18 is vastly different then completing the same course with 11, and also the time it takes to finish courses, meaning that completing course A and B with the difference of one year is different then completing the very same courses with a distance of several years, even with the same grades.

Some thing that can be done is to try and find clusters of good and bad students, or good, average and bad students. What makes a good student is not so trivial as people might thing it is. The final grade of a student a course is not at all representative of how good a student was during its undergraduate and, of course, how good will that student be in any other academical or professional pursuit. It is plausible that there will be clusters of students that contain which finish the course quickly but have just a little above average grades, while at the same time, there will be clusters of students that will have great averages but will be able to complete the courses in a bigger time span.

The objectives describe have to be worked on and as of the writing of this blog post it is not possible to know if any of them will work with this dataset.

Tools

The classes of the Data Mining Course use Weka for exercises. It is possible to use the many implementations of algorithms available in Weka from a Java program, so I’ll probably be using Java to work with them. The idea is to have a Java program that does the preprocessing necessary and then uses the Weka algorithms to do whatever mining that is necessary. In addition to using algorithms that are already made, it might be interesting to modify or implement some algorithms for data mining. Those implementations would be done in Java and they would probably be better for this specific dataset.

If some algorithms are implemented from scratch it might be interesting to use them with existing datasets that have been subject to other data minings studies.

Another possibility is to simply use Python. Basically, if new algorithms are going to be implemented they will simply be implemented in Python. The preprocessing may also be done in Python. One thing that would be interesting is to have the ProbPy Python library working to mine.