This as been a very busy semester at UÉ, so I haven’t written much for this blog. In this post I give an update on something in which I’ve been recently occupied, Neural Networks.
I’ve taken a course on Machine Learning. One of the topics that I wanted to explore was Neural Networks. To do so I started by reading books on the subject. Next I found a few things online and work from that.
The book “Machine Learning”, by Tom Mitchel, was a great starting point. It talked about Perceptrons and Sigmoids, gave an introduction for Gradient Descent, introduced the concept of Neural Network built from Sigmoid Neurons, and finally talked a about the Back Propagation and how this algorithm is used to train the network.
After that I’ve looked for things online. Off course, there are many resources available. I found two particular websites to be of interest. These:
The first website, “Neural Networks and Deep Learning”, really caught my attention. It explains NNs in full detail. It goes on to show an example of using NNs to recognize handwritten numbers using the famous MNIST Dataset. What the network does is to take a handwritten number from 0 to 9 in an image of 28×28 pixels and recognize which number it actually represents. The site explains all the equations used in details and also provides a Python+Numpy implementation.
I myself wanted to implement my own version of the algorithms. I prefer to do so because I learn way much more then just by using an implementation found on the Internet. My version is very similar to the one by the author of the book, but it does have a few differences.
Implementing this algorithm took a few hours of understanding every little detail of the equations that describe how Back Propagation calculates the derivatives of the quadratic error function, how Gradient Descent uses those results, how training is done, etc. It was worth it.
The code can be found in my GitHub page.
I’ve made two simple experiments. They consist in creating a training set of random points that get classified as being in a given class. The network then learns from that training set and is able to classify new points into their correct classes.
I don’t have anything to formal to show yet, or any real example. In the following weeks I will be looking into real applications of NNs and I will use my implementation in the MMNIST Dataset, like in the book.
I’ve been working on the preprocessing for the Data Mining project (First post). There are a few things that need to be done to the original data entries in order to make them anonymous and make it possible to work with them with any data mining software or any algorithm which I may implement. Those things are called preprocessing. To do the preprocessing I’ve made two Python scripts. Available at GitHub.
The dataset is a registry of the academical success of every student of the Computer Engineering Department at the University of Évora since 1995. That means that each entry is a relation between a course, a student and his success at the given course. So, there is an entry for every time a student as completed a course, and that entry also contains his grade. There is also an entry for every time a student has failed a course, whether because he simply didn’t have a good enough grade, because he didn’t show up for evaluations, or for any other undesirable situations.
The dataset was made over the course of several years at the University of Évora since the start of the Computer Engineering course and has 52265 entries and each entry has 19 fields. There are entries for undergraduate courses, master’s degree courses and PhD courses. The number of entries for undergraduate courses make up 49461 entries of the dataset, while the master’s and PhD courses only make up for 2438 and 366, respectively.
Because of changes over the years, there are quite a few courses that no longer exist, courses that changed name, or courses that still exist, have the same name, but also have different course codes. Because of this, it becomes hard to identify which are the same courses over the years.
The first thing to do is to remove entries that will be useless. Because of the small number of entries of the master’s and PhD courses, they are removed from the preprocessed dataset.
Inside the University there are many instances of students first starting undergraduates in other areas, that are not related to computer engineering and then changing undergraduate to computer engineering. When that happens they’re previous complete courses come into the dataset, so there are many entries that contain unique courses. These unique courses are removed from the final dataset.
Some interesting things to note about the dataset are the changes over the years in terms of existing courses. Some older courses no longer exist, for example, there are some courses that show up in the first years of the dataset but don’t show up in recent years because they no longer exist. These courses with the same name but different course codes are kept into the final dataset but they are seen as different courses for two reasons. First, joining both entry types into a single one is not easily done automatically, and second, even if the courses have the same name there might have been considerable changes, so considering course A of 1997 to be the different from course A from 2011 may be desirable.
One very important thing in this dataset is that the student identification needs to be anonymous. Every entry as both the student’s number and the student’s name. In the final version the name of the student is simply removed because it isn’t doing anything there in the first place, and the number is replaced by another value. That other value will be an incrementing integer starting from 0. So one student will have its number replaced by 0, the other by 1, and so on.
The final think to consider in the dataset are some missing values and some redundant information. The final three fields of the dataset are the grade of the entry, the status, if it was approved or not, if the student skipped that evaluation, or if the evaluation was canceled, and if that is the final result. To make things simple, the status of an entry will be binary, so the student will either be approved at the courser or not. In many entries where the student doesn’t get approved there is not grade, in the final version of the dataset they all get 0. The last field, detailing if that is the final result, will be removed because it doesn’t add anything new to the dataset.
The final result is a dataset with only 49299 entries and 12 fields. The entries are all anonymous and there are no missing values. There aren’t entries with courses that have only one or two entries. All values of grades and approval status are normalized.
The next steps of this project will deal with the application of algorithms to find association rules.
This semester I have a course on Data Mining. This area of knowledge is concern with exploring large quantities of information using different methods to make some sense of data. I will be blogging about my project for the next couple of mouths and might submit some code to GitHub. The results of this project might became useful for my Master’s Degree Thesis. I haven’t talked about that in my blog, but I will make it so in the future.
Dataset Used and Preprocessing
For the Data Mining project there was an opportunity to work with a Dataset from my University,which isthe University of Évora (EU). The dataset comes from the Computer Engineering Department (may not be official name in English). Each entry of the dataset consist in a relation between a course from Computer Engineering, information about the student, and if the student finished that course or didn’t finished it, as well as the final grade, if there is any. The dataset has information for students in undergraduate courses and in master’s degree courses. The first entries were made in the academic year of 1994/1995, when the course first started at the University, and there are a total of a bit over 52000 entries in the dataset.
The first thing that needs to be done with the dataset deals with some pre-processing. The existing dataset needs to be processed and transformed into a working dataset with a few changes that will be specified.
First, the dataset contains the names and numbers of students. To keep things anonymous, the names will be removed from the final working dataset. Student numbers are a little bit problematic. The student number for undergraduates at UE always starts with the letter ‘l’ and is followed by an integer number, for example l27402. The Masters number is the same but starts with an ‘m’, for example m11153. In the dataset both numbers are represented without the letter, so there may be an older student with an undergraduate number that will collide with a master’s number from a newer student. Fortunately, it is possible to distinguish an entry between an undergraduate student and a master’s student due to the entries having undergraduate and masters courses with different codes.
Another thing with entries in the dataset deals with missing values. There are a lot of missing values. An ideal entry should contain information on the student, information on the course, and information about the completion of that course, whether the student succeeded or failed, if the course was done during the normal season or another season, the final grade, in case of success, among other things. However, many cases can be found where there is an entry without information on the success of the student, without any grade and without explicitly saying that the student failed the course. For now it is assumes that such a case means that the student simply didn’t made the course. Making assumption over the data needs to be done carefully. In some cases it might be better to simply disregard entries.
What Will Be Done With The Dataset
There are a lot of things that can be done with the dataset. The initial idea of working with the dataset dealt with churn prediction. This kind of prediction is common place with companies like ISPs, telecommunications companies, insurance companies, and most companies that provide some sort of service. The objective is to determine when a client will cancel a certain service to go to competing company, to ask for another similar service, or try to renegotiate some payment scheme. If there is a good enough model to predict client cancellation, then the company can have some sort of mechanism to avoid a particular client from leaving, for example, if a client was leaving because he felt that another company offered a better service, then the company may step in a propose a new deal. For students in colleges there aren’t many services being offered, but it is still useful to determine when a student may leave the course, or how long will the student take to finish said course.
Another thing that might be useful to do with this dataset deals with association rules. It is common to find cases where students that finished the course Programming I with certain grades in the first semester will most probably finish Programming II on the following semester with similar grades, or if the student finished Calculus I, then he will probably finish Calculus II. Other similar rules are found throughout the course. The objective here is to find such association rules in an automatic way. The rules may take into account not only the completion of the course but also grading, because completing a course with a 18 is vastly different then completing the same course with 11, and also the time it takes to finish courses, meaning that completing course A and B with the difference of one year is different then completing the very same courses with a distance of several years, even with the same grades.
Some thing that can be done is to try and find clusters of good and bad students, or good, average and bad students. What makes a good student is not so trivial as people might thing it is. The final grade of a student a course is not at all representative of how good a student was during its undergraduate and, of course, how good will that student be in any other academical or professional pursuit. It is plausible that there will be clusters of students that contain which finish the course quickly but have just a little above average grades, while at the same time, there will be clusters of students that will have great averages but will be able to complete the courses in a bigger time span.
The objectives describe have to be worked on and as of the writing of this blog post it is not possible to know if any of them will work with this dataset.
The classes of the Data Mining Course use Weka for exercises. It is possible to use the many implementations of algorithms available in Weka from a Java program, so I’ll probably be using Java to work with them. The idea is to have a Java program that does the preprocessing necessary and then uses the Weka algorithms to do whatever mining that is necessary. In addition to using algorithms that are already made, it might be interesting to modify or implement some algorithms for data mining. Those implementations would be done in Java and they would probably be better for this specific dataset.
If some algorithms are implemented from scratch it might be interesting to use them with existing datasets that have been subject to other data minings studies.
Another possibility is to simply use Python. Basically, if new algorithms are going to be implemented they will simply be implemented in Python. The preprocessing may also be done in Python. One thing that would be interesting is to have the ProbPy Python library working to mine.