Archive | ProbPy RSS for this section

ProbPy Tests – Part 1

Now that ProbPy is working and being used, (by me) two things need to be assured which are paramount for the library. One is its correctness, meaning that the library must produce correct results and must be reliable in doing so. The other is temporal and spacial efficiency. In this blog post I’ll make a first address regarding the second point.

Scripting languages such as Python always have hidden things working behind the scenes that may increase execution time without us realizing. In Python’s example, and in many other scripting languages, we see simple syntactic constructions that take up so few lines of code and end up consuming a lot of execution time. Naturally we need to be careful with what is used in a Python program in order to make it as fast as possible.

To begin with, ProbPy intends to offer an abstraction for factors and probabilistic operations. Every time a product between two factors is made, it takes a little more then just the element wise product between the factors to get to the result. So how is ProbPy in terms of speed?

To test that the same script was executed in the current version of ProbPy and in an older version that was in the repository. That older version was made right before a few improvements that speedup the algorithms. To set up these experiments the main ProbPy repository was cloned to another directory.

Setting Up

To clone the repository just do:

git clone /path/to/ProbPy

Now, there are two important commits for this test, one is the current master:

e366ecd78f91702a836efb30a9771bb8190326a2

And the other is just an older commit made before some improvements:

aec6f44b1cd557bc238bc8eb5f3a2211705872bd

To jump from one commit to another just do one of the following commands:

git checkout e366ecd78f91702a836efb30a9771bb8190326a2
git checkout aec6f44b1cd557bc238bc8eb5f3a2211705872bd

To get results in terms of execution time two scripts were made. They are exactly the same, but one is used for the older version of ProbPy while the other is used for the current master. They are in this repository at github. The scripts do three operations at each time.

The first operation creates two factors with n random variables. Both factors will have the same variables and in the same order. Their values are irrelevant. The script will iterate through n from 0 to 20 and measure how long will a factor product take to finish. The first iteration will just be the product between two scalar factors, that is, factors that have no variables and are just scalars. The second will have only one variable, the next will have two, and so on all the up to twenty random variables. Each script executes for the two commits specified.

The second operation is similar. The only difference is the order of the variables in the factor. The second factor will have exactly the same variables but in a reverse order. This is done to test whether the order of the variables is important or not.

Finally, the third test will run the product between two factors with the same variable in the same order, but it will make it so using Python’s profile tool. With this it will be easy to observe where is python taking to much time to execute the code, which functions are being called a lot, among other things.

Results of the First Test

The results of the first test will be displayed from 15 variables to 20. The execution time before 15 is to small so it is not displayed. The results for the older version:

15  0.47489118576049805
16  1.0099499225616455
17  2.180561065673828
18  4.577056884765625
19  9.783910989761353
20  20.155898094177246

For the master version:

15  0.3872871398925781
16  0.8108508586883545
17  1.7335929870605469
18  3.6766469478607178
19  7.790879011154175
20  16.35800790786743

In here it simple to see how the master version improves quite a bit. In the last example the gain between the master and the older version is almost four seconds. In this situation we are looking at the product between two factors that each have 20 random variables. That is a lot of variables and it is probable that no one will work with factors of these dimensions. However, the observed gain shows that the newer algorithm is factors, which is good.

Results of the Second Test

The second test, where one of the factors has the random variables reversed, has the following results. For the older version:

15  0.4831809997558594
16  1.0258629322052002
17  2.1770758628845215
18  4.642666816711426
19  9.749383211135864
20  20.390637159347534

For the master version:

15  0.3916609287261963
16  0.8321959972381592
17  1.8294868469238281
18  3.765069007873535
19  7.900938034057617
20  16.56574010848999

The differences between these results are irrelevant. It is simple to see that the order of the variables may not be very important, which was what was expected. The fact that the operations must take the order of variables into account is important because it is something that makes the algorithms a little more complex. However, choosing the order of the variables is not important.

Results of the Third Test

The third test uses the Python’s Profile tool and runs the product between two factors each with 20 variables. The results for the older:

        52429122 function calls in 132.335 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.008    0.008  132.335  132.335 <string>:1(<module>)
        1    0.000    0.000  132.327  132.327 factor.py:122(mult)
  2097152    1.972    0.000    1.972    0.000 factor.py:125(<lambda>)
        1   88.247   88.247  132.327  132.327 factor.py:146(factorOp)
        1    0.000    0.000  132.327  132.327 factor.py:536(__mul__)
        1    0.453    0.453    0.453    0.453 factor.py:96(__init__)
        1    0.000    0.000  132.335  132.335 {built-in method exec}
 48234748   39.872    0.000   39.872    0.000 {built-in method len}
  2097215    1.782    0.000    1.782    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

For the master:

        4194670 function calls in 42.300 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.008    0.008   42.300   42.300 <string>:1(<module>)
        1    0.000    0.000   42.292   42.292 factor.py:126(mult)
  2097152    1.992    0.000    1.992    0.000 factor.py:134(<lambda>)
        1   38.110   38.110   42.292   42.292 factor.py:170(factorOp)
        1    0.445    0.445    0.445    0.445 factor.py:35(__init__)
        1    0.000    0.000    0.000    0.000 factor.py:599(getValuesListSize)
        1    0.000    0.000   42.292   42.292 factor.py:719(__mul__)
        1    0.000    0.000   42.300   42.300 {built-in method exec}
      274    0.000    0.000    0.000    0.000 {built-in method len}
  2097236    1.744    0.000    1.744    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This results are very complete. They contain lots of useful information that may help decide where improvement needs to be done.

The first big difference between the two results is the usage of the len function. In the older implementation that functions was placed in the middle of loops where it wasn’t needed. The simple decision of taking the function out of the loops and calling them only once saved a lot of time. This is a trivial mistake anyone can make, placing a len function inside a loop.

Improvements Made Taking Profile Into Account

One of the worst things in ProbPy’s code is the final loop of the factorOp function. That function is the one that takes two factors and makes whatever operation it receives by parameter. The parameter is either a function or a lambda. The final loop is where the values of the operation are calculated. So, for very big factors the lambda will be called many many times, which is terrible.

Looking at the Profile output, the total amount of time it takes to call a lambda 2097152 times is not that much, but it is still something that can easily be eliminated with a different factorOp implementation. One thing to do is, instead of having the operation as a parameter, just implement the operation using the operators. The + operator for sums, * for products, etc. Like this the loop doesn’t call anything. The problem with this is that there needs to be a loop replicated for each operation, and the version using the parameter must still exist for other operations.

Another thing that is noticeable is the excessive amount of time the append function is called. There is some time wasted in there, but still there might be some parts of the code that may be blotted with to many unnecessary appends.

Future Improvements

Two would be cool if implemented in ProbPy. One is implementing a few method using parallelism. I have though about a way of making factorOp parallel, but I am yet to implement it. Using the Python’s library for this very end, a parallel version of ProbPy could be put up online and still have run in every machine with ProbPy. It is just a question of implementing what is necessary.

Another cool thing would be to implement some function using the C programming language. Like that, some of the most executed parts of the code could gain much speed.

In later post further test will be made not only to speed but also to the correctness of some inference algorithms which are already implemented.

Factors in ProbPy – Part 2

In the previous article the two most important building blocks of ProbPy were introduced. Those were the RandVar and Factor classes. The way information gets stored in a factor is extremely important because ProbPy will have to perform operations between the factors.

This part discusses what are the operations made between factors, how those operations are implemented and finally, some remarks about complexity, efficiency and optimization of the algorithms are made.

Operations Between Factors

Factor multiplication is the cornerstone of ProbPy and its one of the main reason for the development of the library. To quickly explain what factor multiplication is, suppose you have the following distributions:

P(X | Y),
P(Y).

As it was seen, these two distributions are represented by factors. Just to see some numbers lets define the following:

X = RandVar("X", ["a", "b", "c"])
Y = RandVar("Y", ["T", "F"])

fy = Factor(Y, [0.9, 0.1])
fx_y = Factor([X, Y], [[0.2, 0.3, 0.5],
                       [0.6, 0.2, 0.2]])

The factors fy and fx_y represent P(Y) and P(X | Y) respectively, or in factor notation, fa(Y) and fb(X, Y). Remember that the notion of conditioning variable is not important for factor notation, so there is no “|” symbol used in the factor.

Studying probabilities, it is known that multiplying both distributions yields a new distributions, in this case that new distribution will be the join distributions of X and Y. The following operation shows this multiplication. Notice that the multiplication is also called the chain rule in some contexts.

P(X, Y) = P(X | Y) P(Y).

Using factors:

fr(X, Y) = fb(X, Y) fa(Y).

Notice that the resulting factor fr has the same variables as the operand factors. That is something common on all of these operations. The set of variables of the resulting factor is always the union of the set of variables of the operant factors. Mathematically:

var_sets

After having that notion, the resulting factor can be constructed using the following rule:

ele_prod

The following table shows the operation applied to the example given before.

X Y | fa(Y) fb(X, Y) | fr(X, Y)
-----------------------------
a T | 0.9    0.2   |   0.18
b T | 0.9    0.3   |   0.27
c T | 0.9    0.5   |   0.45
a F | 0.1    0.6   |   0.06
b F | 0.1    0.2   |   0.02
c F | 0.1    0.2   |   0.02

If you have read books like “Artificial Intelligence: A Modern Approach” by Stuart Russell and Peter Norvig, or “Probabilistic Graphical Models: Principles and Techniques” by Daphne Koller and Nir Friedman, you have no doubt seen this very operation being defined and used. The operation is factor multiplication meaning that each element in one factor get multiplied with elements in the other factor.

But what about division, addition or subtraction of factors? The operation is exactly the same except for the operator between the two operand factors. The example shown used multiplication, but the same is valid for any other operator, even the % operator like it is used in most programming languages. Extending on the idea, one could see how binary operators are not the limit. It is possible to define a factor operation with any function relating both operands. Of course the usability of such operation would depend on the context of such use.

To make things simles. All these operations are called Factor Operation. A factor operation is the relation done between two factors which results in a new one. The operation used to relate the factors can be multiplication, addition, or any other operation defined by the user.

How Factors are Related For Operations

Relating two factors always results in a new factor. It was seen in the first part that the new factor will always have the union between the set of variables of the two operand factors. To illustrate what is going on, suppose the following two operand factors:

fa(X, Y)=[ 2   4   1   3 ]
       X | 0 | 1 | 0 | 1 |
       Y |   0   |   1   |

fb(Z, Y)=[ 8   6   9   7 ]
       Z | 0 | 1 | 0 | 1 |
       Y |   0   |   1   |

The values on the factors are just placeholder values to see the result of the operations.

The first thing to do is to determine the resulting factor’s variables. Using the definition from before, the set of variables of the resulting factor will be X, Y, Z. Remember that the order of variables in the factor representation is very important and as such, the order of the new factor will also be important. We need a systematic way of performing the union of variables. The way that was chosen is simple. Take the variables of the first factor, and succeed them with the variables of the second factor which are not in the first factor. The pseudo-algorithm for this operation is:

res_vars_algo

Having the variables in order the resulting factor will be represented as:

fr(X, Y)=[                               ]
       X | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
       Y |   0   |   1   |   0   |   1   |
       Z |       0       |       1       |
       i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

The values are still not yet inserted. That is the second thing to do. The last line is just an index i. The problem now is, given this resulting structure, how will the values be calculated?

The approach used in ProbPy starts by considering the index i, and for every i the algorithm will find the corresponding value in the two operand factors and use them in an operation. Because the first variables in the resulting factor are the same variables in the first factors, and because their order is the same, finding the corresponding index for the first factor is simple. The index in the first value is the same as the index in the resulting factor. So index1 could simple be:

index1 = i

This works for i between 0 and 3, but fails for greater values. For values greater then 3, the values of the indexing variables X and Y are just going to repeat. So i=4 has the same indexes as i=0, i=5 has the same as i=1, and so on. It simple to see how index1 will just be:

index1 = i % fa_len

Where fa_len is the size of the list of values in factor 1.

The variables of the second factor are in no specific order in relation to the resulting factor. So to get index2 a more complex method must be used. Recall from the first part of the article that it’s simple to take the indexes of each variable and calculate the i index. Suppose f(X=0, Y=1, Z=1), the i index would be:

i = x + 2*y + 2*2*z = 0 + 2*1 + 2*2*1 = 6

So now a reverse method needs to be applied to get the value of each indexing variable of factor 2. The variables in factor 2 are Z and Y in that order. For every value of i, thr value of Z is:

Z = int(i / (2*2)) % 2

So we take the value of i and divide it by 4. Why? Because Z encapsulates two variables in the resulting factor and each variable has a size of 2. This division scales down the size of the Z variable. After that, the integer division work similarly to the previous one. The same is done for the Y variable:

Y = int(i / 2) % 2

Notice that now i only gets divided by 2 because that variable only encapsulates one other variable. If this was done for the X variable no division would be needed.

After having both the Z and Y variable values, calculating the corresponding index of factor 2 is simply done like:

index2 = Z + 2*Y

Which is the method shown before.

This operation is implemented in the factor.py file. The method is big, but the important part is the following:

for i in range(res_values_size):
    # Get index 1
    index1 = i % len_values

    # Get index 2
    index2 = 0
    for j in index2_range:
        index2 += (int(i / div[j]) % dim[j]) * mult[j]

    # Calculate value
    res_values.append(fun(self.values[index1], factor.values[index2]))

# Make Factor object and return
return Factor(res_rand_vars, res_values)

Line 8 is where the index i transformed in index two. The div list contains the values for which i is going to be divided. In the example in this article the list would be [2, 4], for Y and Z respectively. The dim list contains the sizes of the variables. The mult list contains the values for which each variable is going to be multiplied. The list will be [2, 1].

With this, we have a method that relates two factors. Whether if the method is a multiplication, addition or anything else depends on the fun function which is passed by parameter.

Factors in ProbPy – Part 1

Every data structure and algorithm in the ProbPy library fall on two very important classes, the Factor class and the RandVar class. This article, and the next, will talk about what these classes are, how they work, and why they are used like this.

Factors

Let us consider a binary variable X which has a domain of either True or False. Let us now consider a distribution over that variable P(X) for which we assign the following probabilities:

P(X=True)  = 0.8,
P(X=False) = 0.2.

So we have a distribution over a binary variable. Nothing new here, it all adds up to 1, meaning:

P(X=True) + P(X=False) = 1.

For many reasons, which I’m not going to get in now, it is quite convenient to represent these distributions not as individual values but has a whole vector. That vector is indexed by the very domain of the variable, meaning that the first element corresponds to the probability of X=True and the second element corresponds to the probability of X=False. In a more compact way, we can write:

P(X) = [0.8, 0.2].

To expand on this idea we can start considering distributions that have two variable, like the join distribution P(X, Y) or the conditional distribution P(X | Y). To represent the values of this distribution we can’t resort to the structure of a vector, we use a matrix instead. That matrix will be indexed by the domain of both X and Y. The following shows an example of this. For simplicity sake we consider both variables to be binary.

P(X, Y) = | 0.2 0.4 |
          | 0.3 0.1 |

So if we consider the rows to be indexed by X and the columns indexed by Y we have, for example, the probability P(X=True, Y=True) = 0.2 and P(X=True, Y=False) = 0.4. If we consider the distribution P(X | Y) we would need to have different values than the ones shown above, but the representation would be the same. The domain of the variables would index a matrix and the elements of that matrix would be the values of the distribution.

This leads to the notion of factor. This is a representation that doesn’t consider whether the variables are conditioning or not. It also doesn’t consider if the matrix being represented is a probability distribution. It can simply be a relation between a few values and a set of random variables, or a function mapping random variables to some values.

So, rather then writing P(X, Y) or P(X | Y) we simply write f(X, Y). The probabilistic significance of the variables is therefor not represented because it is not needed for the actual operations between the values.

Continuing the generalization of this idea of factor we can now consider what happens with a factor, with three variables, like f(X, Y, Z). The idea is off course the same, the representation would use a three dimensional matrix. Like the following:

f(X, Y, Z=True) =  | 0.05 0.1 |
                   | 0.1  0.1 |

f(X, Y, Z=False) = | 0.15 0.1 |
                   | 0.2  0.2 |

Having an arbitrary number of random variables in the factor we can now start thinking on the generalized data representation of a factor.

How Data is Represented

To figure how to store a factor in memory we can start by looking at how some programming languages store matrices.

In the C programming language it is possible to define a matrix in a very simple way, suppose the following 2×2 integer matrix:

mat = | 2 4 |
      | 3 1 |

The definition in the C programming language would be:

int mat[2][2] = {{2, 4}, {3, 1}};

In memory, the matrix gets vectorized. That means that it’s values, which were arranged in rows and columns, are logically placed in line following. The GDB output confirms this vectorization by showing the values of the matrix in memory:

(gdb) x/4x mat
0xbffff6a8:    2    4    3    1

We would get something similar with three dimensions. The matrix:

mat[0][][] = | 2 4 |
             | 3 1 |

mat[1][][] = | 7 5 |
             | 6 8 |

Would become:

int mat[2][2][2] = {{{2, 4}, {3, 1}}, {{7, 5}, {6, 8}}};

GDB:

(gdb) x/8x mat
0xbffff698:   2   4   3   1
0xbffff6a8:   7   5   6   8

Vectorization is done using a simple process. The first element placed in memory is the element at mat[0][0][0], next one of the indexes is incremented, which would be the one on the far right, mat[0][0][1]. So to this point we have the array:

mat = [2, 4, ...].

Continuing this, if we increment the index on the right again we get out of bounds, so we go to the first value of that index and increment the index in the middle having mat[0][1][0].

It should be clear how the rest plays out. To gain a better visualization of how a multidimensional matrix can be vectorized the following representation is used, for a matrix with indexes mat[z][y][x]:

mat=[ 2   4   3   1   7   5   6   8 ]
  x | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
  y |   0   |   1   |   0   |   1   |
  z |       0       |       1       |
  i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

The first line in the representation corresponds to the values of the matrix itself. The three lines starting with x, y, and z corresponds to the values of the index for each element. For example, supposing the tuple of index (x, y, z), the number one falls in indexes (1, 1, 0), the number 5 falls in indexes (1, 0, 1), and so on.

The question now is, how do we take our indexing variables and find the i index? We do this with a series of sums and multiplications. For the example above we have:

i = x + 2*y + 2*2*z.

So we get the value of x and start by summing 2*y. We multiply by two because each index of y encapsulates every index of x, hence the number 2 comes from the size of x.

After that we do the same for the z variable we add it’s values multiplied by 2 one time because of the x variable and multiply by 2 another time because of the y variable.

To finish this chapter, suppose we do the same with a factor in which index A has size 3, and index B has size 2.

mat = | 4 2 6 |
      | 8 1 9 |

The representation would be:

mat=[ 4   2   6   8   1   9 ]
  A | 0 | 1 | 2 | 0 | 1 | 2 |
  B |     0     |     1     |
  i | 0 | 1 | 2 | 3 | 4 | 5 |

But what if we start by changing the B variable first instead of changing A? We would have:

mat=[ 4   8   2   1   6   9 ]
  B | 0 | 1 | 0 | 1 | 0 | 1 |
  A |   0   |   1   |   2   |
  i | 0 | 1 | 2 | 3 | 4 | 5 |

What do we have? The order of the values is different, the order of the variables is different and some i indexes correspond to different values of mat. But each tuple of indexes (A, B) still correspond to the same value, that is, each number 8 in the first representation has indexes A=1 and B=0, the same for the second representation, and the same for any other values in mat.

Concluding, we see that the order of the variables indexing the factor is very important. So to store a factor using this method we need to store the actual values and the order of the variables. This may seem bad, but in the next post we see that this method brings flexibility when working with multi variable factors. That flexibility is what allows one to define an arbitrary factor using some variables and then allows one to make operations between these factors in an easy way, like multiplications, additions, variable instantiations, among other things.

Working with ProbPy

This articles shows a few things that can be done with the ProbPy library currently in development. The read should have a basic knowledge of probability theory to be able to work with this library.

Random Variables and Factors

Before defining probability distributions we must first think about the random variables. In something like P(X, Y), the variables would be the X and the Y. So what are these random variables?

Lest take a simple example of a coin toss. When tossed, a coin can fall either with the heads side up or the tails side up. We disregard any other highly improbable outcome. In this case the outcome of the coin toss is represented using a random variable. Let’s call that random variable C. So if the coin falls with the head side up, C will take the value of H, and T otherwise.

The random variable is characterized with two things. First it’s actual name, which is C, and second, it’s domain. The domain of the variable can be thought of as an array [H, T] of size two. For now that is the only thing needed to characterize a variable. Defining a variable in ProbPy is done like this:


Coin = RandVar("Coin", ["H", "T"])

The arguments for the constructor are the name and domain of the variable respectively.

Variables for them selves are not very useful, so we extend to the implementation of probability distributions. A probability distribution for the coin toss scenario would be very simple:


P(Coin = Head) = 0.8
P(Coin = Tail) = 0.2

In here we have two events and their probability. We also note that this is an unfair coin. To represent this with ProbPy the following can be done:


PCoin = factor.Factor(Coin, [0.8, 0.2])

A few questions arise. Why is it called factor and not probability distribution? And what are the two arguments? More on the factor later. The first argument of this constructor is the defined Random Variable above. The second is a list of probability values. Because the domain of the variable is "H" followed by "T", we know that the 0.8 in the probabilities list is the probability of the Coin falling with it’s head side up, the 0.2 the probability otherwise. It’s a correspondence, index by index.

But now we need to represent something with many variables, like P(X, Y, Z). So first, the variables:


X = RandVar("X", ["T", "F"])
Y = RandVar("Y", ["T", "F"])
Z = RandVar("Z", ["T", "F"])
PCoin = Factor([X, Y, Z], values)

So first notice that the first argument is now a list of variables instead of a single variable. Also the second argument is just a variable named values. For this to work that variable needs to be a list of probability values just like the previous one, but how you make that list is not exactly trivial and will be the point of a follow up article. One of the next things to be implemented in this project is an easy way to write that list of probabilities, in a few weeks it will be done.

One last point. Why is the class called factor? Mathematically we have the probability distribution and conditional probabilities, like the following:


P(X | Y)
P(X, Y | Z, W)

Their meaning is different from non conditional probabilities, as you should know, but their implementation in this library is the same. It’s just a list of variables and the values in a list. The fact that the variables are conditioning the distribution or not is irrelevant. Also, you can have a function that maps the variables for some values that do not sum up to one, for example:


Q(X=T) = 10
Q(X=F) = 20

In here, Q maps the values of X from T and F to 10 and 20, making it not a probability distribution but still something that can be implemented in this library in exactly the same way. Objects like this are quite common when working with algorithms with probability distributions. Because of it we drop the left and right side variable notion and just stick with the term factor, which is a relation between random variables and their values.

Using ProbPy

We’ve seen what can be done in regards of variables and factors, but what can be implemented with it? To start off, suppose the following:


P(X | Y) P(Y) = P(X, Y)

We have the trivial chain rule where the product between the conditional distribution of X knowing Y and the distribution of Y is the join distribution. This is very common. This is how we do the same in ProbPy:


X = RandVar("X", ["T", "F"])
Y = RandVar("Y", ["T", "F"])
fy = Factor(Y, [0.2, 0.8]) # P(Y)
fx_y = Factor([X, Y], [0.1, 0.9, 0.5, 0.5]) # P(X | Y)
fxy = fx_y * fy # Results in factor that represents P(X, Y)

So we define the first two distributions and calculate the second. To keep going to something more complex, suppose the definition of the Bayes Theorem:


P(X | Y) P(Y) 1/P(X) = P(X | Y) P(Y) 1/n = P(Y | X)

In the above definition we suppose that 1/n is just a normalization constant. From the previous code we need only to add one extra line to get the final result:


fy_x = fxy.normalize(Y)

The normalization of the argument would be equivalent to dividing fxy by a factor with the probability of Y. This method is just simpler, but alternately we could do:


fx = fxy.marginal(X)
fx_y = fxy / fx

For more features of factor, such as calculating the expected value, instantiating variables in factors, and more, check the documentation in ProbPy/doc/_build/html/index.html, which is distributed with this project.

For now, note that an implementation of a Bayesian Network is done, along with the implementation of the Elimination Ask algorithm. An example of these features in use are in the examples directory distributed with this project. An implementation of ProbPy being used with the Bayes theorem is also distributed in the examples directory.

Future Work

One of the next things done is to write about the factorOp method that allows the implementation of products and other operations between factors. That method is basically what allows us to have a factor with an arbitrary number os random variables of different sizes and still work with them in a logical way.

Other things to be done is the implementation of more algorithms and examples to expand the library. The next few things will still be algorithms over Bayesian Networks and examples displaying those same algorithms. I would like to implement a method that makes it easy to work with index variables. Index variables are practical when implementing HMMs, for example.

Choosing a lisense for the Bayes Project

Licensing is a topic that very few people talk about during the development of a project but it’s still important. This Blog post talks a bit about licenses in general and shows why the MIT license was used for the Bayes Project.

Why need a licence

For a project to function correctly in an open source environment it needs lots of things. In the previous Blog post some of those things were stated. They were the usage of a versioning control system, the usage of conventions such as PEP8, and development of tests using a unit test framework. There was also one thing that was not mentioned in the last post which is the development of documentation, but that should go without saying!

Anyway, one big important thing in any open source project is the license. Licensing a project is very important because if anyone is going to be using the project, or contributing to said project in the first place, they need to know what they can and can’t do.

Before applying a license one should know what happens if the code doesn’t have any license with it. According with US law any software that doesn’t have a license can not be used, or downloaded, it’s source can’t be seen by anyone, etc. I don’t know how this applies outside of US, but that is mostly irrelevant since this project is already under the GitHub Terms of Service, which, quoting the source, “allow[s] other GitHub users some rights. Specifically, you allow others to view and fork your repository”. And this happens simply because the project gets hosted at GitHub.

But having the project just on GitHub is not enough. We still need an open source license in case we want to distribute this project outside of GitHub and in case this projects gets used by other projects. The various open source licenses must now be considered.

Open Source licenses

As one should expect, there are many open source licenses that can be used in an open source project. To name a few, there is the GNU GPL, GNU LGPL, MIT, BSD, among some other more specific ones, like the Python Software Foundation License, Boost license, zlib license, etc. For this project one of these licenses should be picked. The more specific ones should be ignored, even the Python Software Foundation License should be ignored because that license is used in the distribution of the Python project itself. Some projects written developed in Python fall under this license because they can be later integrated with Python itself, which is not the case with the Bayes project (name really needs changing).

One of the licenses that can be used is the GNU GPL license. This is undoubtedly the most used license and recognized by almost anyone. The license states that any source code or associated files of software distributed under it must be made available to anyone. It also states that anyone can modify the software and redistribute it. If someone does redistribute the software, with or without modifications, then the person must also grant these same freedoms as the original.

Any software under GPL must grant the same basic freedoms and can’t be used in a project that is not GPL. Meaning that one can’t take part of some GPL program and shove it into a non GPL program, like for example, taking the Linux Kernel using it in the development of a whole new operating system and selling that operating system never providing it’s source. Can’t be done.

The GNU LGPL license is a bit different. It allows the usage of a project under the LGPL license in Software that can be under other licenses and even be proprietary. There is an interesting article in at gnu.org about why one should not use LGPL for a library. Link for it.

The MIT license is less restrictive then GPL and LGPL. A project using this license can “(…) deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software”. A warranty is also provided stating that “in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.”

In the end the MIT license was chosen because it is simple, not very much restrictive and can be later used in projects with other licenses.

Implementation of a Library for Probabilistic Calculus

Probabilities, Algorithms and Complexity

It is quite common in subjects like Artificial Intelligence, Machine Learning and Decision Theory, that problems which deal with uncertainty show up. This is because these problems deal with settings that are only partially observable, or systems that are simply too complex to be explained by an approach like First Order Logic or Propositional Logic.

To deal with uncertainty the concept of Probabilities is introduced. A probability is basically an assign degree of belief given to some proposition. A proposition is represented using a Random variable.

One of the many applications of probabilities is in Bayesian Networks. Bayesian Networks can be thought of as a direct acyclic graph where each node represents a proposition about some setting. That proposition is seen as a Random Variable, and each node in the network has a conditional probability associated with it. That distribution is the probability of the node’s variable given it’s parent.

Off course, Bayesian Networks are but one example. Probabilities are used in many different contexts and the algorithms that can be applied to it are vast. These calculations and algorithms are not exactly trivial, meaning that they are not easily calculated by hand or using the default constructs and data structures of many programming languages.

The problem of representing an arbitrary big multi variable probability distribution can be non trivial by itself, let alone thinking about all the possible operations over them, like, multiplication between distributions, computing the marginals, expected values, implementing algorithms like elimination ask, among other things. A simple way should exist to deal with these problems.

This is the main motivation for the development of this project. To provide an efficient yet simple to use Python library that abstracts all the intricacies of operations between factors and lets the user implement complex algorithms over them.

Efficient Yet Simple

To make things efficient will be the subject of a whole new blog post. To make things simple will be stated now.

So the main problem is representing Random Variables and Probability Distributions. After that we thing about all the algorithms, functions that can be implemented over these distributions, things like random variables that change over time, among other things.

To represent a Random Variable in this library one simple does:

X = factor.RandVar("X", [True, False])

The first argument of the constructor of the class RandVar is a string. That string is the name of this variable, in this case X. The second argument is the domain of the variable. In this case X is a binary variable which can take the value True or False.

Besides binary variables we can have arbitrary big variables. Suppose:

Y = factor.RandVar("Y", list(range(100)))

A random variable with a domain with size 100.

From this we need a distribution for the variables. For example:

X = factor.RandVar("X", [True, False])
Y = factor.RandVar("Y", [True, False])
x_dist = factor.Factor([X], [0.2, 0.8])
y_dist = factor.Factor([Y], [0.9, 0.1])

So above we have two binary variables, X and Y, and their distributions. For example, the probability of X being True is 0.2, that is, P(X = True) = 0.2. The probability of Y being False is 0.1, P(Y = False) = 0.1.

So we can create distributions for one variable. How about two? If we suppose X and Y are independent, we can just multiply them:

# P(X, Y) = P(X)P(Y)
xy_dist = x_dist.mult(y_dist)

If not we just do it like we did the others:

xy_dist = factor.Factor([X, Y], [0.1, 0.2, 0.3, 0.4])

In here we have to vectorize the matrix that would represent the distribution P(X, Y). That is where the usage of this library is not so simple. But trust me, it is the only thing.

A simple method needs to be implement to circumvent this. I already have a few ideas, but nothing very solid yet.

From here we can get a lot of operations going, such as, the marginal of X in a join distribution:

x_dist = xy_dist.marginal(X)

Instantiating a variable with some value, P(X = True, Y), is done like this:

vec = xy_dist.instVar(X, True)

Off course, the same things are similar with many variables in a distribution. Suppose similar operations in a distribution with four variables, P(X, Y, Z, W):

xy_dist = xyzw.margianl([X, Y])
xy_dist = xyzw.instVar([X, Y], [True, False])

Without continuing too much further into implementations, here is a link for an example with a Bayesian Network implementation.

There are many plans for the future of this projects, both in things to be implemented and things where it is going to be used. Development shall continue.

The Project

The project is being developed using Git has a version control system, off course. It’s source is hosted at GitHub. For future contributions one is expected to be able to use this platform.

Like many programming languages, Python allows many ways to do the same thing. You can use different indentation spaces in the same file, space the functions and classes definitions with a different number of newlines, among other small details.

Problem is, if these details don’t follow some convention, the code files end up being a big ugly joint of code instead of an eye appealing project. This can be a problem if someone wants to participate in a project and has to deal with some very confusing code. Not to mention that some Python projects end up having some portions in Python 2 and other’s in Python 3, which gives many problems in the future.

To avoid this, we convention that the project needs to follow the PEP8 standard, and be written only in Python 3.

To end, I’ve participated in far to many projects that have no notion of unit tests. I am not having this now. This project will have to have unit tests implemented for everything. To work with these tests we use nosetests. A simple testing framework.