Wednesday, May 9, 2012

TraitsUI - Python for Quick and Efficient GUIs

A highlight of the Enthought Tool Suite is the Traits package. Folks at Enthought eat and sleep Traits. Those who regularly use ETS would be familiar with Traits, as ETS almost builds on top of Traits. The TraitsUI tutorial by Gael Varoquaux is an excellent introduction for using Traits to create powerful UIs in Python. The tutorial uses Traits and matplotlib to create a basic interactive GUI. Last month I started as an intern at Enthought, Inc. My first task was to convert Gael's tutorial to use Chaco instead of matplotlib for the plotting. Chaco is an interactive 2D plotting package that is already well integrated with TraitsUI (There is a module for 3D plotting too - Mayavi). Although I have only seen Traits being used in the context of making UIs, it is basically a sophisticated type definition system that can go well beyond UIs. In this post I will attempt to give the reader a cursory feel of TraitsUI specifically and Traits in general.

Traits as Type Definitions in Python
As a beginner in programming, and especially in Python, I have found it quite easy to create new types by simply writing new classes. One could, of course, take a lower-level approach and extend Python using the C/C++ API, but that would miss the point of programming in Python: to write readable, clean code and fast. The Traits package is something that makes the former approach more palatable. It does so by adding a number of characteristics to Python class attributes, like initializing default values for attributes, restricting attributes by their respective properties, notifying the user about changes in attributes, etc. These advantages make Traits particularly effective for GUI design - users might want to interactively modify attributes, visualize the return values of different methods, etc. The visualization of traits is supported by the TraitsUI package.


Visual Representation of Traits
To demonstrate the graphical and interactive modification of traits, consider the following snippet:
from traits.api import HasTraits, CInt, Enum

class Camera(HasTraits):
    
    gain = Enum(1,2,3, desc = 'The gain index of the camera',
                label = 'Gain')
    exposure = CInt(10, desc = 'Exposure in ms',
                    label = 'Exposure')
    
    def capture(self):
        print "capturing an image at %i ms exposure, gain: %i" % 
               (self.exposure, self.gain )

Camera().configure_traits()
Let's consider the trait attributes in the Camera class. The HasTraits class can be subclassed to create the most commonly used traits required in UI applications. The gain of a camera can be (virtually) any real number, but in this case we wish to assign the gain trait a predefined set of integers. For this we use the Enum trait, which is used to assign a predefined set of values to a trait, and they do not have to be of the same type. The default value of an Enum trait is the first value passed to it - in this case, 1. The next trait, CInt, is called a casting trait. Casting traits can be used to cast the type of the value to the type that is required by the trait. When the class is instantiated and the configure_traits() method is called on it, it automatically creates the widgets required by the UI. The user does not have to worry about the layers between writing algorithms and rendering widgets.

A great variety of traits are available in the package, most of them with characteristic visual properties.

Interactive GUIs with Traits
A major advantage of using traits is its notification property. This is indispensable in making interactive GUIs. There are many ways of making a Python programme react to changes in variables. Trait notifications are particularly effective in accomplishing this graphically. Consider the following code snippet:
from traits.api import HasTraits
class EchoBox(HasTraits):
    
    input = Str
    output = Str
    
    def _input_changed(self):
        output = self.input

EchoBox().configure_traits()
For any trait foo, a notification for it's change can be created by writing a method _foo_changed(). If we use a Button object in a UI, then the notification is made by writing a _bar_fired() method in the class (where bar is the name of the Button object). This is illustrated in the following snippet:
from traits.api import HasTraits, Button, Int
from traitsui.api import View, Item

class ButtonClick(HasTraits):
    value = Int()
    add_one = Button()
    
    def _add_one_fired(self):
        self.value += 1
    
    view = View('value', Item('add_one', show_label = False))

ButtonClick().configure_traits()
The View and Item classes from the traitsui.api module are used to add different panels to one TraitsUI window.

More with Traits
I have taken a very minimal approach to Traits so far. In the next post I will write about embedding 2-D plots in TraitsUI, for which I will use ETS Chaco. Chaco is a very convenient package for adding 2-D plotting functionality to a TraitsUI, as it is fully integrated with Traits. Even standalone Chaco plot widgets require Traits to be rendered. I have omitted a few parts of the original tutorial - especially the parts on executing event loops via threads and ensuring safety of these threads. I will rewrite the tutorial when I have played around with Traits a bit.

Monday, April 2, 2012

GSoC Notes: Image Segmentation by Clustering

This photograph was taken by my cousin somewhere near Onyx, California. An interesting problem would be trying to differentiate the clouds from the snow in this picture. In this post, I shall attempt to do that with the help of k-means clustering. The heuristics of all the methodology and algorithms have been deliberately kept simple in order to see the least these algorithms can do.

Generally, in performing clustering on the information in images, the feature vectors consist of readily available data, like the intensities of pixels and the coordinates of those intensity values. Let's make a naive assumption that this image has no discernible spatial patterns, and omit the coordinates from our feature vectors. The dimensions of the image are 960 x 720 x 3. So we have 960*720 feature vectors, each having three components, the red, green and blue values of the respective pixel. It is also possible to do some pre-processing and come up with more elaborate feature vectors, but we shall leave that for later.

For initializing a k-means algorithm we need to decide on the number of clusters we want. Let's say we want to cluster the pixels into three groups, one representing the snow, one for the clouds, and everything else goes into the third. This is a highly relaxed assumption and a lot can go wrong with this kind of reasoning. For instance, whatever we intend to put in the third group might itself belong to a number of classes (like whatever isn't snow or clouds might be the earth, the tree cover, etc) and these classes might not have sufficiently common characteristics to fall into one cluster. A popular way to address this is to use one-vs-all classification, which identifies objects belonging to one class and rejects all others. But this would require the knowledge of all possible classes. But we don't know how many classes there are other than snow and clouds. Unsupervised clustering comes into handy in such cases.

Here is the code I used for generating these contours on the image. It is an adaptation of Vincent Michel and Alexander Gramfort's demo of segmentation, modified to use k-means clustering.

While the results are far from satisfactory, even a crude clustering of the pixel values can provide a reasonable segmentation of the image. In retrospect, this result tells us how ambitious the problem was, and that it calls for better preprocessing. Now it can be clearly seen that being ambiguous about the third cluster has caused only the land to be well segmented, whereas in many places the snow and the clouds fall in the same cluster.

Let's try a more comfortable image now. This one is from the CASIA Iris dataset. We initialize three clusters here too. Detecting the pupil is the easiest part of segmenting an iris image, because the pupil is easily the darkest part of the image and is also spatially very convenient. The real goal is to be able to isolate the iris region within exactly two contours. As the results show, the pupil and the eyelids can be easily identified. But even the eyelashes have to be accounted for. In such a case, simple clustering will not do. It will have to be combined with some confidence measures on fitting contours to the iris. I don't know how to do that yet. In a later post I will write about using clustering as pre-processing for other machine learning based segmentation algorithms.

(Snow and Clouds photo by Sandeep Hardas)

Thursday, January 12, 2012

Scikit-Signal - Python for Signal Processing - Developer Talks

In this post I intend to summarize the discussion that the SciPy Developers' community has been having about the idea of developing Scikit-Signal - a Python toolbox for advanced signal processing tools. The suggestion has received considerable interest on the mailing lists (it can be found here in the archives), and has triggered off a few more threads, with discussions ranging from what such a scikit should include to how it can contribute to the SciPy ecosystem. These discussions have also been followed by debates about the very future of SciPy.

Of course, not all of that is relevant to the purpose at hand - which is to identify the scope of the scikit-signal project, with due regard to the original signal processing abilities of SciPy. Therefore the aim of this post is to consolidate the different viewpoints and suggestions that we have encountered in the threads. It would be quite premature and presumptuous to attempt to define the scope of the project at this time. We do not know what it can grow to become. So let this post serve only as a concise record of what we all have been discussing. This will hopefully allow us to streamline discussions in the future. (Like the minutes of a meeting, if you will.)

What's in a name?

Travis Oliphant wrote to me asking why I don't do this work under scipy.signal. I told him that what I really had in mind was a twofold aim: improving the signal processing routines and documentation already present in SciPy, and scikit-signal could be a dedicated, advanced signal processing package, much like the other scikits. I said that I, alongwith a lot of other people, have some signal processing ideas on which to write Python code. I will let the community decide later on which namespace the code should reside in. I'm happy with any namespace because I, for one, do not understand the consequences a different namespace will have in the long run. If I can perform well, it might turn out that this particular scikit is doing a better job researching and developing signal processing tools, than it would if we were developing in SciPy. But as Gael said, this should not preclude improvement of the existing documentation and code in scipy.signal.

This project is whatever we, the developers, want it to be. If we want to maintain it as a separate scikit, so be it. However, if it becomes indispensable later on and we want to merge it with SciPy, so be it. That would be a true mark of its maturity. No matter what, SciPy as a central toolbox will remain indispensable, as Josef Perktold said in the discussion. So, for now, let us consider the namespace issue settled, or let's just put it on the backburner and proceed with the coding.

The Status Quo

The general consensus seems to be that the scipy.signal isn't in a good shape. It is also limited to filter design and basic linear system analysis. Here's what some people on the list suggested (about signal processing in Python generally, not just scipy.signal).
  1. Charles R Harris suggests that filter design needs improvement.
  2. Zachary Pincus suggests that scipy.interpolate needs improvement because it overlaps with a lot of other packages. (I myself +1 this, we really need better interpolation schemes. However, I think most of it is linear algebra and not signal processing, so I think it's an independent project.) Travis Oliphant said he is looking for someone who is working on and willing to coordinate future development of scipy.interpolate.
  3. Josef Perktold suggested that we include periodograms and the Levinson-Durbin algorithm (many more have mentioned this one). He also mentioned that there is a control system toolbox that overlaps considerably with the scikit-signal idea. If so, let us track them down and take a look at their code.
  4. Scipy.wavelets could do with better documentation and examples of plotting.
  5. David Cournapeau says that we are missing some simple linear code, and scipy.signal itself needs a lot of refactoring (He also stresses periodograms and the Levinson-Durbin algorithm).
These are some of the most stressed issues with signal processing in Python that the discussion has highlighted. There might be many more, I might have possibly left out a few crucial issues too. I so, please point them out.

Proceeding With the Scikit

The most contentious issue that came up in the discussion is why we need 'yet another' signal processing package. I won't defend the idea for a new scikit for signal processing here, that has already been done in the list. However, I believe that the concerns voiced against the project were very cogent. I wrote to Mitar Milutinovic, a researcher for Orange Data Mining, and a mentor in GSoC 2010. He had the following to say:
We had quite few sessions at Google Summer of Code mentors meeting this year about science open source. And we have found again and again that there are too many tools and everybody is implementing their own implementations again and again, instead of us collaborating. To avoid unnecessary re-implementations, inform all existing projects of the proposal of extending their project with such improvements and see which project agrees. And then extend that.
The risk of this project adding to the clutter and fragmentation of code is very real and it is our responsibility to avoid this at all costs.

To this end, we must:
  1. Give scipy.signal its due. Something that is already in scipy.signal, and is well executed and fast, should not be in the scikit. For instance, Charles R Harris says he has a Remez algorithm that works for complex filter design that belongs somewhere. It belongs in scipy.signal, with the original Remez algorithm.
  2. (For other dedicated signal processing code) start by reviewing existing code and defining the scope of this project (I understand that doing it right now is premature - but we must start somewhere)
  3. Assimilate fragmented code from projects that are dead or dormant.
  4. Assimilate code from other scikits / SciPy projects  like nitime, talkbox, etc.
So now to start coding and to wait and watch the kind of response the project receives. I hope I have managed to faithfully represent the reactions of the community through this blog post. I look forward with great excitement to the range of ideas people would bring to this project. I have a few ideas on which to write code - particularly in adaptive data analysis. But soon there will come a time when I'll need to be told what to do about this project. Till such time, I hope (rather ambitiously) that this post suffices as a starting point to the developers of scikit-signal.

Monday, December 26, 2011

Proposal for New Scikit for Signal Processing

In the wake of my talk at SciPy India 2011, Gael Varoquaux suggested that there be a separate Scikit for signal processing. I'd be happy to work on it.

Here's the link to my talk. http://urtalk.kpoint.in/kapsule/gcc-57b6c86b-2f12-4244-950c-a34360a2cc1f/view/search/tag%3Ascipy

Tuesday, November 15, 2011

Python for Hilbert-Huang Transform: Introducing the PyHHT Project

I need to finish writing the paper on the Python toolbox for the Hilbert-Huang transform by the 28th of this month. That's almost two weeks away. Since PyHHT is an ambitiously chosen project that is

1. a big deal in signal processing
2. computationally expensive
3. not exactly easy to code for,

I conclude that I really need to maintain a proper record of what I work on. By not doing this, I have lost the ability to maintain a bird's eye view of my work on the toolbox. I often get stuck in just one layer of abstraction for weeks.

The following few blog posts are intended to be journal entries for the PyHHT project, and will be my primary resources when I submit my paper to SciPy India 2011. The abstract can be found here.

Bringing together different heuristics and interpretations of the Hilbert-Huang transform will require programming that ranges from simple numerical methods (for instance, as required in cubic spline interpolation) to complex machine learning and signal processing tools (for screening of the IMFs and further) to rendering plots (visualizing spectrograms for the time-frequency analysis). This calls for a comprehensive study of basic signal processing (especially the Fourier, Wavelet, and Hilbert transforms along with time-frequency analysis). On the other hand I need to develop high-performance algorithms for these techniques which complement the theory as neatly as possible.

Thus, beginning here, are my journal entries for the development of the PyHHT project. In the following posts I will break the project down into smaller development tasks.

Thursday, October 27, 2011

A Python Toolbox for the Hilbert-Huang Transform: Abstract for SciPy.in 2011

This paper introduces the PyHHT project. The aim of the project is to develop a Python toolbox for the Hilbert-Huang Transform (HHT) for nonlinear and nonstationary data analysis. The HHT is an algorithmic tool particularly useful for the time-frequency analysis of nonlinear and nonstationary data. It uses an iterative algorithm called Empirical Mode Decomposition (EMD) to break a signal down into so-called Intrinsic Mode Functions (IMFs). The set of these IMFs is characterized by being piecewise narrowband, thus making the IMFs suitable for Hilbert spectral analysis.

HHT is primarily an algorithmic tool and is relatively simple to implement. Therefore, even a crude implementation of the HHT is quite powerful for a given class of signals. Thus, one of the motivations for building a toolbox is to sustain the power of HHT across a variety of applications. This can be achieved by bringing together different heuristics associated with HHT on one programming platform (since HHT is largely algorithmic, there are a great many heuristics). It is thus the purpose of the toolbox to provide those implementations of the HHT that are popular in the literature. Along with making the application of HHT more dexterous and flexible, the toolbox will also be a good research tool as it provides a platform for comparison of different HHT implementations. It also supports comparison with conventional data analysis tools like Fourier and Wavelets.

Most of the existing implementations of the HHT have functions that are drawn from different numerical computing packages, and hence are generalized, not optimized particularly for HHT. PyHHT includes functions that are optimized specifically for analysis with HHT. They are designed to operate at the least possible computational complexity, thus greatly increasing the performance of the analysis. The paper includes examples of such components of EMD which have been optimized to operate at the least possible expense – in comparison with conventional implementations. This optimization can be done in a number of ways. One example of optimizing conventional algorithms for PyHHT discussed in the paper is that of cubic spline interpolation. It is a major bottleneck in the EMD method (needs to be performed twice over the entire range of the signal in a single iteration). Most implementations for cubic splines involve the use of Gaussian elimination, whereas for PyHHT the much simpler tridiagonal system of equations will suffice. Furthermore, it can be improved using many different methods like using NumPy vectorization, the weave and blitz functions in SciPy, or by using the Python-C/C++ API. Thus, the portability of Python comes in handy when optimizing the algorithm on so many different levels. The paper also discusses the possibility of further improving such functions that are the biggest bottlenecks in the EMD algorithm.

Other heuristics of the HHT include imposing different stopping conditions for the iterative EMD process. Once the IMFs of the original signal are obtained, their time-frequency-energy distributions can be obtained. PyHHT uses Matplotlib to visualize the distributions. The IMFs can further be used in computer vision and machine learning applications. PyHHT uses a number of statistical and information theoretic screening tools to detect the useful IMFs from among the decomposed data.

Finally we perform HHT on a few test signals and compare it with the corresponding Fourier and Wavelet analyses. We comment on the advantages and limitations of the HHT method and discuss future improvements in the PyHHT project.

Thursday, September 29, 2011

Empirical Mode Decomposition: Cubic Spline Interpolation

I've been programming for four straight hours. Nowhere close enough to improving the performance of my EMD algorithm. If anything I've made it worse. I'm sure nobody fills up a 2 GB RAM completely with just trying to calculate solutions to a tridiagonal system of linear equations.

Maybe the problem is with the linalg.solve function in SciPy. I'll have to look into it, compare it with other solvers. Wikipedia says tridiagonal systems are easily solved with Gaussian elimination. Gaussian elimination... that sounds like first year engineering mathematics, would certainly explain why I vaguely remember only the name.

But I have mastered the algorithm itself. Earlier my EMD program would bottleneck at the interpolation (which I earlier did with interpolate.splrep and splev). Now it's only a matter of how fast I can solve the tridiagonal system.