Monday, May 7, 2012

Connexions Importer

Since the beginning of January, I have been fortunate enough to work with Kathi Fletcher and an international team of developers on the Connexions' Importer Project. Funded by the Shuttleworth Foundation (http://www.shuttleworthfoundation.org/fellows/kathi-fletcher/) , the project aims to smooth the content sharing by creating tools which make it easier for content creators to contribute to open education resources(OER). The importer accomplishes this by converting content from various formats to cnmxl then html, making it easy for contributors to share their content with anyone through Connexions(cnx.org).

My contribution to the importer has been mostly in bug fixes which helped me to become more familiar with the project's code base while also being an active contributor. I started with a couple of transformation bugs and finally moved on to more complex bugs which required heavier fixes. They are listed below for your reading pleasure:

  1. CNXML Editor Server Error (Bug 61): Choosing the 'Edit CNXML' option in the importer's Advanced Mode caused an 'Internal Server Error'. I found the issue to be caused by a Unicode error where the cnxml was being interpreted as ASCII instead of Unicode. Adding a couple of lines of code to explicitly assure a Unicode interpretation fixed this issue.
  2. Mathml Rendering Error (Bug 26): HTML documents with extensive mathml failed to render after being imported. This error had a couple of causes. First, the mathml namespace necessary for the browser to correctly interpret mathml were not set in the HTML document. Second, the use of a modified Connexions specific mathjax script, based an older version mathjax, also caused the mathml interpretation errors. Explicitly adding the xml namespace and replacing the modified script with an updated mathjax script fixed the errors in Chrome, Firefox, and IE8. However, mathml still failed to load in IE9. I fixed this by adding a compatibility header which forced rendering in IE7 mode. More recently the folks at Mathjax have put out a new version of mathjax that thankfully has removed the need to do this. 
  3. Missing Images (Bug 137): Some embedded images in Google Docs fail to correctly upload through the importer. I investigated this issue with another intern and discovered that the failure mainly occurred due to a permissions issue. Images which are editable, such as PNG's, are located in a different part of Google which requires broader permissions. Although we haven't found a fix for this issue yet, we have been able to discover the source of the error which is always half the battle.
  4. Openoffice doc/docx to odt failures (Bug 123): My most significant contribution to date has been my fix for the Openoffice bug. Simultaneous doc/docx uploads would periodically cause Openoffice to hang, blocking any other doc/docx conversion request from completing (which is not entirely surprising because Openoffice was not built to handle multiple conversion requests). To remedy this initially, I modified the conversion pipeline to start Openoffice as a background process. The process listened on a port where it would receive and handle conversion requests from clients. I then wrote an additional script which assures that Openoffice would be listening when a conversion request is sent. However this only handled the case in which Openoffice crashes. It did not handle the case in which Openoffice hangs due to receiving simultaneous requests. Doing some research on how developers have solved this particular issue, I found that developers suggested using a Java based tool called JOD converter. JOD beautifully handles simultaneous requests by creating a pool of Openoffice processes which retrieve and handle requests from a global queue. It also provides additional features such as automatic restarts of Openoffice upon a crash, a task queue timeout, and being able to set the maximum number of queued requests.                                                                                                                   However because JOD is written in Java, I had to build some surrounding infrastructure to make it compatible with the importer's Python code base. Luckily, the creators provided a sample webapp capable of receiving HTTP requests while running locally on a tomcat server --essentially  making it virtually compatible with any language. I spent the next couple of weeks setting up JOD and then writing some python code to construct and send HTTP requests. Then I ran some benchmarks to compare my solution to running Openoffice as a daemon. I saw a 2.5x improvement in conversion times of simultaneous requests and of course saw no Openoffice freezes related to simultaneous requests. Finally, I wrote an install script to allow easy customization of JOD's features and to ease the integration of my solution into the current pipeline. Currently, I am working with another developer to test and verify my solution before finally moving on to integrate my solution into the importer's pipeline. After doing so, I will provide a github link to my python solution which I hope someone, who may be looking for a python JOD solution, will find helpful. 
The last couple of months have been quite exciting for me and I'm excited to see what new challenges this summer will bring.

Thanks for reading
-Gbenga

Update:

Here is a Dropbox link to my solution:
https://www.dropbox.com/s/8nt7dngi4e29zi4/JOD.zip