Redesigning Sequel Pro's Export Architecture
Ever since joining the team of developers working on Sequel Pro in November 2008, we've released four new versions, each including a huge number of major new features, performance improvements and bug fixes. Highlights of these changes include a completely redesigned preferences window, a redesigned query editor, built in SSH tunnel support, our own forked version of the underlying MySQL framework and most noticeable, major improvements to nearly every aspect of the application's interface. With every new release the feedback from users has been great, from invaluable insight into how they think something should work to general thoughts of appreciation that the application and project behind it actually exist. This has been more evident with the latest release, version 0.9.6, with SSH tunnel support proving to be a very popular addition.
Having said this though, very few applications are considered perfect or perform every one of their functions extremely well and Sequel Pro is no exception. With no disrespect intended to either the developers of the original CocoaMySQL (of which Sequel Pro is a fork and wouldn't exist without) or the other current Sequel Pro developers, CocoaMySQL had a lot of very annoying bugs and performance issues. Nearly all of these have now been addressed, but there are still a few areas that require some major improvement before I think we can release version one.I'm not talking about adding support for other RDBMS's, such as SQLite and PostgreSQL as although this is one of our major long term goals, it unsurprisingly requires a significant amount of work, including redesigning and rewriting large parts of the application. The particular areas which I am referring to (which are of my own personal view and not those of the other developers) are data import/export, user account management and advanced table content filtering using predicates. As is obvious from the title of this post I'm interested in redesigning the application's data export architecture as it along with data import are among the most heavily used functions of the application.
First, before I propose my grand plan for the data export's redesign, which will inevitably require a bit of rewriting and a lot of restructuring work, its only right that I provide an overview of its current implementation and the problems associated with it. One of the first things that I should point out is that the core functionality of the current data export (that is, the actual code that generates the CSV, SQL and XML from the data extracted from the database) is generally not the problem and actually performs quite well (with one big drawback, which I'll soon get to). The main problem with its current implementation is in the design and how everything fits together, which subsequently affects its performance. The following outlines two of the main and most obvious problems.
Put simply, there is none. All of Sequel Pro's data import and export methods are all contained in a single class, which if you ask any of the other developers I'm sure they'll agree is far from pretty to look at, let alone work with. You might think that having everything to do with the current operation in one place (including the references to the relevant user interface controls) would make things a lot easier as you don't have to worry about passing around references to objects of interest, but it doesn't. As is evident from the current implementation, all of the data import/export code can become quite complex and having it all in once place doesn't help, it just adds to its overall complexity.
This is the big drawback. Just as above, there is none. None of the data export code is multithreaded, meaning any operation performed will be executed on the application's main thread. This for example—when exporting a large amount of data (regardless of output format)—means the entire application will be blocked until the export is complete. What makes things worse is factoring in the speed of the network over which you are connected to the MySQL host. If like me, you're connected to a database provided by your web host and it's located on the other side of the world, chances are your going to be waiting a while (with no way of cancelling it) if you attempt to export a significant amount of data.
As stated, the above are just two of the main issues with regard to the current design, with others not being so obvious and in some cases no doubt the side effects of these two. It should be noted though that addressing these areas are not going to solve every single issue that is wrong with current design and implementation. For example, splitting all the code into separate classes of common functionality and greatly improving upon its current non-existent structure is not going to improve the overall performance of performing a data export operation, but it will greatly improve the ease at which it can be maintained and further developed. In contrast to this though, when introducing a level of concurrency into an operation, significant performance improvements are expected (assuming it's implemented right), but such concurrency no matter how well it's implemented is going to solve the problems associated with working over a slow network.
Something that is inherently missing from the current implementation is any form of a recognisable design pattern. I'm a huge fan of design patterns when it comes to designing software, especially MVC. The reason being that it ultimately results in easily maintainable and reusable modules of code, something that is well worth thinking about before beginning any implementation.
The key thing to note about the data export operation is that very few areas of the complete operation are specific to the chosen format that is selected by the user. For example, regardless of which format the user selects, there are a set of format independent operations that are always going to be performed, including giving the user a visual indication that the export process has begun, getting the data by querying the database, updating the UI to reflect the progress of the export process and finally, writing the resulting data to disk in the chosen format. The obvious missing step (which is format specific) is the conversion of the extracted data to the chosen format, which would take place while the progress of it was updated and presented to the user.
It would then seem logical to split out all format specific data conversion code into their own respective classes. Because they all share some common properties and processes such as indicating when the conversion process starts and ends, providing the progress of the conversion and specifying what character encoding should be used when dealing with the data, these separate classes should all be subclasses of a common exporter. For example, the end result would be
SPExporter as the base class containing all non-specific exporter functionality and data format specific subclasses
SPXMLExporter obviously containing all format specific conversion code. Adhering to this design would therefore mean the task of adding a new export format would be as simple as subclassing
SPExporter and implementing the format specific conversion code.
Note that the exporters should not have to worry about where the data comes from or how it is obtained. They simply need to be supplied it, most likely in the form of a multi-dimensional array as is the case when dealing with exporting entire tables. The exporters could also be designed to handle the data in the form of the raw result set (in this case an instance of
MCPStreamingResult) although this would mean making the exporters database and even framework specific. If support were added though, the time consuming step of converting a result set to an array could be eliminated.
All of this is constructed and controlled in a single location by the new
SPExportController, which in turn is designed to work with the newly redesigned export interface completed by one of the other developers.
SPExportController is responsible for all UI interaction, the initialisation of a specific exporter based on the user's format selection and the task of supplying data to it. The new controller is also the assigned delegate of the exporter, allowing it to be informed of the progress of the data conversion process, which subsequently enables the UI to be updated, providing accurate user feedback.
The original idea of introducing concurrency into the export architecture was to manually manage all threads ourselves using
NSThread. For example, as
SPExporter is the superclass of all exporters, the idea was to give this class a single instance of
NSThread, with which each exporter subclass would inherit and subsequently use during the process of performing its most time consuming tasks (e.g. converting a data array to CSV data). There is nothing immediately wrong with this approach, but upon its implementation the obvious problems of manually managing threads and having to do it multiple times resulting in code duplication in each specific exporter, become apparent. Regardless of this though, as of Leopard with the introduction of
NSOperation and more recently in Snow Leopard with Grand Central Dispatch (GCD) its not exactly the easiest or preferred method of introducing concurrency into an application or process.
Since we're not yet targeting 10.6 only we unfortunately can't use GCD, so
NSOperation is the next best thing and is actually perfectly suited to what we are trying to accomplish. With the introduction of both GCD and
NSOperation there's been a shift away from manually managing everything with threads to configuring and submitting isolated 'packets' of work that are to be executed concurrently and then their results collated once complete.
For our purposes, subclassing
NSOperation is probably the best way to go as it gives us complete control over everything (with the ability to add extra functionality) and also because we are already creating an instance of an exporter, it makes sense to make
SPExporter the subclass. Although we'll still have to manually create our own thread to make the operation concurrent, doing so will enable the implementation of all the custom
NSOperation code in the exporter base class resulting in the subclasses only having to implement the data conversion code that is specific to each exporter in their
As stated, the most time consuming part of the export process once the data is obtained is the conversion of this data into the selected format. You could factor in the time its takes to get the data from the server, but this is a separate issue and has also been largely addressed by the addition of result set streaming. The data conversion process is therefore our designated 'packet' of work that needs to be run concurrently. The benefit of this, is that we can now theoretically create as many exporter instances as we want, assign them 'chunks' of data to be converted and place them on an operation queue for the conversion process to be executed concurrently. For example, currently if we want to export ten different tables then we have to—one after the other—get the data from these tables, convert it to whatever format and then write the data to separate files on disk. Using the proposed concurrent implementation, after getting the data (either concurrently or not), it can then be assigned to one of the exporter instances and placed on the operation queue.
The end result is the exporters being placed on the operation queue when the data is available without having to wait on the data from each table. The system then decides when the data conversion process should be executed based on available resources. This could potentially mean that the data conversion for all of the ten tables could be taking place concurrently. The converted data could then be written to disk on a table per file basis as is the case just now, but it could also be done as soon as the conversion process is complete, removing the need to wait on the data from the other tables.
So that's the plan, its not perfect but hopefully addressing these two issues will bring noticeable benefits to end-users and the Sequel Pro developers. It's also a start towards a completely re-written export and possibly import architecture. Although data import may seem fundamentally different and it is at its core, but the same general design and approach can be applied to it. The key difference would be in the implementation of the data conversion logic and the result of this. For example, converting CSV data into the relevant SQL
INSERT statements and executing those statements against the database instead of writing them to a file.
If you've reached this far and think what you've just read is more like a bunch of notes rather than a solid statement of intention, it's because it is (and the reason why there isn't any pretty pictures or diagrams), my thoughts on how I think it should be done, not necessarily how it will eventually be done. It's always so much easier to state how you would like something to be done in theory, which is why I'm sure I'll come across and have to overcome problems that I haven't yet thought of at the time of refining its design and eventually its implementation. All the more reason why any form of feedback is welcome and encouraged, especially for things that I may have misinterpreted or overlooked.
Update: I guess I should read the docs more carefully. The statement that "we'll still have to manually create our own thread to make the operation concurrent" is not entirely correct, at least for what we are doing here. You only need to implement the spawning and management of your own thread in your
main() method if you plan on starting the execution of the operation manually. That is, starting it by calling it's
start() method as opposed to adding it to an operation queue. If you're not planning on starting the operation manually you don't need to implement any threading in your
main() method, only the operation code. Once added to an operation queue the system will execute your
main() method concurrently.