Donnerstag, Oktober 07, 2010

Complexity, Size and Focus

Why is it so hard to understand software? Or to put it more to the point: What makes it so hard to write understandable software? This question is the driving theme of this article: the quest to uncover techniques of complexitiy management that go beyond established principles of software engineering such as information hiding and modularization and which are unbound to specific paradigms like object-orientation or functional programming.

For that reason I studied two kinds of software systems in depth: telecommunication systems, modeling and programming languages. I did so for two reasons: First, telecommunication systems are among the oldest, largest and most successful systems. They are highly reliable, roboust and scalable. What are their key design principles that make them stand such demanding requirements? Second, each modeling and programming language is its creators attempt to provide a set of language conceptions to address and manage complexity. A language – be it a programming or modeling language – is, so to speak, the destillate of another persons opinion, experience and expert knowledge about how to write „better“ programs or how to design „better“ software.

My observation and claim is that two factors are most responsible for what I capture under the term „complexity“: size and focus. Size refers to number of lines of code and number of features, focus refers to intellectual comprehensibility.

The code basis of todays operating systems goes into million lines of code. Recently, Linux was reported to reach 10 million lines of code. Windows 7 is estimated to be of the same size. These numbers do not include application software typically shipped with these operating systems. Applications such as OpenOffice also count some ten million lines of code. The latest release of the Eclipse IDE (Integrated Development Environment), version 3.6, counts 33 million lines of codes. Since there is a relation between code size and the number of faults to be expected per 1000 lines of code (numbers vary from 25 to about 1 error per 1000 lines of code), such huge-sized software is highly interspersed with faults.

Another dimension of size is number of features: Todays software is extremely feature-rich. A whole industry of education and consulting is build around teaching and configuring the use of software, which is too feature-rich to use out of the box. Office applications and SAP R5 come into mind. This observation also refers to programming languages used to build these systems. The most spreaded languages are C, C++, C# and Java. Java, for instance, is such feature-rich that it requires a programmer to learn and understand a language specification of almost 700 pages of text. It is no exaggeration that most Java programmers only master a personalized subset of these 700 pages.

The sheer size of code of todays software makes it impossible for a software developer to understand software systems in its entirety. The sheer volume of code is overwhelming und impossible to master. It is a valid question whether this size complexitiy is inherent to the problem domain or a symptom of a certain design philosophy that has become main stream and is manifested in lanaguages like C(++), C# and Java. Alternative approaches indicate the latter: TeX, a typesetting system designed in the 1980s by Donald E. Knuth, is still top-class in its typesetting quality and widely spread in academia and among textbook authors; many publishers prefer manuscripts produced in TeX. TeX is based on a language kernel with primitives for typesetting and it can be easily extended via a powerful macro system. Besides bug fixes, the TeX kernel has been kept stable for almost 30 years now. Nonetheless, the system adapted constantly via its macro system with grwoing demands and new technologies coming up. Another example is Postscript.

There are also alternatives to feature-rich languages like C(++), C# and Java. Kernel-based languages like Lisp/Scheme, Prolog, Forth and Smalltalk are easy to understand. Their implementations fit on some few pages of code. They easily incorporated new paradigms and trends (e.g. object-orientation and aspect-orientation) due to their extensibility.

Another aspect of complexitiy management is focus. From a cognitive viewpoint, complexity is a human beings incapability to intellectually manage information which is (a) too much and (b) spread in time and space. It is a combination of information overload and a lack of recognizing temporal and/or spacial patterns. Two techniques address these issues: one is condensation, the other is localization. Condensation comes in two forms: abstraction building and modeling. Abstraction building can be reversed by refinement without loss of information; modeling condenses at the price of loosing information thereby simplifying things. A simplification introduces faults and errors; an oversimplification overstresses the acceptance of incorrectness. Localization is a technique to bring together (to bring in focus), what was spread and distributed before and thus appeared to be unrelated and unconnected. To concentrate on a problem (domain) means to put it in focus, to dissolve and localize relevant parts and highlight their relations, which might be spatial (i.e. structural) and/or temporal (behavioral). The act of localization establishes a new context, a new perspective or point of view, a new universe of discourse, a new domain.

In software engineering, several techniques have been developed for abstraction and localization. Among many other ideas we just would like to mention abstract data types, object-orientation and meta-object protocols, aspect-orientation, meta-programming and macro systems. All these approaches have one thing in common: they try to rearrange parts in a software description, they try to bring in focus, they localize. We call the flexibility of a language to adapt to different localization needs its expressiveness.

Interestingly, the other aspect of condensation, modeling, is rarely used in a systematic manner in software engineering with a clear understanding of the degree of incorrection and impreciseness introduced with a model. This understanding of modeling differs significantly from the common interpretation of the term. Typically, modeling is more meant to be a form of visual programming or a means to visually create code templates.

The assumption is that small size systems are a natural consequence of systems designed with extremely expressive languages. Empirical data point into this direction: Systems developed in expressive languages like Lisp/Scheme, Prolog, Python or Ruby argue with code size reduction compared to languages like C, C++, C# and Java. These languages (Lisp etc.) are quite expressive, whereas C and others strictly separate the language from the problem domain. If a certain localization need is not covered by language features, frameworks need to be designed and implemented to simulate expressiveness.

I think that software engineering has yet underestimated the use and the value of highly expressive languages and highly extensible kernel-based systems.