Biology of Distributed Information Systems: Lean Information Systems

Lean Manufacturing is a powerful concept, which is often misunderstood. It was made popular by Toyota’s implementation and Taiichi Ohno’s vision (one of Toyota charismatic leaders). A very simple way to explain what it is would be to compare two production shops:

one shop is organized so that each machine is run at optimal capacity, in its best operating conditions. Buffers are introduced and the transport between machines is a little longer (so set up the machine optimally)

the second shop is organized so that the flow is shortened as much as possible. Buffers are reduced (and eliminated as much as possible) and the transport is optimized. The consequence is that each machine is no longer working optimally. Some are underutilized and others are working in operations mode that do not yield the best productivity.

What does Lean Manufacturing (and experience) say ? Obviously the first shop costs less to operate (cost for producing one unit) on paper, but unless it operates in a ideal world with no variations at all, it actually costs more in real life. The second approch costs less from an inventory perspective, but mostly it is more flexible (with respect to priority changes) and more robust (with respect to load variations).

Let us now consider two information systems, from our scope of large-scale, distributed information systems (many parallel nodes running business processes) :

The first one has been designed so that each node is running close to its optimal capacity. A node here may be a group (cluster, farm, blade) of servers that run services which are the elementary components of the business processes. The computing power of the node is designed so that the node is running at 85% capacity when the load is full (i.e. when the business processes are running at their maximal expected load).

The second one has been designed to speed up the process running and to avoid "queuing waiting time". Hence the computing power of the node is adjusted so that the average utilization ratio is closer to 50%

Here also, the first data center is clearly cheaper to build than the first one. The second one has a few advantages: better SLA (service level agreement) may be promised to the customer (tighter = faster garanteed response time), and the upgrade process (when the company grows) may be planned in a more regular way) ... but let us assume that these are not compelling advantages. That is, let us suppose that the customer accepts the two different SLA:

in the first case, the SLA is such that the target response time will be obtained in 98% of the time with regular business conditions.

in the second case, the SLA is also such that the response time will be obtained in 98% of the time (hence, a smaller number than the first one).

I have made some interesting computing experiments last month to see how these two data centers would behave when "a little stress occurs". Stres here may come from one node unavailability, from a process overload, or from a higher-than-usual variation in the processing load. Anyone who has any experience with operations will recognize that these are the common issues of day-to-day production life.

These experience were reported in a talk that I gave at the "Colloque d'Automne du LIX", from which I have extracted the last slide:

You may find the complete presentation on the CAL web site. To keep things simple, the curves describe the behavior of the systems (1) and (2), with different stress scenario. The different curves correspond to different strategies of "adaptive middleware" (recall that I have this interest for autonomic computing :)). What matters here is that the lower curve reproduces the strategy that ALL existing systems use today (first-come, first served). What you may see is a tremendous difference:

The lean IS (on the left) does actually very well under stress. Only the loss of a node creates a real problem (and it is not major, the SLA drops to 75%)
The loose IS (on the right) is definitely not robust. The stess conditions cause a significant drop of the SLA (down to 20% !).

There is another way to say it: if your IS is run in such a way that message queues are often full of pending requests, setting up a proper SLA is a very difficult job, because predicting the behaviour (response time) of an overloaded queing system is hard science. It is not enough to add reasonamble margins (such as, promise a 10 minutes response time because the average processing time is 1 minute).

There is nothing new here. This experiment confirms what experience or intuition shows. What is interesting (and what surprised me) is the HUGE difference that the computing experiment reveals.

I plan to do similar experiments within the (global) entreprise context. I need a model that links the behavior of the IS with that of the company itself. Fortunately, I can rely on the great work (and models) just released by the CEISAR.

The CEISAR is a French initiative, under the patronage of the Ecole Centrale, to create a repository of models and practical knowledge about Enterprise Architecture. A first gem is their global model (follow "main concepts" then "Core Business System" on their web site), an attempt to define Enterprise Architecture with 10 key concepts. Another extremely useful piece that is part of the first release is a document about entity modeling. In one of my books I complained that this type of knowledge was not accessible (and could only be obtained from experience). It is nice to see real-life-experts, such as Jean-René Lyon, share their knowledge about such topics.

I definitely plan to adhere to CEISAR's terminology and framework for my own future work about IS architecture. One of the most pressing issue (as I have already testified on this blog) is to build a framework/model to explain, discuss, simulate data distribution and synchonization protocol. The only way to make this a relevant topic is to keep a very broad perspective, that includes a model of the coupling between IS and business. The nice conclusion is that this type of work falls neatly between my two topics of interest (cf. my other blog) : IS efficiency and Enterprise efficiency.

Biology of Distributed Information Systems

Thursday, October 11, 2007

Lean Information Systems

No comments:

My Links

Blog Archive

Other Blogs and Sites