Monday, March 23, 2015

Internet of Things and Smart Systems


Last month I had the pleasure to lead two roundtables about connected objects during an event jointly organized by Cité des Sciences and the National Academy of Technologies entitled “Connected Objects : a third digital revolution”. The first roundtable gave me the opportunity to discuss about the systemic vision that is necessary to understand and design connected objects, together with a very interesting list of companies:  Terraillon, Medissimo, Connected Cycle, Withings, Kolibree. This post proposes a summary of some of these ideas, with a focus on understanding what make connected objects smart and useful. This is not a summary of the complete discussion; it is filtered by my bias towards system analysis, as is expressed in a previous post from this blog. In a way, this is a follow-up to the talk that I gave at the 2013 Amsterdam conference on Smart Home, which I made available on Slideshare. It is also influenced by discussions at the IRT SystemX, the French technology research institute on “Systems of Systems”.

I will follow a simple outline:
  • First I will take a look at “connected objects” seen as objects with the additional benefits of being connected. My opening statement during the roundtable was that we should not care about this from a technology standpoint (what is added to the object) but focus on the customer experience. This may sound like a boring and self-evident statement, but my experience shows that (1) there is too much excitement about the technology (2) not enough effort with experience design, and that we are bound to see more disappointment with connected objects.
  • The second section will look at Internet of things (IoT) as a component of the Web Squared vision. With this second level of system analysis, connected objects interact with the complete Web in both directions, they bring additional senses (through sensors) to the Web and become contact / interaction points (bringing the Web “closer” to the user, what Joel de Rosnay calls the “clickable environment”). Obviously, the burning engine of this system is the value that we can bring though Big Data, a point that was abundantly made clear during our roundtable discussion and which is the topic of many books. However, a similar observation may be made that data follows usage and that focusing on customer experience first is the best way to avoid technology mirages.
  • The last section of this post will focus on “smart objects”, as defined by Olivier Ezratty – who was also a participant to the roundtable – as “objects that talk to one another”. There is a consensus that “we are not there yet” in the sense that experiences that are built on many connected objects interacting with each  other are rare, tend to involve objects from the same manufacturer, and to follow a closed set of possible interactions. “Smart objects” correspond to a more advanced “connected object ecosystem”, which is necessary to unlock more value to the end user.

1.   Connected Objects Seen as Objects

At a first glance, or at a first systemic level of analysis, a connected object gets two additional benefits from its connection:
  • Access to a deported interface, which is most often the smartphone. The tremendous capabilities of the smartphone (quality of the screen, touch interface, motion sensors, etc.), together with its ubiquity, make it a wonderful user interface for most scenarios.
  • Access to large storage and computing capabilities. This is true for computing, in the sense that one may do more on a smartphone or in the cloud that on the embedded chipset from the connected object. This is even more important for storage, since external storage gives both better capacity and durability. Connected objects have “memory”, as exemplified by the Withing connected scale.
We will look at the deeper benefits later on, but I would maintain that the large majority of connected devices that are available today are mostly selling the benefit of being controlled by the smartphone with some form of extended memory. That is, the additional benefit of true Web, Big Data, Smart System integration are either not there, not developed enough to be useful, or they are readily available in other ways.

My experience, especially in the realm of smart Homes where I have been quite active with my previous job, is that remote interaction with the smartphone does not carry enough weight to last long. One needs to add systemic value to the connected experiences, based on scenarios, context, and personalization to keep the interest going. Otherwise, once the novelty wears off, you are left with the underlying object: if it is a great, useful object in itself, you keep using it, otherwise you simply discard it. This is precisely the content of what I said during the Amsterdam Smart Home talk, with a focus on “life moments”, which are the daily events that may trigger scenarios.

The business model of such connected objects is, most often, to sell the object with a premium that corresponds to the improved service. The challenge is twofolds: (a) to deliver enough recurring value (b) to reduce the ownership cost/burden of the “connectedness”. The keyword “recurring” is important: it is easy to sell a connected object with the combination of aesthetic design and technology excitement, but if there is not recurring value, the business model is not sustainable. The ownership costs grows over time, with issues such as battery replacement / charging, network setup and update (the first wifi/Bluetooth pairing is easier because of the excitement of the recent purchase, the later ones, once the box and user guide card is gone, is trickier). Genevieve Bell, the anthropologist from Intel, explains this very well: each connected object is begging for attention, from its notifications to its battery charge, creating a true “cost” of ownership.

The Nurun design agency has a nice way of formulating the value challenge:  the value of the connected object, defined as immediate (out of the box), aggregate (over time) and emerging (such as social benefits) must outweigh the replacement cost. This is the only way to generate a sustainable business. As of today, and if the systemic value that we will discuss in the next sections is not there, there are few objects that will pass this test, and their success comes more from their intrinsic value (objects that prove useful and pleasant to use irrespectively of their connection) than their “connected” status.
To win this difficult equation, the technology battle is the excellence with service delivery (especially the app on the smartphone). Companies need to develop state-of-the-art skills with respect to digital software. This makes perfect sense since this is a “service business model”: delighting the customer with superior services associated with the object. This is where the “Web Giants’ lessons”, together with Lean Startup principles are relevant.

Giving memory and control to objects require sending data to the service provider, which prompts the issue of respecting data privacy. Our second roundtable was dedicated, among other things, to the issue of data privacy. It would require a different blog post to deal with this issue, but I would like to point out that, as of now, the absence of real customer value from her/his collected data is a bigger issue than the respect of data privacy, which is a great way to introduce the next section. Waze is a great example to remind us that applications which generate true user value gets access (consent) to lots of personal data.

2.       Internet of Things and Web Squared

If we pursue our thinking, and look at a more global system analysis, the connected object may benefit from two additional features:

  • It may send information to the cloud, using it sensors. The object is a mobile / wearable / personalized data capture device, which feeds “global services”, hosted on the Web. This corresponds to the “IoT as the senses of the Web” metaphor.
  • It may receive information from the cloud, and act upon them. The animated object (from a simple display – where the connected device acts as a more ergonomic and more convenient device than, say, the smartphone – to a complete animated behavior change – for instance a small robot whose interaction embodies a web service).

These two sides are the two parts of the “web square” vision: the connected object becomes part of something bigger. The object is not enough; it’s the digital experience that matters. The value does not come solely from what is in the object (sensors & actuators included), it comes in a large part from what is in the cloud. If you are not familiar with the “Web Squared” concept, the move “Eagle Eye” is a great and entertaining illustration.

Because of its very large scope, it is difficult to categorize the different business models that such a system approach enables. Without completeness, we can list:
  • Personalization and contextualization of a digital experience: the connected object is used to enrich a positive experience that is sold as a service. This leverages the first direction: object to cloud. Ambient digital experiences rely on connected objects.
  • Reciprocately, connected objects may better the digital experience though improved usability. In his book “Enchanted objects: design, human desire and the internet of things”, David Rose writes about glaneability, gesturability and wearability, which are three compelling reasons to use connected objects as a way to interact with the Web.
  • The complete B2B scope of connected objects, which is huge and outside the scope of this blog post, fits nicely into this object-to-cloud system view. Although the B2B model has its own logic, there are clear dependencies and cross-opportunities. For instance, metering devices, which are introduced with their separate business logic, may become part of customer digital experiences.
  • IoT/Cloud integration is a way to improve efficiency of business processes (I will not qualify them as digital, most business processes nowadays are digital in some form). This efficiency improvement may translate into cost reduction that may be passed to the consumer. Improving efficiency may come from cost reduction, (cf. metering example), risk avoidance or better characterization (insurance),  or performance improvement (cf. automated logistics). The “pure case” (when the value created by the efficiency gain is larger than the connected device) is uninteresting (objects become connected and the cost of their “status upgrade” is not seen by the customer) but common (more and more objects will be built with “connected object capabilities” – think of your printer’s ink cartridges).  The hybrid case (where the created efficiency value is not enough, but part of the business case for the connected object) is the most interesting one.

If we get back to our topic of “consumer connected objects”, the common umbrella for most of the digital experience that may be enriched in a systemic way through connected object is the “assistance” experience, from immediate to long-term assistance and coaching, from Siri or Google Now to healthcare services. Assistance requires context (personal data), knowledge (big data), personalization (machine learning) and problem solving (automated reasoning). The consensus during our roundtable discussion was that there needs to be a fair amount of each of these capabilities to provide a cloud assistance or coaching service that can bring sustainable recurrent value to the user. Most wearable devices, intended to be a piece of a wellbeing / health care digital experience, are still very far from meeting this goal, as can be seen by the mediocre smartphone applications and the fast decline in usage rate once the initial excitement wears off. The buzz about “quantified self” gives a good illustration of this point. Three years ago, I conjectured that quantified self would fit perfectly a combination of narcissists, geeks and people who enjoyed “system thinking” (if seeing a curve with the evolution of your bio-data tickles you, you’re in). This is a fairly large intersection – I fit just in there J – but it is still a niche. For most people, the “quantified self” experience left to oneself – which is exactly what I enjoy about my Withing connected scale – is too abstract and must be replaced by a true coaching experience.

This leads to two major challenges:
  • To generate knowledge from connected objects, one needs smart algorithms and data, and a little bit of time. To get data, one needs both usage and consent from the user, which both requires to deliver actual value. There is a chicken-and-egg conundrum: one needs data to deliver value, and value to attract the data. In most cases, the solution is to follow a two-step process, where the value generated by Big Data analysis comes in a second step, with a different value model to bootstrap usage.
  • The combination of information and technology required to produce a viable assistance or coaching service is accessible to large & technology-focused companies such as Apple or Google, but difficult to reach for most companies. This means that a partnership is required between different players, one of which is the connected object manufacturer. The object is not the center of the business value, what matters is reaching the critical mass of content (what to say), contextualization (when to say it), personalization (how to say it) and knowledge (why to say it). It is possible to add human interaction into the system loop, as a way to add knowledge or reasoning, which brings the digital experience closer to traditional coaching, but this usually comes at a cost. Still, there is likely a sweet spot for hybrid assistance or coaching services that mix the benefits of smart technology and human care.

The technology challenges associated with this systemic level are different from the previous one from Section 1. The first key domain is API (Application Programming Interfaces) architecture, used both for data collection and ecosystem partnerships. The importance of API comes from the second observation that partnerships will be required. The second domain is obviously data mining, as in Big Data, which must be either a strong point of the aspiring connected object contender, or one of its preferred partners. As stated earlier, most of the connected devices that have been introduced so far, especially in the field of connected health, fall short on delivering value with their data mining abilities.

There is a related challenge to this system vision, which is the social acceptability of “smart” cloud-based coaching services. The question of the “social acceptability” of artificial intelligence is also worth a separate discussion which I will address in my next post. During a presentation at the NATF, Dominique Cardon explained to us that the word “algorithm”, which had a very positive image a few decades ago as a symbol of technological progress, is now seen with more suspicion by the average citizen. “Smart” systems are under scrutiny, whereas social systems, where recommendation comes from the community, have a clear trust advantage.

3. Smart Objects Ecosystems

The ultimate goal of smart objects is not to improve, enrich or extend existing experiences, it is to create brand new ones. This will happen once smart objects start to act as a smart system, which supposes that they interact with one another. To build truly exciting and new experiences supposes an adaptive and intelligent behavior, obtained from distributed control and autonomous smart communication between objects. Today, the “state of the art” for mashing-up connected objects and web services seems to be IFTTT, which is deterministic, rule-based and centralized.

This long-term goal of what would be described as a smart object ecosystem is close to the general topic of this blog, in the spirit of Kevin Kelly organic/grown distributed systems. I have already mentioned the ideal of “calm computing” in a previous post. The three key principles of calm ubiquitous computing are the invisibility of the “smart dimension” (computing) – which does not mean invisible objects – the ability to stop down the assistance and the implicit machine learning – the smart object ecosystem must learn from the user, not the other way around. I often quote Adhoco, a Swiss smart home solution provider, as an example of existing smart objects service that learns from its users and operates under calm computing principles.
Another direct application of biomimicry is how to design a fail-safe smart system. Not only the control should not be centralized but distributed across the system elements, following complex systems design principles, in addition smart functions and devices must operate as assistance to lower level functions, and not the opposite. Following a simple principle that says that the more sophisticated the device, the more likely it will experience failure, control must be designed so that lower level automation may function independently from the more advanced “smart” or “adaptive” component.

A traditional failure of advanced smart home systems of the previous decade was the introduction of complex computerized systems to control low-level but vital automatic functions (such as light control or blind opening). The inevitable computer problems would then translate into a dysfunctional house, which is anything but smart. The appropriate modern design is both to avoid single-points-of-failure, and to make sure that high-level control logics are introduced as fail-safe “adds-on” to lower level of system logic.

The business model of smart object ecosystem is to provide with a new experience, or with a new level of satisfaction linked to an existing digital experience. The conviction that I expressed during this 2013 Amsterdam Smart Home interview is that it requires an “operator”, that is a company that takes responsibility for installing and maintaining the smart system. The alternative approach is to believe in interoperable standards and let the customer be the integrator of his own system, from her home to her health digital environment. It follows from the low level of maturity of “smart object interaction”, which we all agreed on during the roundtable, that letting the user be the integrator is a risky business scenario. This DIY (Do It Yourself) approach has two drawbacks, which I have seen firsthand with smart home experiments: first the smart object setup is a barrier to most consumers, but a few passionate geeks; second, the simplicity required by DIY installation prevents from delivering a true “smart system” behavior, which keeps the experience within the range of what is achieved with Section 1 and Section 2 approaches. This means that the value associated with the “target ideal experience” (of a smart home, for instance) is simply not there, which prevents raising a sustainable fee for the new service. The business challenge is to setup a brand with a clear promise, a distribution and installation network and the service maintenance infrastructure. I see the move of US telcos, such as AT&T or Verizon, towards smart home operators, as a significant signal. 

The technology challenges associated with this “third level of system integration” for connected objects are precisely linked to system integration. The skills and the software challenges are also different from what we saw in Section 1 and 2. Open system design, fault-tolerant architecture and machine learning are three key components of what is necessary to build a smart object ecosystem. Although digital and Big Data skills are bound to play an important role (level 3 encompasses levels 1 & 2), this is foremost a “system of system” integration game. Usability, availability, reliability, maintainability and adaptability are the key design challenges for smart object ecosystems.


On the one hand, I am expecting a lot of disappointment with respect to connected devices in the years to come because I believe that many of the barriers identified in this post still have to be removed. A very crude summary would be to say that connected objects are not enough, and that we are still far from the promise of smart “assistance” through connected objects, because there is a critical mass of data, learning and service innovation that requires more time and energy that what has been spent so far. On the other hand, I am a big believer in connected object because I think that the best is yet to come. This is just the beginning, not only will technology improve in all directions (miniaturization, better performance, new interaction channels, much better speech, image and pattern recognition, etc.) but time will make the conundrum identified in Section 2 become a virtuous cycle. The more time passes, the more we accumulate data which enriches the value that may be delivered through connected objects.

I believe that the distinction made here between three business models for mass market connected objects is useful, because it emphasizes different challenges and different expectations from the user. These three dimensions are not exclusive, many connected object strategies are bound to be a combination, but it matters to focus on the true benefits brought to the consumers. Here is a last attempt to summarize these three models:
  • The “service” business model uses objects to deliver improved services, with a key contribution of digital software excellence to the connected object experience.
  • The “efficiency” model uses objects to deliver better efficiency through big data, with a technology focus on API architecture and data mining.
  • The “new business” model provides a new experience that is made possible by a system of connected objects, where the technical challenge is the combination of usability, resilience and machine learning, from a “system of systems” point of view.

Friday, October 24, 2014

Big Data hides more than one paradigm shift

This post originates from a report which I wrote this summer for the NATF, which proposes a summary of what the ICT commission learned from its two-years cycle of interviews about Big Data. The commission decided to investigate about the impact of Big Data on French economy in 2012. Big Data is such a popular and hyped topic that it was not clear, at first, if a report would be necessary. So many books and reports have been published in the past two years (see the extract from the bibliography in the next section) that it made little sense to add a new one. However, throughout our interviews, and thanks to the FoE conference that NATF co-organized last year with NAE in Chantilly – which included Big Data as one of its four topics – we came to think that there was more to say about the topic than the usual piece about new customer insights, data scientists, the internet of things and the opportunities for new services.

Today I will focus on two ideas that may be characterized as paradigm shifts. I will keep this post short so it may be seen as a “teaser” for the full report which should be available soon. The first paradigm shift is a new way to analyze data based on systemic cycles and the real-time analysis of correlation. The old adage “correlation is not causation” is made “obsolete” because data mining is embedded into an operational loop that is judged, not by the amount of knowledge that is extracted, but by the dollar amount of new business that is generated. The second paradigm shift is about programming: Big data entails a new way to produce code in a massively distributed environment. This disruption comes from two fronts: the massive volume of data requires to distribute both data and procedure, on the one hand, and algorithmic tuning needs to be automated, on the other hand. Algorithms are grown as much as they are designed, they are derived from data though machine learning.
The full report contains an overview of what Big Data is because a report from NATF needs to be self-contained, but this is not necessary for a blog post. I assume that the reader has some familiarity with the Big Data topic. Otherwise, the Wikipedia page is a good place to start, followed by the upcoming bibliography entries, first of which Viktor Mayer-Schönberger and Kenneth Cukier’s book.

1. Big Data – A Revolution That Will Transform How We Live

This is the title of the book from Viktor Mayer-Schönberger and Kenneth Cukier, which covers the most famous paradigm shift of Big Data, which is its ability to transform our lives, from hard science, such as medicine, to marketing. The paradigm shift comes from the combination of what technology makes possible today – the ability to analyze very large amount of heterogeneous data in a very short amount of time – and the availability of the relevant data, which are the traces of our lives that have become digital. Thanks to the web, to smartphones, to technology which is everywhere in our lives and objects, there is a continuous stream of information that describes the world and our actions. Big Data may be described as the Information Technology which is able to mine this “digital logs” and produce new insights, opportunities and services. The constant improvement of technology (from Moore’s Law about processing to Kryder’s Law about storage) is matched by the increase in digital details about our lives. New connected objects, sensors and the growth of IoT (Internet of Things) mean that we are only seeing the beginning of what Big Data will be able to do in the future. 
One of the reason for not discussing these themes further is that the book from Viktor Mayer-Schönberger and Kenneth Cukier covers them very well so that I encourage you to read it. The other reason is that there are many other sources that develop these theses. Here is a short extract from our report’s bibliography :
[1] Commission Anne Lauvergeon. Un principe et sept ambitions pour l’innovation. 2013.
[2] John Podesta & al. Big Data : Seizing Opportunities, preserving values. Executive Office of the President, May 2014.
[3] François Bourdoncle. Peut-on créer un écosystème français du Big Data ?, Le Journal de l’Ecole de Paris n°108, Juillet/Aout 2014.
[5] Viktor Mayer-Schönberger, Kenneth Cukier. Big Data – A Revolution That Will Transform How We Live, Work and Think. John Murray, 2013.
[8] Gilles Babinet. L’ère numérique, un nouvel âge de l’humanité : Cinq mutations qui vont bouleverser votre vie. Le Passeur, 2014.
[10] Phil Simon. The Age of The Platform – How Amazon, Apple, Facebook and Google have redefined business. Motion Publishing, 2011.
[12] IBM Global Business Services, « Analytics : Real-world use of big data in telecommunications – How innovative communication service providers are extracting value from uncertain data”. IBM Institute for Business Value, Avril 2013.
[13] Thomas Dapp. “Big Data – The untamed force”, Deutsche Bank Research, May 5, 2014.
[15] David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis
[16] Tim Harford. “Big data: are we making a big mistake?”, Financial Times,  March 28th, 2014. 
[19] Octo Technology. Les géants du Web : Culture – Pratiques - Architecture. Octo 2012.
[21] Tony Hey, Stewart Tansley, Kristin Tolle (eds). The Fourth Paradigm – Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[22] Max Lin. “Machine Learning on Big Data – Lessons Learned from Google Projects”.
[24] Michael Kopp. “Top Performance Problems discussed at the Hadoop and Cassandra Summits”, July 17, 2013.
[25] Eddy Satterly. « Big Data Architecture Patterns ».
[26] Paul Ohm. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization”. UCLA Law Review, Vol. 57, p. 1701, 2010
[27] CIGREF, « Big Data : La vision des grandes entreprises », 2013

2. Big Data as a new way to extract value from data

As introduced earlier, the key idea is to forget about causation, or trying to extract knowledge from data mining. A large part of Viktor Mayer-Schönberger and Kenneth Cukier’s book (chapter 5) is dedicated to the difference between causation and correlation. But what we found when we heard about “real big data systems”, such as those from Google or Criteo, is that these systems are anything but static data mining systems that aimed at producing knowledge. They are dynamic systems that constantly evolve and “learn” from the data, in a controlled loop. Most big data statistical tools, such as logistic regressions, or machine learning algorithms are still looking for correlations, but these correlations are not meant to hold intrinsic value, they are input for action, whose effect is measured in real-time. Hence knowing the “why” of the correlation, knowing if there is a causation, a reverse causation, a complex dependency circle which is the signature of complex systems … does not really matter. Nor does it matter (as much as in the past) to be assured that the correlation is stable and will last in time. The correlation detection is embedded into a control loop that is evaluated though the overall process financial result.

This Big Data approach towards data mining is about practice and experiments. It is also broader in scope than statistics or data mining, since the performance comes from the whole “dynamic system” implementation. It is also easy to deduce from this, which we do in our report with more details, that building such a Big Data system is a team effort, with a strong emphasis on technology and distributed systems and algorithms.  From a marketing perspective, the goal is no longer to produce “customer knowledge” – which meant to understand what the customer wants – but to build an adaptive process which leads to better customer satisfaction – which is actually less ambitious. If we consider the Walmart example that is detailed in the previously mentioned book [5], analyzing checkout receipts produces correlations that need not to be thought as “customer insights”. There is no need to find out why there is a correlation between purchases of diapers and beer packs. It is enough to put these two items close by and see if sales improve (which they did). In the virtual world of the Web, testing these hypotheses becomes a child play (no physical displacement necessary).

There is a natural risk of error if one takes these correlations out of their dynamic-loop context and tries to see them as prediction. Our different experts, including the speakers invited for the 2013 Frontiers of Engineering conference – in particular Jeff Hammerbacher from Cloudera et Thomas Hofmann from Google – were adamant about the fact that “Big Data produces information that one does not really understand”, which creates the risk of poor utilization. This is similar in a sense to the phenomenon of « spurious correlations », which says that one analyses a large cloud of data points with a very large number of variables, one finds statistically a large number of correlations without any meaningful significance. A great example of avoiding such pitfall is given by the “Google Flu Trends” (GFT) story. When they analyzed search requests that used works linked to Flu, Google researchers found that they could forecast flu epidemics propagation with a good level of accuracy. This claim was instantly absorbed, amplified and orchestrated as a proof of Big Data greatness. Then more detailed analysis [15] [16] showed the limits and the shortcoming of GFT. The article that is publish on Harvard’s blog [15] is actually quite balanced. Although it is now clear that GFT is not a panacea and shows more errors than other simpler and more robust forecasting methods, the articles also states that : « The initial vision regarding GFT – that producing a more accurate picture of the current prevalence of contagious diseases might allow for life-saving interventions – is fundamentally correct, and all analyses suggest that there is indeed valuable signals to be extracted ».

3Data is the new code

This great catch phrase was delivered to us by Henri Verdier, one of the many experts that was interviewed by the ICT commission. When Google’s teams look at a new startup, they compute the valuation mostly from the volume and the quantity of data that has been collected, with much less regard for the code that has been developed. Data valuation comes both from the difficulty to collect the data and its estimated future usage potential. Code is seen as an “artefact” that is linked to data, which is both destined to change and easy to replace. In this new world of Big Data, code is conceptually less important than the data it applied to. It is less important because it changes constantly, because it is made of simple sub-linear algorithms (the only ones that can be run onto petabytes of data) and because it is the result of a learning loop (simple algorithms in their principles, but zillions of parameters that require to be fine-tuned through experiments). To caricature the reasoning, Google could tell to these young startups : “I will buy your data and I will re-grow the code base using our own methods”.

This new way of programming does not apply only to new problems and new opportunities! This approach may be used to re-engineer more classical “information systems” through the combined application of commodity computing, massively parallel programming and open source data distribution software tools. This combination helps win one or two orders of magnitude with respect to cost, as was shown to us with numerous examples. In the previously mentioned book [5], one may learn about the VISA example, where Big Data technology was used to re-build an IT process with spectacular gains in cost and throughput. This “new way of programming”, centered on data, may be characterized in three ways:
  • Massively parallel programming because of the distribution of very large amount of data. The data distribution architecture becomes the software architecture because, as the volume grows, it becomes important to avoid “moving data”.
  • Sub-linear algorithms (whose compute time grows slower than the amount of data that they process) play a key role. We heard many great examples about the importance of such algorithms, such as the use of Hyperloglog counters in the computation of Facebook social graph diameter.
  • Algorithms need to be adaptive and tuned incrementally from their data. Hence machine learning becomes a key skill when one works on a very large amount of data.

During the 2013 FoE conference, Thomas Hoffman told us that “Big data is getting at the core of computer science”. This means that all current problems that receives the attention of today computer scientists, such as artificial intelligence, robotics, natural language processing, behavioral learning, and so on, all require the combination of these three characteristics: need for massive hence distributed computing power, huge amounts of data and machine learning to grow better algorithms.

This does not mean that « data as the new code » is a universal approach towards information system design. Massive distribution of data has its own constraints and faces fundamental (theory-proven) data architecture difficulties. They are known, for instance, as the CAP Theorem or the problem of snapshots algorithms in distributed computing. Simply put, it is not possible to get at the same time data consistency, high availability and fault tolerance if part of the network becomes unavailable. Big Data solutions usually pick a weakened form of consistency or availability. The logical consequence is that there remains domains – mostly related to transactions and ACID requirements as well as very low latency requirements – where “more classical “ architecture are still better suited.

4. Conclusion

These two paradigm shifts are accompanied by changes in culture, methods and tools. To finish this post and to summarize, I would quote: agile methods, open-source software culture and DevOps (continuous build, integration and delivery). It is stunningly obvious that one cannot succeed in developing the kind of closed-loop data mining systems described in Section 2, nor the machine-learning-data-driven algorithms described in section 3 without the help of agile methods.  Agile methods advocate incremental and short batch development cycles, organized around multi-skills teams, where everyone works in a synchronous manner on the same objective.  The same argument applies to the proper use of open-source software, though it is less obvious and comes from experience. It is not about using great software available for free, it is more about using continuously evolving software that represents the bleeding edge of big data technology. It is even more about the open source culture that fits like a glove to the concept of software factories (the topic of another post !). To succeed in this new software world, you need to love code and respect developers (and yes, I am aware of the paradox that this may cause together with the “data is the new code” motto). Last, there is no other way to produce continuously evolving code (which is implied by both these paradigm shifts, but is also true in the digital world) than switching to continuous build, integration and delivery, as exemplified by DevOps. I am quoting DevOps but I could also make a reference to the software factory idea (the two are closely related).

Not surprisingly, the reader who is familiar with “Les Géants du Web” [19] from Octo will recognize the culture which is common to the “Web Giants”, that is, the companies who are the most successful in the digital world, such as Amazon, Google or Facebook. There is no surprise because these companies are also amongst the world leaders in leveraging the promises of Big Data. Agile (hence collaborative) development is critical to Big Data which requires to mix computer science, information technology, statistical and domain matter (business) skills. Because Big Data requires to work on the “real” (large) sets of data, it means a strong collaboration between IT operations and development. This is made even more critical by the paradigm shift described in Section 2, since algorithmic development and tuning is embedded into an operation cycle, which is an obvious call for DevOps.

I will conclude with a few of the recommendation from the NATF report:
  • Big Data is much more than new opportunities to do new things. Fueled by a technology shift that is caused by drastic price drops (storage and computing), Big Data paradigm causes a disruption about how to build information systems.
  • Massive parallelism and huge volumes of data are bringing a new way of programming that is urgent to learn, and to teach. This goes for companies as well as universities or engineering schools.
  • The old world of cautious “analyze/model/design/run” waterfall linear projects is in competition with a new world of systemic loops “experiment/learn/try/check”.  This is true for science [21] as well as for business. Hence, Big data’s new paradigms needs to be taught in business schools as well as in engineering schools.

Readers who are familiar with  Francois Bourdoncle’s theses on Big Data will recognize them in these recommendations, which is quite natural since he was one of the experts audited by the ICT commission from the NATF.

Monday, July 14, 2014

Viral Propagation Models for Apps and Social Software

Today’s post is a follow-up from my previous text on software ecosystems : I will focus on the virality of social applications, that is, the ability for applications to grow their customer bases through social networks. This post is more technical than most, because it is unfortunately necessary, but I will try to keep everything “as simple as possible, but no simpler” :). 

Social propagation of application is desirable because the fight to survive on the smartphone is quite tough. Not only do most people download only a few tens of apps, (statistics varies according to sources; however, the story is the same) but most of them are never used. 80 to 90% of downloaded apps are used only once then discarded. Becoming one of the few app that stays in the “smartphone top of mind” is very hand (i.e., active app), and being a collected app (installed for future use) seems to be very precarious. This is why the route of the web application (smart responsive HTML5 page with embedded bells & whistles) that is accessed through all the classical Web paths (search, links, etc.) is looking more and more interesting for many companies.

We may categorize the social behavior of apps into three categories:
  • Solo apps: applications whose main goal is to be used on your own, even if the score (of a game) may be shared eventually.
  • Communication apps: applications which are used to synchronously communicate with other people. The value of the service grows with the number of correspondents that may be reached.
  • Social apps: application which use asynchronous communication to become content publishing platforms. The distinction between “communication & social” will become clearer later on, but we may state right now that the value depends on the amount of available content, which depends on the total amount of time spent by social partners on the social app.

Not surprisingly, we know that solo apps appeal more to men than women. What I want to look at is the ability for social software (a larger category than apps) to propagate itself through social use and recommendation.

1.      Metcalfe Law for Communication Software

If we consider a simple communication tool (such as instant messaging), its customer base defines a communication network which values grows as the square of the number of users (O(N^2)), according to Metcalfe’s Law. Metcalfe's Law states that the value of a communication network grows as the number of possible pairs of connected users.

The value for one individual is linear in the number of user (O(N)), but both the total value and the virality is quadratic.  The virality, which is linked to the growth rate, may be seen as the product of the “infected” population (number of users) and the probability for one customer to “infect” another person (that is, recommend the service), which is liked to her or his satisfaction (hence, to the value).

One may notice that this is already quite different from an epidemiology model, since the probability of transmitting the disease does not only depend of being infected, but the number of your infected friends.

There are two points which are usually debated in this reasoning. The first remark states that we do not benefit from a very large network of possible contacts, since the number of meaningful correspondents is usually bounded (whether by Dunbar’snumber or any other). 

The second idea is that all correspondents are not equals and that the communication time distribution usually follows a law similar to Zipf’s Law. This leads to the result that the value grows in a O(N logN) fashion. The whole issue boils to the question of knowing is the distribution of the communication tool among your possible correspondents is homogeneous (randomly distributed) or not. This is actually a debate about strong ties versus weak ties, one of my favorite topic. If the communication tool is used to communicate with your close friends, then the propagation model follows the strong ties social graph and we may assume that the value for each customer grows in O(log N) because  of Zipf’s law. On the other hand, if the communication tool is used to reach a larger set of people, then the probability of one of these contact to be equipped with the same communication tool is roughly linear with respect to the usage rate, hence the individual value grown in O(N).

2.      Social Software and Cumulative Valuation of Time

We now consider an application, like Facebook, that acts as an asynchronous content publishing platform. The key observation is that the value of a Facebook session does not depend on how many friends you have, but on how frequently they visit and contribute.
People have different profiles when it comes to reading and contributing on social platforms. However, it is plausible to assume that (a) the read/write ratio is different for each individual but remains rather stable over time (b) the amount of messages read and written is proportional to time spent on the social platform.  Similarly, the attractiveness (i.e., interest to others) of content varies significantly from one user to another, but we may assume that the interest varies linearly with the amount of messages that are exchanged (this is clearly wrong for “newsworthy events” but seems to be true for the vast majority of exchanges that happen on Facebook).

This leads to a recursive system of complexity equations (written in a rather informal style) :
  •  Total Value = N x Average Value
  •  Average Value = Average Degree  x O(Average Time Spent) x Filtering Factor
  • Average Time Spent = O(Average Value)
The only way to make this equation balanced is to assume that the asymptotic behavior of the “Filtering Factor” is O(1/D) (which makes sense, there is only so much that you can read). So if the average degrees grows, some filtering is necessary. For instance, Facebook relates that, every day, it has to choose between 1500 messages what to display to each user. This the role of the “Edge Rank” filtering algorithm, a topic which I have discussed in a previous post.

Once the role of “filtering” is understood, we are left with a “self-fulfilling” set of circular equations that tells us that the value is proportional to the average time spent, which is proportional to the perceived value. It may be thought of as a disappointing tautology, but it says that similar social platform may indeed know very different fates.
At this time we can state two things:
  1. The formula that describes the value obtained by a social app user is complex, hence the virality percolation model is complex. It does not compare at all with an epidemiology model since the probability of “infecting” someone depends both on (a) the number of your infected friends (b) how deeply infected they are.
  2. There is not simple model for understanding the spread of social network platforms : there may exist multiple solutions with similar customer bases (N). The example of Google Plus and Facebook springs to mind: They have both large customer bases (1230 Millions monthly active users for Facebook and 300 Millions montly active users for Google Plus) and average time spend stats which are totally different (8 hours per month for Facebook versus 7 minutes for Google Plus). Nothing in the percolation models tells if Google Plus should grow closer to FB in the future, it all depends on much finer details (value provided to the user per unit of time and per unit of meaningful social content). The non-linear nature of the equation (re-entering loop) means that a tiny difference in this value-creation function may lead to a radical difference in customer usage (i.e., the presentation difference produces different time allocation patterns that, in turn, amplify the perceived value difference).

Notice that usage and subscription are two very different things, with different percolation models. Subscription is much closer to an epidemiological model (modulo the observations that we made earlier), and it is both easier to predict and to favor viral adoption.

3.      Why Facebook’s Doom Cannot Be Predicted with Epidemiological Models

Early this year there was a lot of excitement about a paper that predicted that Facebook would almost disappear before 2017. This information was printed and commented in many famous news sites and newspapers.  The origin for this information is an "archive" (i.e., submitted for publication) paper from two Princeton PhD students, John Cannarella and Joshua Spechler.

Facebook replied with a humorous answer where they use different buggy-but-convincing statistics charts to show the future decline of Princeton and breathing air. They conclude that “We don’t really think Princeton or the world’s air supply is going anywhere soon. We love Princeton (and air). As data scientists, we wanted to give a fun reminder that not all research is created equal – and some methods of analysis lead to pretty crazy conclusions. »

I actually downloaded and read the article, which is very simple and straightforward. It looks at how social networks percolation may be modelled with an epidemiology model (which is clearly wrong, as we showed in the previous section). On the one hand, the paper is “technically correct” : it simply says, what would happen if Facebook’s usage behaved like a the spread of a disease ? What is incorrect is all the newspapers that drew the wrong conclusion. On the other hand, it is of no value since it is very clear that the model does not fit the problem. The fact that the authors were able to tweak the virology parameters so that the first phase of Facebook growth matched historical data is irrelevant. There are many percolation models that would give a similar “S-curve” phase of growth. I laughed at Facebook’s debunk of the article (the fact that is it quoted as viral / epidemiology research article from two PhD Students from the Mechanical and Aerospace Engineering department should have raised some suspicions), but the debunk misses the point : it is not poor data science, it is poor science to begin with.  If you look at the illustration, you will see that the « input data » used for the epidemiology model is the number of « Facebook searches », which means that the decline may also be interpreted as the complete domination of Facebook !

 4.      Percolation Models for Social Software are Unstable

The previous “model” of section 2 is crude because it does not introduce the connection frequency. To understand and to model the behavior of a social app user, one need both the average frequency and the average time spent per users (20 mins for an average Facebook session and slightly more than once a day). I have tried to build a computational model two years ago, and failed because I did not have enough connection frequency data. This means that I could have used my model to predict almost any possible outcome … somehow like the Princeton computational experiment.

From a system science perspective, the “re-entrant” characteristic of the “time spent” parameter in the value equation means that any model is bound to be quite unstable and very sensitive to other dimensions (see the conclusion). One could point out that, as a consequence, the outcome proposed by John Cannarella  and Joshua Spechler is not impossible :). Let us look at a possible “Facebook displacement” scenario (since users seem to enjoy the time they spend on Facebook, it is logical to assume that such a scenario is the outcome of the introduction of a newer, better platform). It makes perfect sense to illustrate this with the rise of Whatsapp (considering the money spent by Facebook to acquire them, someone else must have thought that there was a real threat). The scenario breaks into four steps:

  1.  A new app appears, that is more efficient for a new group of users (most likely, an aged-based group, but not necessarily, it may be a matter of geography or culture). WhatsApp is a great example since it has reached 500 M users in record time.  
  2. Because the app is significantly better (from the point of view of new users), it eats away the “free time budget” : the time spent on the new app is taken away from the time spent on Facebook. This is clearly true for WhatsApp with more than 10 hours of monthly use (here also, statistics vary, but the tally is still impressive).
  3. This decreases the perceived value of Facebook for other users, who open an account and then spend some of their SNS time onto the new app. This has yet to appear for the WhatsApp case; for instance, in Spain where WhatsApp is very strong, Facebook is still growing, even if adoption rate is slower than other European countries. Also, the fastest growing segment of Facebook users is people over 55, it will be hard to get them away as a community.
  4. Eventually the new app becomes the place where the majority of users go (there is a winner take all system dynamic, which has been very profitable for Facebook since it started).

Steps (1) and (2) may happen rapidly, but (3) and (4) will take much longer (this is a guess, as said earlier, the speed has nothing to do with an epidemiological model and is much harder to model). But time spent becomes a habit, and habit takes longer to change (it takes longer to forget a habit than to pick a new one).

A lot of work is available in the scientific community related to percolation over social networks, including the work from Callaway, Newman, Strogatz and Watts, which has inspired my own research about social networks. However, the time aspect of social network usage changes completely the percolation model.
The previous curve shows that social apps have a stronger percolation capability than simpler communication apps.

5.      Conclusion

Rather than drawing a conclusion from this difficulty to efficiently model percolation of social software, I will simply point out a few directions for developing social and viral adoption of applications:
  • One must “pick the right fight”: it does not make sense to fight for usage time if the usage frequency is not high enough. If the frequency is too low, it’s a different game : how to use other SNS for “signaling” (letting people know that theirs friends have used your app).
  • Surf the wave instead of racing it” : profit from existing SNS which are created as platforms, to leverage existing social networks to grow you own app's social usage.
  • Make it easy to share your content on competing platforms (a good example being LinkedIn which allows easy sharing with Twitter, while the reciprocate exchange, that is, sharing from Twitter on any other SNS, is not true).
  • Empower your users to do whatever they please with your app, making it a true "platform". This follows from the observation that increasing time spent will increase value, hence adoption. This is something that Facebook has been quite good at (although this is a subject of debate), and that Snapshat or Instagram are also good example of.
  • Think about “value / effort” all the time and focus on simplicity, usability and speed. Especially, to the previous point, sharing/publishing must be as effortless as possible. We are back to the “maximize the value per unit of time and unit of content” principle stated in Section 2. The dynamics of content/time percolation means that a small efficiency competitive advantage can accumulate rapidly into a larger content & customer base sustainable advantage.

Technorati Profile