June 20, 2018

Brittleness and incremental improvement

Every system — computer or otherwise — needs to deal with possibilities of damage or error. If it does this well, it may be regarded as “robust”, “mature(d), “strengthened”, or simply “improved”.* Otherwise, it can reasonably be called “brittle”.

*It’s also common to use the word “harden(ed)”. But I think that’s a poor choice, as brittle things are often also hard.

0. As a general rule in IT:

New technologies and products are brittle.
They are strengthened incrementally over time.

There are many categories of IT strengthening. Two of the broadest are:

Bug-fixing.
Bottleneck Whack-A-Mole.

1. One of my more popular posts stated:

Developing a good DBMS requires 5-7 years and tens of millions of dollars.

The reasons I gave all spoke to brittleness/strengthening, most obviously in:

Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

Similar things are true for other kinds of “platform software” or distributed systems.

2. The UI brittleness/improvement story starts similarly: Read more

Categories: Analytic technologies, Buying processes, Theory and architecture

1 Comment

July 31, 2016

Notes on Spark and Databricks — technology

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. Read more

Categories: Benchmarks and POCs, Cloud computing, Databricks, Spark and BDAS, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP)

3 Comments

July 19, 2016

Notes on vendor lock-in

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Categories: Amazon and its cloud, Buying processes, Cassandra, Exadata, Facebook, IBM and DB2, Microsoft and SQL*Server, MongoDB, Neo Technology and Neo4j, Open source, Oracle, SAP AG

12 Comments

February 15, 2016

Some checklists for making technical choices

Whenever somebody asks for my help on application technology strategy, I start by trying to ascertain three things. The absolute first is actually a prerequisite to almost any kind of useful conversation, which is to ascertain in general terms what the hell it is that we are talking about. 🙂

My second goal is to ascertain technology constraints. Three common types are:

Compatible with legacy systems and/or enterprise standards.
Cheap, free and/or open source.
Proven, vetted by sufficiently many references, and/or generally having an “enterprise-y” reputation.

That’s often a short and straightforward discussion, except in those awkward situations when all three of my bullet points above are applicable at once.

The third item is usually more interesting. I try to figure out what is to be accomplished. That’s usually not a simple matter, because the initial list of goals and requirements is almost never accurate. It’s actually more common that I have to tell somebody to be more ambitious than that I need to rein them in.

Commonly overlooked needs include:

If you want to sell something and have happy users, you need a good UI.
You will also soon need tools and a UI for administration.
Customers demand low-latency/fresh data. Your explanation of why they don’t really need it doesn’t contradict the fact that they want it.
Providing data access and saying “You can hook up any BI tool you want and build charts” is not generally regarded as offering a good UI.
When “adding analytics” to something previously focused on short-request processing, it is common to underestimate the variety of things users will soon want to do. (One common reason for this under-estimate is that after years of being told it can’t be done, they’ve learned not to ask.)

And if you take one thing away from this post, then take this:

If you “know” exactly which features are or aren’t helpful to users, …
.. and if you supply only what you “know” they should use, …
… then you will discover that what you “knew” wasn’t really accurate.

I guarantee it.

Categories: Business intelligence, Buying processes, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics

2 Comments

November 19, 2015

CDH 5.5

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
  - When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
  - Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
  - More “platform” stuff from the Hadoop stack (e.g. for data ingest).
  - Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of: Read more

Categories: Benchmarks and POCs, Cloudera, Data warehousing, Databricks, Spark and BDAS, Market share and customer counts, Petabyte-scale data management, Predictive modeling and advanced analytics, SQL/Hadoop integration

4 Comments

October 26, 2015

Differentiation in business intelligence

Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:

Both kinds of products query and aggregate data.
Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
You really, really, really don’t want your customer data to leak via a security breach in either kind of product.

That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:

BI is less mission-critical than some other database uses.
BI has done a lot less than DBMS to deal with multi-structured data.
Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.

And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.

Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say: Read more

Categories: Business intelligence, Buying processes, ClearStory Data, Data mart outsourcing, Pricing, QlikTech and QlikView, Rocana, Tableau Software

Differentiation in data management

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

Scope
Accuracy
(Other) trustworthiness
Speed
User experience
Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:
Read more

Categories: Buying processes, Clustering, Data warehousing, Database diversity, Microsoft and SQL*Server, Predictive modeling and advanced analytics, Pricing

2 Comments

October 26, 2015

Sources of differentiation

Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.

Many buying and design considerations for IT fall into six interrelated areas: Read more

Categories: Buying processes, Predictive modeling and advanced analytics, Pricing, Text

1 Comment

July 14, 2014

21st Century DBMS success and failure

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say: Read more

Categories: Business intelligence, Buying processes, Cloud computing, Columnar database management, Data warehouse appliances, Data warehousing, EMC, EnterpriseDB and Postgres Plus, Exadata, Greenplum, Hadoop, HP and Neoview, IBM and DB2, In-memory DBMS, MarkLogic, Microsoft and SQL*Server, Mid-range, MySQL, NewSQL, NoSQL, OLTP, Open source, PostgreSQL, SAP AG, Sybase, Teradata, Vertica Systems, Workload management

2 Comments

April 30, 2014

Hardware and storage notes

My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:

Rack- or data-center-scale systems.
The real or imagined demise of Moore’s Law.
Flash.

I also got updated as to typical Hadoop hardware.

If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)

One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.

Categories: Amazon and its cloud, Buying processes, Cloudera, Facebook, Google, Intel, Memory-centric data management, Pricing, Solid-state memory

19 Comments

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Brittleness and incremental improvement

Notes on Spark and Databricks — technology

Notes on vendor lock-in

Some checklists for making technical choices

CDH 5.5

Differentiation in business intelligence

Differentiation in data management

Sources of differentiation

21st Century DBMS success and failure

Hardware and storage notes

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin