Analytic technologies

Discussion of technologies related to information query and analysis. Related subjects include:

Business intelligence
Data warehousing
(in Text Technologies) Text mining
(in The Monash Report) Data mining
(in The Monash Report) General issues in analytic technology

September 14, 2015

DataStax and Cassandra update

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. 🙂

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind: Read more

Categories: Business intelligence, Cassandra, Databricks, Spark and BDAS, DataStax, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Specific users, Text

6 Comments

September 10, 2015

MongoDB update

One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:

>2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
~75,000 users of MongoDB Cloud Manager.
Estimated ~1/4 million production users of MongoDB total.

Also >530 staff, and I think that number is a little out of date.

MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:

Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results. 🙂
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.

There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. Read more

Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents, Text

6 Comments

August 24, 2015

Multi-model database managers

I’d say:

Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely: Read more

Categories: Data models and architecture, Database diversity, Databricks, Spark and BDAS, Intersystems and Cache', MOLAP, Object, Streaming and complex event processing (CEP)

3 Comments

August 3, 2015

Data messes

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — 🙂 — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)

Inconsistency can take multiple forms, including: Read more

Categories: Business intelligence, ClearStory Data, Data integration and middleware, Data warehousing, Databricks, Spark and BDAS, Derived data, EAI, EII, ETL, ELT, ETLT, Gooddata, Greenplum, Hadoop, Log analysis, Streaming and complex event processing (CEP), Web analytics

11 Comments

July 7, 2015

Zoomdata and the Vs

Let’s start with some terminology biases:

I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).

So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.

It turns out that:

Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
Zoomdata depends heavily on Spark.
Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
Zoomdata also tries to detect data variability.
Zoomdata is OEM/embedding-friendly.

*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.

Core aspects of Zoomdata’s technical strategy include: Read more

Categories: Amazon and its cloud, Business intelligence, Cloudera, Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Kafka and Confluent, MongoDB, NoSQL, Predictive modeling and advanced analytics, QlikTech and QlikView, SQL/Hadoop integration, Tableau Software, Zoomdata

9 Comments

June 14, 2015

“Chilling effects” revisited

In which I observe that Tim Cook and the EFF, while thankfully on the right track, haven’t gone nearly far enough.

Traditionally, the term “chilling effect” referred specifically to inhibitions on what in the US are regarded as First Amendment rights — the freedoms of speech, the press, and in some cases public assembly. Similarly, when the term “chilling effect” is used in a surveillance/privacy context, it usually refers to the fear that what you write or post online can later be held against you. This concern has been expressed by, among others, Tim Cook of Apple, Laura Poitras, and the Electronic Frontier Foundation, and several research studies have supported the point.

But that’s only part of the story. As I wrote in July, 2013,

… with the new data collection and analytic technologies, pretty much ANY action could have legal or financial consequences. And so, unless something is done, “big data” privacy-invading technologies can have a chilling effect on almost anything you want to do in life.

The reason, in simplest terms, is that your interests could be held against you. For example, models can estimate your future health, your propensity for risky hobbies, or your likelihood of changing your residence, career, or spouse. Any of these insights could be useful to employers or financial services firms, and not in a way that redounds to your benefit. And if you think enterprises (or governments) would never go that far, please consider an argument from the sequel to my first “chilling effects” post: Read more

Categories: Predictive modeling and advanced analytics, Surveillance and privacy

3 Comments

June 10, 2015

Hadoop generalities

Occasionally I talk with an astute reporter — there are still a few left 🙂 — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.

There is a group of questions going around that includes:

Is Hadoop overhyped?
Has Hadoop adoption stalled?
Is Hadoop adoption being delayed by skills shortages?
What is Hadoop really good for anyway?
Which adoption curves for previous technologies are the best analogies for Hadoop?

To a first approximation, my responses are: Read more

Categories: Application areas, Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, MapR, MapReduce, Market share and customer counts, Open source, Pricing

6 Comments

June 8, 2015

Teradata will support Presto

At the highest level:

Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
Teradata is announcing support for Presto with a classic open source pricing model.
Presto will also become, roughly speaking, Teradata’s replacement for Hive.
Teradata’s Presto efforts are being conducted by the former Hadapt.

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
… Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
Facebook at various points in time created both Hive and now Presto.
Facebook started the Presto project in 2012 and now has 10 engineers on it.
Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project: Read more

Categories: Cloudera, Columnar database management, Data integration and middleware, Data pipelining, Data warehousing, Facebook, Hadapt, Hadoop, MapReduce, Market share and customer counts, Open source, Petabyte-scale data management, SQL/Hadoop integration, Teradata, Workload management

19 Comments

May 26, 2015

IT-centric notes on the future of health care

It’s difficult to project the rate of IT change in health care, because:

Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
… health care is heavily bureaucratic, political and regulated.

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
  - Most tests exploit electronic technology. Progress in electronics is intense.
  - Biomedical research is itself intense.
  - In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.

These vastly greater amounts of data cited above will allow for greatly changed analytics.
Read more

Categories: Business intelligence, Cassandra, Data integration and middleware, Hadoop, HBase, Health care, Log analysis, MongoDB, NoSQL, Predictive modeling and advanced analytics, RDF and graphs, Splunk, Streaming and complex event processing (CEP), Surveillance and privacy, Text

6 Comments

May 13, 2015

Notes on analytic technology, May 13, 2015

1. There are multiple ways in which analytics is inherently modular. For example:

Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.

Also, analytics is inherently iterative.

Everything I just called “modular” can reasonably be called “iterative” as well.
So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”

If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.

2. In 2011, I wrote, in the context of agile predictive analytics, that

… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.

I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more

Categories: Business intelligence, Data warehousing, Datameer, Hadoop, Log analysis, Oracle, Platfora, Predictive modeling and advanced analytics, SAS Institute, Software as a Service (SaaS), Tableau Software, Web analytics

2 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Analytic technologies

DataStax and Cassandra update

MongoDB update

Multi-model database managers

Data messes

Zoomdata and the Vs

“Chilling effects” revisited

Hadoop generalities

Teradata will support Presto

IT-centric notes on the future of health care

Notes on analytic technology, May 13, 2015

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin