Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Notes on data security
1. In June I wrote about burgeoning interest in data security. I’d now like to add:
- Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
- In an exception to that general rule, many enterprise have vague mandates for data encryption.
- In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.
We can reconcile these anecdata pretty well if we postulate that:
- Enterprises generally agree that data security is an important need.
- Exactly how they meet this need depends upon what regulators choose to require.
2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically: Read more
Categories: Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Surveillance and privacy | Leave a Comment |
Analytics on the edge?
There’s a theory going around to the effect that:
- Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
- Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
- Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.
There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.
1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.
2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.
3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. Read more
Generally available Kudu
I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:
- Security is an ever bigger deal.
- There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
- Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.
Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:
- A data storage system introduced by Cloudera (and subsequently open-sourced).
- Columnar.
- Updatable in human real-time.
- Meant to serve as the data storage tier for Impala and Spark.
Kudu’s adoption and roll-out story starts: Read more
The data security mess
A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:
- Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
- A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
- The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
- Traditionally paranoid industries are still paranoid.
- Other industries are newly (and rightfully) terrified of exposing customer data.
- My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.
*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.
Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):
- Easy, comprehensive access control. More on this below.
- Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
- Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
- Whatever regulators mandate.
- Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
- Whatever the government is known to use. This is a common proxy for “best practices”.
More specific or extreme requirements include: Read more
Categories: Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, QlikTech and QlikView, Tableau Software | 4 Comments |
Light-touch managed services
Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:
- Altus manages jobs for you.
- But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.
Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.
For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.
Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: Read more
Categories: Cloud computing, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Software as a Service (SaaS), Surveillance and privacy | 3 Comments |
Interana
Interana has an interesting story, in technology and business model alike. For starters:
- Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
- Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
- Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
- Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.
And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.
Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well. Read more
Analyzing the right data
0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.
1. In line with that theme:
- Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
- Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.
2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.
*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.
3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:
- Divide your data into clusters.
- Model each cluster separately.
That continues to be tough work. Attempts to productize shortcuts have not caught fire.
Introduction to SequoiaDB and SequoiaCM
For starters, let me say:
- SequoiaDB, the company, is my client.
- SequoiaDB, the product, is the main product of SequoiaDB, the company.
- SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
- SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
- … and is usually sold for RDBMS-like use cases …
- … except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
- SequoiaDB’s products are open source.
- SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
- Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.
Also:
- SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
- Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
- SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
- SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
- SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)
Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.
Notes on the transition to the cloud
1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:
- The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
- Software as a service, aka SaaS.
- Co-location in off-premises data centers, aka colo.
- On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.
Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.
This is a good example of Monash’s Laws of Commercial Semantics.
2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.
This fact now seems to be widely understood.
“Real-time” is getting real
I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:
- Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
- Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.
A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:
- It respects the obvious point that different use cases require different levels of data freshness.
- It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
- It covers cases of both human and automated decision-making.
Straightforward applications of this principle include: Read more