Hadoop
Discussion of Hadoop. Related subjects include:
Imanis Data
I talked recently with the folks at Imanis Data. For starters:
- The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
- The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
- As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
- “Imanis” is a new name; the previous name was “Talena”.
Categories: Cassandra, Hadoop, Market share and customer counts, NoSQL, Predictive modeling and advanced analytics, Vertica Systems | 1 Comment |
Notes on data security
1. In June I wrote about burgeoning interest in data security. I’d now like to add:
- Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
- In an exception to that general rule, many enterprise have vague mandates for data encryption.
- In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.
We can reconcile these anecdata pretty well if we postulate that:
- Enterprises generally agree that data security is an important need.
- Exactly how they meet this need depends upon what regulators choose to require.
2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically: Read more
Categories: Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Surveillance and privacy | Leave a Comment |
Generally available Kudu
I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:
- Security is an ever bigger deal.
- There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
- Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.
Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:
- A data storage system introduced by Cloudera (and subsequently open-sourced).
- Columnar.
- Updatable in human real-time.
- Meant to serve as the data storage tier for Impala and Spark.
Kudu’s adoption and roll-out story starts: Read more
The data security mess
A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:
- Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
- A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
- The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
- Traditionally paranoid industries are still paranoid.
- Other industries are newly (and rightfully) terrified of exposing customer data.
- My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.
*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.
Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):
- Easy, comprehensive access control. More on this below.
- Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
- Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
- Whatever regulators mandate.
- Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
- Whatever the government is known to use. This is a common proxy for “best practices”.
More specific or extreme requirements include: Read more
Categories: Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, QlikTech and QlikView, Tableau Software | 4 Comments |
Light-touch managed services
Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:
- Altus manages jobs for you.
- But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.
Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.
For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.
Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: Read more
Categories: Cloud computing, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Software as a Service (SaaS), Surveillance and privacy | 3 Comments |
Cloudera Altus
I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:
- Provide all the important advantages of on-premises Cloudera.
- Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
- Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.
In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.
*Or, if you prefer, improving on early versions of the port.
Categories: Amazon and its cloud, Cloud computing, Cloudera, Databricks, Spark and BDAS, Hadoop, Log analysis, MapReduce, Software as a Service (SaaS) | 2 Comments |
Cloudera’s Data Science Workbench
0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:
- One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
- Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
- A third way is needed.
Cloudera’s idea for a third way is:
- You don’t run anything on your desktop/laptop machine except a browser.
- The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
- The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.
In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.
Categories: Cloudera, Hadoop, Market share and customer counts, Predictive modeling and advanced analytics | 5 Comments |
DBAs of the future
After a July visit to DataStax, I wrote
The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.
- Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
- Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.
That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.
- First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
- Second — and more narrowly 🙂 — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.
My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views: Read more
Categories: Databricks, Spark and BDAS, Hadoop, MongoDB, NoSQL, Streaming and complex event processing (CEP) | Leave a Comment |
Notes on the transition to the cloud
1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:
- The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
- Software as a service, aka SaaS.
- Co-location in off-premises data centers, aka colo.
- On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.
Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.
This is a good example of Monash’s Laws of Commercial Semantics.
2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.
This fact now seems to be widely understood.
Are analytic RDBMS and data warehouse appliances obsolete?
I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big. 🙂
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.