Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Brittleness, Murphy’s Law, and single-impetus failures
In my initial post on brittleness I suggested that a typical process is:
- Build something brittle.
- Strengthen it over time.
In many engineering scenarios, a fuller description could be:
- Design something that works in the base cases.
- Anticipate edge cases and sources of error, and design for them too.
- Implement the design.
- Discover which edge cases and error sources you failed to consider.
- Improve your product to handle them too.
- Repeat as needed.
So it’s necesseary to understand what is or isn’t likely to go wrong. Unfortunately, that need isn’t always met. Read more
Categories: Analytic technologies, Text | 5 Comments |
Brittleness and incremental improvement
Every system — computer or otherwise — needs to deal with possibilities of damage or error. If it does this well, it may be regarded as “robust”, “mature(d), “strengthened”, or simply “improved”.* Otherwise, it can reasonably be called “brittle”.
*It’s also common to use the word “harden(ed)”. But I think that’s a poor choice, as brittle things are often also hard.
0. As a general rule in IT:
- New technologies and products are brittle.
- They are strengthened incrementally over time.
There are many categories of IT strengthening. Two of the broadest are:
- Bug-fixing.
- Bottleneck Whack-A-Mole.
1. One of my more popular posts stated:
Developing a good DBMS requires 5-7 years and tens of millions of dollars.
The reasons I gave all spoke to brittleness/strengthening, most obviously in:
Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
Similar things are true for other kinds of “platform software” or distributed systems.
2. The UI brittleness/improvement story starts similarly: Read more
Some stuff that’s always on my mind
I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans. 🙂
So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.
1. Data(base) management technology is progressing pretty much as I expected.
- Vendors generally recognize that maturing a data store is an important, many-years-long process.
- Multiple kinds of data model are viable …
- … but it’s usually helpful to be able to do some kind of JOIN.
- To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)
2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.
- My two central examples have long been inaccurate metrics and false-positive alerts.
- In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
- Enterprise search and other text technologies are still often terrible.
- After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.
Categories: Data models and architecture, Database diversity, Predictive modeling and advanced analytics, Public policy, Theory and architecture | 5 Comments |
Notes on artificial intelligence, December 2017
Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.
1. As I wrote back then in a post about the connection between machine learning and the rest of AI,
It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.
2. Accordingly, it can be reasonable to equate machine learning and AI.
- AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
- AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
- “AI” can be the sexier marketing or fund-raising term.
3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:
- Understanding or translation of language (written or spoken as the case may be).
- Machine vision or autonomous vehicles.
- Facial recognition.
- Disease diagnosis via radiology interpretation.
4. The importance of AI and of recent AI advances differs greatly according to application or data category. Read more
Categories: Cloud computing, Predictive modeling and advanced analytics, Public policy, Surveillance and privacy | 4 Comments |
Imanis Data
I talked recently with the folks at Imanis Data. For starters:
- The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
- The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
- As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
- “Imanis” is a new name; the previous name was “Talena”.
Categories: Cassandra, Hadoop, Market share and customer counts, NoSQL, Predictive modeling and advanced analytics, Vertica Systems | 1 Comment |
Notes on data security
1. In June I wrote about burgeoning interest in data security. I’d now like to add:
- Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
- In an exception to that general rule, many enterprise have vague mandates for data encryption.
- In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.
We can reconcile these anecdata pretty well if we postulate that:
- Enterprises generally agree that data security is an important need.
- Exactly how they meet this need depends upon what regulators choose to require.
2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically: Read more
Categories: Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Surveillance and privacy | Leave a Comment |
Analytics on the edge?
There’s a theory going around to the effect that:
- Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
- Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
- Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.
There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.
1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.
2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.
3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. Read more
Generally available Kudu
I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:
- Security is an ever bigger deal.
- There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
- Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.
Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:
- A data storage system introduced by Cloudera (and subsequently open-sourced).
- Columnar.
- Updatable in human real-time.
- Meant to serve as the data storage tier for Impala and Spark.
Kudu’s adoption and roll-out story starts: Read more
The data security mess
A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:
- Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
- A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
- The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
- Traditionally paranoid industries are still paranoid.
- Other industries are newly (and rightfully) terrified of exposing customer data.
- My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.
*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.
Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):
- Easy, comprehensive access control. More on this below.
- Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
- Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
- Whatever regulators mandate.
- Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
- Whatever the government is known to use. This is a common proxy for “best practices”.
More specific or extreme requirements include: Read more
Categories: Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, QlikTech and QlikView, Tableau Software | 4 Comments |
Light-touch managed services
Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:
- Altus manages jobs for you.
- But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.
Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.
For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.
Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: Read more