Predictive modeling and advanced analytics

Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.

November 1, 2012

Introduction to Continuuity

I chatted with Todd Papaioannou about his new company Continuuity. Todd is as handy at combining buzzwords as he is at concatenating vowels, and so Continuuity — with two “U”s — is making a big data fabric platform as a service with REST APIs that runs over Hadoop and HBase in the private or public clouds. I found the whole thing confusing, in that:

I recoil against buzzwords. In particular …
… I pay as little attention to distinctions among PaaS/IaaS/WaaS — Platform/Infrastructure/Whatever as a Service — as I can.
The Continuuity story sounds Heroku-like, but Todd doesn’t want Continuuity compared to Heroku.
Todd does want Continuuity discussed in terms of the application server category, but:
- It is hard to discuss app servers without segueing quickly amongst development, deployment, and data connectivity, and Continuuity is no exception to that rule.
- There is doubt as to whether using app servers makes any sense.

But all confusion aside, there are some interesting aspects to Continuuity. Read more

Categories: Application servers, Cloud computing, Hadoop, HBase, MapReduce, Parallelization, Predictive modeling and advanced analytics, Software as a Service (SaaS)

7 Comments

September 26, 2012

When should analytics be in-memory?

I was asked today for rules or guidance regarding “analytical problems, situations, or techniques better suited for in-database versus in-memory processing”. There are actually two kinds of distinction to be drawn:

Some workloads, in principle, should run on data to which there’s very fast and unfettered access — so fast and unfettered that you’d love the whole data set to be in RAM. For others, there is little such need.
Some products, in practice, are coupled to specific in-memory data stores or to specific DBMS, even though other similar products don’t make the same storage assumptions.

Let’s focus on the first part of that — what work, in principle, should be done in memory? Read more

Categories: Business intelligence, Data warehousing, Memory-centric data management, Parallelization, Predictive modeling and advanced analytics

2 Comments

September 7, 2012

Integrated internet system design

What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.

Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.

The top integration and integration-like challenges for me, from a practical standpoint, are:

Integrating silos — a decades-old problem still with us in a big way.
Dynamic schemas with joins.
Low-latency business intelligence.
Human real-time personalization.

Other concerns that get mentioned include:

Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
Logical data warehouse, a term that doesn’t actually mean anything real.
In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.

Let’s skip those latter issues for now, focusing instead on the first four.

Categories: About this blog, Business intelligence, Cache, Clustering, Data integration and middleware, Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Exadata, NoSQL, OLTP, Oracle, Predictive modeling and advanced analytics, SAP AG, Surveillance and privacy

6 Comments

August 19, 2012

In-database analytics — analytic glossary draft entry

This is a draft entry for the DBMS2 analytic glossary. Please comment with any ideas you have for its improvement!

Note: Words and phrases in italics will be linked to other entries when the glossary is complete.

“In-database analytics” is a catch-all term for analytic capabilities, beyond standard SQL, running on the same machine as and under the management of an analytic DBMS. These can run in one or both of two modes:

In-process or unfenced, i.e. in the same process as the DBMS itself. This option gives maximum performance, but any defects in the analytic code may crash the whole DBMS. Also, it generally requires that the code be in the same language as the DBMS, i.e. C++.
Out-of-process or fenced, i.e. in a separate process. This option sacrifices performance, in favor of reliability and language flexibility.

In-database analytics may offer great performance and scalability advantages versus the alternative of extracting data and having it be processed on a separate server. This is particularly likely to be the case in MPP (Massively Parallel Processing) analytic DBMS environments.

Examples of in-database analytics include:

Creating temporary data structures that persist past the life of a query.
Creating temporary data structures that are non-tabular.
Predictive modeling that uses all the same nodes in an MPP cluster where the data resides.
Predictive analytics (scoring only).

Other common domains for in-database analytics include sessionization, time series analysis, and relationship analytics.

Notable products offering in-database analytics include:

Teradata Aster SQL/MR.
Multiple other analytic platforms, such as Sybase IQ, Vertica, or IBM Netezza. Indeed, in-database analytics are a defining feature of analytic platforms.
Fuzzy Logix (for predictive analytics).

Categories: Analytic glossary, Aster Data, Data warehousing, IBM and DB2, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics, Sybase, Teradata, Vertica Systems

8 Comments

August 6, 2012

People’s facility with statistics — extremely difficult to predict

My recent post on broadening the usefulness of statistics presupposed two things about the statistical sophistication of business intelligence tool users:

It varies a lot.
In many cases, it isn’t be very high.

Let me now say a little more on the subject. My basic message is — people’s facility with statistics is extremely difficult to predict.

If you DO have to make a point estimate, however, you could do worse than just putting quotation marks around the last four words of that sentence …

Suppose we measure people’s statistical understanding on a 5-point scale:

People who haven’t clue what a p-value is.
People who think a p-value of .05 signifies a 95% chance of truth.
People who know better than that, but who still think that “statistically significant” is pretty close to the same as “true”.
People who know better yet, but aren’t fluent in using statistical techniques correctly.
People who are fluent in statistics.

Just knowing somebody’s job description, can you confidently predict their ranking to within, say, +/- 1 point? I suggest you can’t. People differ wildly in general numeracy and in specific statistical knowledge.

Even our guesses about average knowledge may be off, not least because education is changing things. Read more

Categories: Predictive modeling and advanced analytics, Scientific research

5 Comments

July 31, 2012

Integrating statistical analysis into business intelligence

Business intelligence tools have been around for two decades.* In that time, many people have had the idea of integrating statistical analysis into classical BI. Yet I can’t think of a single example that was more than a small, niche success.

*Or four decades, if you count predecessor technologies.

The first challenge, I think, lies in the paradigm. Three choices that come to mind are:

“I think I may have discovered a new trend or pattern. Tell me, Mr. BI — is the pattern real?”
“I think I see a trend. Help me, Mr. BI, to launch a statistical analysis.”
“If you want to do some statistical analysis, here’s a link to fire it up.”

But the first of those approaches requires too much intelligence from the software, while the third requires too much numeracy from the users. So only the second option has a reasonable chance to work, and even that one may be hard to pull off unless vendors focus on one vertical market at a time.

The challenges in full automation start: Read more

Categories: Business intelligence, Predictive modeling and advanced analytics

13 Comments

July 28, 2012

Some Vertica 6 features

Vertica 6 was recently announced, and so it seemed like a good time to catch up on Vertica features. The main topics I want to address are:

External tables and the associated new Hadoop connector.
Online schema evolution.
Workload management.

Also:

I have some tidbits to add to my June, 2011 coverage of Vertica’s analytic functionality.
I’ll stand for now on my previous coverage of Vertica’s database organization.

In general, the main themes of Vertica 6 appear to be:

Enterprise/SaaS-friendliness, high uptime, and so on.
Improved analytic usefulness.

Let’s do the analytic functionality first. Notes on that include:

Vertica has extended its user-defined function/analytic procedure/whatever functionality to include user-defined load. (Same SDK, different specific classes.)
One of the languages Vertica supports is R. But for now, parallel R is limited to “Of course, you can run the same functions and procedures on many nodes at once.”
Based on community activity around bugs and so on, it seems there are users for Vertica’s JSON-based Twitter sentiment analysis plug-in.

I’ll also take this opportunity to expand on something I wrote about a few vendors — including Vertica — at the end of my post on approximate query results. When I probed how customers of Vertica and other RDBMS-based analytic platform vendors used vendor-proprietary advanced analytic SQL and other analytic capabilities, answers included: Read more

Categories: Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Predictive modeling and advanced analytics, SQL/Hadoop integration, Vertica Systems, Workload management

2 Comments

May 28, 2012

Quick-turnaround predictive modeling

Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.

Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling: Read more

Categories: Business intelligence, Investment research and trading, KXEN, Predictive modeling and advanced analytics, Revolution Analytics, Telecommunications, Web analytics

10 Comments

May 21, 2012

Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more

Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in