Hadoop

Discussion of Hadoop. Related subjects include:

MapReduce
Open source database management systems

October 10, 2013

Aster 6, graph analytics, and BSP

Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:

There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.

There’s much more, of course, but those are the essential pieces.

Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.

The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):

BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.

Use cases suggested are a lot of marketing, plus anti-fraud.

*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.

So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:

Ordinary SQL except in the FROM clause.
Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.

Within those functions, the core idea is: Read more

Categories: Application areas, Aster Data, Business intelligence, Data models and architecture, Data warehousing, Hadoop, Parallelization, Predictive modeling and advanced analytics, RDF and graphs, Teradata

4 Comments

September 29, 2013

ClearStory, Spark, and Storm

ClearStory Data is:

One of the two start-ups I’m most closely engaged with.
Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy. 🙂
On the verge, finally, of fully destealthing.

I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.

ClearStory:

Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
… pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
Has much of its technical differentiation in the area of data mustering …
… and much of the rest in DBMS-like engineering.
Is a flagship user of Spark.
Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
Is to a large extent written in Scala.
Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.

To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:

ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.

Categories: ClearStory Data, Cloud computing, Data integration and middleware, Data models and architecture, Databricks, Spark and BDAS, Derived data, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, Software as a Service (SaaS)

7 Comments

September 8, 2013

Layering of database technology & DBMS with multiple DMLs

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
MySQL storage engines.
MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).

Other examples on my mind include:

Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
FoundationDB’s strategy, and specifically its acquisition of Akiban.

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more

Categories: Aerospike, Akiban, Aster Data, Cache, Calpont, Cloudera, Data models and architecture, Database diversity, Databricks, Spark and BDAS, DATAllegro, Derived data, Greenplum, Hadapt, Hadoop, JPMorgan Chase, NoSQL, NuoDB, Parallelization, Solid-state memory, SQL/Hadoop integration, Structured documents, Text

7 Comments

August 25, 2013

Cloudera Hadoop strategy and usage notes

When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.

The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- Search.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
Stream processing (Storm) is next in line.
Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
Charles is also seeing at least POC interest in Spark.
But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system

HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.

Another good subject was offloading work to Hadoop, in a couple different senses of “offload”: Read more

Categories: Cloudera, Databricks, Spark and BDAS, Endeca, Hadoop, HP and Neoview, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Revolution Analytics, SAS Institute, Streaming and complex event processing (CEP), Teradata

22 Comments

August 25, 2013

Cloudera Sentry and other security subjects

I chatted with Charles Zedlewski of Cloudera on Thursday about security — especially Cloudera’s new offering Sentry — and other Hadoop subjects.

Sentry is:

Developed by Cloudera.
An Apache incubator project.
Slated to be rolled into CDH — Cloudera’s Hadoop distribution — over the next couple of weeks.
Only useful with Hive in Version 1, but planned to also work in the future with other Hadoop data access systems such as Pig, search and so on.
Lacking in administrative scalability in Version 1, something that is also slated to be fixed in future releases.

Apparently, Hadoop security options pre-Sentry boil down to:

Kerberos, which only works down to directory or file levels of granularity.
Third-party products.
Roll-your-own.

Sentry adds role-based permissions for SQL access to Hadoop:

By server.
By database.
By table.
By view.

for a variety of actions — selections, transformations, schema changes, etc. Sentry does this by examining a query plan and checking whether each step in the plan is permissible. Read more

Categories: Cloudera, Hadoop, IBM and DB2, Oracle

7 Comments

August 24, 2013

Hortonworks business notes

Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:

Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
… a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Media.
- Telecommunications.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.

In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social. Read more

Categories: Cloud computing, Cloudera, EAI, EII, ETL, ELT, ETLT, GIS and geospatial, Hadoop, Health care, Hortonworks, Log analysis, Market share and customer counts, Microsoft and SQL*Server, Open source, Petabyte-scale data management, Pricing, Telecommunications, Teradata, Text, Web analytics, Yahoo

5 Comments

August 12, 2013

Things I keep needing to say

Some subjects just keep coming up. And so I keep saying things like:

Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.

Most generalizations about Hadoop are false. Reasons include:

Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
The transition from Hadoop 1 to Hadoop 2 will be drastic.
For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.

Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.

Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.

Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)

Categories: Actian and Ingres, Amazon and its cloud, Benchmarks and POCs, Business intelligence, Cloud computing, Columnar database management, Data warehouse appliances, Data warehousing, Hadoop, HBase, In-memory DBMS, Infobright, Market share and customer counts, NoSQL, OLTP, ParAccel, Pricing, SAP AG, Sybase, Vertica Systems

10 Comments

August 8, 2013

Curt Monash on video

I made a remarkably rumpled video appearance yesterday with SiliconAngle honchos John Furrier and Dave Vellante. (Excuses include <3 hours sleep, and then a scrambling reaction to a schedule change.) Topics covered included, with approximate timechecks:

0:00 Introductory pabulum, and some technical difficulties
2:00 More introduction
3:00 Dynamic schemas and data model churn
6:00 Surveillance and privacy
13:00 Hadoop, especially the distro wars
22:00 BI innovation
23:30 More on dynamic schemas and data model churn

Edit: Some of my remarks were transcribed.

Related links

I posted on dynamic schemas data model churn a few days ago.
I capped off a series on privacy and surveillance a few days ago.
I commented on various Hadoop distributions in June.

Categories: Business intelligence, ClearStory Data, Data warehousing, Hadoop, MapR, MapReduce, Surveillance and privacy

Hortonworks, Hadoop, Stinger and Hive

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Hortonworks founder J. Eric “Eric14” Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
~250 employees.
~70-75 subscription customers.

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

10ish nodes for a typical starting cluster.
100ish nodes for a typical “data lake” committed adoption.
Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
HBase used in >50% of installations.
Hive probably even more than that.
Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

It’s been in preview/release candidate/commercial beta mode for weeks.
Q3 is the goal; H2 is the emphatic goal.
Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more

Categories: Cloudera, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Datameer, Facebook, Greenplum, Hadoop, Hortonworks, IBM and DB2, MapR, MapReduce, Market share and customer counts, Microsoft and SQL*Server, MicroStrategy, Open source, Petabyte-scale data management, Solid-state memory, SQL/Hadoop integration, Tableau Software, Teradata, Yahoo

12 Comments

July 31, 2013

“Disruption” in the software industry

I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify. 🙂

You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:

Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
Upstarts expand their offerings, and eventually attack the leaders in their core markets.

In response (this is the Innovator’s Solution part):

Leaders expand their product lines, increasing the value of their offerings in their core markets.
In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.

But not all cleverness is “disruption”.

Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
Innovative new technology from small companies is not, in itself, disruption either.

Here are some of the examples that make me think of the whole subject. Read more

Categories: Business intelligence, Data warehousing, Hadoop, Microsoft and SQL*Server, MongoDB, MySQL, Netezza, NewSQL, NoSQL, Oracle, Predictive modeling and advanced analytics, QlikTech and QlikView, Tableau Software

13 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Hadoop

Aster 6, graph analytics, and BSP

ClearStory, Spark, and Storm

Layering of database technology & DBMS with multiple DMLs

Cloudera Hadoop strategy and usage notes

Cloudera Sentry and other security subjects

Hortonworks business notes

Things I keep needing to say

Curt Monash on video

Hortonworks, Hadoop, Stinger and Hive

“Disruption” in the software industry

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin