Application areas
Posts focusing on the use of database and analytic technologies in specific application domains. Related subjects include:
- Any subcategory
- (in Text Technologies) Specific application areas for text analytics
Netezza Q1 earning call transcript
I finally read the Netezza Q1 earnings call transcript, put out by Seeking Alpha. Highlights included:
- Netezza got 14 new-name accounts and 21 follow-on deals. Average sale in both groups was right around $1 million.
- The economy is tough, deals are slipping, and nobody knows for sure what will happen.
- Netezza’s main head-to-head competitors are Oracle and Teradata. Netezza claims good but not perfect win rates against each, but concedes that those vendors (especially Oracle) of course get other deals Netezza never sees.
- Netezza characterizes Teradata as offering its multiple product lines, trying to upsell many customers from cheaper to more expensive product lines, and being selectively aggressive about pricing. None of this is surprising to me.
- 80% of Netezza’s Q1 revenue, and perhaps even a higher fraction of new-name accounts, was in four vertical markets: “Digital media,” telecom, government, and financial services.
- Some time over the next few months, Netezza will give at least some more clarity about future products.
One tip for the Netezza folks, by the way, from this former stock analyst — you should never use the word “certainly” about a deal you haven’t closed yet. “Almost surely” could be OK, but “certainly” — well, it certainly was not the thing to say.
Greenplum blogs about some customers
I’ve written some about Greenplum’s customers at eBay and Fox Interactive Media. But as I recently grumped, I’m not in the mood right now to write much about other Greenplum customers. Fortunately, Greenplum has filled the gap itself. Marketing chief Paul Salazar just blogged about a number of other big Greenplum customers. And last month Paul blogged in considerable detail about what he characterizes as an enterprise data warehouse (EDW) conversion — Oracle replacement — at a large pharmaceutical company.
Categories: Application areas, Data warehousing, Greenplum, Oracle | Leave a Comment |
More on Fox Interactive Media’s use of Greenplum
Greenplum’s most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across a “review” of Greenplum by FIM’s Brian Dolan, neatly summarizing his views about Greenplum’s strengths, weaknesses, and uses inside Fox. Highlights include: Read more
Categories: Data warehousing, Fox and MySpace, Greenplum, Web analytics | 2 Comments |
Sneakernet to the cloud
Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet. In some circumstances, he is surely right. Possible implications include:
- When sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, cloud, cloud — but Clearpace thinks you really should have a bit of its software onsite too, to compress the data before sending it across the wire.
- Getting data from one cloud to another cloud could be problematic. I’m fond of saying that weblog data naturally lives in the cloud at your hosting company’s location, so you should analyze it there too. But this makes the most sense if you analyze it or at least filter/reduce it in place. (That said, the really, really big web companies have lots of different data centers, and presumably do move huge amounts of log data from place to place.)
But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.
Categories: Amazon and its cloud, Cloud computing, Database compression, EAI, EII, ETL, ELT, ETLT, Web analytics | 2 Comments |
Followup on IBM System S/InfoSphere Streams
After posting about IBM’s System S/InfoSphere Streams CEP offering, I sent three followup questions over to Jeff Jones. It seems simplest to just post the Q&A verbatim.
1. Just how many processors or cores does it take to get those 5 million messages/sec through? A little birdie says 4,000 cores. Read more
Categories: Analytic technologies, IBM and DB2, Investment research and trading, Streaming and complex event processing (CEP) | 7 Comments |
Microsoft announced CEP this week too
Microsoft still hasn’t worked out all the kinks regarding when and how intensely to brief me. So most of what I know about their announcement earlier this week of a CEP/stream processing product* is what I garnered on a consulting call in March. That said, I sent Microsoft my notes from that call, they responded quickly and clearly to my question as to what remained under NDA, and for good measure they included a couple of clarifying comments that I’ll copy below.
*”in the SQL Server 2008 R2 timeframe,” about which Microsoft wrote “the first Community Technology Preview (CTP) of SQL Server 2008 R2 will be available for download in the second half of 2009 and the release is on track to ship in the first half of calendar year 2010. “
Perhaps it is more than coincidence that IBM rushed out its own announcement of an immature CEP technology — due to be more mature in a 2010 release — immediately after Microsoft revealed its plans. Anyhow, taken together, these announcements support my theory that the small independent CEP/stream processing vendors are more or less ceding broad parts of the potential stream processing market.
The main use cases Microsoft talks about for CEP are in the area of sensor data. Read more
Categories: Analytic technologies, Application areas, Microsoft and SQL*Server, Streaming and complex event processing (CEP) | 8 Comments |
IBM System S Streams, aka InfoSphere Streams, aka stream processing, aka “please don’t call it CEP”
IBM has hastily announced System S Streams, a product that was supposed to be called InfoSphere Streams and introduced only in 2010. Apparently, the rush is because senior management wanted to talk about it later this week, and perhaps also because it was implicitly baked into some of IBM’s advertising already. Scrambling ensued. Even so, Jeff Jones and team got to me fast, and briefed me — fairly non-technically, unfortunately, but otherwise how I like it, namely on a harmless embargo and without any NDAs. That’s more than can be said for my clients at Microsoft, who also introduced CEP this week, but I digress …
*Indeed, as I draft this post-Celtics-game, the embargo is already expired.
Marketing aside, IBM System S/InfoSphere Streams is indeed a CEP/stream processing engine + language (with an Eclipse-based development environment). Apparently, IBM’s thinks InfoSphere Streams (if that’s what it winds up being renamed to) is or will be differentiated from other CEP packages in:
- Scale-out. (That’s the one that appears to be real today. In fact, there’s a prototype running on Blue Gene.)
- Support for complex datatypes such as XML, text, voice, video, etc.
- Security and general industrial-strengthness.
Categories: Analytic technologies, Application areas, IBM and DB2, Investment research and trading, Scientific research, Streaming and complex event processing (CEP) | 3 Comments |
Facebook, Hadoop, and Hive
I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.
Updating the metrics in my Cloudera post,
- Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
- Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
- Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.
Nothing else in my Cloudera post was called out as being wrong.
In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more
Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo | 48 Comments |
37 Ways To Get More From Analytics, Version 2.0
As I hoped, there were some very helpful responses to my post listing ways to improve analytic effectiveness. Here’s a second draft incorporating them. Comments continue to be very welcome. I need to finalize this soon. Read more
Categories: Analytic technologies, Business intelligence, Data warehousing, Presentations, Web analytics | 4 Comments |
eBay’s two enormous data warehouses
A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.
Metrics on eBay’s main Teradata data warehouse include:
- >2 petabytes of user data
- 10s of 1000s of users
- Millions of queries per day
- 72 nodes
- >140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
- 100s of production databases being fed in
Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:
- 6 1/2 petabytes of user data
- 17 trillion records
- 150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
- 96 nodes
- 200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
- 4.5 petabytes of storage
- 70% compression
- A small number of concurrent users