July 2, 2013
Notes and comments, July 2, 2013
I’m not having a productive week, part of the reason being a hard drive crash that took out early drafts of what were to be last weekend’s blog posts. Now I’m operating from a laptop, rather than my preferred dual-monitor set-up. So please pardon me if I’m concise even by comparison to my usual standards.
- My recent posts based on surveillance news have been partly superseded by – well, by more news. Some of that news, along with some good discussion, may be found in the comment threads.
- The same goes for my recent Hadoop posts.
- The replay for my recent webinar on real-time analytics is now available. My part ran <25 minutes.
- One of my numerous clients using or considering a “real-time analytics” positioning is Sqrrl, the company behind the NoSQL DBMS Accumulo. Last month, Derrick Harris reported on a remarkable Accumulo success story – multiple US intelligence instances managing 10s of petabytes each, and supporting a variety of analytic (I think mainly query/visualization) approaches.
- Several sources have told me that MemSQL’s Zynga sale is (in part) for Membase replacement. This is noteworthy because Zynga was the original pay-for-some-of-the-development Membase customer.
- More generally, the buzz out of Couchbase is distressing. Ex-employees berate the place; job-seekers check around and then decide not to go there; rivals tell me of resumes coming out in droves. Yes, there’s always some of that, even at obviously prospering companies, but this feels like more than the inevitable low-level buzz one hears anywhere.
- I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- And if you still want to do something that looks like a regression – linear or otherwise – then you might want to use a tool that lets you shovel training data in WITHOUT a whole lot of preparation* and receive a model back out. Even if you don’t accept that as your final model, it can at least be a great guide to feature selection (in the statistical sense of the phrase) and the like.
- Champion/challenger model testing is also a good idea, at least if you’re in some kind of personalization/recommendation space, and have enough traffic to test like that.**
- Most companies have significant turnover after being acquired, perhaps after a “golden handcuff” period. Vertica is no longer an exception.
- Speaking of my clients at HP Vertica – they’ve done a questionable job of communicating that they’re willing to price their product quite reasonably. (But at least they allowed me to write about $2K/terabyte for hardware/software combined.)
- I’m hearing a little more Amazon Redshift buzz than I expected to. Just a little.
- StreamBase was bought by TIBCO. The rumor says $40 million.
*Basic and unavoidable ETL (Extract/Transform/Load) of course excepted.
**I could call that ABC (Always Be Comparing) or ABT (Always Be Testing), but they each sound like – well, like The Glove and the Lions.
Categories: About this blog, Amazon and its cloud, Couchbase, Data warehousing, Hadoop, HP and Neoview, Market share and customer counts, MemSQL, NoSQL, Petabyte-scale data management, Predictive modeling and advanced analytics, Pricing, StreamBase, Surveillance and privacy, Vertica Systems, Zynga
Subscribe to our complete feed!
Comments
7 Responses to “Notes and comments, July 2, 2013”
Leave a Reply
Vertica for $2K/terabyte for hardware/software combined? Really? What kind of hardware are you talking about given that bare SAS drives cost $500/GB or more (and over $1,000/GB in RAID1)?
Or is it really about $2M/petabyte – which not that many need and can afford?
I meant (of course) $500/TB – not $500/GB
Yes, $2 million/petabyte, and that would be with reasonable assumptions about compression.
With regard to “a hard drive crash that took out early drafts of what were to be last weekend’s blog posts:”
I do all writing and coding in a Dropbox folder so that each file is copied to the cloud within seconds after it is saved to the local disk, as long as the laptop or desktop is connected to the internet. When the hard disk on my Ubuntu laptop failed, I was able to recover from Dropbox the last saved version of everything on which I was working. The free version of Dropbox is more than sufficient for months of my work.
Dropbox also downloads any newer version of every file in the Dropbox folder to each of my laptops and desktops, including those running Windows 7 and XP, within seconds after I turn one on. This allows me to start working on a file on one PC and continue working on the file on another PC.
I use free accounts at MiMedia, Mozy, SkyDrive, and Ubuntu One for more extensive backups.
Nevertheless, a hard drive crash is a pain in the neck, or lower down.
Enjoyed the post. I too am seeing clients do more stratified modeling. I like to use the target (response) variable and a supervised I(predictive) technique like a shallow decision tree to define initial segments (aka clusters) versus clustering. Then develop a model in each segment. I compare the stratified models to a fit on the entire training data to evaluate potential lift / error reduction. Definitely also seeing more use of boosting methods that require less data preparation.
[…] July 2 comments on predictive modeling were far from my best work. Let’s try […]
[…] have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or […]