Investment research and trading
Discussion of how data management and analytic technologies are used in trading and investment research. (As opposed to a discussion of the services we ourselves provide to investors.) Related subjects include:
- CEP (Complex Event Processing)
- (in Text Technologies) The use of text analytics in trading and investment research
The Sybase Aleri RAP
Well, I got a quick Sybase/Aleri briefing, along with multiple apologies for not being prebriefed. (Main excuse: News was getting out, which accelerated the announcement.) Nothing badly contradicted my prior post on the Sybase/Aleri deal.
To understand Sybase’s plans for Aleri and CEP, it helps to understand Sybase’s current CEP-oriented offering, Sybase RAP. So far as I can tell, Sybase RAP has to date only been sold in the form of Sybase RAP: The Trading Edition. In that guise, Sybase RAP has been sold to >40 outfits since its May, 2008 launch, mainly big names in the investment banking and stock exchange sectors. If I understood correctly, the next target market for Sybase RAP is telcos, for real-time network tuning and management.
In addition to any domain-specific applications, Sybase RAP has three layers:
- CEP (Complex Event Processing). Sybase RAP CEP is based on a version of the Coral8 engine Sybase licensed and has been subsequently developing.
- In-memory DBMS. Sybase’s IMDB is part of (but I guess separable from) and has the same API as Sybase’s OLTP DBMS Adaptive Server Enterprise (ASE, aka Sybase Classic).
- Sybase IQ. Actually, Sybase used the phrase “based on Sybase IQ,” but I’m guessing it’s just Sybase IQ.
Quick thoughts on Sybase/Aleri
Sybase announced an asset purchase that amounts to a takeover of CEP (Complex Event Processing) Aleri. Perhaps not coincidentally, Sybase already had technology under the hood from Aleri predecessor/acquiree Coral8, for financial services uses (notwithstanding that between Aleri Classic and Coral8, Aleri Classic was the one of the two more focused on financial services). Quick reactions include:
- The folks at Sybase still haven’t figured out when to prebrief me. (Edit: I’ve been briefed subsequently.)
- Sybase/Aleri is a potentially powerful combination, if they can effectively address the point I just made about integrating disparate latencies. That said, I’m not expecting a lot, because the CEP industry always disappoints me.
- Microsoft, IBM, and (somewhat less clearly) Oracle are all trying to do CEP inhouse. Sybase is making a good choice in having serious CEP inhouse itself
- Surely the main focus and financial justification for the Sybase/Aleri acquisition is the financial services market.
- Specifically, I expect the focus of technical integration between Aleri and Sybase’s DBMS products to start with Sybase IQ.
- Coral8 had some interesting ideas about how to integrate CEP with OLTP/operational BI, but I’m not aware that they got much traction.
- I bet there are use cases where Sybase tries and fails to sell Adaptive Server SQL Anywhere that CEP would be a better technical fit, but I don’t immediately see much practical business significance to that observation.
- While this deal could easily strengthen the Vertica/StreamBase partnership, I don’t see any reason why it would lead those two companies to actually merge.
Related link
Categories: Aleri and Coral8, Analytic technologies, Investment research and trading, Streaming and complex event processing (CEP), Sybase | 7 Comments |
Three broad categories of data
People often try to draw a distinction between:
- Traditional data of the sort that’s stored in relational databases, aka “structured.”
- Everything else, aka “unstructured” or “semi-structured” or “complex.”
There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don’t tend to work very well, I think the most important one is that:
Databases shouldn’t be divided into just two categories. Even as a rough-cut approximation, they should be divided into three, namely:
- Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays
- Human/Nontabular data — i.e., all other data generated by humans
- Machine-Generated data
Even that trichotomy is grossly oversimplified, for reasons such as:
- These categories overlap.
- There are kinds of data that get into fuzzy border zones.
- Not all data in each category has all the same properties.
But at least as a starting point, I think this basic categorization has some value. Read more
Categories: Database diversity, Investment research and trading, Log analysis, Telecommunications, Web analytics | 19 Comments |
A framework for thinking about data warehouse growth
There are only three ways that the amount of data stored in data warehouses can grow:
- The same kinds of data are stored as before, with more being added over time.
- The same kinds of data are stored as before, but in more detail.
- New kinds of data are stored.
Categories: Analytic technologies, Application areas, Data warehousing, Investment research and trading, Log analysis, Solid-state memory, Storage, Telecommunications, Text, Web analytics | 9 Comments |
Boston Big Data Summit keynote outline
Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I’m posting them below.
General introduction to Splunk
I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.
Splunk’s technical stories include:
- Text search over log files.
- Business intelligence over text search. (That part sounds a lot like Attivio.)
- MapReduce with schema flexibility and smart multi-stage execution plans. (That part sounds a lot like Aster Data.)
More on those in a separate post.
Less technical Splunk highlights include: Read more
Categories: Analytic technologies, Fox and MySpace, Investment research and trading, Log analysis, Splunk, Telecommunications, Text, Web analytics | 1 Comment |
Infobright notes
I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn’t primarily a briefing, but a few takeaways are:
- Infobright now has >100 paying customers.
- Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.
- Agile development is at or approaching two-week release cycles.
- Like Kickfire, Infobright has a multi-year deal with MySQL that insulates it against many potential Oracle/MySQL shenanigans.
- From an industry perspective, Infobright’s customer base sounds a lot like other vendors’:
- Data mart outsourcing/online analytics
- Log files for websites
- Telecommunications
- Financial services
- OEM, especially in the markets cited above
- “Hey, we’re beginning to see the occasional energy deal”
- A few random others
- Infobright is seeing some household-name customers, who surely have big-name analytic DBMS products, but who also have a policy that open source is the default choice, and if open source can get the job done then the favorite closed-source choices aren’t used.
- Infobright has the usual open-source community story — lots of involvement and engagement in the forums, but contributions are limited mainly to connectivity, utility scripts, etc. (Maybe some national language translation too; I’m not sure.)
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
Sybase IQ business notes
As specialized analytic DBMS go, Sybase is near the top of the charts both in age (Sybase IQ was first introduced in the mid 1990s) and adoption. That’s even more true, of course, if we restrict the discussion strictly to columnar DBMS, aka column stores. Basic Sybase IQ adoption claims include:
- >1500 users
- >3000 installations (Sybase has variously cited 2.1 and 2.5+ as the installation/user ratio)
- At least ~50-60 installations with >5 terabytes of user data
Note that 98% of Sybase IQ installations are under 5 terabytes; the heart of Sybase IQ’s business is the sub-terabyte data warehouse market.* Read more
Categories: Analytic technologies, Data mart outsourcing, Data warehousing, Investment research and trading, Sybase | 3 Comments |
Vertica’s version of MapReduce integration
I talked with Omer Trajman of Vertica Monday night about Vertica’s MapReduce integration, part of its Vertica 3.5 release. Highlights included:
- By “integrating Vertica and MapReduce,” Vertica means “integrating Vertica and Hadoop.”
- Vertica’s Hadoop integration is based on Cloudera’s DBInputFormat.
- Omer called out for me several features of Vertica’s Hadoop integration that didn’t just come from Cloudera, namely:
- Cloudera’s DBInputFormat assumes the database runs on a single computer, or a single head node of an MPP system. Vertica’s technology, however, runs on peer parallel nodes with no head, and so Vertica adapted the DBInputFormat technology accordingly.
- Vertica lets you push down Map functions to the database. Omer reports a roughly even division among users and prospects between those who want to do this and ones who don’t.
- Vertica lets you do Reduce functions (or Map functions, if you don’t push them down to the database) on a separate cluster than you run the database software. Vertica asserts that its customers and prospects all want to do this. Right here is the big difference between Vertica’s MapReduce integration and Aster’s or Greenplum’s. (Aster would also say that Vertica’s weaker MapReduce/SQL programming integration is a big difference as well.)
- Indeed, Vertica lets you Reduce into a different DBMS than Vertica, if you choose.
- Vertica gives you flexibility on the size of the Map and Reduce clusters. Omer agreed with me when I said there were some limits on how fast one can add or subtract nodes in a Vertica grid, because there’s data redistribution involved. But one can add/change/delete Hadoop clusters extremely quickly.
Apparently, the use cases for Vertica/Hadoop integration to date lie in algorithmic trading and two kinds of web analytics. Specifically: Read more