Kickfire’s FPGA-based technical strategy
Kickfire’s basic value proposition is that, if you have a data warehouse in the 100s of gigabytes, they’ll sell you – for $32,000 – a tiny box that solves all your query performance problems, as per the Kickfire spec sheet. And Kickfire backs that up with a pretty cool product design. However, thanks in no small part to what was heretofore Kickfire’s penchant for self-defeating secrecy, the Kickfire story is not widely appreciated.
Fortunately, Kickfire is getting over its secrecy kick. And so, here are some Kickfire technical basics.
- Kickfire is MySQL-based, with all the SQL functionality and lack of functionality that entails.
- The Kickfire/MySQL DBMS is columnar, with the usual benefits in compression and I/O reduction.
- Kickfire is based on FPGAs (Field-Programmable Gate Arrays).
- The Kickfire DBMS is ACID-compliant.
- Kickfire runs only as a single-box appliance.
- While Kickfire earlier estimated that, at least for data sets that compressed well, a Kickfire box could hold 3-10 terabytes of user data, more recent figures I’ve heard from Kickfire have been in the 1-1 /2 terabyte range. (Edit: Karl Van Der Bergh subsequently wrote in to say that the 1 1/2 TB is raw disk figure, not user data.)
The new information there is that Kickfire relies on an FPGA; Kickfire had long been artfully vague on the subject of FPGA vs. custom silicon. This had the unfortunate effect that people believed Kickfire relied on a proprietary chip, with all the negative implications for future R&D effectiveness that is believed to imply. But in fact Kickfire just relies on standard chips, even if — like Netezza and XtremeData — Kickfire does rely on less programmer-friendly FPGAs to do some of what most rival vendors do on Intel-compatible CPUs.
In terms of how it uses the FPGA, Kickfire is more like XtremeData than like Netezza. That is, large fractions of actual SQL processing seem to be done on the FPGA, not just projections and restrictions. Pipelining is a key concept, in that data is shunted among various “processing engines” without, unless absolutely necessary, being sent back into RAM. If I understood Kickfire founder Raj Cherabuddi correctly:
- There are three kinds of on-FPGA Kickfire “processing engines”.
- Each Kickfire processing engine can do any of about half a dozen different basic things.
- When data finishes at one engine it is sent straight to another engine if at all possible.
- One of the Kickfire optimizer’s main responsibilities is to ensure that this will be possible as often as – well, as often as possible. 🙂
Raj says that there are two main reasons data can ever be sent back to memory mid-query. First, the optimizer might sadly fail to find a “networking solution” that allows for perfect pipelining. Second, a query might be so complex that several passes through the pipeline are needed to get it done.
That’s one of Kickfire’s top-two performance strategies. I ran out of time on my last visit before I properly understood the other one, which is something that Kickfire calls “deep indexing,” but which sounds a lot like an inverted list. (Key point: If you have an inverted list already created, joins can be very fast.) When/how exactly that’s used, and what Kickfire does in “hardware” to support it, is a subject for another time.
On the negative side: To get good update/trickle-feed performance, columnar vendors have to do something or other clever. That’s still a future for Kickfire, with the specifics of the roadmap being NDA. I imagine Kickfire also has performance weaknesses in areas where it relies on MySQL for things that MySQL doesn’t happen to be very good at.
Comments
16 Responses to “Kickfire’s FPGA-based technical strategy”
Leave a Reply
Hi Curt,
Thanks for the post. A couple of comments.
I’d put our value proposition a little differently. The Kickfire appliance targets what we call the data warehousing mass market which are those deployments up to about 5TB in size (representing over two thirds of deployments according to IDC).
We believe the top needs of this market are 1)performance 2) fast time to value 3) low upfront cost and TCO. Our appliance addresses these needs with 1) the highest performance per dollar, per watt, and per cubic foot of any vendor, based on the industry data available 2) a true (i.e. not a bundle from multiple vendors) plug-and-play appliance that is ready to go in 15 minutes 3) a starting price of just $32K all in (hardware, software, storage) and a TCO that is 10X better than the industry.
On the question of trickle-feed updates. We support this today but we typically recommend that our customers use our high-speed incremental loader for fastest performance which does micro-batch updating.
We look forward to following up with you further on the details behind our chip.
Karl
@Curt
While Kickfire may have been artificially vague or what not, the Sigmod 2009 paper, FPGA: What’s in it for a Database?, specifically mentions that Kickfire uses a FPGA:
So I guess it’s not entirely new information…
Interesting. Not sure how they got that information, or whether it was just a lucky/educated guess.
They also made the same assertion in http://www.inf.ethz.ch/personal/jteubner/publications/streams-on-wires.pdf
@Karl:
Call me stupid, but I have looked at kickfire from an architecture perspective.
To put it very very bluntly:
KickFire FPGA is simply a massive memory controller array. The Kickfire idea is nothing more than the traditional tried and true philosophy that: fit all your data in ram.
The FPGA boards each control 128GB of ram or so. So basically it is a PCI-Express board ram expansion card for 128GB of ram. You put say 8 of them in a Dual Socket Xeon system, and you suddenly have a nice 1TB ram cache for your database. Add in column store and compression, you get a nice ram based database able to handle 1-3TB datawarehouse.
Let me ask this question: given the current Nehalem-EX design, an 8 Socket Nehalem-EX will be able to have 1TB of ram cache directly controlled by the integrated memory controller. Why offload ram to PCI-Express boards when you can have 1TB ram in your system at a low low price? For $32,000 dollars or whatever you guys want for the kickfire, you can easily just build a big box and put everything in ram like Markus Frind did to plentyoffish.com. Using 8 socket AMDs or 8 Socket Nehalem-EXs?
So from the “store everything in ram” perspective, all Kickfire did was to add ram based cache in a non-commodity way(FPGA). Current trend is to do database caching in MLC flash on top of hard drive arrays.(ZFS L2ARC for example). In that sense, you can already buy PCI-EXpress MLC flash cards from OCZ, Micron, heck even fusion-io. So from the perspective of “store everything in flash”, Kickfire now is hopelessly outpriced by flash.
Call me a non-believer, but Kickfire never will be big.
@TS You are stupid.
The ram the Kickfire appliance uses is not just a cache. It isn’t a big memory controller. The Kickfire appliance does of course include memory controllers to access the ram connected directly to the chip.
The SQL chip can only access the memory directly attached to it, and it does so in a DATA FLOW manner, not INSTRUCTION FLOW. There has to be a lot of memory to store tuples as they flow through the chip processing, which requires a lot of ram.
The current form factor of the appliance doesn’t include “cards”. There is a QPM attached via PCI-X bus.
Buying 1TB of ram and sticking your data in it will not perform as well as buying a smaller amount of ram and using it more efficiently, which is what the data flow chip on the appliance is all about.
Neither flash nor SSD will outperform a data flow chip. Period.
Get your facts straight before making claims you don’t understand. Maybe send us an email and ask some questions instead of spouting off about things you have no grasp on.
@Curt that paper makes some suppositions about Kickfire that are not entirely accurate.
TS – Get back to me when your 1TB machine beats our appliance in Price/Performance on the TPCH 300G benchmark.
Hi Curt,
Thank you for your post on our technology. I have a couple of additional details to provide:
1) I think it would be interesting for your readers to understand the key design philosophy behind our SQL chip. In a nutshell, with our chip, we have done for memory bandwidth what column store has done for I/O bandwidth. Why is this important? The SQL chip is based on a dataflow architecture which employs direct transistor-based processing engines to natively execute high-level relational operations and database algorithms. This approach delivers an order of magnitude more query processing capability than today’s state-of-the-art microprocessors. This results in the need for an order of magnitude greater memory bandwidth than today’s microprocessors have available. There are several techniques that are employed in our SQL chip and systems stack to get this increased memory bandwidth. As we discussed, the top two are a) the ability to keep the intermediate data sets/tuple sets, resulting from the processing of complex queries, live on the chip without spilling to memory and b) using “deep-indexing” which helps avoid memory-intensive column scans which we can discuss later at your convenience. The net result is that you get the processing power and memory bandwidth equivalent of 20 or so Nehalem-based CPUs in a single chip.
2)A little more detail on how we manage the MySQL interface. After receiving the parse tree from MySQL, our optimizer generates a plan that uses either or both the SQL chip (our FPGA) and our software SQL execution engine (executed on the x86 in our base server). The majority of queries run natively in hardware or with just a small component in software. A smaller set will run just in our software execution engine. Neither of these paths use the MySQL optimizer or execution engine thereby ensuring consistently high performance.
Thanks,
Raj
TS,
These may help illustrate the value of the FPGA in the Kickfire architecture:
1) “Where’s the Beef? Why FPGAs Are So Fast” http://research.microsoft.com/apps/pubs/default.aspx?id=70636
So…orders of magnitude speedups are typical for FPGA across a wide range of apps…but note the importance of the customized memory subsystem and interface:
“Our results show that custom memory interfaces are the most effective way at enabling much greater performance on the FPGA, and that memory interfaces traditional software use become a bottleneck when the FPGA uses the same interface.”
This is why Kickfire uses a separate memory subsystem. It’s also why in-socket FPGA accelerator approaches like XtremeData DBX fall far short.
Now consider, just for the job of compression/de-compression ALONE (crucial in the column-store context), a single FPGA is faster than dozens of multi-GHz cores, and uses a tiny fraction of the power.
For a specific example, take a look at this:
2) “Streaming implementation of a sequential decompression algorithm on an FPGA”
http://portal.acm.org/citation.cfm?id=1508195#abstract
And of course compression becomes exponentially more important in the Flash SSD context.
Hope this helps.
Eric Wendel
Eric,
Consider further that Kickfire doesn’t use sequential compression algorithms but instead uses dictionary compression which doesn’t require decompression during query processing.
@ Eric
Two things: Everything about dbX is dramatically different than KickFire. I’ll start with – we scale to data sets that are 1000x bigger than KickFire, and finish with the In-Socket Accelerator has QDR memory on it, uses the largest of FPGAs which have massive internal memory bandwidth, and we use the local DDR DIMMS on the motherboard via two separate controllers – it gives us all the power we need to offer 1GB/Sec/Node regardless of data partitioning. I’ll recommend this ChalkTalk for anyone that wants to know more: http://www.xtremedata.com/parallelism.php
Second, I’d love to learn more about KickFire. I’ve read the website but I wanted more. So I just did a search at the http://www.uspto.gov (US patent and trademark office) and could only find a patent application, not an approval. I’m guessing I just missed it or their search engine isn’t finding it. (Noting that calling something “patented” that is only “patent pending” is a highly illegal, so perhaps I just can’t find it). Can you post or email us the patent number? That would probably clear some things up about what you are doing and help everyone reading understand the information? A tall request I know, but you have basically made this public via the filing for a patent in the first place.
Hi Geno,
I don’t have any association with Kickfire whatsoever so I can’t address your patented vs. pending question, but Kickfire’s Joseph Chamdani (for one) has a very long and impressive resume in the area, many patents issued, and worked for Sun in related areas as well. It’s entirely possible that there are previously issued patents that Kickfire has licensed and can therefore legitimately use “Patented” even if none of the current applications have been issued as patents.
Anyway, I think we should assume (especially here in a public forum) that the Kickfire team and their lawyers are smart enough to use the term “patented” appropriately.
Now, the QDR memory on your device is (a) an SRAM cache, and (b) only 32Mbytes. The need for a custom memory interface I refer to applies to the system DRAM, which needs to be GBytes in size and support massive concurrency of “in flight” memory transfers (in addition to bandwidth) for Data Warehousing apps.
If the Xtreme Data approach of using the server mobo memory-subsystem works as well as Kickfire’s, then (by virtue of being a much cheaper way to go) it should deliver at even better cost/performance benefits in TPC-H…right?
Eric
You are correct about the QDR size, but there are 4 of them (vs just one) – each with separate addresses, in addition to the two controllers to the Mobo DIMMS.
However, I’m sure under the hood the data flow and execution models are pretty different. We’ve tried to optimize for large data problems and MPP, to allow the system to be data model agnostic. (Query time on data with HASH = Query Time on same data round-robin). I’ll let KF comment, but I’m positive it is not the same “goal”.
The In-Socket concept allows us to mix/match as we need to. If we felt we needed more FPGA and/or more memory for something in the future, we’d just move the HP-DL785 for example. For now, we don’t need too.
If/when we publish our benchmark numbers publicly, it would not be for the 300GB size, but the largest sizes (3TB, 10TB, and 30TB sizes). We haven’t because our customers aren’t asking for it. They want a rack(s) on site, with their data, running their queries. This is what we are doing with 100% effort.
I agree a Kickfire/XtremeData comparison doesn’t make too much sense given the different product focus. I will be more interested in a Kickfire/VectorWise comparison as these both seem to have similar target (but I guess we are 12 months away from that). One could assume that an FPGA approach will outperform but the price/performance gap will be interesting.
Regardless of FPGA/non-FPGA strategies, a cheap “plug and play” appliance for mid-range MySQL data marts still seems like a pretty good idea to me.
[…] architecture; you can read about it in Daniel’s blog and in a great discussion thread on Curt Monash’s blog post. As a result of all this work, the system can keep intermediate data sets in the chip’s […]
[…] strategy, e.g. from Daniel Abadi, Merv Adrian and, kicking things off — as it were — me. Weeks after a recent Kickfire product release, there’s finally a fairly accurate data sheet […]