Boston Big Data Summit keynote outline
Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I’m posting them below.
The top two points from Q&A probably were:
- Big Data and the cloud actually have relatively little to do with each other, a few exceptions notwithstanding, especially if the data is in a shared-nothing DBMS (as opposed to, say, a MapReduce-oriented file cluster). Two principal reasons are:
- Redistributing data from node to node is a little slow, undermining some of the elasticity benefits of the cloud.
- Getting data into the cloud in the first place is a lot slow.
- The NoSQL movement is a lot like the Ron Paul campaign — it consists of people who are dissatisfied with the status quo, whose dissatisfaction has a lot to do with insufficient liberty and/or excessive expenditure, and who otherwise don’t have a whole lot in common with each other.
Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.
Quick introduction
- Big Data vs. cloud
- How big is Big Data?
- At the low end of that range, there’s little you can’t do with conventional technology if you have:
- An unlimited budget for hardware
- An unlimited budget for software
- An unlimited budget for people, especially Oracle DBAs
Big Data in OLTP
- Hard-core OLTP
- Focus of DBMS technology for a long-time
- Big budgets because each transaction has significant value
- Tough to get users to change technologies
- Lighter-weight OLTP
- Classic example = web companies
- Big ones — retail-oriented ones (eBay, Amazon) partially excepted — rolled their own technology stacks
- Reluctant to give money to anybody
- Open source, etc.
- Difficulty finding market
- Product vs. feature
- Clustering/HA/DR/whatever
- Ditto cloud enablement
- True products haven’t found much traction yet
- Product vs. feature
- Classic example = web companies
Analytic Big Data use cases
- Kinds of data for analytics
- More of same != big
- More detail and/or new kinds
- Complete data sets
- Transactions
- Call details
- Tick/trade history
- Web clickstreams
- Network event logs
- Other machine-generated data
- CAM bottom line
- Anything human-generated should and will be retained in its entirety
- Quantities of machine-generated data retained should and will grow roughly in line w/ computing cost reductions (Moore’s Law, etc.)
- Analytic uses of Big Data
- Analytics is mainly about three things
- Problem detection
- Customer relationship improvement
- (Those overlap when the customer relationship is bad)
- Financial statements on steroids
- Main kinds of analytics
- What BI vendors traditionally sell
- General reporting and dashboards
- Ad-hoc query (now driven from those reports and dashboards)
- Planning (allegedly integrated with BI)
- Research
- Ad hoc relational query (worth mentioning twice because it drives so much of the market)
- Data mining
- Most web search and web mining
- Operational/near-real-time
- Archiving/compliance
- What BI vendors traditionally sell
- What gets Big?
- Mainly research and archiving
- But when reporting or operational get Big, you have really interesting computing problems
- Analytics is mainly about three things
Technology issues and trends
- Moore’s Law
- CPUs — All about cores, hence parallelism is key
- RAM
- SSDs – hence replace disks
- Sensors – hence generate lots more data
- Kryder’s Law
- But rotational speeds up only 12.5X since Eisenhower Administration
- Hence solid-state memory (or RAM) will soon take over
- In the mean time, I/O bottlenecks have had to be beaten
- Hence sequential scans
- Hence index-light architectures
- Hence columnar
- DBMS “overhead”
- Raw license and maintenance fees – software increasing fraction of total
- OLTP vestiges – locking and all that
- DBAs
- People costs = huge fraction of total
- Index-lightness addresses
- So does appliance
- Many people don’t really know how to write SQL
- Configuration
- Appliance/tightly-balanced
- Netezza
- Teradata earlier
- Greenplum/Sun
- Oracle
- IBM
- Microsoft/Madison
- Commodity/do what you want
- Vertica
- Greenplum now
- Infobright, Aster and others
- MapReduce-oriented file systems
- Extreme rigidity is silly
- Teradata, Oracle have both signaled moving to more modularity
- Big driver of that = heterogeneous storage
- Cheap disk
- Expensive disk
- Solid-state
- RAM
- CPU/storage ratio is even more of a driver
- Appliance/tightly-balanced
Theoretically defensible ways to segment the market
- Latency requirements
- High availability and low latency go together
- Query types
- Simultaneous users for same
- Database size
- Budget
Actual segments right now
- Utter ADW/EDW
- Data mart
- Size
- Naturally columnar vs. naturally row-based
- Operational/frontline
- Less dramatic/smaller EDW
Comments
6 Responses to “Boston Big Data Summit keynote outline”
Leave a Reply
Really cool list. Do you consider Aster (and Vertica, which I don’t believe you mentioned) the only viable/existing cloud players in the big data world or do you see others in there?
Thanks
J.
Jerome,
No promises of completeness are implied.
But last I looked, Aster and Vertica had a little more cloud track record — with “little” being the operative word — than most of the others.
Interested in your comment, “Many people don’t really know how to write SQL”.
I think it’s even worse than that: many web developers are writing using frameworks that don’t allow them to easily see the SQL generated, so they don’t even think to tune it. Maybe we have become blase, thinking that the database problem is “solved”, because it’s invisible.
However, I agree that even if they could see the SQL, most developers wouldn’t know what to do with it. Database and query tuning remain black arts.
[…] It’s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about. […]
[…] expense of expertise. Highly skilled Oracle DBAs are expensive. The same can be said for many other categories of people, whether in IT or business units, needed […]
[…] for my Boston Big Data Summit (no relation to Aster Data’s Big Data Summit series) talk in October, 2009 Categories: […]