November 23, 2009

Boston Big Data Summit keynote outline

Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I’m posting them below.

The top two points from Q&A probably were:

Big Data and the cloud actually have relatively little to do with each other, a few exceptions notwithstanding, especially if the data is in a shared-nothing DBMS (as opposed to, say, a MapReduce-oriented file cluster). Two principal reasons are:
- Redistributing data from node to node is a little slow, undermining some of the elasticity benefits of the cloud.
- Getting data into the cloud in the first place is a lot slow.
The NoSQL movement is a lot like the Ron Paul campaign — it consists of people who are dissatisfied with the status quo, whose dissatisfaction has a lot to do with insufficient liberty and/or excessive expenditure, and who otherwise don’t have a whole lot in common with each other.

Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.

Quick introduction

Big Data vs. cloud
How big is Big Data?
At the low end of that range, there’s little you can’t do with conventional technology if you have:
- An unlimited budget for hardware
- An unlimited budget for software
- An unlimited budget for people, especially Oracle DBAs

Big Data in OLTP

Hard-core OLTP
- Focus of DBMS technology for a long-time
- Big budgets because each transaction has significant value
- Tough to get users to change technologies
Lighter-weight OLTP
- Classic example = web companies
  - Big ones — retail-oriented ones (eBay, Amazon) partially excepted — rolled their own technology stacks
  - Reluctant to give money to anybody
    - Open source, etc.
- Difficulty finding market
  - Product vs. feature
    - Clustering/HA/DR/whatever
    - Ditto cloud enablement
  - True products haven’t found much traction yet

Analytic Big Data use cases

Kinds of data for analytics
- More of same != big
- More detail and/or new kinds
  - Complete data sets
  - Transactions
  - Call details
  - Tick/trade history
  - Web clickstreams
  - Network event logs
  - Other machine-generated data
  - CAM bottom line
    - Anything human-generated should and will be retained in its entirety
    - Quantities of machine-generated data retained should and will grow roughly in line w/ computing cost reductions (Moore’s Law, etc.)
Analytic uses of Big Data
- Analytics is mainly about three things
  - Problem detection
  - Customer relationship improvement
    - (Those overlap when the customer relationship is bad)
  - Financial statements on steroids
- Main kinds of analytics
  - What BI vendors traditionally sell
    - General reporting and dashboards
    - Ad-hoc query (now driven from those reports and dashboards)
    - Planning (allegedly integrated with BI)
  - Research
    - Ad hoc relational query (worth mentioning twice because it drives so much of the market)
    - Data mining
    - Most web search and web mining
  - Operational/near-real-time
  - Archiving/compliance
- What gets Big?
  - Mainly research and archiving
  - But when reporting or operational get Big, you have really interesting computing problems

Technology issues and trends

Moore’s Law
- CPUs — All about cores, hence parallelism is key
- RAM
- SSDs – hence replace disks
- Sensors – hence generate lots more data
Kryder’s Law
- But rotational speeds up only 12.5X since Eisenhower Administration
- Hence solid-state memory (or RAM) will soon take over
In the mean time, I/O bottlenecks have had to be beaten
- Hence sequential scans
- Hence index-light architectures
- Hence columnar
DBMS “overhead”
- Raw license and maintenance fees – software increasing fraction of total
- OLTP vestiges – locking and all that
- DBAs
  - People costs = huge fraction of total
  - Index-lightness addresses
  - So does appliance
- Many people don’t really know how to write SQL
Configuration
- Appliance/tightly-balanced
  - Netezza
  - Teradata earlier
  - Greenplum/Sun
  - Oracle
  - IBM
  - Microsoft/Madison
- Commodity/do what you want
  - Vertica
  - Greenplum now
  - Infobright, Aster and others
  - MapReduce-oriented file systems
- Extreme rigidity is silly
  - Teradata, Oracle have both signaled moving to more modularity
  - Big driver of that = heterogeneous storage
    - Cheap disk
    - Expensive disk
    - Solid-state
    - RAM
  - CPU/storage ratio is even more of a driver

Theoretically defensible ways to segment the market

Latency requirements
- High availability and low latency go together
Query types
- Simultaneous users for same
Database size
Budget

Actual segments right now

Utter ADW/EDW
Data mart
- Size
- Naturally columnar vs. naturally row-based
Operational/frontline
Less dramatic/smaller EDW

Categories: Analytic technologies, Archiving and information preservation, Business intelligence, Cloud computing, Clustering, Columnar database management, Data warehouse appliances, Data warehousing, DBMS product categories, Humor, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, NoSQL, OLTP, Open source, Parallelization, Presentations, Pricing, Solid-state memory, Storage, Telecommunications, Theory and architecture, Web analytics

Subscribe to our complete feed!

Comments

6 Responses to “Boston Big Data Summit keynote outline”

Jerome Pineau on November 23rd, 2009 8:02 pm

Really cool list. Do you consider Aster (and Vertica, which I don’t believe you mentioned) the only viable/existing cloud players in the big data world or do you see others in there?

Thanks
J.
Curt Monash on November 23rd, 2009 8:27 pm

Jerome,

No promises of completeness are implied.

But last I looked, Aster and Vertica had a little more cloud track record — with “little” being the operative word — than most of the others.
Malcolm on November 28th, 2009 6:19 pm

Interested in your comment, “Many people don’t really know how to write SQL”.

I think it’s even worse than that: many web developers are writing using frameworks that don’t allow them to easily see the SQL generated, so they don’t even think to tune it. Maybe we have become blase, thinking that the database problem is “solved”, because it’s invisible.

However, I agree that even if they could see the SQL, most developers wouldn’t know what to do with it. Database and query tuning remain black arts.
The Naming of the Foo | DBMS2 -- DataBase Management System Services on March 18th, 2010 1:19 am

[…] It’s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about. […]
8 not very technical problems with analytic technology | DBMS2 -- DataBase Management System Services on May 8th, 2010 8:33 am

[…] expense of expertise. Highly skilled Oracle DBAs are expensive. The same can be said for many other categories of people, whether in IT or business units, needed […]
Notes and cautions about new analytic technology | DBMS 2 : DataBase Management System Services on April 8th, 2011 12:00 am

[…] for my Boston Big Data Summit (no relation to Aster Data’s Big Data Summit series) talk in October, 2009 Categories: […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Boston Big Data Summit keynote outline

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin