Thoughts on “data science”
Teradata is paying me to join a panel on “data science” in downtown Boston, Tuesday May 22, at 3:00 pm. A planning phone call led me to jot down a few notes on the subject, which I’m herewith adapting into a blog post.
For starters, I have some concerns about the concepts of data science and data scientist. Too often, the term “data scientist” is used to suggest that one person needs to have strong skills both in analytics and in data management. But in reality, splitting those roles makes perfect sense. Further:
- It may or may not make sense to say that a computer scientist is doing “science”; the term “data scientist” inherits that ambiguity.
- It may or may not make sense to say that a corporate scientist is doing “science”; for example, a petroleum geologist might do very valuable work without making any scientific discoveries. The term “data scientist” inherits that ambiguity too.
- Too often, people use the term big data as if it were something radically new, rather than a continuation of what has been done in large-scale analytic data management for decades. “Data science” has a similar problem.
- The term “data science” sounds as if you need specialized academic training to do it, which isn’t really true.
The leader in raising these issues is probably Neil Raden.
But there’s one respect in which I think the term “data science” is highly appropriate. In conventional science, gathering data is just as much of an accomplishment as analyzing it. Indeed, most Nobel Prizes are given for experimental results. Similarly, if you’re doing data science, you should be thinking hard about how to corral ever more useful data. Techniques include but are not limited to:
- Keeping data you used to throw away. This has driven a lot of growth in relational data warehouses and big bit buckets alike.
- Bribing customers and prospects. Loyalty cards are the paradigmatic example.
- Split testing. The more internet-based users you have, the more tests you can do.
- Storing derived data. That can be as simple as pre-computing the scores from your predictive analytics model, or it can be as complex as running a 50-step sequence of Hadoop jobs.
- Getting data from third parties, for example:
- Supply chain partners (right now this rarely amounts to more than simple BI, but that could change in the future).
- Data vendors of various kinds (e.g. credit bureaus).
- Social media/the internet in general, which also usually involves some kind of service provider.
Comments
4 Responses to “Thoughts on “data science””
Leave a Reply
Great feedback – I do think one place “data scientist” is appropriate is the scientist who is now using tech to collect data and do analysis.
Not different from previously, except that with the ubiquity of sensors, gathering data about the physical world is easier.
In industrial processes, running your car, even a modern exercise monitor like the highend Polar, Garmin and Suuto : my Polar RS800CX has more instrumentation than my first car! (By count ~ 3 times as many)
Pulling in data from these different types of sensors, and then applying statistical analysis methods – that’s data *science*
I think the essance of data science is; the techniques and activities necessary to arrive at actionable insight.
Nice summary .I also think that in real life in many cases the collection/load/transformation and etc is actually done by “data engineers” ( that seems to be the term in fashion I guess ) and the analysis after that by “data scientist” but I could be wrong.
“Data scientist” is how we refer to analysts who do not depend on user-friendly tools and vendor-defined OOTB “solutions”.
For the record, none of the generic techniques cited — from retaining data previously discarded, to leveraging experimental design, to leveraging third party data — are new. Technology, however, has advanced the frontier of what is commercially viable.