Text
Analysis of data management technology optimized for text data. Related subjects include:
- Native XML database management
- (in Text Technologies) More extensive coverage of text search
How to beat “fake news”
Most observers hold several or all of the views:
- “Fake news” and the like are severe problems.
- Algorithmic solutions have not worked well to date.
- Neither have manual ones.
- Trusting governments to censor is a bad idea.
- In light of the previous points, trusting large social media corporations to censor is a bad idea too.
- Educating consumers to evaluate news and opinions accurately would be … difficult.
And further:
- Whatever you think of the job traditional journalistic organizations previously did as news arbiters, they can’t do it as well anymore, for a variety of economic, structural and societal reasons.
But despite all those difficulties, I also believe that a good solution to news/opinion filtering is feasible; it just can’t be as simple as everybody would like.
Categories: Public policy, Text | 9 Comments |
Brittleness, Murphy’s Law, and single-impetus failures
In my initial post on brittleness I suggested that a typical process is:
- Build something brittle.
- Strengthen it over time.
In many engineering scenarios, a fuller description could be:
- Design something that works in the base cases.
- Anticipate edge cases and sources of error, and design for them too.
- Implement the design.
- Discover which edge cases and error sources you failed to consider.
- Improve your product to handle them too.
- Repeat as needed.
So it’s necesseary to understand what is or isn’t likely to go wrong. Unfortunately, that need isn’t always met. Read more
Categories: Analytic technologies, Text | 5 Comments |
MongoDB 3.4 and “multimodel” query
“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:
- A query layer with multiple ways to query and analyze data.
- A separate data storage layer in which you have a choice of data storage engines …
- … each of which has the same logical (JSON-based) data structure.
When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.
To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further: Read more
Categories: Database diversity, Emulation, transparency, portability, MongoDB, MySQL, NoSQL, Open source, RDF and graphs, Structured documents, Text | 4 Comments |
Rapid analytics
“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.
1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:
- General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
- Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
- Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
- Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
- Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.
2. In early 2011, I coined the phrase investigative analytics, about which I said three main things: Read more
Notes on the transition to the cloud
1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:
- The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
- Software as a service, aka SaaS.
- Co-location in off-premises data centers, aka colo.
- On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.
Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.
This is a good example of Monash’s Laws of Commercial Semantics.
2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.
This fact now seems to be widely understood.
What is AI, and who has it?
This is part of a four post series spanning two blogs.
- One post gives a general historical overview of the artificial intelligence business.
- One post specifically covers the history of expert systems.
- One post (this one) gives a general present-day overview of the artificial intelligence business.
- One post explores the close connection between machine learning and (the rest of) AI.
1. “Artificial intelligence” is a term that usually means one or more of:
- “Smart things that computers can’t do yet.”
- “Smart things that computers couldn’t do until recently.”
- “Technology that has emerged from the work of computer scientists who said they were doing AI.”
- “Underpinnings for other things that might be called AI.”
But that covers a lot of ground, especially since reasonable people might disagree as to what constitutes “smart”.
2. Examples of what has been called “AI” include:
- Rule-based processing, especially if it is referred to as “expert systems”.
- Machine learning.
- Many aspects of “natural language processing” — a term almost as overloaded as “artificial intelligence” — including but not limited to:
- Text search.
- Speech recognition, especially but not only if it seems somewhat lifelike.
- Automated language translation.
- Natural language database query.
- Machine vision.
- Autonomous vehicles.
- Robots, especially but not only ones that seem somewhat lifelike.
- Automated theorem proving.
- Playing chess at an ELO rating of 1600 or better.
- Beating the world champion at chess.
- Beating the world champion at Jeopardy.
- Anything that IBM brands or rebrands as “Watson”.
Categories: IBM and DB2, Text | 5 Comments |
Sources of differentiation
Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.
Many buying and design considerations for IT fall into six interrelated areas: Read more
DataStax and Cassandra update
MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.
It seems fair to say that in most cases:
- Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
- Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.
Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:
- DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
- Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
- A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.
*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. 🙂
While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind: Read more
MongoDB update
One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:
- >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
- ~75,000 users of MongoDB Cloud Manager.
- Estimated ~1/4 million production users of MongoDB total.
Also >530 staff, and I think that number is a little out of date.
MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:
- Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
- A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results. 🙂
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
- Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.
There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. Read more
Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents, Text | 6 Comments |
IT-centric notes on the future of health care
It’s difficult to project the rate of IT change in health care, because:
- Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
- … health care is heavily bureaucratic, political and regulated.
Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:
- The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
- Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
- Most tests exploit electronic technology. Progress in electronics is intense.
- Biomedical research is itself intense.
- In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
These vastly greater amounts of data cited above will allow for greatly changed analytics.
Read more