Details of the JPMorgan Chase Oracle database outage
After posting my speculation about the JPMorgan Chase database outage, I was contacted by – well, by somebody who wants to be referred to as “a credible source close to the situation.” We chatted for a long time; I think it is very likely that this person is indeed what s/he claims to be; and I am honoring his/her requests to obfuscate many identifying details. However, I need a shorter phrase than “a credible source close to the situation,” so I’ll refer to him/her as “Deep Packet.”
According to Deep Packet,
- The JPMorgan Chase database outage was caused by corruption in an Oracle database.
- This Oracle database stored user profiles, which are more than just authentication data.
- Applications that went down include but may not be limited to:
- The main JPMorgan Chase portal.
- JPMorgan Chase’s ability to use the ACH (Automated Clearing House).
- Loan applications.
- Private client trading portfolio access.
- The Oracle database was back up by 1:12 Wednesday morning. But on Wednesday a second problem occurred, namely an overwhelming number of web requests. This turned out to be a cascade of retries in the face of – and of course exacerbating – poor response time. While there was no direct connection to the database outage, Deep Packet is sympathetic to my suggestions that:
- Network/app server traffic was bound to be particularly high as people tried to get caught up after the Tuesday outage, or just see what was going on in their accounts.
- Given that Deep Packet said there was a definite operator-error contributing cause, perhaps the error would not have happened if people weren’t so exhausted from dealing with the database outage.
Deep Packet stressed the opinion that the Oracle outage was not the fault of JPMorgan Chase (the Wednesday slowdown is a different matter), and rather can be blamed on an Oracle bug. However, Deep Packet was not able to immediately give me details as to root cause, or for that matter which version of Oracle JPMorgan Chase was using. Sources for that or other specific information would be much appreciated, as would general confirmation/disconfirmation of anything in this post.
Metrics and other details supplied by Deep Packet include:
- The Oracle database was restored from a Saturday night backup. 874K transactions were reapplied, starting early Tuesday morning and ending late Tuesday night.
- $132 million in ACH transfers were held up by the JPMorgan Chase database outage.
- Somewhere around 1000 each auto and student loan applications were lost due to the outage.
- The Oracle cluster has 8 biggish Solaris boxes (T5420 with 64 GB of RAM).
- EMC is the storage provider. In early trouble shooting, EMC hardware was suspected of causing the problem – specifically in a SAN controller — but that was ruled out at some point Monday night.
- JPMorgan Chase’s whole fire drill started at 7:38 Monday night, when the slowdown was noticed. Recognition that the problem was database related was very quick (before 8 pm).
- Before long, JPMorgan Chase DBAs realized that the Oracle database was corrupted in about 4 files, and the corruption was mirrored on the hot backup. Hence the manual database restore starting early Tuesday morning.
- And by the way, even before all this started JPMorgan Chase had an open project to look into replacing Oracle, perhaps with DB2.
One point that jumps out at me is this – not everything in that user profile database needed to be added via ACID transactions. The vast majority of updates are surely web-usage-log kinds of things that could be lost without impinging the integrity of JPMorgan Chase’s financial dealings, not too different from what big web companies use NoSQL (or sharded MySQL) systems for. Yes, some of it is orders for the scheduling of payments and so on – but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.
Related link
Comments
41 Responses to “Details of the JPMorgan Chase Oracle database outage”
Leave a Reply
[…] Edit: Subsequent to making this post, I obtained more detail about the JP Morgan Chase database outage. […]
“The vast majority of updates are surely web-usage-log kinds of things that could be lost without impinging the integrity of JPMorgan Chase’s financial dealings”
I disagree there. Even the less critical data can’t be lost in this sort of company, purely for auditing reasons.
They need to know there was a failed/successful login from ip x.x.x.x at x:09 PM. Is it as important as an actual transaction? No, but they need the data none the less.
Do they need to know that a browser claiming to be Firefox loaded the main page, then left. Probably not, but separating a lot of that data is what leads to over-engineering.
Chris,
Actually, I have a hunch that the same project queued up to replace the Oracle database will in fact break it back up by application, simplifying it somewhat. Does that mean any part can go non-transactional? Not necessarily.
But I’m not sure that they need an ACID-compliant security audit trail even on authentication attempts, which seems to be what you were suggesting.
In fact, I’d go so far as to suggest that even at banks, national security departments, etc., there actually are lots and lots of perimeter security devices that don’t keep all authentication request data under the control of an ACID-compliant DBMS.
“Yes, some of it is orders for the scheduling of payments and so on – but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.”
Under the assumption that you have not worked or audited the application, how can you possibly make the speculation that the system was over-engineered?
The same way I comment on other technology I didn’t write personally, which includes almost everything I talk about on this blog.
“Before long, JPMorgan Chase DBAs realized that the Oracle database was corrupted in about 4 files, and the corruption was mirrored on the hot backup. Hence the manual database restore starting early Tuesday morning.”
Having some past exposure to the Chase env; this is not the first time an issue of db corruption has caused an extended outage.
Reading your synopsis; yes I have to agree with the over-engineering comment; but I would extend this across the whole architecture, which has lead to many moving transactions, messages, and points-of-failure that require constant attention.
What is worth questioning is how well-balanced JPC’s high availability strategy is. From what has been described; this is common with physical replication tools; and I have seen many occurrence’s where this type of replication cascades corrupted files to the secondary/backup nodes. Reading the technologies involved; and knowing that JPC is somewhere between Oracle 9i and 10g; an assumption is being made its using Physical DataGuard, and possibly SRDF (leading to some finger pointing at EMC).
Thanks, Ryan!
[…] insight into technology and marketplace trends.”, has some very interesting details on his blog, including this: “…even before all this started JPMorgan Chase had an open project to […]
…which leads to the interesting question: Does anyone know of a setup in which one replicates *with a delay*? That is, there’s a master and a replica to which updates are always applied with a 15 minute delay. If the master blows up, you stop the updates to the replica until you’re sure you can make them safely, then continue. (Obviously, 15 minutes is an arbitrary value – you trade off the delay in restarting against the window you have in which to detect a problem. Given enough funds, you could have multiple replicas, but even some like JPMorgan Chase would have trouble justifying that.)
In effect, “hot backup” rather than “hot standby”.
— Jerry
RJP – “What is worth questioning is how well-balanced JPC’s high availability strategy is. From what has been described; this is common with physical replication tools; and I have seen many occurrence’s where this type of replication cascades corrupted files to the secondary/backup nodes. Reading the technologies involved; and knowing that JPC is somewhere between Oracle 9i and 10g; an assumption is being made its using Physical DataGuard, and possibly SRDF (leading to some finger pointing at EMC).”
Okay, so why the point at EMC? SRDF maintains consistency – and quickly at that, if DBMS is corrupted this is obviously replicated to failover site (active/passive mirror), the question should come down to whether SRDF alone is a suitable vehicle for near real time recovery from application error….. Ryan, care to elaborate the finger pointing?
I am going to abstain from the finger pointing dialogue but want to mention that SRDF does its job well but it is important for one to assess the implementation and the what-if’s. There are solutions available to handle the delay in replicating the data to the BCP site etc. EMC’s RecoverPoint is one such solution. As always not everything is a fit for all and needs to be designed, tested and validated based off the needs.
They could prevent this if they had dataguard standby with flashback. Or logical replica via GoldenGate. But oh oh well … as my friend says: my CIO has contingency plan in case of disaster: find new job. And disasters / bugs dont happen too often.
[…] of the JPMorgan Chase Oracle database outage (Curt Monash/DBMS 2) Curt Monash / DBMS 2:Details of the JPMorgan Chase Oracle database outage — After posting my speculation about the JPMorgan Chase database outage, I was […]
I reported that some of the early analysis of the problem examined what turned out to be an INCORRECT theory that EMC was to blame. So it was wholly appropriate for Ryan to speculate on why anybody might have ever held that theory in the first place.
Oracle is a nightmare to patch and is increasingly vulnerable – NITS
http://www.computerworld.com/s/article/9057226/Update_Two_thirds_of_Oracle_DBAs_don_t_apply_security_patches
This is good reporting, but I take issue with the opinion making: “but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.”
Over-engineering a database does not cause file corruption. File corruption is typically caused by a disk error, hence the first assumption by the JPMorgan Chase group. The corruption appears (from this post) to be a caused by a bug in Oracle DBMS itself. Over-engineering a database is unlikely to lead to the exposure of a software bug. Under-engineering or ignorant misuse of db options seems more likely to expose bugs, and that seems unlikely given JP Morgan Chase’s buying/hiring power.
Over-engineering a DB can lead to a kind of brittleness. But it is misleading to suggest that kind of brittleness was a contributing factor to a bug-failure in Oracle. (and I’m no fan of Oracle, I think that’s over-engineering right there!)
JPMC probably uses these files/tables along with all their other files in the same DBMS, and if they are not, then brittleness can arise from applications that are interacting across DBMS’s and other repositories. And it’s those applications that give rise to brittleness and even then are very unlikely to produce a bug driven Oracle DBMS failure.
[…] been no official incident report offering details of the outage, but database industry analystCurt Monashhas an interesting unofficial account. After writing about the incident last week, Monash was […]
I believe several folks have misunderstood Curt’s assessment of brittleness. I take it he was referring to how this over-engineering, rather than being a contributing *cause* of the outage, was likely a major factor in the difficulty in *ending* the outage quickly — and also of not limiting its effect to the authentication portion of the app, but ultimately leading to a lengthy outage on back-end that prevented processing of non-online transactions (which I believe was established in Curt’s prior post).
Then again, if JPMC split this up into a MySQL authentication DB + Oracle user profile DB scenario, and the Oracle back end had a corruption, it sure seems like you have most of the same issues. The only advantage you get is that customers can authenticate into a system that has an outage in a different DBMS downstream.
Some advantage, eh?
DH, no finger pointing; nor raising any flags on reliability of SRDF ; rather as Curt pointed out I was commenting on the possible reason why SRDF was suspected. As with many DR failures doubt is cast upon every tool in the recovery process until ruled out.
To Curt’s suggestion that over-engineering being a factor in the recovery process; that being resolving the matter; and maintaining a low MTTR strategy. Having witnessed some architectures that fell into the category; that brittleness didn’t contribute to the failure; but it limited the options on recovery to the point where radical changes had to be implemented.
The other brittleness point is that we still don’t know what was corrupted. Was it authentication? Web log type of profile data? ACH instructions? If it really was authentication, that’s a good excuse for many apps to go down at once. If it was something else, perhaps more apps came down than really “needed” to.
If they had data guard configured and corruption was detected on the primary, the mrp process on the standby should have prevented it from being applied. They could have restored a copy of those 4 files from the standby and recovered them which should have lead to a smaller outage window. Taking this a step further if they were/are on 11g then there is a new feature that does automatic block recovery from primary to standby and vice versa. The fact that corruption appears to have been replicated to the dr site, leads me to believe they probably were using only a storage replication technology that would replicate corruption too…well that’s just my opinion.
[…] subsequently posted that the outage was caused by corruption in an Oracle database which stored user profiles. Four […]
I think that generally speaking array-based replication (or any replication) is used for disaster recovery and so should happen in as near real-time as possible. Technologies such point in time copies or ‘snapshots’ are more usually used for recover from corruption. The thing about corruption in a database is that you can never tell when it started, so how do you know how long to delay the application of changes to the remote copy? JPMC may well have used a snapshot to recover though and then performed database recovery. That they lost data says something about the design though.
[…] a post about the recent JPMorgan Chase database outage, I suggested that JPMorgan Chase’s user profile database was over-engineered, in that various […]
[…] recent JPMorgan Chase outage caused by an Oracle RAC block corruption places an old question back on the agenda that gets ignored way too often: How to tell whether you […]
[…] Vijayan of Computerworld did a story based on my reporting on the JP Morgan Chase Oracle outage. He did a good job, getting me to simplify some of what I said before. He also added a quote from […]
[…] Details of the JPMorgan Chase Oracle database outage | DBMS2 : DataBase Management System Services […]
[…] An anonymous tipster spent 2 ½ hours IMing with me to reveal the true cause of the JP Morgan Chase site outages. […]
[…] 2010 at 8:32 am (data, design, engineering, musings) The blogosphere is abuzz about JPMC outage (1, 2, 3). The basic reason people site for long recovery time is a big, ambitious database design […]
[…] Meanwhile, RJP supplied details about the JP Morgan Chase Oracle outage that my actual source didn’t know. […]
[…] Details of the JPMorgan Chase Oracle database outage | DBMS 2 : DataBase Management System Services (tags: database ha oracle jpmorgan outage) […]
[…] ソース [DBMS2] photo credit: peterkaminski 管理人コメント: ロギングのような、あまり重要じゃない部分がシステム全体の足を引っ張るというのは意外とよくある。もしそうだとしたら、実に本末転倒なクラッシュだったことになるね。 […]
[…] ソース [DBMS2] photo credit: peterkaminski 管理人コメント: ロギングのような、あまり重要じゃない部分がシステム全体の足を引っ張るというのは意外とよくある。もしそうだとしたら、実に本末転倒なクラッシュだったことになるね。 […]
> Jerry Leichter Said:
> Does anyone know of a setup in which one
> replicates *with a delay*?
MongoDB has a configurable slave delay.
[…] […]
[…] a third choice, and it’s often misunderstood. Let’s start by looking at a blog post on the JPMorgan Chase incident. The author makes the following observation, with which I agree: […]
[…] there can be considerable costs to giving them what they don’t need. A classic example is the 2010 Chase fiasco, in which recovery from an Oracle outage was delayed by database clutter that would have fit better […]
[…] and another 1,000 student loan applications were lost due to the outage, Monash said in a blog post detailing his conversation with the […]
[…] an additional monitoring tool from Oracle. The most recent news about an outage experienced by major bank JP Morgan Chase shows the seriousness of database outages for businesses. In addition, Oracle RAC uses many shared […]
Earth Energy present
Details of the JPMorgan Chase Oracle database outage | DBMS 2 : DataBase Management System Services
Freedman’s Office Furniture, Cubicles, Desks, Chairs
515 E Las Olas Blvd Suite 120,
Fort Lauderdale, FL 33301
(954) 466-1629
Office desks, Office chairs, Office tables