An ERP database is different and more complex; it is designed to use with an enterprise-wide integrated software system which may include components from different functional areas in the company, like accounting, manufacturing, and human resources. I have been at customer sites which may employ dozens of DBAs, but only have 1 or 2 ERP DBAs. I've seen some companies separate ERP DBAs into separate roles: the software administrator and the physical DBA role; the former may do software configurations, patching, and application administrative duties, and the physical DBA handles the database administration operation and maintenance, backups, etc.
- Read the Documentation
This may seem to be an obvious point, but let me provide an explanation. I will also draw on this scenario in a follow-up discussion on platform differences. I've had a lot of experience with Oracle's ERP Suite, traditionally called Oracle Applications (Financials, Manufacturing, etc.) or more recently referenced as e-Business Suite.
I was an hourly subcontractor to an outsourcing company doing operations work for the park district in a large Midwestern city. My Apps DBA experience has ranged from 10.7 including a client-server architecture to more web-oriented 11.0.x to 11i to current R12. The city had an aggressive 3-month upgrade schedule to go from 11.0.x to 11i. One of the reasons I had been selected for the role was because I had experience with both versions. The city had awarded the project to a different consulting unit of the vendor. (The project had a tight deadline because the city wanted it done 6 weeks before the end of the calendar for end-of-year processing. What the client didn't know was that the vendor was not assigning full-time employees to staff the projects but subcontractors like me. The vendor project manager did not ask for my input in hiring the 3 or so project DBAs. I was not asked to be on the project in part because of the different funding sources.)
The platform was Windows NT; there are some nuances in running Oracle on Windows. For purposes of the present discussion it suffices to know that the Windows registry presents issues. Additional background: the upgrade process was guided by the installation of the RapidWiz utility. The stages of the upgrade are called categories. The upgrade usually includes a higher version both of the ERP software and the database. Now Oracle also allows a way for customers to upgrade their database separately from the application software, but you generally have to run a series of interoperability patches on the application layer to accommodate a higher version database. (We'll return to this point later in the post.)
In a broader view, the upgrade includes a series of steps to prepare the database for the upgrade. You then shut down the database and upgrade the database if necessary (see above discussion). You then run the Apps driver patch and then complete post-upgrade activities. Most of what RapidWiz does is install software for the new database and software version. It's not really needed for the closeout steps for the database prior to upgrade--with one major exception: some of the scripts used for the closeout steps are installed with RapidWiz. (If I had been Oracle, I would have bundled these separately.)
Now the Apps upgrade manual said in boldface type and centered in a centered paragraph: do NOT run this on a Windows server which already has Oracle server software installed. I had given the upgrade team a production database clone with Oracle database server software installed and properly configured. Now Oracle didn't spell out the reasons why you should avoid running RapidWiz on the server, but it's clear from context if you understand the implications of the Windows registry. For example, I believe 11.0.3 had RDBMS 8.0.x installed and RapidWiz installed version 8i. If you try to bring up a lower-version database with higher-version database software, the software is going to balk that you need to upgrade the database first.
Now the project team DBAs really didn't talk to me about their activities other than in an all-hands morning staff meeting. I'm in the server room working on the production server when I heard the project DBAs in a general state of confusion. They explained they had just bounced the server and the database and listener wouldn't come up. The lead guy CB was a piece of work; I quickly pointed out they were trying to run on 8i executables and asked them if they had run RapidWiz. They conceded the obvious. They didn't have a clue on how to proceed; I went and made a number of changes to the registry to point back to version 8 database server locations. (CB, of course, protested that Microsoft and/or Oracle doesn't support hacking the registry. Of course, Oracle didn't support what he did causing the problem.)
I later tried to educate CB on the meaning of the warning. I said, "Look, I can install RapidWiz on a fresh server and network-share the scripts directory so you can finish up your closeout activities." And CB is demanding to know where Oracle is saying to do that, as if I have control over Oracle software deployment or documentation; even if Oracle didn't flesh a solution, it was fairly obvious given the context. For example, you have a new database server with 11i installed and you migrate the production database to that server after closeout. CB creates a convoluted interpretation of the plain-word meaning of the warning which the consulting PM and my boss seem to buy into, and my boss told me I had to open up a ticket with Oracle Tech Support to support my understanding. The analyst was unsympathetic to my case ; "Didn't you tell them that violating warnings is against Technical Support agreements and we won't support future Apps issues?" I said yes, but this was politics, and I needed a piece of paper telling them that Oracle meant what was said in that boldface warning.
I was an hourly subcontractor to an outsourcing company doing operations work for the park district in a large Midwestern city. My Apps DBA experience has ranged from 10.7 including a client-server architecture to more web-oriented 11.0.x to 11i to current R12. The city had an aggressive 3-month upgrade schedule to go from 11.0.x to 11i. One of the reasons I had been selected for the role was because I had experience with both versions. The city had awarded the project to a different consulting unit of the vendor. (The project had a tight deadline because the city wanted it done 6 weeks before the end of the calendar for end-of-year processing. What the client didn't know was that the vendor was not assigning full-time employees to staff the projects but subcontractors like me. The vendor project manager did not ask for my input in hiring the 3 or so project DBAs. I was not asked to be on the project in part because of the different funding sources.)
The platform was Windows NT; there are some nuances in running Oracle on Windows. For purposes of the present discussion it suffices to know that the Windows registry presents issues. Additional background: the upgrade process was guided by the installation of the RapidWiz utility. The stages of the upgrade are called categories. The upgrade usually includes a higher version both of the ERP software and the database. Now Oracle also allows a way for customers to upgrade their database separately from the application software, but you generally have to run a series of interoperability patches on the application layer to accommodate a higher version database. (We'll return to this point later in the post.)
In a broader view, the upgrade includes a series of steps to prepare the database for the upgrade. You then shut down the database and upgrade the database if necessary (see above discussion). You then run the Apps driver patch and then complete post-upgrade activities. Most of what RapidWiz does is install software for the new database and software version. It's not really needed for the closeout steps for the database prior to upgrade--with one major exception: some of the scripts used for the closeout steps are installed with RapidWiz. (If I had been Oracle, I would have bundled these separately.)
Now the Apps upgrade manual said in boldface type and centered in a centered paragraph: do NOT run this on a Windows server which already has Oracle server software installed. I had given the upgrade team a production database clone with Oracle database server software installed and properly configured. Now Oracle didn't spell out the reasons why you should avoid running RapidWiz on the server, but it's clear from context if you understand the implications of the Windows registry. For example, I believe 11.0.3 had RDBMS 8.0.x installed and RapidWiz installed version 8i. If you try to bring up a lower-version database with higher-version database software, the software is going to balk that you need to upgrade the database first.
Now the project team DBAs really didn't talk to me about their activities other than in an all-hands morning staff meeting. I'm in the server room working on the production server when I heard the project DBAs in a general state of confusion. They explained they had just bounced the server and the database and listener wouldn't come up. The lead guy CB was a piece of work; I quickly pointed out they were trying to run on 8i executables and asked them if they had run RapidWiz. They conceded the obvious. They didn't have a clue on how to proceed; I went and made a number of changes to the registry to point back to version 8 database server locations. (CB, of course, protested that Microsoft and/or Oracle doesn't support hacking the registry. Of course, Oracle didn't support what he did causing the problem.)
I later tried to educate CB on the meaning of the warning. I said, "Look, I can install RapidWiz on a fresh server and network-share the scripts directory so you can finish up your closeout activities." And CB is demanding to know where Oracle is saying to do that, as if I have control over Oracle software deployment or documentation; even if Oracle didn't flesh a solution, it was fairly obvious given the context. For example, you have a new database server with 11i installed and you migrate the production database to that server after closeout. CB creates a convoluted interpretation of the plain-word meaning of the warning which the consulting PM and my boss seem to buy into, and my boss told me I had to open up a ticket with Oracle Tech Support to support my understanding. The analyst was unsympathetic to my case ; "Didn't you tell them that violating warnings is against Technical Support agreements and we won't support future Apps issues?" I said yes, but this was politics, and I needed a piece of paper telling them that Oracle meant what was said in that boldface warning.
- ERP Databases Can Have Different Requirements
Take one example to make the point. The default character set for databases in the US distribution is US7ASCII. DBAs should be generally aware that this parameter can be modified only at the time the DBA creates the database. The traditional remedy for a misspecified character set is to create a new database with the non-default character set and do an export/import. If you are talking about databases several gigabytes or terabytes in size, this can be nontrivial.
The scenario for this example involved the US subsidiary of a European company which operates a chain of airport stores; the subsidiary is located in a Baltimore suburb, and my boss was okay with me commuting to projects outside of the Chicago area where I lived at the time. My new DC area consulting company employer had been targeting me for an Oracle Apps project at a major tobacco products company headquartered in NYC. The project had been delayed when a female company DBA who had gotten into an argument with the client female VP was walked off the project in just her second week. So I was tasked to replace her. It was just a couple of weeks before Christmas, and I was worried about the political baggage I would inherit. You want to have your lead off project a success story; I know stories of DBAs literally walked out of a job on their second day (not me personally). On the plus side, I got to drive a Mercedes to work. Not my idea, by the way. My boss had hired a tech manager who had made lease of a Mercedes a condition of employment. The manager had been caught by the boss running his own business interests at Oracle's annual conference. They were stuck with the lease of the Mercedes once the tech manager was terminated. I tend to be cheap when spending someone else's money; I will sometimes even take the cheapest subcompact when my boss would sign off on an SUV. In this case, it made no sense to pay for a second vehicle.
On my first day, I met with the production DBA, who wasn't a client employee but a contractor who had advanced to a DBA role from his prior work as a developer. When I first saw him, he had just kicked off the Smart Client driver patch for the test system running 10.7 Apps. The client had just licensed Oracle Human Resources in conjunction with the already installed Oracle Financials. The HR component required the GUI component of Smart Client, which was not installed in production.
I had run through this driver patch probably a dozen times without incident. When I got to his desk, the driver had just broken down. Oracle gave a message to the effect; "You've just encountered a problem. Do you wish to continue on with the patch (Y/N)?" This is a matter of common sense, but he was about to say 'Y', when I screamed at him to stop. Maybe not so common sense.
I looked at the error in question and it involved the routine compilation of a code object. I looked up the issue in Oracle Metalink/My Oracle Support's Knowledge Base. No hit. I started looking at the infrastructure; the environmental variables were correctly specified. At some point the thought crossed my mind: the environment thinks the database is running on the correct Western European character set; what does the database think its character set is? Bingo--US7ASCII. I later showed the client I could run the patch against a database running Western European character set without a problem. Two issues; first, this meant the production database had the same problem; second, how could they have been operating all this time with the wrong character set and not notice until my discovery?
At some point I became aware that the Apps database had been on a different server and was rehosted to the current AIX server by an independent, obviously non-Apps DBA. I don't know the nuances of the migration (e.g., platform implications) or whether they wanted to reorganize the database in the migration process, but rather than clone the existing database across the servers, the DBA recreated the Apps database doing an export/import. All the DBA's activities had been scripted and logged. Bingo! I fished out the CREATE DATABASE statement, which did not specify the Western European (or other) character set.
There was only one supported solution, as described above. The production DBA made it political by reproducing a Metalink note describing a hack to the character set value but even the note that mentioned it described it as a temporary measure until the official solution was implemented. The troublemaker DBA told the VP I was trying to bump up consulting fees. I told the client it didn't matter to me if they assigned the production server to the production DBA, me, or an independent consultant; what concerned me was that running an unsupported configuration violated their support contract with Oracle, which they couldn't risk in their production environment.
The VP of IT didn't let it go. One day she called me in on what she thought was a gotcha. I had to choose my words carefully. She showed me proof that Oracle supported US7ASCII. Yes, of course, Oracle supported databases using its default character set, but that didn't apply to Oracle Apps databases. I had to show her in writing in the first appendix to the Apps installation manual which explicitly pointed out the requirement of Western European. (The follow-up remedy is a separate story, including server hardware problems not reported to management.) But the point remains; the non-Apps DBA in doing the migration across servers had not done due diligence on the prerequisites for creating an Apps database, and his simple omission of character set meant his efforts had to be redone. Why the client had selected a non-Apps DBA in the first place is a separate issue
The scenario for this example involved the US subsidiary of a European company which operates a chain of airport stores; the subsidiary is located in a Baltimore suburb, and my boss was okay with me commuting to projects outside of the Chicago area where I lived at the time. My new DC area consulting company employer had been targeting me for an Oracle Apps project at a major tobacco products company headquartered in NYC. The project had been delayed when a female company DBA who had gotten into an argument with the client female VP was walked off the project in just her second week. So I was tasked to replace her. It was just a couple of weeks before Christmas, and I was worried about the political baggage I would inherit. You want to have your lead off project a success story; I know stories of DBAs literally walked out of a job on their second day (not me personally). On the plus side, I got to drive a Mercedes to work. Not my idea, by the way. My boss had hired a tech manager who had made lease of a Mercedes a condition of employment. The manager had been caught by the boss running his own business interests at Oracle's annual conference. They were stuck with the lease of the Mercedes once the tech manager was terminated. I tend to be cheap when spending someone else's money; I will sometimes even take the cheapest subcompact when my boss would sign off on an SUV. In this case, it made no sense to pay for a second vehicle.
On my first day, I met with the production DBA, who wasn't a client employee but a contractor who had advanced to a DBA role from his prior work as a developer. When I first saw him, he had just kicked off the Smart Client driver patch for the test system running 10.7 Apps. The client had just licensed Oracle Human Resources in conjunction with the already installed Oracle Financials. The HR component required the GUI component of Smart Client, which was not installed in production.
I had run through this driver patch probably a dozen times without incident. When I got to his desk, the driver had just broken down. Oracle gave a message to the effect; "You've just encountered a problem. Do you wish to continue on with the patch (Y/N)?" This is a matter of common sense, but he was about to say 'Y', when I screamed at him to stop. Maybe not so common sense.
I looked at the error in question and it involved the routine compilation of a code object. I looked up the issue in Oracle Metalink/My Oracle Support's Knowledge Base. No hit. I started looking at the infrastructure; the environmental variables were correctly specified. At some point the thought crossed my mind: the environment thinks the database is running on the correct Western European character set; what does the database think its character set is? Bingo--US7ASCII. I later showed the client I could run the patch against a database running Western European character set without a problem. Two issues; first, this meant the production database had the same problem; second, how could they have been operating all this time with the wrong character set and not notice until my discovery?
At some point I became aware that the Apps database had been on a different server and was rehosted to the current AIX server by an independent, obviously non-Apps DBA. I don't know the nuances of the migration (e.g., platform implications) or whether they wanted to reorganize the database in the migration process, but rather than clone the existing database across the servers, the DBA recreated the Apps database doing an export/import. All the DBA's activities had been scripted and logged. Bingo! I fished out the CREATE DATABASE statement, which did not specify the Western European (or other) character set.
There was only one supported solution, as described above. The production DBA made it political by reproducing a Metalink note describing a hack to the character set value but even the note that mentioned it described it as a temporary measure until the official solution was implemented. The troublemaker DBA told the VP I was trying to bump up consulting fees. I told the client it didn't matter to me if they assigned the production server to the production DBA, me, or an independent consultant; what concerned me was that running an unsupported configuration violated their support contract with Oracle, which they couldn't risk in their production environment.
The VP of IT didn't let it go. One day she called me in on what she thought was a gotcha. I had to choose my words carefully. She showed me proof that Oracle supported US7ASCII. Yes, of course, Oracle supported databases using its default character set, but that didn't apply to Oracle Apps databases. I had to show her in writing in the first appendix to the Apps installation manual which explicitly pointed out the requirement of Western European. (The follow-up remedy is a separate story, including server hardware problems not reported to management.) But the point remains; the non-Apps DBA in doing the migration across servers had not done due diligence on the prerequisites for creating an Apps database, and his simple omission of character set meant his efforts had to be redone. Why the client had selected a non-Apps DBA in the first place is a separate issue
- Be Aware That Independent Modifications or Operations on Proprietary Software Objects May Violate Vendor Support Standards
Several years back I worked initially as a subcontractor through an agency in which I call a "babysitting gig"; in this case, the client was an American marketing/support subsidiary of a large Japanese maker of chip testing machines; its clients included IBM, Intel and Micron. Almost a year earlier they had relocated their American headquarters from the northwest suburbs of Chicago to Silicon Valley (Santa Clara), near one of their largest clients. Many employees refused to relocate with the company, including the company's DBA because the company hired a local DBA (inferred from the fact he told me he had been there 7 months). It was a relatively small IT group, maybe 15 people. There was a soap opera beyond the context of this post; he had applied for a vacant IT manager position and was denied. So he had given his notice and didn't have a backup to transition to. This was mid-1999 at the peak of the Internet bubble. The client couldn't recruit a replacement, and I agreed to a 5-week contract commuting out of Chicago (it later got extended, and I ended up relocating to Santa Clara for 18 months when I agreed to go perm).
The arrangement was for V to transition me over a week, but he was ready to leave my second day. As a former senior principal consultant for Oracle Consulting, I had high standards; I soon discovered that the client was behind megapatches maybe up to 3 years or more. For example, one product was at level F and the latest megapatch was at the Q level. And it was poorly documented; for instance, they had replaced Oracle's standard check with a custom one (so if you patched Apps, it would regenerate its standard check). I lobbied the newly hired IT manager to update the Apps. This was critical because if we ran into a problem with production, the first question Oracle Support would ask is whether the Apps were current patch-level including all interim product fixes (vs a one-off patch). The users, particularly the accountants, resisted change, worrying any patching would break functionality. I applied the megapatches one weekend, skipping my commute back home.
Now the accountants were attached to "green screen" character-mode 10.7 which was analogous to Lotus 1-2-3/DOS in terms of keystrokes to get to the desired screen access; they also had a more direct connection to the database server. Oracle had desupported green screen mode at the end of 1998 in favor of its GUI approach, and the accountants were not impressed by what they saw as usability issues navigating through a series of pop ups to get to a desired screen. As someone with an active research interest in human factors in computing, I was empathetic with the accountants' complaints on Oracle force feeding an unrequested, less usable interface on them. On the other hand, I had a responsibility to the company to ensure our configuration was supported through our maintenance contract with Oracle. I could not afford for something to blow up in production and Oracle say, apply these patches and call us in the morning if the problem hasn't gone away--or have Oracle ask why an accountant is still running green screen after desupport.
I can still remember this six-foot female receivables clerk telling her boss she could get through all her invoices today if she could go green screen or part of them doing things "Ron's way"--his choice. (I only mention her height, because my boss was hinting she might like me, I don't have issues dating taller women. However, I like them more when they don't fight me doing my job.)
One particular Asian female accounting manager who was unhappy I replaced V was agitating over my "reckless patching", confident I was "screwing things up". She finally discovered an anomaly: the company's fixed assets appeared in green screen, but not in GUI. Even worse, she contacted Oracle Support directly, without going through me. At some point, they ask her to do a "row who" on a record, and she reports back "ANONYMOUS". Game! Set! Match! Oracle had tricked her unknowingly into revealing the company had violated Oracle's support agreement. Let me explain: Oracle wants you to do things through the application, and they tag records accordingly. When a record has an anonymous tag, it means the record was inserted independently of Apps processes, e.g., a SQL statement or SQL-Loader. Oracle Consulting gave my boss a project quote of $10,000 to fix the product, plus the Apps would be unavailable for 2 weeks (a non-starter because of the effects on operations).
It took me a while to piece together what had happened and how to fix the problem. When the company installed the Fixed Assets module some 3 years earlier, developers didn't want to go through the busy work hassle of setting up the subcategory "None" for each of 17 asset categories, so they targeted the target Apps table and SQL-LOADed it into the database. What they didn't know was that in working through the setup, Oracle converted the subcategory name into uppercase. The green screen interface would allow for a mixed case match. However, what the GUI software did was convert the search subcategory name into uppercase and do a search on the relevant database object. Since the subcategory names were "None", not "NONE", no assets appeared under GUI.
Devising a fix was easier said than done, because the subcategory name field was referenced in multiple objects across the database, and of course the fix itself would violate Oracle's agreement. After thoroughly testing my fix in test, I then promoted it into production after hours. I won a record 3 CEO awards during my 13 months with the company but not for this issue. Did I get a raise, bonus or award? Nope, but there is the satisfaction of knowing you solved a problem few people are capable of solving. (Why did I leave after 13 months? A long story; my boss had reneged on some promises and was taking advantage of me as an employee (70-hour workweeks at a below-market salary), and I was about to go to an in-person at a real estate portal in Austin, when my boss had a heart attack while visiting our Vermont branch office (primarily servicing IBM). I waited until he was back on his feet before resigning.)
The arrangement was for V to transition me over a week, but he was ready to leave my second day. As a former senior principal consultant for Oracle Consulting, I had high standards; I soon discovered that the client was behind megapatches maybe up to 3 years or more. For example, one product was at level F and the latest megapatch was at the Q level. And it was poorly documented; for instance, they had replaced Oracle's standard check with a custom one (so if you patched Apps, it would regenerate its standard check). I lobbied the newly hired IT manager to update the Apps. This was critical because if we ran into a problem with production, the first question Oracle Support would ask is whether the Apps were current patch-level including all interim product fixes (vs a one-off patch). The users, particularly the accountants, resisted change, worrying any patching would break functionality. I applied the megapatches one weekend, skipping my commute back home.
Now the accountants were attached to "green screen" character-mode 10.7 which was analogous to Lotus 1-2-3/DOS in terms of keystrokes to get to the desired screen access; they also had a more direct connection to the database server. Oracle had desupported green screen mode at the end of 1998 in favor of its GUI approach, and the accountants were not impressed by what they saw as usability issues navigating through a series of pop ups to get to a desired screen. As someone with an active research interest in human factors in computing, I was empathetic with the accountants' complaints on Oracle force feeding an unrequested, less usable interface on them. On the other hand, I had a responsibility to the company to ensure our configuration was supported through our maintenance contract with Oracle. I could not afford for something to blow up in production and Oracle say, apply these patches and call us in the morning if the problem hasn't gone away--or have Oracle ask why an accountant is still running green screen after desupport.
I can still remember this six-foot female receivables clerk telling her boss she could get through all her invoices today if she could go green screen or part of them doing things "Ron's way"--his choice. (I only mention her height, because my boss was hinting she might like me, I don't have issues dating taller women. However, I like them more when they don't fight me doing my job.)
One particular Asian female accounting manager who was unhappy I replaced V was agitating over my "reckless patching", confident I was "screwing things up". She finally discovered an anomaly: the company's fixed assets appeared in green screen, but not in GUI. Even worse, she contacted Oracle Support directly, without going through me. At some point, they ask her to do a "row who" on a record, and she reports back "ANONYMOUS". Game! Set! Match! Oracle had tricked her unknowingly into revealing the company had violated Oracle's support agreement. Let me explain: Oracle wants you to do things through the application, and they tag records accordingly. When a record has an anonymous tag, it means the record was inserted independently of Apps processes, e.g., a SQL statement or SQL-Loader. Oracle Consulting gave my boss a project quote of $10,000 to fix the product, plus the Apps would be unavailable for 2 weeks (a non-starter because of the effects on operations).
It took me a while to piece together what had happened and how to fix the problem. When the company installed the Fixed Assets module some 3 years earlier, developers didn't want to go through the busy work hassle of setting up the subcategory "None" for each of 17 asset categories, so they targeted the target Apps table and SQL-LOADed it into the database. What they didn't know was that in working through the setup, Oracle converted the subcategory name into uppercase. The green screen interface would allow for a mixed case match. However, what the GUI software did was convert the search subcategory name into uppercase and do a search on the relevant database object. Since the subcategory names were "None", not "NONE", no assets appeared under GUI.
Devising a fix was easier said than done, because the subcategory name field was referenced in multiple objects across the database, and of course the fix itself would violate Oracle's agreement. After thoroughly testing my fix in test, I then promoted it into production after hours. I won a record 3 CEO awards during my 13 months with the company but not for this issue. Did I get a raise, bonus or award? Nope, but there is the satisfaction of knowing you solved a problem few people are capable of solving. (Why did I leave after 13 months? A long story; my boss had reneged on some promises and was taking advantage of me as an employee (70-hour workweeks at a below-market salary), and I was about to go to an in-person at a real estate portal in Austin, when my boss had a heart attack while visiting our Vermont branch office (primarily servicing IBM). I waited until he was back on his feet before resigning.)
- Be Careful of Database Shut Downs/Start Ups in an ERP Environment
There are a couple of examples to make this point, the second involving the park district upgrade project cited under my documentation point above.
I started my ERP DBA experience in 1996 doing SAP Basis Administrator duties at a well-known baking supply company located in the southwest suburbs of Chicago. At that time, Basis Administrators were in high demand and could fetch 6-figure salaries. The company had put a former network administrator through over $30,000 in training at SAP, Veritas, Oracle, and Sun; they were concerned that he could be poached by another company, so they hired me as a form of insurance. On their part, they made a commitment to some SAP training.
In reality I ended up doing the day-to-day administration while the other administrator was off to meetings. A couple of other points setting up the story: the company had Deloitte consultants on site, and bouncing SAP and the database had to be done in a specific order; the process usually took around 20 minutes. Basically bouncing relevant servers/services at least once a week was important because the server memory would fragment, and if a job couldn't be allocated a big enough chunk of memory, it would fail--and these could snowball very quickly, typically on a Friday. For the most part we could get through 4 days or so between reboots, but the fifth day was potentially a problem. And let me tell you, working through 20-letter German error terms is not fun. I remember the odometer was at about 35 jobs one Friday, and he started playing politics going through approvals. I'm telling him, "We've got to bring things down now; no time to waste." The odometer had surged past 220 as now all new jobs were failing before I got the okay to bring things down. I spent literally weeks working through the failed jobs.
As part of the company's agreement, I had signed up for an SAP on Unix course. (Their courses were so popular that I used to joke they could offer a $4200 course on "Basket weaving for SAP" and sell out. SAP wasn't forgiving of late changes like cancellations or rescheduling.) So my boss just before I'm scheduled to start training, decides to pull me out of class to shadow Deloitte contractors (who didn't want to be shadowed) working on an upgrade. (SAP at the time strictly qualified who could do upgrades.I had not qualified through their system to do upgrades.) I'm obviously not happy; and my boss, rather than lose my course fee, decides to let our main Unix admin go. Huge mistake of Biblical proportions.
So at lunch time we're eating our sack lunches when my colleague takes a phone call from the junior Unix admin who reports some message he's seeing on one of the monitors. Now recall my point about the protocol of bouncing SAP; my colleague's brain seem disengaged from his mouth as he responds, "Well, when I was at Sun school, they said when we see a message like that, reboot." Say what? I turned to my colleague, astonished at what he just said, He seemed to suddenly realize what came out of his mouth. "Of course, don't do it NOW...." He turned pale, hung up and headed to the server room. "Gotta go." All of a sudden, everything went black. About 20 minutes later, my colleague asks for my help. "Oracle is not coming up. I don't understand these messages I'm seeing." Now my colleague had a lot of pride and would probably lose a limb before asking me for help. It's difficult to explain; Oracle's interface encountering hardware issues can be obscure. It was like Oracle was saying, "Now I see the redo logs; now I don't." I know we're running Veritas and I quickly infer we have corrupt mirrors. I had no training in Veritas but I had him open it to some display which showed mirrors being rebuilt one pane at a time. I extrapolated recovery time to be 3.5 hours, which was almost exactly the case. In the meanwhile. we've got 75 people whose work responsibilities depending on SAP being up coming back from lunch and discovering things are dead.
My boss asked my colleague what happened, and he lies, of course, saying something like it's a random incident, shit happens, etc. I repeatedly argue that it's a Veritas issue, not an SAP or Oracle issue. I suspect my boss realized that having his senior Unix admin, who would have known better than to reboot in a nonstandard manner in the middle of a business day, attend my class would come back to bite him on the ass. So my boss has my colleague talking to the SAP SWAT team (high level emergency support), and they're looking at the database initialization parameter file. Idiots! If there was an issue with the parameter file, it would be consistent across database bounces, and this was different. The CIO, based on my boss's information, tells the company users the cover up story. I later tell the CIO what really happened, and my supervisor immediately fires me. To be honest, I was close to quitting on my own over my boss's boneheaded decision. There were management and training issues there, and scapegoating me didn't change anything.
Now to the metropolitan park district issue: Oracle is configured somewhat differently on the NT vs. Unix platform. I'm going to respond in terms of the platform at the time; I have not worked on NT backservers recently and there could be nuances since then. But typically you have a couple of key services per database SID. If you manually shutdown the database, it may shut down one service but not both. The key takeaway is both services need to be down before you make a usable cold backup.
I had advised CB to keep the database in archivelog mode (which enables more flexible recovery), but he disregarded my advice (DBAs don't normally keep test databases in archivelog mode, but you don't accumulate logs closing out an Apps database for upgrade), Now I have to explain a huge mistake he made to explain my point about shutting down one of my Apps databases, Vision Demo.
Oracle has traditional maintained statuses of concurrent managers (Apps background processes) in the database. If you shut down the concurrent mangers properly before shutting down the database, these statuses will be updated. The issue comes into play on start ups; Oracle looks at the stored status to see whether the process is up; it's not going to restart a process already up. So you could have a case where the process is really down but the stored status is invalid/corrupt, like if the database come down before closing out the Apps background processes. Oracle has a cmclean.sql script available on Metalink/MOS which should clear out relevant corruptions.
One day I was doing a last-minute check of Vision Demo, a training database, to ensure everything was up for park service personnel in for training. The concurrent managers were down. This was curious; maybe the concurrent managers hadn't been launched. But no, they weren't coming up. It had to be sticky statuses. I apply cmclean. The concurrent managers are back up. Now I know what happened: CB had brought down my Vision Demo database without first bringing down the managers. Incompetent; but more importantly, why was he messing with Vision Demo?
I ask CB if he knew anything about Vision Demo being down. He said, yes; he explained that he had just added a fourth datafile to the system tablespace and realized that he could have just as easily have simply extended an existing datafile. So he manually deletes the new datafile. The database crashes and won't recover. (OK, if I'm a DBA manager, I terminate the DBA for cause immediately when this happens. You NEVER, EVER screw with the system tablespace, and you never fat-finger a datafile unless you're sure it's not in the database or removed in a process like offlining.) But what does this have to do with my Vision Demo?
CB called Oracle Support. (I doubt that Oracle Support is this bad.) He says Oracle told him to look for other datafiles around named SYSTEM04.dbf. My Vision Demo had one. So he brought my database down and scp'ed a copy of SYSTEM04.dbf to his server. He looked at me through his thick glasses and said with sincere astonishment, "Would you believe it didn't work, Ron?"
OK, if any Oracle DBAs are reading this, it's milk-squirting-out-your-nose funny: you can't mix-and-match datafiles from different databases. It took all my self-control not to laugh. This is the kind of story I would use as an ice-breaker at Oracle's annual conference. It's tragic but funny as hell. Dude, you worked at Oracle for 17 years? As what--a janitor?
It turned out that all four of his weekly cold backups were unusable. The bottom line is 6 weeks into a 12-week upgrade process committing to 3 test upgrades, we were now at the start again. The PM refused to fire him because he didn't want to train a new DBA. I would have trained a replacement for free. I didn't even trust this guy to go to McDonald's to pick up lunch. (I once went there with him and saw him go around picking up abandoned receipts. An expense report scam.) There's more to the story but save it for a future post.
Let me summarize a few points from the first section and this section: know your platform. For example, Windows on a backserver stores logs in a trace directory off Oracle Home. You reference service controls. Be careful of what an install or upgrade does to your Oracle registry. There are nuances to references in environment variables. Note this point also applies to non-ERP Oracle databases as well:
- You Need To Take Platform Nuances Into Account
Finally, quite often an ERP release includes interoperable integrated Oracle/vendor software--database, middleware, application software, etc. They are typically certified for platform versions, e.g., RHEL 6.6, with various Unix/Linux packages, etc. Oracle contains in Metalink/MOS certification matrices. You can't simply make a unilateral upgrade of one of these components just to be up to date with the latest software. Oracle won't support an uncertified mix of software product releases.
Let me go back to a heated argument I had with the project DBA's on the park district project, referenced twice above. Now some big shops often liked to run one version of the Oracle RDBMS, e.g., run 11.0.3 of Oracle Apps on 8i vs. 8 when they had other databases running 8i. They also wanted to be able to break down the Apps upgrade into separate database and Apps upgrade portions. Now let me first point out that you don't save much time because the database upgrade is maybe one hour in a 3-5 day upgrade window. The point is that you had to run a lot of INTEROPERABILITY PATCHES so you could run 11.0.3 on 8i. And there can be issues with interoperability patches. So here's the top-level view: an upgrade is done in categories; in a typical upgrade you do closeout activities in Categories 1-3 and late in Category 3 you do the database upgrade, followed by the Apps upgrade driver. Now the closeout steps are not very time-consuming, maybe a few hours at most.
This is the point that the consulting PM, BK, allegedly a PhD like me, couldn't seem to grasp, and neither could his staff DBAs: there is no advantage to running interoperability patches just to do closeout activities for the database, but it was like pulling teeth. So they wanted to do the 8i upgrade and interoperability patching in category 1 vs doing the 8i upgrade in category 3. You add a hell of a lot of patching in Category 1 which isn't necessary to get to the upgrade portions of category 3. Here's the next salient point: the client wasn't going to do a two-phase upgrade. We got one outage to go live over 3-5 days. The category 1 option requires a sinkhole of patching for a short-term, at best, benefit. No interoperability patching means I get to the database upgrade and Apps driver sooner than later. This is not rocket science, folks!
I eventually carried the day because the IT deputy manager understood my point. But just remember don't upgrade anything one off until you verify it's certified and know new software versions imply a certain level of interoperability patches. Otherwise, Oracle won't support your installation.
A second example. I got called to join an Apps project in the Virginia Beach area. This was after I finished my work on the airport store chain referenced above. At that time, Developer was part of the middle server configuration (forms, reports, etc.). The client Unix admin had caused a problem because he decided to upgrade to the latest (not Apps certified) version of Developer, not realizing the interoperability issues with what he did.
The Unix admin shadowed me like a hawk. He complained to his boss that there was nothing I was doing that he couldn't do. Dude! The reason the client hired me to the project is because I wouldn't do what you did. While I was there, the trains ran on time. He had caused a problem where consultants costing cumulatively hundreds of dollars an hour were sitting on their hands unable to do their job. I was treated by the consulting PM like a rock star on my arrival. And so my final point for this essay:
- Do Not Make Unilateral Chances to Software Components Unless the Vendor Has Certified the Changes. Understand that Major Component Upgrades May Require Interoperability Patches