Happy Thanksgiving!

2024 has been an especially hectic year. With Thanksgiving drawing near, and it also being my wife’s favorite holiday, I wanted to try to make it extra special for her this year. My wife does all of the cooking for our family, and she spends a lot of time in the kitchen. When we bought our house, the kitchen had a combination wall oven/microwave installed. We never use the microwave and it would make cooking big meals much easier if we had a double wall oven, so I installed one. My wife is also a big football fan, and the annual Detroit Lions Thanksgiving game is always happening while she is preparing our Thanksgiving feast. I had previously rigged up a laptop in the kitchen so she could watch the game but this year I thought I’d level up and install a TV on the wall in the kitchen.

My mom came over for the day, so she and my son were having fun playing games, looking at pictures, and being entertained by our dog. It was shaping up to be a perfect Thanksgiving! As it got closer to dinner time, my son started getting grumpy. We must’ve asked him 10 times what was bothering him, but he kept saying nothing. He gets like this sometimes, and it is usually fruitless to keep trying to get an answer out of him, so I decided to just let him be grumpy. After working diligently in the kitchen all day, my wife finally sat down so that we could enjoy the meal she had been working hard on all day. As we began to take our first bite, my son says his stomach hurts and he goes to sit on the couch. Not even a minute later, it happened…

From the dining room table, I see my son running toward the bathroom with a creamy substance flowing out of his mouth and onto the floor. My wife runs over to the bathroom to see what was happening, and I followed behind while holding our dog back. It looked like a crime scene. Vomit was probably in 10 different places on the floor in the hallway, and running down the front of the sink, on 3 different walls, on the garbage can, and in, on, and around the toilet. I didn’t know whether to ask if he was OK or call in an exorcism. Then my wife comes out of the bathroom holding a small black object and saying that it probably needs to be thrown away. I looked on in horror, as I realize it’s my Logitech Harmony Elite universal remote control, absolutely covered in vomit! My son had been in the process of turning on the TV when he realized he was going to vomit, and he was in such a hurry to get to the bathroom that he kept the remote in his hand. Obviously I’m more worried about him than the remote, but this remote is one of the best universal remotes of all time, and it went out of production about 5 years ago. Good examples of it are hard to come by, and quite expensive.

After the violence ended, my wife and I began cleaning up the mess. I’d never seen anything like it. The sheer coverage area and variety of affected surfaces was astounding. My son appears to have a gifted ability to spew vomit in a manner that is very difficult to clean. The sections of it that had been sprayed at the front of the sink and toilet had seeped underneath the edges of them. There’s a good chance they’ll need to be removed in order to thoroughly clean underneath each of them. I sense my black Friday will be more brown than black. I would estimate that less than 5% of the content made it into the toilet. My mom kept yelling from the other room offering to help, but I was intent on keeping her out of the bathroom because I didn’t want a slip and fall accident to add to the chaos.

While cleaning the area around the toilet, I managed to spray some cleaner on the leak sensor installed behind the toilet, which sounded the alarm on all of our phones and automatically shut the water off. Perfect for when you’re wrist deep in what is basically sewage, and now have no way to wash your hands!

My mom, now unable to eat and fearing we had the plague, high-tailed it out of the house. When the dust settled, I gazed at the dining room table, which had 4 servings of untouched food, and various serving dishes of food that my wife had been working on all day. None of us got to enjoy the Thanksgiving dinner that we had all been looking forward to. I turned the water back on, washed my hands 10 times, and sat down to eat dinner. No one else was able to eat after that experience, but I’m able to compartmentalize just about anything when I’m hungry. So, I enjoyed Thanksgiving dinner alone, while still somewhat in shock from what I’d just witnessed. This Thanksgiving was definitely not the awesome day I had planned for my wife, but it will definitely go down as one we will not forget!

Posted in Uncategorized | Leave a comment

I just can’t leave well enough alone

The TJ started making a faint squeaking noise when going over any kind of bump. Anyone that knows me knows that I can’t stand squeaks and rattles. The noise was coming from the tailgate, because of the oversized spare tire that is mounted to it. The tailgate hinges just aren’t up to the task of supporting such a large tire. There are all kinds of solutions to this problem, but in order to maintain a mostly stock look, I decided to go with some beefed up hinges to stabilize the tailgate.

Installing the new hinges is just 8 bolts, so it should only take a half hour or so, if you include aligning the tailgate. The first 6 bolts came right out, with zero drama. The 7th bolt started to lock up and make loud squeaking noise, which is a tell-tale sign of the threads on the back side of the nut being rusted/corroded. I stopped and hit the bolt with some penetrating oil, and decided to let it sit for a bit. The 8th bolt made similar noises, but came out just fine.

Back to the 7th bolt. The bolt backed out about 1/4″ before it started to lock up. I decided to hit it with more penetrating oil and run it back in, to make sure the threads were lubed. As soon as I tried to tighten it, the bolt snapped with very little effort. FML. I was devastated. I figured my best bet was to drill out most of the bolt and try to remove what’s left with a punch. Naturally, I couldn’t find my center punch or screw extractors, so I took the boy to Home Depot for some supplies.

Home Depot didn’t have a regular center punch, so I had to get one of the stupid automatic ones. I also picked up a set of screw extractors and some cutting oil. I get home and pull the stuff out of the bag and everything is covered in oil. The spout on the cutting oil container was cracked and the oil spilled everywhere inside the bag. Luckily the bag contained it, so it wasn’t all over the car.

After center-punching the bolt to align the bit, I drilled the bolt about 95% of the way to the threads. It was fairly uneventful. The last bit I ran through it actually spun the back half of the bolt out the backside of the nut, so I only had the front half of the bolt left in the nut. I decided to grab the remainder of the bolt with a screw extractor, thinking it likely wasn’t jammed that bad. I got about one full rotation with the screw extractor and SNAP. The screw extractor jammed inside the nut and snapped off.

What you see here is a broken screw extractor inside of a broken bolt, inside of a captive nut inside the body of the TJ, which I removed because of a very minor squeaking sound.

I’m not actually sure how to proceed at this point. Screw extractors are obviously very hard and brittle, so I tried to hit it hard with a punch a few times. I was able to break a few small chunks off of it, but didn’t make much progress at all. I spent some time going at it with a Dremel, but the screw extractor was just too hard for any of the bits I had on hand. I went back up to Home Depot to buy a diamond bit for the Dremel, and I hope to be able to burrow through the center of the screw extractor in the morning. If I can get through the center of it, I hope to be able to crack it into a couple pieces and pull it out.

I hate leaving a project in a state like this, but everything I touched today turned to shit, so it’s time to call it and regroup in the morning.

Day 2

I spent a couple hours trying to make a dent with one (actually, two) of these diamond bits in my Dremel tool. It is an incredibly slow process, and it produces a surprising amount of very fine metal dust given how little material it appears to be removing. It was very difficult to tell where the broken screw extractor ended, and where the nut and surrounding bracket started.

Eventually, at what seemed like glacial pace, I broke through the center of the screw extractor. At this point, I was able to widen the hole with the less worn-down part of the diamond bit. This went much faster than trying to burrow straight through, but it presented a new challenge… I needed to make sure I didn’t go beyond the screw extractor (and the remains of the broken bolt that was still in the hole). If I did, I’d compromise the captive nut in the body, and who knows how the hell I’d recover from that. So, I tried to go slowly and carefully, while looking inside every so often to see if I saw the material change to the bolt or the nut.

Once I could finally see the threads, I broke out some of the remains of the screw extractor with a punch. Now it was just a small portion of the screw extractor and what was left of the bolt. I started trying to run a tap through the hole, but it jammed pretty early on. I wasn’t about to break off a tap in the same damn hole I’d been working on for hours. I kept running the tap up until it jammed, backing it off, adding oil, and running it again. I repeated this process over and over until I started making progress.

Then I heard a crack. My heart sank. I was afraid to move my hands. When I gained the courage to move, I was relieved to learn that the tap was still intact. The crack appeared to come from the screw extractor. I backed the tap out and did what I could to clean out the threads. I ran the top back through the hole and I was able to power through the rest of the junk in the threads and make it completely to the back side of the nut! I backed the tap out and inspected the nut and it appeared damage free! Eureka! Talk about a relief!

I am astounded that I was able to remove the screw extractor and the remains of the broken bolt without ruining the threads on the nut. This felt like a complete impossibility when I ended my day yesterday. Now all that remains is to install the new tailgate bracket.

Thankfully, the new bracket installed drama-free. I’m very happy with the quality and it appears to hold the tailgate much more rigidly. No squeaks, and the tailgate shuts better than it ever has!

I only ordered the hinge kit, because I already have a heavy duty spare tire bracket and I didn’t like the one that was intended to go with these hinges. I expected that I would have to do a little custom work to make my tire bracket work with these hinges. All that remained was to take some measurements so that I could design and 3D print a spacer for the tire bumper pieces that support it.

A few simple measurements and an hour later, I had a nice spacer that wrapped around the new hinge and brings the bumper right back to where it was before the project began. I’d already added some spacers to these bumpers when I installed my heavy duty spare tire bracket; I just needed to make one with a cutout for the new hinge, since it cut into the area the spacer used to occupy.

Victory! I am very happy with the end result, but this is absolutely the longest I’ve ever struggled with a single bolt in my entire life. I’d say it took 5-6 hours of suffering to accomplish what should’ve been a 20-3o minute project.

Posted in Uncategorized | 1 Comment

Today, I was tested.

I had a doctor appointment scheduled for this morning at 8a. It’s cold as hell, so I tried to remote start my truck.

It did not start. No lights, no crank, no nothing. I drove the truck 2 days ago and everything was fine. Something must’ve caused it to drain the battery.

I walk out to my truck to see what is going on and the battery is completely dead. I had to put the key in the door to open it, and was presented with no dome light, etc.

I measure the battery with a voltmeter and it’s zero volts. It’s a 2 year old Optima yellow top deep cycle AGM battery. Most batteries won’t come back from being this dead, but I am determined to try.

I go grab my battery charger to see if I can revive it, but I need to run an extension cord out to the truck to charge it.

The garage door won’t open when I press the button on the wall. The screen on the button says “Press the push bar to activate control” but pressing it does nothing. We had a power event a few days ago, which likely caused something to go crazy with the RATGDO garage door controller I use. It’s times like this where I wonder why I complicate my life with this stuff. I was able to use my phone to open the garage door and drag the cord out to the truck.

I connect the charger to the truck and it doesn’t even recognize it’s connected to a battery. After some screwing around, I manage to get it to engage the 75A engine start function, which wakes up the truck. At this point, the alarm starts going off. HONK HONK HONK. It’s just after 7am. This likely wakes up my son and some of my neighbors. I disconnect the charger, which immediately stops the furious honking, because the battery returned to zero volts. I close everything up and head inside. I will deal with this later when everyone is awake and it’s not dark out.

I call to cancel my doctor appointment and after navigating the phone prompts, I eventually realize there is literally no way to speak to an actual person. Pressing 0 just takes you to the main menu, and pressing 6 for “extra help” just reads off the URL to the website and then hangs up on you. There is also no way to cancel the appointment online, because it is on the same day. About a half hour later I get a nastygram from the doctor because I didn’t show up.

At about 9:30a, one of my meetings ends early so I head back outside to try to get the battery charging. I disconnect the battery from the truck so that the alarm doesn’t go off again.

Now the charger just gives me an error saying there is an open cell in the battery, and it won’t charge. It’s not possible to have an open cell in this battery. I try a different charger. The moment I connect the clamp to the positive terminal, the clamp explodes and part of it hits me in the forehead. Apparently single digit temperatures combined with old plastic and a lot of spring tension equals a spectacular failure of the clamp.

I go get the volt meter to see the state of the battery. The battery in the volt meter is dead. Mind you, this is the same volt meter that I used at 7a, and I did not forget to power it off.

I grab a third battery charger (what, you don’t have 3 battery chargers?). It stops charging almost immediately, saying the battery is “Full.” It reports the battery voltage at 5V.

These “smart” chargers are apparently too smart to charge a battery this dead. I need an old-school charger that just applies 14V to whatever you connect it to, regardless of the circumstances. I’ve probably thrown away a charger like that and now I need it.

After much screwing around, I manage to get the first battery charger to trickle charge the battery, or at least it appears to be. I leave it be, and will come back at lunch and see if I can charge it at a higher rate.

When I check on it at lunch, the battery is measuring 12.1V. The “smart” charger finally agrees to try to charge the battery at 25A. I’m going to leave it like this for the rest of the day and check it around dinner to see what the status is. Chances that it didn’t error out and cancel charging as soon I walked away from it are about 0% today.

I can’t wait to repeat this entire process tomorrow morning, since that is when I rescheduled the blood draw and I have no idea what caused my truck to drain the battery.

Posted in Uncategorized | Leave a comment

Don’t ever buy a Frigidaire refrigerator

This is an ongoing saga that is not yet resolved… However, one thing that has been determined is that Frigidaire has the worst customer service of any company I have ever dealt with in my entire life.

Update: After 35 days, Frigidaire finally replaced the refrigerator, concluding the saga.

9/30: A 3 month old fridge fails on Saturday night (9/30/23). When I went into the kitchen in the morning, there was water all over the floor because the ice in the ice maker had all melted and dripped out of the dispenser on the door.

I call Lowe’s because I have a service contract on it. They say since the fridge is within its 12 month warranty, I have to work with the manufacturer.

Frigidaire service is closed for the weekend.

10/2: Call Frigidaire on monday morning and they say a tech has to come look at the fridge. They assign me to a service company in Flint called Saginaw Valley Service, and say my appointment is Tuesday afternoon

Get a call Monday afternoon from the service company saying they can’t help me because they don’t have any available techs. They say I need to have my case with Frigidaire re-queued to a different service company.

I call Frigidaire and after following the phone prompts I am greeted with a message saying “Due to unusually high call volume, we are unable to take your call.” They end the call. I’ve been trying to get through for the last two hours and haven’t been able to.

I call Lowe’s to see if there’s anything they can do, and now they are unable to find my service contract… the one they successfully looked up on Saturday when I initially called.

I finally get through to Frigidaire again, after 3 hours of trying. They inform me that there are no other service contractors in my area so they have to escalate my case to a “service locator team.” They say this process may take 2-3 business days. I tell them this is unacceptable and I want my fridge replaced, but I get nowhere. The best they can do is “expedite the service locator escalation” so that it takes “only” 24-48 hours to try to find another service contractor. They will not, no matter how much I complain or how many managers (3) I speak with, replace the fridge without waiting for a service appointment.

10/4: More than 48hrs goes by. I call Frigidaire back. They say my case has an update yesterday that says a service company called RSI appliance was assigned to me. They ask if I have gotten a call from them and I tell them no, and that I had no idea a company had been assigned. While on the phone with Frigidaire, they put me on hold to call RSI appliance to see if they can perform the service call. I sit on hold in silence for more than 10 minutes, then magically I hear some hold music. This goes on for another 10-15 mins until someone says “RSI Appliance service, how can I help you?” Frigidaire service somehow blind transferred me to the service contractor, who now needs a reference number that I don’t have. He manages to look me up by the fridge serial number and says the soonest they can come out is October 13th!! I have them book the appointment in case all else fails.

I call Frigidaire back and say I need a service appt sooner than 10/13. They agree that 9 days away is unacceptable and they send my case to the “replacement review team” to process a replacement.

10/5: I have major surgery and cannot deal with fridge drama, so no progress is made.

10/6: Still recovering from surgery, and Frigidaire will be closed tomorrow for the weekend.

10/9: I call Frigidaire on Monday morning to get an update. They rejected the replacement because they found another service contractor (Autumn Appliance), and supposedly scheduled a service appointment. I call Autumn Appliance and no service appointment has been scheduled. The soonest they can come out is 10/16. I have them schedule the appointment, but in my head I am banking on RSI Appliance coming on the 13th.

10/13: RSI appliance calls me to tell me their tech is sick and they won’t be able to make my service appointment today. They reschedule me for 10/20.

10/16: Autumn Appliance visits the house and determines that the Fresh Food evap blower fan has a locked rotor, and the fridge failed diagnostics. The tech sits on the phone with Frigidaire support for over an hour to complete the diagnostics and check the availability of the part. Frigidaire says the fan itself is unavailable so they will need to replace the entire “air tower” in the fridge. Frigidaire verifies that the air tower is available an in stock.

10/17: Autumn Appliance tech calls me to tell me that the air tower is in fact NOT in stock, but now they can order the fan itself. They place the order for the fan and say it will take 3-4 days to arrive and they schedule a service appointment for Tuesday 10/24.

10/23: I call Autumn Appliance to verify that the part came in and they will be able to come to my house tomorrow to service the fridge. The person on the phone says they don’t see that the part came in, but they will check on it and call me back. They call back an hour later and say that the part is back ordered and never shipped. Frigidaire did not tell them it was back ordered when it was ordered the previous week, so the last 8 days have been wasted.

I call Frigidaire to plead my case for a replacement once again. The representative that answers the phone definitely does not speak English as her first language. She can barely understand what I am saying and keep repeating the same thing to me over and over again. At one point she says she will call Autumn Appliance and see if she can get the tech to change the repair order to say the fridge is unrepairable! I inform her that this is absolutely not what I want her to do, and that it should not be up to the service contractor to falsify records in order to try to get Frigidaire to do the right thing. I continue to plead my case and she puts me on hold to “see what she can do.” 1 hour goes by and I am still on hold, with no interaction from the Frigidaire representative. A few minutes later, the line goes dead and hangs up.

10/24: I call Frigidaire to try to get an update. The representative I speak with now says that he needs to escalate the back-order to the service locator team to try to locate the part at a local repair facility/warehouse and that the process will take 5-7 business days. I inform him that time frame is unacceptable and re-explain the history of my case to him. I demand the fridge be replaced. He just continues to repeat the same thing back to me, “We need to complete the service locator process before a replacement can be issued.” He also stated several times that, “If the part cannot be located, we will issue a replacement refrigerator.” I tell him this is the third time I’ve been told that the fridge will be replaced and it still has not been replaced. Eventually, I am able to convince him to assign a “very high priority” to the service locator ticket and he says I will get a response in 24-48 hours. Out of frustration, I accept those terms and end the call.

10/25: On a whim, I look up some Frigidaire (Electrolux) execs on LinkedIn and try to get some contact information from them. With the help of a friend, we’re able to infer the email address of the COO, so I send them a long email that reeks of desperation, in hopes they can help.

10/26: After 48 hours, I did not receive any updates from Frigidaire, so I call them for an update. I was in the middle of a few things today, so I decided to try the text message support for the first time. I was clearly connected to a bot, but I provided my reference number and asked for an update on the backorder escalation. The bot responds that there is no update and she will submit the case to the replacement review team (again). She informs me that it will take 5-7 business days to get an update (again). I respond that 5-7 more business days is unacceptable and I want this to be a high priority request. I go back and forth for several minutes trying to convince the bot that it needs to happen sooner than 5-7 business days, but my efforts were futile. Having already been escalated to the replacement review team earlier in this process, I expect that I will receive no update. When I call next week it will have been rejected and the process will start all over again. This is absolutely maddening.

10/31: I get a call from the appliance service contractor saying that they’re still unable to get the parts needed to repair the fridge. For some reason, they gave me a new case number to track the process.

10/31: I get an email from Frigidaire–the first form of any communication I have received from Frigidaire since this process began–that my case has been closed. No other information is provided.

11/1: I call Frigidaire support to see what happened to my case. They inform me that my replacement has been approved! They hand me off to a local store for replacement logistics. I call the local store and they have no authorization from Frigidaire on the replacement. Outstanding.

11/3: New refrigerator has arrived and seems to work. Finally, this circus is over!

Posted in Uncategorized | Leave a comment

The little timing gear that couldn’t

I needed to replace the water pump and the lifters on my Jeep TJ. The lifters are quite involved, since you have to remove the head from the engine. I decided to do this job in the winter, since I don’t typically drive the Jeep in the winter and it wouldn’t matter how long it was torn apart. Boy am I glad I did.

While the engine was torn down, I decided to replace a bunch of other maintenance items: radiator, radiator hoses, heater hoses, and the timing set (chain and gears). I had ordered some of the parts back in the summer, so that I knew I would have what I needed on-hand. This summer when I replaced the oil pan gasket, I could access the timing chain and I noticed it had quite a lot of slop in it. They are known to last upwards of 300k miles in this engine, but it was easy enough to replace it while I had everything else torn apart. Or so I thought.

My good friend Joel was a valuable second set of hands for the entire project. We encountered some minor difficulties with the other items on the list. For instance, I could not get the lifters out with a magnet because they were stuck in their bores. I scrambled and bought a tool to get them out, but the tool was garbage. Thankfully, my neighbor had a much better version of the tool and let me use it. We got the head reinstalled and were making great progress. I decided to tackle the timing set last, as I expected it to be fairly straightforward. After all, it is just two gears and a chain.

A photo of the old timing set. You can see the slack in the chain here

The old timing set came off without incident. I grabbed the new set and slid the small sprocket on the crankshaft. It got stuck and would not slide on.

You can see in this photo that the crankshaft has two keys on it. The outermost key is for the harmonic balancer, and the inner key is for the timing gear. The new gear was binding on the inner key and would not slide back against the engine block. I took the gear off and inspected it. No burrs or other crud in the keyway that would prevent it from fitting. I inspected the key on the crankshaft with the same result. I took the old gear and slid it right back on the crankshaft without any resistance. I compared the two gears and they looked nearly identical, but clearly the keyway on the new gear was narrower. I grabbed a set of digital calipers and measured the keyways. This was more difficult than I thought, because the shape/size of the keyway makes it very difficult to get calipers inside for measurement.

Close-up of the keyway on the crankshaft timing sprocket

The original gear keyway measured about 0.1890″.
The new keyway measured about 0.1850″.
The crankshaft key measured 0.1880″.
The new gear was clearly too small. I called Cloyes (the manufacturer of the timing set). They were able to give me the spec for that keyway, which was 0.1885 – 0.1910″. I grabbed a file and tried to open up the keyway a bit. The gear appeared very hard, and the file didn’t seem to have any effect, so I stopped. I called Cloyes back and let them know that my gear was out of spec. The tech had someone measure the gears they had on-hand and they were all also undersized and out of spec. They didn’t have one gear that was within spec to ship to me. Since I bought this timing set back in August, there was no way I could return it now (January). I decided to try to file it again.

Out of the arsenal of files I have in the toolbox, only about 5 of them fit inside the keyway and were aggressive enough to attempt to remove any of the steel. I spent about an hour with the gear in a vice, putting a lot of weight on the files trying to remove any material at all. I managed to get the keyway to open up to about 0.1855 or 0.1860″, but that was not enough. Frustrated, I began my search for a replacement.

I wanted to find another brand, to try to avoid getting another out of spec gear. I called every local auto parts store, and all of them only carried Cloyes timing parts for the Jeep. I went on Rock Auto and ordered a Melling timing set, and another Cloyes sprocket (just the sprocket was pretty cheap) just to see if I could get lucky. If the new sprocket fit, I could return the Melling set and save some money. They arrived a couple days later. The Cloyes sprocket was identical to what I had, and measured the same, so it did not fit. No surprise there. I opened the Melling set, and to my surprise there was a Cloyes set inside the box. Apparently Melling just rebrands Cloyes products in this case. This gear was also out of spec and did not fit. For those keeping score at home, I now have 3 identical Cloyes timing gears, and none of them fit.

Frustrated, I began another online search for alternative brands of timing sets for this engine. Most of the sets I could find were for the earlier model of this engine with a different type of camshaft sprocket (the larger sprocket). You are supposed to install these as a set, and not mix and match new and old gears or gears from different sets, as the gears and chain wear together. I eventually located a Comp Cams timing set on Summit Racing, and ordered it. I paid for overnight shipping because I just want to be done with it at this point.

The Comp Cams timing set arrived the next morning, and I could not resist a quick test fit. The gear slid right on the crankshaft, with virtually no resistance. Victory!

The Comp Cams sprockets are cast instead of milled like the Cloyes unit. They definitely aren’t as nice, but they fit and I’m rolling with it!

Posted in Uncategorized | Leave a comment

Storage saga

A few weeks ago, I noticed that my ZFS array was resilvering, due to a HD failure.  This is the first time a drive has failed in my ZFS array, which is a little over 2 years old.  My ZFS pool was 99.7% full; an issue I’ve been meaning to deal with for quite some time now, but have had other priorities.  As a result, a resilver (rebuild/resync in RAID terms) is causing quite a bit of thrashing on the disks.

A bit of background:  In early 2011, I built the successor to my 20HD 30TB Raid6 array.  It is a 24HD 48TB ZFS RaidZ2 array.  RaidZ2 is similar to Raid6, in that two drives are used for parity rather than storage.  This means that two drives can fail without losing any data.  I knowingly went outside best practices while building it, and put 24 drives in one vdev.  Actually 23 drives in the vdev, and one hot spare.  A vdev is considered a group of drives within a ZFS pool.  Parity is localized to a vdev, and you can have multiple vdevs within a pool (with more parity drives in each vdev).  You are only supposed to have up to 9 drives in a RaidZ2 vdev.  At the time, it seemed like the only disadvantage to having that many drives in a pool was performance, which was not a huge factor for me.  ZFS doesn’t stripe across drives in a vdev, so your vdev is only as fast as your slowest drive.  What I didn’t consider was the amount of stress on the drives, and overall time it would take for a resilver to complete.  Especially when your pool is 99.7% full, and it has to move data around in little tiny chunks.

So back to the drive failure.  This should be no big deal.  I had a hot spare, which is why the array was resilvering itself, without any intervention from me.  By the time I noticed it, the resilver was about 3 hours in.  The ETA to complete the resilver was 72 hours from then!  That’s 72 hours of continuous hard disk thrashing, in addition to the normal load caused by the r/w of the 22 VMs I have running on that array.  I shut some of my VMs down, to hopefully speed up the process, and checked back in a few hours.  To my horror and dismay, the hot spare failed, and 4 other drives had taken IO errors (and were being resilvered as a result).  The array continued to resilver, across the remaining drives, and was still thrashing like hell.  Several hours later, 3 more drives had taken IO errors.  That’s 8 drives resilvering, and 2 faulted in the pool.  It doesn’t get much worse than this.  I always buy my drives from multiple sources when I build a NAS, so that they will have different dates of manufacture, and are less likely to fail in huge batches.  What the hell was going on?  All I could figure was that the incredible stress of the resilver was too much for my consumer-grade HDs to handle.

About 10pm that night, it happened.  A third drive failed, less than 18 hours into the resilver.  Since one of the three was the hot spare, technically only two drives from the original pool had failed, which is the maximum allowed without losing data.  At this point I am shitting bricks, and literally can’t sleep.  I shut all my VMs down, and am scrambling to move critical data to other disks in the house.  I lit up my old array, which hadn’t been powered on since we moved into the new house.  It wouldn’t boot!  Something was up with the OS on the boot drive, so I booted off of a Ubuntu Live CD and mounted the array.  All was fine, but the data (which was originally a backup of what was on the new array) was quite stale.  Since 48TB > 30TB, I obviously had to decide what I was willing to lose, and only copy some stuff over.  I started using external USB drives, and my desktop machine as temporary storage to move data to, in case another drive failed.  The next morning my wife says, “Dave, is something supposed to be beeping in the furnace room?”  This can only mean one thing.  A drive failed on my old array (which has an enterprise RAID card on it, and notifies you when a drive fails).  What else could go wrong?  Since my VMs were shut down, I did not get an email notification.  I hopped on the console and noticed that drive 9 had failed, and the array was rebuilding with drive 10 (the hot spare).  The ETA on this rebuild was much less: 10 hours.  It completes without incident, and I swap the bad HD with a cold spare that I had on-hand.

Eventually, the resilver completes, in 73hrs.  No more drives failed, and I haven’t lost any data.  I’m relieved, but still incredibly spooked that I could lose it all at any minute, if another drive failed.  Throughout all of this, I’ve been trying to figure out what my long-term plan was going to be.  Up until this all happened, I had been considering rebuilding my old array (the one with the hardware RAID card in it) with larger (and more) disks.  But now there is critical data copied on that array, and I can’t scrap it and start over.  It seemed like my only option was to build another (third) NAS server, at a considerable expense.  I could go ZFS, which requires lots of RAM (expensive), or RAID, which requires a hardware RAID card (expensive).  I’m highly annoyed, because if I had just addressed this a few months ago when I knew I was running out of space, I would not have to build a third server.  Then it occurred to me that I could buy a drive enclosure with a built-in SAS expander, and connect it to my existing server.  That would require me to upgrade the amount of RAM (ZFS likes RAM), but it was doable.  Of course, I would have to scrap my existing RAM, because it was ECC unbuffered, and I had maxed out what my motherboard could handle (48GB).  I would have to purchase ECC Registered DIMMS to go beyond the 48GB barrier.  I telling my tale of woes to a coworker, and he mentioned that we had a bunch of servers in our warehouse that we weren’t using, and that he thought they were full of RAM.  I checked it out, and they were indeed full of RAM.  352GB of ECC Registered RAM, to be exact!  So I borrowed 12 8GB DIMMs and put them in my server.  Voila!  96GB!

I ordered up my enclosure, a Norco DS-24E, and 8 Toshiba 3TB 7200RPM SATA drives.  I figured I would start with 8 drives, and expand later on.  The enclosure and drives arrived a few days later, and appeared to install without incident.  That is, until I realized that all of the drives detected as 2.2TB drives.  WTF?  Some googling quickly revealed that the LSI SAS1068E chipsets on my SAS controllers did not support 3TB drives!  At this point it’s been over a week since the first drive failed, and I’m on borrowed time with this array.  After a few hours of research, I order a LSI SAS2008 PCI-E SAS HBA.  It’s not the best, or newest, but it is known to work in the very unique configuration I am running (ESXi hardware passthrough to a Solaris VM, to share the array back to ESXi via NFS).  I also ordered 8 more 3TB drives, because I realized that because of my only putting 8 drives in a vdev now, I will have much less usable space and still needed more room to temporarily store my data.  This is getting quite expensive!

The new controller and drives show up 2 days later, and I begin surgery.  It goes surprisingly well.  ZFS is fantastically resilient and scalable.  After an export/import, the pool detected perfectly on the new controller, even though all of the drive IDs had changed.  I was super relieved at this point.  I then added the other 8 drives to the enclosure, and built the new pool as 2 8 drive RaidZ2 vdevs.  The pool created without incident.  I enabled compression, nfs and smb on the pool, and immediately began copying my data to it.  It’s now been a little over 24 hours, and 20T worth of the data has been copied to it.  I intend to get a current copy of everything on the failing array, and scrap it completely.  I will then rebuild it with 3 8 drive RaidZ2 vdevs, just like the new array, and forgo the hot spare.  I’ll lose a significant amount of storage (6TB), but this whole event will be much less likely to occur again.  Any future resilvering will be limited to 8 drives, instead of 24.  Also, my IOPS will be greatly improved, because ZFS stripes across multiple vdevs.

Posted in Uncategorized | Leave a comment

‘Twas the night before Thanksgiving, when all through the house… The DVR was not working; no TV for my spouse.

So the night before thanksgiving, I’m doing some work in my office and I hear, “Dave…  The TV isn’t working.”  Great.  The DirecTV HD DVR in my living room evidently crapped out.  It locked up, and upon rebooting it I was greeted with a “ERROR 14 – Internal Storage Problem.”  Several reboots later, I was on the phone with DirecTV.  They ordered a replacement receiver, but could not guarantee what model I would receive.  Over a year ago, I fought tooth and nail to get the HR24 receiver, because the previous models are painfully slow.  I was certain that I was going to get one of the previous models, because the guy said I would be receiving a refurbished unit. Oh, and did I mention I have to pay $20 for shipping on the replacement receiver?

The order didn’t get officially entered until Friday afternoon, due to the holiday.  7 days go by…

Last night I received the replacement unit.  To my surprise, it was an HR24!  I was quite pleased.  I immediately unboxed it and hooked it up in place of the dead unit.  Upon plugging it in, I kept hearing the “bong” noise through the speakers that you normally hear when you press an invalid button on the remote.  My first thought was that they had shipped me a remote and I didn’t see it in the box, and the buttons were being mashed.  No remote in the box.  I went through the entire activation process (muted the audio).  When the programming finally came up, it started wildly cycling through the screen resolutions.  So fast that my TV couldn’t keep up and I basically just saw a black screen.  The resolution indicator on the front of the unit was rapidly and erratically cycling between 480p, 720p, 1080i, 1080p.  This particular unit has a touch-sensitive front panel, so it does not have any physical “hard” buttons.  For some reason, it thinks the resolution button is being pressed repeatedly.  In an effort to stop it, I press the resolution “button”, and some of the other buttons on the front of the unit.  No joy.  It just keeps cycling through the resolutions.  I thought maybe I should block the front of the unit, to make sure that it wasn’t getting some erroneous IR signal causing this to happen.  The cycling continued.  So again, I’m on the phone with DirecTV.  This time they run me through 20 minutes of useless troubleshooting.  “Try unplugging the HDMI cable from the receiver.  Ok, reboot it.  Ok, now try unplugging the HDMI cable from the TV.”  ?!  Eventually, the “tech” concludes that the replacement receiver has an internal issue.  They assign me a case manager and tell me that I will be contacted within two hours.  Mind you, it is 8:30pm on a Wednesday.  Unsurprisingly, I don’t get a phone call that night.

This morning I got a call just after 8am from a case manager in Colorado.  She asked me to explain to her, in detail, the issue and the steps I’ve taken to try to resolve it.  She recommends having a technician visit my house.  I tell her that I don’t want to take time off work to resolve this issue.  She says that my only other option is “chance” having another receiver shipped to my house.  She says that she has no way to guarantee that it will be an HR24, and that it may or may not have a similar issue.  How can you run a business like this?  Is there no quality control?  By the tone in her voice, I could tell that she did not have a high level of confidence in the refurbished receivers that ship from their warehouse.  I suspect that their quality control process consists of little more than checking to see if the lights come on when it is plugged in, and then it gets shipped out to another poor sap.  There is no shipping charge for the second replacement, and I managed to get her to credit me the shipping for the first unit.  Still not really compensation for not having service for what will probably wind up being 11 days (at best).

I’m amazed by how little control anyone seems to have over this process.  She said the receiver should arrive in 2 days, but she couldn’t promise that the order would get processed today.  It was 8:30am (6:30am her time) on a week day.  If she submitted the order during the call, how could it not be processed the same day?  She had no ability to choose to ship me a new receiver, vs a refurbished one.  Apparently the only way you can get a new receiver is to have a tech visit your home and install it himself.  What a sweet setup.

I realize it’s just TV.  Big fucking deal, in the grand scheme of things.  But this kind of stuff infuriates me to no end!  So, DirecTV, you are on very thin ice.  One more strike and I’ll be choosing an alternative content provider.

To be continued…

 

 

Posted in Uncategorized | Leave a comment

The damn tree

So my neighbor has a large Silver Maple tree in her backyard.  It hangs over my driveway and frequently drops large branches on our vehicles.  I’ve had lots of dents/scratches in my vehicles over the year, courtesy of this tree.  So I finally work out a deal with my neighbor to have the tree cut down.  It was scheduled to be cut down on Wednesday (yesterday).  However, the tree guy had to postpone the work due to weather.  This morning I went to get in my truck to go to work, and couldn’t open the door.  The door handle was GONE.  There’s a huge 20′ limb laying next to my truck, and the door handle is laying in the grass next to it.  The door also has some nice scratches on it.  Awesome!  How’s that for lucky?  The same day the tree was supposed to be cut down, we have a wind storm and a fallen limb breaks the door handle off of my truck.

Posted in Uncategorized | Leave a comment

How hard is it to manufacture a power button?

So over the last couple of days, I have been building a new VMWare server / NAS in my basement.  This involves ordering and assembling probably 50 different components.  One of those components is the case that everything mounts to.  Last night I completed mounting everything in the case, and began testing.  I could not, for the life of me, get this server to turn on.  Since I had assembled everything over a couple day span, I figured I must have missed or forgotten about something.  I start double checking all of my work, and I can’t find anything wrong with it.  Finally, I get frustrated and just short the pins for the power button with a screwdriver.  The server immediately powers on!  WTF?  So I test the power button with an ohmmeter.  It reads about 130ohms.  That means the button sort of works, but isn’t quite completely making a connection.  How the hell does that even happen?  How hard is it to manufacture a power button?

So my options are this:  I either disassemble the entire server and return the case, for a new one (a huge amount of work), I install a secondary power button and stick it to the front of the case (cheese), or I just deal with a server that doesn’t have a power button.  I have the server configured to automatically power-on when it receives power from the wall, but having to unplug the power cord to shutdown the server is not exactly what I had in mind.  What a pain in the ass!

Oh, and to be clear, this is a rackmount case, with a very unique power button.  It’s not something I can just purchase and replace.  What a pain in the ass!

Posted in Uncategorized | 1 Comment

Who does that?

May of 2010 we are in the process of moving from Lansing to Grand Rapids. I have loaded up the Traverse I was driving at the time with boxes, kids and hooked up the camper to haul to the new place. Chris has already headed out with his Dad and a truck load of stuff. On the highway in 5 o’clock traffic on a Friday I am driving along when all of the sudden the car in front of me swerve’s out of our lane, I look quickly around and see that I can not make the same move, not only because I am hauling the longest freaking camper ever but because there are cars all around me. Then I see why the car swerved… there is a freaking tire, rim and all, laying in the middle of the highway! I have no option but to slow down but that doesn’t help much. I run over the stupid tire, it catches on the undercarriage and I continue to drag it while I come to a stop. Once I stopped and pulled over, I get out (a fun task all by its self on the highway at 5 o’clock), I smell rubber burning and look under the car. Yup the tire is lodged under there.  Does anyone stop to help? nope.  Who would stop for the crazy lady kicking the front end of the car on the highway screaming obscenities, “who leave a f*&^%^& tire on the highway?” Once I calmed down a bit I decide to back up my car, mind you I have the camper hooked up and I am on the HIGHWAY at 5 o’clock. I get this train backed up, removed the tire from under my car and kick the stupid f@#! down the side of the road. So if your ever driving on 96 NW towards GR look to your left just passed the Grand Ledge exit, you’ll see the stupid tire down there by a tree.

 

Posted in Uncategorized | 1 Comment