Erica Thomas manages the InfoDev team responsible for updating all user documentation for SolarWinds products. While documentation is often treated as a supporting resource, its accuracy can directly affect customer satisfaction and product usability. “We’re not just writing text,” says Erica. “We’re developing and maintaining an essential part of the product experience.”

Troubleshooting Script Errors

As part of their documentation processes, Erica’s team uses custom PowerShell scripts that automate various steps, such as building content, processing files, and preparing updates for publication. However, of a team of 11 writers, only a few were comfortable troubleshooting script errors, especially when the issues related to setup problems or nuanced scripting bugs. “We relied on individual writers to copy and paste error messages they encountered,” Erica explains. “But if they missed details or copied incomplete logs, it could take hours to recreate what happened and figure out where the problem started.”

This inefficient back-and-forth often led to:

  • Significant delays in publishing documentation updates
  • Writers feeling intimidated by technical errors they couldn’t easily diagnose
  • Erica having to spend time piecing together incomplete information to troubleshoot

To solve these issues, Erica implemented centralized logging:

  • Automated error reporting: “Our custom scripts now send logs, error messages, and warnings directly to SolarWinds Observability SaaS via a webhook. The script posts messages formatted like syslog messages,” Erica says.
  • Comprehensive visibility: “I can log into SolarWinds Observability SaaS and see all logs from before, during, and after the error, instead of relying on someone to notice and report the issue manually.”
  • Faster resolution: By eliminating fragmented communication, Erica can quickly identify the root cause of errors and guide her team to a fix, saving hours or even days of troubleshooting time.

Securing the File Upload Process

Another major challenge was the way Erica’s team handled uploading completed documentation to the web server. They used an outdated file transfer protocol (FTP) with no built-in safeguards, and had to coordinate carefully whenever urgent updates required a manual push to make documentation live before the next scheduled sync.

This created several serious issues:

  • Long upload times, sometimes exceeding 30 minutes on slower connections, made it difficult for writers to stay aware of ongoing processes
  • Urgent updates required manually initiating a push that disconnected the FTP service, risking corruption if anyone was in the middle of uploading
  • Communication relied on team members noticing last-minute messages about these manual pushes—an unreliable method, especially if someone stepped away for coffee or was distracted

One particularly stressful incident underscored the risks. “There was a time when someone accidentally deleted several important files from the root of our website,” Erica recalls. “They were just trying to upload documentation for one product, but ended up deleting multiple products. Once they realized what happened, they stopped and closed the FTP tool. But because there was no logging on to their computer, we had no record of what was deleted. We were forced to rely on memory to figure out what folders were missing, and it was incredibly frustrating.”

Erica turned to an existing but underused resource:

  • Leveraging FTP logs in Loggly: “Our web team was already storing all FTP logs in Loggly. Now, before doing a manual push, I check Loggly for recent activity in the last ten minutes,” Erica says. “That way, I know exactly who’s been uploading and can reach out directly to confirm whether it’s safe to proceed.”
  • Reducing risk of incomplete uploads: By confirming active uploads are complete before pushing, the team avoids publishing incomplete content or losing data.
  • Recovering from mistakes: “When someone accidentally deletes the wrong folder, I use the FTP logs to get a clear record of what was deleted, so we know exactly what to recover and recreate.”
  • Strategic alerting: “In the case above,” explains Erica, “we were lucky that the person even noticed their mistake. If they hadn't, we would have had an even bigger problem on our hands, as the deletion would have persisted to the production server.” Erica created alerts in the Loggly® tool to notify the team when an entire folder is deleted. Importantly, it does not trigger an alert when only the contents of the folder are deleted, since this is normal practice.

Detecting and Resolving 404 Errors

Maintaining hundreds of pages across multiple products inevitably leads to the occasional broken link or missing file. In the past, Erica’s team frequently encountered issues where certain files returned 404 errors after a writer changed a link or accidentally omitted a topic.

“These weren’t random errors—it was often the same file affected repeatedly,” Erica says. “The problem was, we would only discover it after customers reported the broken link, which hurt the user experience and made us look careless.”

To catch these issues before they reached users, Erica’s team transitioned from the Pingdom® solution to SolarWinds® Observability SaaS and began using transaction monitoring:

  • Automated testing of key user flows: “We set up transaction monitoring, so it visits specific URLs, finds designated links, clicks them, and checks whether the resulting page has specific text in the H1 tag,” Erica explains. “If the expected text isn’t there, it means the page didn’t load as expected or returned a 404 error.”
  • Proactive alerts: “We now get notified about broken links for important pages before our customers ever encounter them.”
  • Targeted fixes: Transaction monitoring pinpoints exactly where the process failed, so the team can fix issues immediately instead of sifting through reports or guessing what went wrong.

Future Plans: Building a Robust Automation Pipeline

Despite these improvements, Erica still sees further opportunities to modernize the InfoDev team’s workflow. One of the biggest priorities is eliminating the fully manual build and upload process for documentation updates. “Right now, I need to open our documentation tool, press the build button, wait for it to finish, and then manually upload everything via FTP,” Erica explains. “If someone tries to upload at the same time I’m pushing changes, it can interrupt the service entirely.”

With a build automation pipeline, Erica envisions:

  • One-click deployments: “A system where I could simply press a button, have the latest content automatically built, and see the files moved to the correct folders without any manual intervention.”
  • Elimination of FTP: By moving away from FTP, the team can avoid disconnecting the service for manual pushes, improving security and reliability
  • Scheduled, automated pushes: Updates could happen automatically at set intervals, reducing the need for urgent manual interventions
  • Comprehensive observability of the process: Erica plans to send all logs related to the automated build pipeline to SolarWinds Observability SaaS so her team can monitor for errors and spot potential improvements

Beyond automation, Erica is also excited about adding advanced monitoring for new features in documentation:

  • Advanced website coding with APM libraries: “We’re considering leveraging APM libraries to track user interactions on advanced PHP or JavaScript pages. If a user encounters an error, we’d receive alerts and have data to analyze the root cause.”
  • Improved performance insights: Monitoring site performance with APM would help ensure documentation pages load quickly and reliably for all users.

Visibility Unlocks New Efficiencies

The improvements Erica and her team made were only possible once her team gained comprehensive visibility into their processes. “When you have the right solutions in place to give you holistic insight, the third- and fourth-degree uses for these tools are almost limitless,” she says. “You can’t anticipate all the benefits until you have the data and understanding. That’s when opportunities for efficiency begin emerging on their own.”

Operational success often depends on how quickly IT personnel can resolve issues when they arise. Is your incident response framework up to scratch?