Use PowerShell to Edit a CSV, Revisited

1. Back in January 2019, I started writing on another website. In December of that same year, I modified how I was doing things. Instead of just writing a full post there, I would start a new post there and then finish the post here, on my site. I did that through June 2020. All the posts from January 2019 to November 2019 are gone and all the posts from December 2019 to June 2020 are partially gone (as only a portion was written away from tommymaynard.com).

2. I have recently mentioned that if I hit 416 posts by the end of June 2022, I will have an average of one post per week over an eight-year timeframe. Bringing those posts back to life and publishing them on my site would get me to my goal much quicker. Beyond that, I have long wanted to recapture that work anyway, and realistically, those posts could have just as easily — maybe even more easily — been published here, to begin with.

As it is time to fix this, it would not make sense to do things manually unless it was an absolute requirement. Automation is kind of why we are all here. I intend to recover as many of my posts as I can and publish them here with help from archive.org — the Internet Archive — and their API.

I have been down the CSV road before. While not as popular as my Hash Table to CSV post, with greater than 21,000 all-time views, my Use PowerShell to Edit a CSV post has greater than 15,000 all-time views. Hash Table to CSV has part II or revised post, and now Use PowerShell to Edit a CSV does, too.

At first, I started to manually collect data — the data in the below image. I did not want to do that any more than I had already done. That decision was all I needed to save some time that was better spent prepping and preparing for this post. In this first image, you can see the CSV from which I began.

Do notice a few things: One, we’re missing some dates on the left, two, we’re missing all the URLs from the WB_URL column, and three, all the Notes are empty too. Our goal is to add dates on the left where we can, add URLs into the WB_URL column, and Notes when that’s necessary. There’s nothing to add to the TM_URL column. Those empty lines indicate posts that were never written and published on my site.

The first thing to do is ensure I can import the above CSV file. We will use the $CsvBasePath variable to hold the path to the file we will import with the Import-Csv command. The values returned from this command will be stored in the $CsvFileContents variable. Then they will be output to a formatted table, so they are easier to read. This is just a quick check to ensure the CSV file can be imported. As you can see below, the file’s contents could be imported. This is the identical data we saw in the previous image.

$CsvBasePath = 'C:\users\tommymaynard\Desktop\tommymaynard.com'
$CsvFileContents = Import-Csv -Path "$CsvBasePath/Corrections.csv"
$CsvFileContents | Format-Table -AutoSize

If you want to read about it, and it is short, here is information on the API that we will be using: https://archive.org/help/wayback_api.php. It is too important not to show you. The below image is what a response looks like when it is returned from using the API. We will use this API with Invoke-RestMethod. It may not mean much now, but you may end up referring to it as you progress further into this post. As you will see shortly, I will use …status -eq 200, I will use and edit the timestamp, and I will collect the URL. Having seen this image may make understanding the PowerShell much more straightforward.

I now know the CSV file can be imported and I understand the structure of the API response.

In the below PowerShell, I removed outputting the contents of the $CsvFileContents variable to the host, and instead, I set up a ForEach-Object looping construct. The first thing done inside this loop is put the current PS_URL value into the $UriRemainder variable in line 5. In line 6, we concatenate the $UriBase variable and the $UriRemainder variable and use them as the URI supplied to the Invoke-RestMethod command. We will continue working through this code below.

$CsvBasePath = 'C:\users\tommymaynard\Desktop\tommymaynard.com'
$UriBase = 'http://archive.org/wayback/available?url='
$CsvFileContents = Import-Csv -Path "$CsvBasePath\Corrections.csv"
$CsvFileContents | ForEach-Object {
	$UriRemainder = $_.PS_URL
	$WBInfoFull = Invoke-RestMethod -Uri "$UriBase$UriRemainder"
	If ($WBInfoFull.archived_snapshots.closest.status -eq 200) {
		$WBInfoDate = $WBInfoFull.archived_snapshots.closest.timestamp
		$WBInfoDate = -join $WBInfoDate[0..7]
		$WBInfoUrl = $WBInfoFull.archived_snapshots.closest.url
		$WBInfoFull
		$WBInfoDate
		$WBInfoUrl
		<#
		$_.Date = $WBInfoDate
		$_.WB_URL = $WBInfoUrl
		#>
		# Remove-Variable -Name WBInfoFull,WBInfoDate,WBInfoUrl
	} Else {
		'[[[[[NOPE]]]]]'
	} # End If-Else.
	$_
} # | Export-Csv -Path "$CsvBasePath\Corrections-temp.csv" -NoTypeInformation

The Invoke-RestMethod command reaches out to archive.org and stores the result — remember the response object — in the $WBInfoFull variable. For each iteration through the loop, this variable is repeatedly filled with data from each lookup against the Wayback Machine — another name for archive.org if I did not say that already. If the status is 200, we know our Invoke-RestMethod command was a success and so we progress further into the If portion of our nested If-Else construct.

We will then set $WBInfoDate by returning the timestamp property such as 20201022005417 and then joining the first eight “digits” [lines 8 and 9]. We then set, or assign, the $WBInfoURL variable. In the remainder of this code, we just dump our values to the screen, clear the variables, and then move onto the next line in the CSV file. We have yet to actually write to a CSV yet.

The below image shows a portion of the output generated by the above commands. Again, I’m not writing to the CSV yet; I’m only making sure the values in my variables are accurate.

It is here where the working PowerShell will be modified in such a way, that we can begin writing to the CSV file. I should make something clear. I did not edit an existing CSV file as much as I created a new one. I suppose I could have written back to the same file…or maybe I could not. That is probably worth finding out someday, but I do suspect that an open CSV can be written to. In the first iteration of this post, I created a temporary file and then did a remove/rename, so it appeared I actually edited a file.  But this whole time, I have not really been editing anything. Such a fraud, I know.

Moving along though. The below changes include having removed the code that outputs the values in the variables to the screen. Instead, these values are being written to the Date and WB_URL columns. When the status isn’t 200, instead of writing [[[[[NOPE]]]]] to the screen as I did above, something more pleasant and professional is written to the Notes column. In order to write to a new CSV, I uncommented the Export-CSV command, as well. In the post’s final image, you can view the “updated” CSV file. By my count, there are over 40 lines in this CSV that I didn’t have to type or paste in after manually doing the search myself. I’ll take it!

Always search for an API to use. Always.

$CsvBasePath = 'C:\users\tommymaynard\Desktop\tommymaynard.com'
$UriBase = 'http://archive.org/wayback/available?url='
$CsvFileContents = Import-Csv -Path "$CsvBasePath/Corrections.csv"
$CsvFileContents | ForEach-Object {
	$UriRemainder = $_.PS_URL
	$WBInfoFull = Invoke-RestMethod -Uri "$UriBase$UriRemainder"
	If ($WBInfoFull.archived_snapshots.closest.status -eq 200) {
		$WBInfoDate = $WBInfoFull.archived_snapshots.closest.timestamp
		$WBInfoDate = -join $WBInfoDate[0..7]
		$WBInfoUrl = $WBInfoFull.archived_snapshots.closest.url
		$_.Date = $WBInfoDate
		$_.WB_URL = $WBInfoUrl
		Remove-Variable -Name WBInfoFull,WBInfoDate,WBInfoUrl
	} Else { 
		$_.Notes = 'Unable to locate an archived webpage for that URL.'
	} # End If-Else.
	$_
} | Export-Csv -Path "$CsvBasePath\Corrections-temp.csv" -NoTypeInformation

While the above image includes the updated data, there have been some unexpected changes from the CSV image earlier in this post. This was worrisome for me at first — why were the dates changing!? My PowerShell worked before, so why not now? It turns out that it still is working. The date changes are because newer snapshots of the pages have been taken by the Wayback Machine since I began this post back in October — yeah, it has been a long time coming. Therefore, no worries. What clued me in was the above code and the response. Take a look at this for a brief moment.

$WBInfoFull.archived_snapshots.closest.timestamp

The keyword is “closest,” as in the most recent snapshot. My most recent snapshots changed between the creation of my original CSV file, and the updated one.

There may be a way to use PowerShell and an API to gather the old content, but for now, I am collecting it manually. I want to reclaim this content ASAP in order to line up getting these posts — the ones that are still relevant — republished here on tommymaynard.com. It is a lot of work, but I am after 416 posts by the end of June 2022. It is fair to say that I will be back with some new, old content very soon.

Leave a Reply

Your email address will not be published. Required fields are marked *