How to Archive Your Online Articles
When Yahoo Contributor Network (YCN) shut down in 2014, I had more than a hundred articles published on the site. Unless I did something to preserve that treasure trove (at least that’s what I considered it to be), all that content would simply disappear when the YCN site went away.
To make sure that didn’t happen, I wanted to create an online archive of my articles where they would be available in almost exactly the form in which they appeared on the YCN site.
Knowing that a number of these articles had already been stolen and republished on rogue websites, I needed for my archive to be accessible online. That way I could simply provide a link to my original content in order to establish proof of authorship when filing DMCA copyright violation complaints.
On the other hand, because I would be republishing some of this work on other writing sites, I needed to insure that my archive would not show up as duplicate content in web searches done through Google, Bing, or other search engines.
Since I already had a self-hosted WordPress website* I could use to house my files, I just had to figure out how to transfer my articles so that they both retained their original appearance and wouldn’t be listed by search engines.
After some trial and error, I came up with a three-step process to create such an archive, and I thought it might be useful to other online writers to know what I did. The steps are:
1. Copy your article web pages onto your computer
2. Upload your articles to your web site
3. Set up a robots.txt file to prevent search engines from seeing your files
I make no claim of this being the best way to go about creating such an archive; it’s simply the way I chose to do it.
So, here are the steps a writer could take to create an online archive similar to mine.
* Note that you can only upload files to a website you own. If you have a free wordpress.com site, for example, you can't upload files and so can't use this method.
1. Copy Your Article Web Pages Onto Your Computer
The first step is to get a copy of each of your article web pages, along with all files (like image files) necessary for the page to appear as it originally did. For me as a Windows user, this was a very simple though somewhat time consuming process.
All you have to do is open the web page of each article in a browser, and do a Save As to your computer.
In Windows this is as simple as hitting Ctrl-S. That opens up a window that will allow you to save the article’s web page file, plus all the ancillary files that are necessary to retain its original appearance.
Saving your web page to a folder on your computer
Start by selecting or creating a folder on your computer to receive the downloaded files. Now, for each article file, open it in your browser and use Ctrl-S to save it into the folder you selected.
The Save As process will place two entities into your download folder. The first is the file named in the File name box. The second is a folder containing all the files necessary to allow the page to retain the appearance it had online.
Here’s how the Save As box looked when I clicked Ctrl-S to save an article called Pennsylvania’s “Benevolent Gesture” Bill Makes Sense into my Yahoo folder.
Both the web page file and the folder containing the ancillary files are given the same name, except that the folder has “_folder” added to the end of the name. This common name is what links the two together.
Important tips concerning file names
The name with which you download your web page will be its name from now on. That’s because if you rename either the web page file or its associated folder, the link between them will be broken. That happens even if you rename them to the same name. The only approved way to rename a downloaded web page is to open it in your browser, and save it again under the new name. So, be sure to put your desired name into the File name box before saving the page.
I should have modified the name of this file before saving it for a couple of reasons.
First of all, the name automatically given to it by the YCN site carries a lot of extra baggage I didn’t need (the part that says “-Yahoo Voices – voices.yahoo.com”). All I really wanted for the downloaded filename was the article title alone.
Watch out for “special” characters in the file name
The second reason I needed to choose a different name is that the article name has some non-standard characters in it. Although they don’t cause a problem on my Windows computer, when the article web page and its associated folder were uploaded to my web site, those non-standard characters prevented the linkage between the two from being recognized. The result was that although I could see all the written content of my page, all the formatting, as well as the images it contained, were lost.
Here’s how the original page looked on the YCN site:
But because of the interference caused by the non-standard characters in the name, here’s how it appeared on my web site:
Here are the non-standard characters that can get you into trouble
What were those non-standard characters that messed up my beautifully formatted page? Here are the ones I’ve found: ; : ‘ ’ “ ” –
These are the “smart” versions of double quotes, single quotes, and dashes that may be produced by a document editor like Microsoft Word, plus colons and semicolons. When my website server sees any of those characters in a file or folder name, it doesn’t know what to do with them. Here’s how the name of the file I uploaded looked in the file manager of my website:
Pennsylvania�s �Benevolent Gesture� Bill Makes Sense - Yahoo Voices - voices.yahoo.com.html
The easy solution is to either strip such characters out of the file name completely, or replace any “smart” characters with their simple equivalents. In other words, if I select a smart quote ( “ ) in the Filename box, and type over it with that same character from the keyboard, ( “ ) becomes ( " ) and the problem is eliminated.
Get rid of spaces!
One final thing I would now do in renaming my downloaded web page is to replace all spaces in the name with dashes or underscores. So, “Bill Makes Sense” would become “Bill-Makes-Sense” or “Bill_Makes_Sense”. The reason for that is purely esthetic. Your website server will automatically change any space in a filename to %20. (%20 is the ASCII code for the space character). So, “Bill Makes Sense” would be seen as “Bill%20Makes%20Sense”. I’d rather see the dashes.
Once you get your article web pages downloaded to your computer under the names you’d like them to have, the next step is to upload them to your web site.
Do you think you're ready if a writing site that has your articles shuts down?
2. Upload Your Articles to Your Web Site
You will need to upload both the article file and its associated folder to the same folder on your website. Typically, you'll create this folder through your site's file manager dashboard as a subfolder of your public_html folder.
The easiest way to upload files is by use of a program called an FTP client. That’s simply an application you run on your computer that allows you to bulk upload files to the chosen folder on your website.
The FTP client recommended by my web hosting service is FileZilla, and that’s the one I used. You can get more information about this free, open source program at https://filezilla-project.org/.
In researching FTP clients I ran across an interesting alternative you might want to check out. It’s called FireFTP, and like FileZilla, it's free. As the name suggests, it’s an add-on to the Firefox browser. Once you install FireFTP, it will appear on the browser’s Tool menu. You only have to click on it, and a simple, easy-to-use window opens that will allow you to quickly and easily upload your files.
You can see further information about FireFTP, and download it if you so desire, on cnet.com.
Video Tutorial: How to copy web sites
3. Set Up Your robots.txt File to Prevent Search Engines From Seeing Your Files
Search engines use web crawling robots to identify every file that is accessible from the internet. However, there is provision for people who don’t want these robots to see their files to opt out. It’s called a robots.txt file.
The robots.txt file, which is housed in the top-level directory of your web site, gives specific instructions to any web crawler about which folders or files on your site should be ignored.
In another article I give detailed instructions on how to set up a robots.txt file. Please see:
How to Get a Table of Contents for Your Uploaded Files
Here’s one final tip I found very useful. If you enter the name of your archive folder (without any file names) into your browser, it will list the files that folder contains. For example, if your archive folder is hosted at
typing that into your browser will produce an page that looks something like this:
Index of /myArchive
- Parent Directory
- My- Second -Article_files/ ... and so on.
You can open any article simply by clicking on its link on the index page.
Also, I found it convenient to copy the index into a Microsoft Word document (Ctrl-A followed by Ctrl-C in the browser, then Ctrl-V to paste the list into Word. Then delete the lines that end in "_files/"). That way, I can use that Word document as a Table of Contents, and access any of my article files simply by holding down the Ctrl key while clicking on the link.
My Files Look the Way They Should
My uploaded files appear on my website in almost exactly their original form, including, by the way, comments and most ads.
If you’d like to see the article I’ve been using as an example of the process, you can access it by clicking here.
There may be quicker and easier ways to do what I’ve done here, but for someone who’s sole interest is in preserving his articles exactly as they originally looked, this works for me.
I hope it works for you as well.
Questions & Answers
© 2014 Ronald E Franklin