battling technology: data scraping and kumu

I’ve always liked to think of myself as a reasonably technologically capable person. I know the very basics of html, and have managed to poke at some pre-written page codes when I didn’t quite like them and come out with a result that pleased me.

Data scraping, unfortunately, did not give that same experience.

When I sat down to do this project, after plugging multiple years around the 1830s into the Dissenting Books search and coming back with 900-1000 results, I decided that surely going back a couple decades would yield a far smaller data set. I was correct- the year 1810 only had three pages of results. So, with a halfway giddy (only three pages!) and halfway guilty (only three pages…) conscience, I began data scraping.

The scraping process itself was not difficult. I was able to successfully scrape the data from the table, and put it into Google Sheets without a problem. In fact, I enjoyed adding all the small codes to add different columns from the table into the scraped data. I understood (mostly) what was going on, and faced no troubles on that front.

From that point on, it was downhill.

My thoughts, the further I got into this process of cleaning and copying the data, went something like this: Google Sheets is now my worst enemy. If I ever have to use Google Sheets for anything other than making a cross-stitching pattern ever again I’m going to fling my computer across the room. It was a mess of coding not working- even now, four days after I was defeated by the dragon, I’m not sure if it was human error on my part, if some of the data went wrong, or if Sheets just decided it didn’t like codes that day. Thankfully, my data set was small enough that when some things started to go wrong, I was able to simply go in and fix them myself.

Things that went wrong:
– Find and replace didn’t get rid of all the :, /, and assorted punctuation at the end of book titles, no matter what variations of the code I tried.
– Finding the weight refused to actually count the books, putting a “1” in for every single book, even when I could see five different occurrences on my screen.
– Copying and pasting values was a nightmare I could not seem to escape, and I never managed to fix it.

By the time I made it to the Connections sheet and got stuck on trying to wrangle VLOOKUP, four hours had passed, curses had poured from my mouth, and tears had been shed. It was at this point in time that I raised my figurative white flag and emailed Dr. Pauley what I’m sure was a rather disconcerting email, including a sentence that said something along the lines of “I’d rather chop off a hand than continue to do this.” (In retrospect, that line was very dramatic. I will, however, continue to stand by it.)

After Dr. Pauley swooped in and saved my Google Sheet from being abandoned completely (thank you so much, BP), I continued to avoid the data for the weekend, because I am a coward and did not want anything to do with it. However, today I finally logged back on and was pleased to have everything go smoothly with Kumu.

If you look at the map, you’ll notice something I found incredibly ironic- not a single person at Manchester Academy in 1810 checked out the same book as any other person. There are literally no connections- everyone is their own tiny map. Thanks to the weight, you can tell that some people checked out the same book multiple times. However, that it as interesting as it gets. After all of my plight to find their connections and study them, the people of 1810 said, “Checking out the same books? Nah.”

C’est la vie, I suppose. The fact that it turned out not to give me the kind of data I wanted to see will probably add to my desire to never, ever, ever in my life use data scraping and cleaning as a tool again.

Leave a Reply

Your email address will not be published. Required fields are marked *