What can I understand about myself through my book highlights – Part 2 [Targeted Text Retrieval]

(How my eyes looked after all this late night programming sessions to find and retrieve my precious highlights)

Read Part 1 first, so you know why I’m doing this!  It’ll take you a minute.

So, I’ve found a couple ways to do targeted text retrieval on my book highlights. Google Play Books stores my highlights in Google Docs, but does not make it easy for me to do much with my highlights from there.

The two methods I used are:

1) Text scanning in Ruby, and
2) “Selecting” things on a web page using Javascript / JQuery

The former allowed me to preserve my highlights as a whole, but the second is much faster to code on the fly. Let me walk you through how I did it the first method, then I’ll show you the second.

For the first, I had to download the source code of the page and there, I noticed that each of my highlights were wrapped in these “u001c*\n\u”…things. Yuck.

text = '%{
\n\n\u0010\u0012\u001c\n\u0010\u0012\u001c*\n\u001cWhile science, 
medicine, art, poetry, architecture, chess, space, sports, 
number theory and all things hard and beautiful promise purity, 
elegance and sometimes even transcendence, they are fundamentally 
subordinate. In the end, they must bow to the sovereignty of politics.
\n\nNovember 2, 2014\n\u001c8\n\u0011\n\u0011\n\n\u0010\u0012\u001c\n\
u0010\u0012\u001c*\n\u001cGet your politics wrong, however, and 
everything stands to be swept away. This is not ancient history. 
This is Germany 1933.\n\nJune 13, 2017\n\u001c8\n\u0011\n\u0011\n\n\

Do you see the pattern of text that surrounds my highlights? Patterns are our friends The code I crafted to pull out my highlights is below – it basically scans the text (such as above), looking for all text (in blue below) surrounded by those patterns we talked about above (in red below). With this one line of code, I was able to liberate my highlights, feeding them into an array (fancy computer speak for ‘list’).

data = text.scan(/(?<=\\n\\n\\u0010\\u0012\\u001c\\n\\u0010\\u0012\\

Now, with my text in an array, I can do more interesting things, like sampling the highlight (there are 2050 of them!) or breaking them down into words for later text analysis.  Below, you’ll see how I output my work, as a simple web page that on reload shows a random highlight from my database of highlights.  Nice!  Apparently, something piqued my interested when I read this paragraph on the importance politics…

Screen Shot 2017-08-02 at 7.09.40 AM

All of the highlights from the books I read on Google Play Books end up in a folder of Google Docs (one for each book).  The document will show how many highlights I made, on what dates, and the highlight itself.  Since a Google doc is just a webpage, I’ve learned recently that another way to get text from a page is using Javascript. Actually, one of my instructors begged me to put off this project until I learned more Javascript, but I was stubborn.  Here is how I did it.  First, I inspected the page (press command, option, J on macs) and figure out what elements my text were associated with.  Turns out my highlights (and a bunch of other stuff) are in spans, with the class name of, well, you see in the below.

spans = 

//convert HTML collection into a regular, easy-to-work-with array.
var nodeArray = Array.prototype.slice.call( spans )

//lots of spans are just empty strings (empty spaces). get rid of 'em
filtered = nodeArray.filter(span=>{
  return span.innerText.trim() !== ""})

//give me my highlights, one line at a time, with no spaces after.

Screen Shot 2017-08-02 at 6.58.44 AM

The downside to this JS strategy is that it only lets me get my text line by line, rather than as a whole highlight (usually a few sentences of text).   However, if my goal is to collect each new bit of text every time I make a highlight on my phone,  and maybe train an algorithm to analyze what I’ve just read, the JS approach may be a more flexible approach.  Jury’s still out on that one.

Now, for next steps:
1) Text analysis using either ruby or javascript.  The output will be something like the below, a demonstration of a topic modeling process called “latent dirichlet allocation”.  A JS implementation is here.  Basically, I feed this algorithm a number of documents containing a number of words (each of my book’s highlights broken down into each word), and this algorithm outputs what topics could have generated those words.  Each of these boxes below represent a topic, with the larger words having more weight in that topic.  I’m very interested in what topics my highlights generate – weird, likely boring things probably, but insights nonetheless.

Screen Shot 2017-08-02 at 6.55.03 AM

2) I’m taking aim at screenshots.  I’ve realized that I take a lot of screenshots on my phone – it’s basically how I bookmark things – such as restaurants to take friends to, subway directions to look at later, inspirational stuff I read and wanna come back to later (I rarely do though), and more.  Apparently, if I give some of my screenshots to a Google free service called Google Cloud Vision API, their computers can pull out the text in the screenshot (yet another text retrieval!) and even tell me what my screenshot may represent (ex. is there a dog in the screenshot?).  It’s gonna be a fight though.  Early efforts to even count all of my screenshots in Google Photos have been difficult, as the images do not all load on the page at once in order to prevent long image load times.  Gonna have to put on my programmer hat (read: google like a madman) and come up with a clever solution.

Thanks for reading, and happy to answer any questions!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s