Parts 1 and 2 deal with the motivation for this project and targeted text retrieval using ruby and jQuery. Here in Part 3, we will delve into the wild west of natural language processing.
Remember, we started with 2100 of these across forty books from various genres.
The good news: I’ve got a moderately interactive front end with React and Rails API backend chock full of book highlights. This allows users to easily take a peek at the highlights I’ve made over the years. Give it a try (a screenshot is also below)!
The bad news: What do I do with all this data?
My goal: from the books I’ve read, I want to automate the detection of topics I consciously annotated. I want to make sense of data, especially data about myself. We already do this when we wear things like Fitbit, which track our steps, heart rate, and other vitals to let us know how active we are, help us see patterns in our exercise, and motivate us to keep going. I’m trying to do something similar with our book highlights.
In more broad terms, I have been asking myself the questions:
- Can I use algorithms to help me understand myself better?
- Are book highlights a good source of material to analyze?
- Can I design some experiments to resolve (1) and (2)?
- What are some applications of my work, even if (1) is too broad of a question for the methods I am trying.
I have been working with an algorithm commonly used to discover latent (unobservable) topics in a set of many documents. This algorithm is called the Latent Dirichlet Allocation (LDA). Published in 2003 by Blei, Ng, and Jordan, LDA posits that documents can be generated from a random mixture of topics, with each topic characterized by set of words that appear together.
Here’s a nice explanation from Edwin Chen (on Quora):
Suppose you have the following set of sentences:
- I ate a banana and spinach smoothie for breakfast
- I like to eat broccoli and bananas.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.
Latent Dirichlet allocation is a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like
- Sentences 1 and 2: 100% Topic A
- Sentences 3 and 4: 100% Topic B
- Sentence 5: 60% Topic A, 40% Topic B
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
OK, so let’s say that all a machine knows about the things I think about are some documents (each book’s set of highlights) and all the words contained within those highlights. If I can do an LDA on my data, I may be able to understand:
- The topics represented across all of my books
- The topic distribution within these various books.
So, we have 154,210 words in my “corpus”, and I read that 50-100K words is generally a good start. I can filter out words that contain little meaning (words like be, being, are, etc), words that are just one or two letters long, etc. Some people just look at nouns or verbs, but I won’t reduce my dataset that much to start.
Let’s run an LDA algorithm developed in JS by Awaisathar and based on an earlier implementation in Java. I used 40 topics. Results are below (zoomed in & zoomed out):
Takeaways:
– the good: algorithm was simple to work with, open source, and showed a few topics that make sense (especially the technical topics), fairly fast to run in a chrome browser (takes less than 10 minutes), though it freezes a few times as its running.
– the bad: some garbage topics, does poorly on topics related to emotions, everyday topics, results do not drive a decision, and I do not have a measurement to assess whether the results are good or bad. Furthermore, Someone else might find it hard to interpret if they were not me.
Future work:
- Can the topics be automatically ranked based on their ‘significance’ and whether the topic is real or junk? AlSumait et al wrote a great paper about this!
- Figure out how many words were screened out with the stop words and other reduction strategies I used on the original corpus of 150K words.
- Better data visualization may also help make the data more easy to understand.
- Given the data, is my question too broad? How can I scope down the question to give the machine a chance to help me? What am I trying to do with my data anyway.
Next steps: Discuss results with NLP knowledgeable person.
Thanks for reading! I would love to hear what you think!
Paul