There are many characters in a novel, how to do sentiment analysis for each individual one of them? Is he/she feeling good in this chapter? Is this episode a good/bad scenario for someone?
Bigger questions: Is this news good/bad for one specific company? The events happened today are good/bad for each individual company? (One event can be good for Apple but bad for Google)
After reading some papers&articles, I came up with an idea.
Distributional Hypothesis: words that are used and occur in the same contexts tend to purport similar meanings. The Distributional Hypothesis is the basis for statistical semantics.(http://en.wikipedia.org/wiki/Distributional_semantics)
There are many methods about how to represent words with vectors.(LSA, HAL, Topic models, etc.) Most of the methods are based on the Distributional Hypothesis. After changing the words to vectors, we can use top k nearest neighbors of the one word(in vector form) to represent the meaning of that word.
In order to get more accurate result, we always need to have a fairly large corpus.(for this is a hypothesis based on statistics) But, what if we don’t have a large corpus? What if we only have one book? What if we only have one article? Here is my hypothesis: Under this kind of scenarios, the top K nearest neighbors of one word represents the book’s/article’s perspective on the word.
Then I started do some experiments to verify my hypothesis.
Experiment 1, The Novel
Novel is a very good target to do experiment on. There are many characters in one novel, some are friends while some are enemies to each other. In this chapter this guy is one of the most powerful people in the world, but in the next chapter he becomes a dead man. Like rollercoaster.
Create LSA space for each individual chapter of one novel.
Get K nearest neighbors for some leading roles.
Do sentiment analysis based on these K words.
Read through the novel see if the SA result make sense.
I tested on 5 novels (two Chinese novels, three in English). The results were promising.
Here is the result of the Starks and the Lannisters from the novel A Song of Ice and fire: Game of Thrones by George R. R. Martin. Left column is the Starks, right is the Lannisters, the good/bad scale is [-10, 10], negative number means bad, positive means good. Click on the image below to view the original big picture.
As you can see above, when the Starks’ value is negative in one chapter, something bad happens to the Starks and vice versa. I have to point out C1 and C2 point in the graph. They are chapters about the Starks and the Lannisters having conflicts. In C1, the Starks win while the Lannisters lose. In C2, the Starks lose while the Lannisters win. (This is a case I mentioned before, “one event can be good for Apple but bad for Google”)
Experiment 2, IT News
After experimenting on novels, I started crawl news from Slashdot, Engadget and cnBeta(Chinese IT news website). Trying to answer the question: “The events happened today are good/bad for each individual company?” I think this can reflect the stock price of each individual company.
Then, here are some results that I get:
X-axis only from 0 to 250 for there are only 250 trading days in one year. The Y-axis is the (stock price of one day)/(stock price of the first trading day) * 100% with the Green line, (the good or bad value of one day + the value of the previous day)/(the value first trading day, which is 100) * 100% with the Red line.
The result seems to be good, but a lot more experiments needed to be done to verify my hypothesis and there are a lot of improvements can be done with the algorithm and the process that I use in the experiments.