I love when I find something that ties all of my worlds together. This should be of interest to anyone who is into interactive infographics of literary studies using hip hop… You know, or any subset thereof…
Things of note:
1) I love that they have Shakespeare on there as a benchmark
2) I love that there is a Wu-Tang button. I hereby propose that all infographics, nay, all software ever from this point on, have a Wu-Tang button.
3) I love that this is definitive proof that Wu-Tang is both nothing to fuck with, and for the children… 36 Chambers of Education, yo…
This is amazing
Well, the difficulty with the analysis lies in the tokenization method he used. As he he describes in his methodology, the tokenization approach ensures that multiple variations of the same word are counted, so if I released a piece where I just went “pimp, pimping, pimped out pimpitude pimposity pimper pimpaliciousness pimps” that would be counted as 9 unique terms as opposed to 2. He needs to at least run the work through a stemmer, and a proper comparison would include multiple metrics. The Melville button is an indicator of the problem, really, since he picked Moby Dick, and part of the reason that Moby Dick ends up off the scale is because it’s got a gigantic pile of chapters detailing cetology; it’s sort of a Tom Clancy novel in that respect.
So, I’m not just doing this to be an asshole (although I admit that helps!), it’s just that I’ve seen a lot of bad work of this type in data science. Someone builds a counterintuitive conclusion out of flimsy methodology and then issues a press release — I’m not saying he’s wrong, I’m saying he hasn’t demonstrated he’s right yet, because Big Boy Research requires accounting for these errors, and there are some big-ass honking ones glowing there. Not Wu-Tang, but Melville.
luckily, if you wanted to try the alternate approach all of his source data (rap genius, complete works of shakespeare, and moby dick) are easily available.
I’ve been tossing around an article on bad data science for a while, but I don’t particularly want to just beat on this one. FWIW, I found an article where someone was doing a similar analysis on Shakespeare Vs. Johnson, Johnson had the higher vocabulary, strangely enough. Author never mentioned directly that the Johnson book he used was his Dictionary.
Can you be more specific on “Johnson”
(i.e. Ben Jonson or some random rapper?)
Samuel Johnson.
in any case, I’m not convinced the methodology is that flawed. He applied the same tokenization to everyone. And I get why he did it. It’s not like Shakespeare standardized on spelling anymore than Outkast does. Such is the way with poetic verse. And that is what ultimately is being measured. So “pimping” and “pimpin” should be different words, as should “over” and “o’er”
moreover should “among” and “amongst” be the same word? “ax” and “axe”? “theater” and “theatre?” I’d argue that differentiating between them makes as much sense as differentiating between “who” and “whom” or “him” and “his” which are just as semantically similar as “pimp” and “pimping”
though I agree, I can’t tell for certain without mining the data myself, I’m betting that the semantically similar tokens are rare enough in the data (and regular enough across authors) that it would just be noise anyway… UNLESS you’re willing to argue that you really think DMX used over 3000 separate and distinct words, while Æsop Rock came up with more than 7000 variations of the word “pimp” which would in and of itself still be impressive and I think validate the chart.
This thread just seems so ironic….
how so?
Among other things, the scientific approach on categorizing the variations of the word “pimp”. It’s quite amusing.
I’d want to use either stemming or n grams, that would limit the problem of our 7000 pimps, although I agree that if they got that many variations they should get a medal. I would also strip out all particles – he his him her etc. then I’d randomly sample from each corpus. I’d likely also want to remove all proper names. Even then there’s open questions about word usage – Shakespeare isn’t going to talk about Bugattis and… Goddamit I was about to say that wu-tang wouldn’t mention water clocks, but wu-tang will eventually reference everything.
The real problem is that proper peer review is an endurance contest for this reason – you have to address the holes. At the minimum i would want to see him use multiple metrics in the display
Christian Roylo Shakespeare constantly refers to pimping and prostitution, and was a subject of countless moral panics and bluestocking rewrites. The only difference is he’s been sanitized over 400 years, as exhibited by the oxfordian twerps. Mark
My words in 400 years someone is going to claim that RZA was a Harvard educated getman American
It all depends on what it is you are trying to prove. You have to define “word.” If you are asking the question “Who uses the most semantic tokens in his writing between DMX, Shakespeare and Æsop Rock?” Then the answer is clearly the last, and the OP’s study proves that. If you’re trying to ask who references the most separate and unique concepts, then you need to do something like your study, and even that isn’t enough, because if you’re going to argue that “pimp” (noun) is the same concept as “pimped” (verb) and that “over” and “o’er” are the same, then I’d say you’d have to also link “yell” to “shout” and “red” to “crimson.” etc. Eskimos having 18 words for snow and all that
but the poet (be he West Coast or Elizabethian) doesn’t really want that. The entire point of poetry is to create beauty through words. The choice of word is intentional. “O’er” is used rather than “over” for a reason. That might be for its diction, to preserve meter, to make a rhyme work. The same is true of “pimping” and “pimpin'” or “nigger” and “nigga.” In fact, I’d argue that even outside of hiphop lyrics, most people who speak ebonics mean something different by pimpin’ than pimping. Similarly, DMX means something different when he says “bitch” vs. “ho.” It’s not just rhyme.
I just wanna say that Moby Dick is awesome and I would totally buy a boxed set of wu tang clan retelling it. Stubb is totally pimpin’.
What is the difference between pimpin’ and pimping?
Defining your goal is a critical part of it, but the goal is often “please hire me because of this neat counterintuitive result” (and I don’t think the result is counterintuitive, to think that Shakespeare references more stuff than WTC indicates either unfamiliarity with WTC or a pathological level of bardolatry Mike Heidenberg any thoughts? You know Shakespeare and hip-hop better than I do]). Every study is flawed, the question is how bad the flaws are and whether they’re important enough that they invalidate the result. The fact that he then does my research bullshit alarm actions (nifty visualization, publication) doesn’t help in the slightest.
context… “pimping” is pretty much always the act of selling prostitutes. “pimpin'” may be the act of selling prostitutes, but it also is a more generalized adjective or form that is highly connotative depending on his it’s used. It could be a walk (“I saw Cousin Junebug pimpin’ down the street yesterday.”) It could be a lifestyle (“pimpin’ ain’t easy”). It could be an adjective (“Those are some pimpin’ shoes”)
I think you’re being unfair with the bullshit alarms. I’m very fond of nifty visualizations… half because that was my job for so long, but moreover because I think that high usability info graphics are much better at conveying information to the masses than a 200 page academic report. I think both are important, but he posted that on a web page, which needs to be “bam, distill this info so that it can disseminate to the masses and the .001% who REALLY care will dig deeper for the exact theory.” I’m very much into Edward Tufte.
As for why, I mean, looking at his website (http://www.mdaniels.com) he appears to be a data analysis specialist with a specific personal interest in hip-hop. If you’re going to question everyone’s ulterior motives in their research then you’re going to lose out on a lot of research. I mean, yes, I agree that ulterior motives can color research, but I also think pretty much everyone has them (on some level at least) and that doesn’t automatically devalue it.
The problem isn’t the graphics themselves, its that the original quality of the research is mediocre, there are obvious avenues of attack that he hasn’t addressed (stemming, n-grams, the quality of transcription between the two sets, the vocabulary), there are problems he can’t address but hasn’t (the 400 year gap between authors),and he’s comparing apples and oranges (everyone is a poet on the list except Melville, who is a prose stylist). THEN he spent an enormous amount of time working on the visualization — visuals are important, but well-crafted visuals on top of poor results raise an alarm. It’s published EDA, and like a lot of data science, it’s a headline with a justification written underneath it.
The thing about the motivation is that it’s easier to slack off when the conclusions match your beliefs, which is why peer review is so critical. Couple of years ago, some NZ researchers published a study arguing that the environmental impact of owning a dog was about the same as owning an SUV. Their goal was to reduce pet ownership, but it turned out that most readers though the opposite — owning an SUV wasn’t so bad since it had the same impact as owning a medium sized dog, right?
Re: the info graphic. that’s what I’m saying. That’s not fair. You don’t know how long anything took him. That visualization would have taken me like 2 hours to work up. The data mining to get the info would take me like a week, or so… maybe quicker, if the webpages from rap genius are regular enough that I could write a good parser. He’s a data analysis weenie… that’s what he does. Audience matters. Working up a 10 year data mining project for a blog posting would be silly. Being more technical would have limited his audience. It’s a problem with academia in general.
Re motivation: He’s not saying this is the be all end all of linguistic study. He’s saying “look, here’s something interesting. Lets talk about it.” In your SUV study example, “oh, lets get an SUV” is a perfectly reasonable conclusion if it is presented in the form of “here’s something interesting.”
You’re in grad school now, Mav, as one of your predecessors, it’s my job to crush all joy in your life until nothing remains except analysis 😛
dude… 1) I’m an english major. I’m all about analyzing nature of language and poetry. This is right up my alley. 2) I’m not most grad students. My whole philosophy is moving the academy away from the soulless micro minutia into something meaningful and popular. It was like in my personal statement and everything.
I’d like to see your personal statement at some point if you feel like passing it by. English PhDs do this for love and are far braver men people than I.
Sumner: that was kinda my point earlier, actually. Word choice matters to poets. See Spot Run is actually far less syntactically interesting than say DMX’s Party Up. Your average rap song is probably like 4.5 minutes long. DMX likely has a much lower unique word count because he is very fond of repetition. He will end each line with the same word for like 5 or 6 lines in a row.
That’s not (necessarily) low vocabulary. That’s an intentional poetic choice.
Just ask, say Jill, Margaret, or Jennifer (three poets off the top of my head who read my FB). Style matters and varies by artist.
I picked the word “time” poorly, what I’m really trying to say is that as a data scientist, what I see here is padding — an attempt to justify a weak result. You’re right, it’s a blog post, and he’s not causing harm here that he would in a research journal, but then again a research journal wouldn’t accept this result anyway. What I worry about with contemporary data science is that he’s normal for the discipline, and that’s a problem for the discipline to be taken seriously.
So given your point about poets, do you think he should have Melville in the set at all? Melville’s a prose stylist.
I would include it just because it’s an interesting baseline vs “regular” speech. Well, really… it’s an interesting baseline against “other” speech, since I dunno that Melville is “regular.” So it’s interesting baseline against prose.
I’d actually include others too just so that we can compare better. So for clarity sake, I grabbed some random long e-texts and ran them through some quick sed-magic to wrangle them down to unique word counts. Here’s what I found (looking at only the first 35000 words of each):
9/11 Commission Report: 6627 words
Dante’s Inferno: 5827 words
Homer’s Odyssey: 3795 words
Joyce’s Portrait of the Artist as a Young Man: 4963 words
Joyce’s Ulysses: 7315 words
King James Bible: 3590 words
MLK Speeches*: 4101 words
Thoreau’s Walden: 5966 words
(For the King Speeches, I had to use several in order to get him up to 35,000 words, so I used his most famous ones, from here: http://www.famous-speeches-and-speech-topics.info/martin-luther-king-speeches/)
So that ought to give you a taste of where those rappers sit in relation to other texts
To complicate this discussion: Part of the job of poets is to wordplay. I might use the word pimpin in relation to any number of things that would have nothing to do with any of the notes listed above, just because I can in a poem. I also might repeat a simple word like sad or bad over and over again, not because I don’t know others, but because I want to. We change definitions constantly, changing meaning and vocabulary as we go. And then, there are prose poems, and there’s also poetic prose. Take an author like Ron Rash–I’d argue that his poetry is more prose-y than his prose (his prose, like any number of writers–such as Dostoevsky, Henry James, etc.–is incredibly poetic, more like poetry than prose). In any case, I had to pop by because Mav drew my attention to all this, but I’m not a data person. I’m going to go read Melville while listening to DMX now…
What Jen said. (Way better than me. Which is why I tagged her. 🙂 )
Also, you’ll note I intentionally used two different Joyce books…. books which are about THE SAME CHARACTER. Just to show what Jennifer is talking about more. Joyce certainly has a large vocabulary. But he was doing something different with Ulysses, than he was with Portrait of the Artist. So he ended up making an intentional style choice to use almost twice as many unique words in the same space of the former as the latter.
As a random last thought, I decided to parse through one more. So the unique word count for the first 35,000 words of E. L. James’ 50 Shades of Grey is: 4658 words
I think that criticizing the comparison between rappers and Melville and Shakespeare is missing the point of the chart. It’s not about comparing rappers to Melville and Shakespeare. It’s about comparing rappers to other rappers. Melville and Shakespeare are just included as points of interest. That’s why they’re lines, not circles. Plus I think that the bigger problem is likely to be the reliability of the transcription. I know that Homeboy Sandman has been very vocal about his low opinion of the quality of the transcription of his lyrics. Some, like Aesop Rock have printed books of their lyrics, and it is likely that for those rappers the lyrics have just been copied from there, but the rest are just transcribed by whoever feels like it, sometimes with improvements from the community. Generally, when I’ve looked at transcriptions of rap lyrics for mistakes, the most frequent mistake is the substitution of a more commonly used word for an uncommonly used one. So part of the question is whether or not the rate of transcription mistakes is evenly distributed. Even with that caveat, I still think that it’s an interesting chart. And part of what it proves is that vocabulary isn’t everything in terms of the value of lyrics. It’s not as though I’m going to look at this and conclude that Nelly is a much better lyric writer than Kanye or 2Pac due to higher ranking on this chart.
Keith makes a really good point–transcriptions of lyrics (sometimes even those within albums) are pretty laughable. I’d probably teach rap more often, except that I so often have to take forever getting the lyrics down myself for those students who can’t quite keep up with the audio.
I <3 anaphora
I have a hard time accepting Tech N9ne being so low on this list.