Monday, October 8, 2007

Statistics and the independence of syntax and semantics

I know we're moving on to animal communication, but I had a comment on last week's materials that I felt should be made. It seemed to me that the "take-away" lesson we converged on during the course of our discussion of the LSLT excerpts was that Markovian processes, whether hidden (over POS tags) , high order (lots of history), or both, were dismantled by Chomsky as implausible accounts of natural language phenomena. This is undoubtedly true, but not even remotely controversial for most folks who are interested in statistical models of language. What I thought was also quite evident from reading LSLT but wasn't really addressed was the fact that Chomsky also argues for independence of meaning and grammaticality (I know this sounds so obvious as to almost be another example of the inanity of people who "work on statistics"). However, the implications of this claim are actually extremely precise for any statistical model. Specifically, we know that the independence of two events A & B has the following properties:
  1. P(A,B) = P(A)P(B)
  2. P(A|B) = P(A)
  3. P(B|A) = P(B)
Therefore, if we are interested in "grammaticality" and are working with a model which conflates meaning and grammaticality (ie, P(A,B)), we must immediately doubt whether we can really even address the later without factoring out the former. Conversely, if we have a model which predicts both, we would expect their relationship (if Chomsky is right) to conform approximately to the relationship expressed in (1).

Any thoughts? Did I miss something obvious?


Tim Hunter said...

I guess I agree so far but I'm not sure what you're getting at ... can you be a bit more specific? What are the events A and B which we'd be plugging in? And what is the model that conflates meaning and grammaticality?

Chris said...

A probability model that conflates meaning and grammaticality is any standard n-gram language model over words. Things that have low semantic likelihood (eg. The candle murdered my eggplant.) have low probability along side things that might have low grammatical probability (eg. Chris want to go to sleep), but are perfectly acceptable semantically.

I guess my question is the same as your first one. Are there meaningful linguistic objects that can be put in the formulas above that correspond roughly to Chomsky's hypothesis of semantic/syntactic independence?

Tim Hunter said...

I think A and B would be the event of the sentence "being grammatical" and the event of the sentence "having a sensible meaning". If we build an n-gram model over a corpus we're probably going to end up with a model which conflates the two properties, because more "being grammatical" and more "having a sensible meaning" both push in the direction of being more frequently used.

I think Chomsky's whole point is that models from corpora conflate the two and give you P(A,B), when what we're really interested in as linguists is just P(A). Both "The candle murdered my eggplant" and "Chris want to go to sleep" have a high score on one of these scales and a low score on the other, so they end up close together when we multiply the two. (Also, "The candle murder my eggplant" has a low score on both so it's probably the lowest of all, whereas "Chris wants to go to sleep" has a high score on both so it's probably the highest of all, in a corpus.)

So these are, I think, the two probabilities that would slot in as A and B above, but I don't think a sentence's probability of "having a sensible meaning" is necessarily a linguistic object.

(Again I feel like I might be missing something in the question though ...)