The ongoing "Corpus Pattern Analysis" project is building a "Pattern Dictionary of English Verbs" (PDEV). PDEV is an inventory of the normal patterns of use of English verbs, with links in each case to an unambiguous meaning.
The problem that PDEV aims to solve is the highly ambiguous nature of the link between words and meanings. First, it is necessary to recognize that in natural language words don't have meaning -- they have only "meaning potential" -- something that is so vague, variable, and ambiguous that it cannot be used as a basis for reliable disambiguation, computing, or anything else. What we see in standard English dictionaries are lists of undifferentiated meaning potentials. The human user has to guess at the differentiating criteria. Humans are good at guessing such criteria, so they can use ordinary dictionaries. Computers aren't, so they can't.
The problem is soluble, however, if we take a closer look at how words are actually used to make meanings. We see that meanings in text depend (probabilistically, statistically) on patterns of interaction between nouns and verbs. The aim, therefore, is to establish an inventory of patterns (PDEV). It turns out that patterns are almost always unambiguous.
The next step is to develop computational techniques for relating unseen uses of a verb in text to the relevant pattern.
Corpus Pattern Analysis (CPA) is a new technique for mapping meaning onto words in text. It is currently being used to build a 'Pattern Dictionary of English Verbs', which will be a fundamental resource for use in computational linguistics, language teaching, and cognitive science. It is based on the Theory of Norms and Exploitations (TNE, see Hanks 2004 and 2013, Hanks and Pustejovsky 2005). TNE in turn is a theory that owes much to the work of Pustejovsky on the Generative Lexicon (see Pustejovsky 1995), to Wilks's theory of preference semantics (e.g. Wilks 1975), to Sinclair's work on corpus analysis and collocations (eg. Sinclair 1966, 1987, 1991, 2004), to the Cobuild project in lexical computing (Sinclair et al. 1987), and to the Hector project (Atkins 1993; Hanks 1994). CPA is also influenced by frame semantics (Fillmore and Atkins, 1992). It is complementary to FrameNet. Where FrameNet offers an in-depth analysis of semantic frames, CPA offers a systematic analysis of the patterns of meaning and use of each verb. Each CPA pattern can in principle be plugged into a FN semantic frame. Some recent work in American linguistics (Jackendoff 2002) has complained about the excessive "syntactocentrism" of American linguistics in the 20th century. TNE offers a lexicocentric approach, with opportunities for synthesis, which will go some way towards redressing the balance.
The focus of the analysis is on the prototypical syntagmatic patterns with which words in use are associated. Patterns for verbs and patterns for nouns are different in kind. Noun patterns will consist of corpus-driven gnomic statements, into which all relevant collocates will be incorporated, to give a full picture of the noun's idiomatic use. Verb patterns consist not only of the basic "argument structure" or "valency structure" of each verb (typically with semantic values stated for each of the elements), but also of subvalency features, where relevant, such as the presence or absence of a determiner in noun phrases constituting a direct object. For example, the meaning of take place is quite different from the meaning of take his place. The possessive determiner makes all the difference to the meaning.
No attempt is made in CPA to identify the meaning of a verb or noun directly, as a word in isolation. Instead, meanings are associated with prototypical sentence contexts. Concordance lines are grouped into semantically motivated syntagmatic patterns. Associating a "meaning" with each pattern is a secondary step, carried out in close coordination with the assignment of concordance lines to patterns. The identification of a syntagmatic pattern is not an automatic procedure: it calls for a great deal of lexicographic art. Among the most difficult of all lexicographic decisions is the selection of an appropriate level of generalization on the basis of which senses are to be distinguished. For example, one might say that the intransitive verb abate has only one sense ("become less in intensity"), or one might separate storm abate from political protest abate, on the grounds that the two contexts have different implicatures. That is a simple example, but in more complex cases (e.g. the verb bear) patterns are indispensible for effective disambiguation. Bearing a heavy burden is a pattern that normally has an abstract interpretation in English (as opposed to, say, carrying a heavy load), and the meaning is associated with the prototypical phrase, which is quite different in turn from I can't bear it.
In CPA, the "meaning" of a pattern is expressed as a set of basic implicatures. E.g., for the verb file one pattern is: [[Human = Plaintiff]] file [[Procedure = Lawsuit]] of which the implicature may be expressed as "If you file a law suit, you are acting as the plaintiff and you activate a procedure by which you hope to obtain redress for some wrong that you believe has been done to you"). Depending on the proposed application, the implicature of a pattern may be expressed in any of a wide variety of other ways, e.g. as a translation into another language or as a synonym set such as "file = activate, start, begin, lodge". Each argument of each pattern is linked to a node in a shallow semantic ontology (Pustejovsky et al. 2004).
Every verb pattern is supported by links to analysed evidence of usage in a corpus (at present, the British National Corpus). Corpus samples are randomly selected and typically consist of 250 corpus lines (or a multiple thereof) for each verb. Every line in the sample is classified. This enables CPA to complement the Sketch Engine’s focus on statistically significant collocations.
Most corpus lines are ‘normal uses’, assigned to a specific pattern. However, a few corpus lines are classified as a creative ‘exploitation’ of a normal use and tagged as such. There are three subclasses of exploitations: semantically anomalous arguments, figurative uses, and syntactically anomalous structures. In this way, CPA is building up an unrivalled collection of examples of creative language use, as a spin-off from its central task of identifying normal usage and meanings.
Among other features, PDEV also provides information about the relative frequency of each phraseological pattern.
CPA & PDEV bibliography
- Atkins, B. T. S. 1993. Tools for computer-aided corpus lexicography: the Hector project. Acta Linguistica Hungarica, 41.
- Fillmore, Charles J., and Beryl T. [Sue] Atkins. 1992. "Towards a frame-based organization of the lexicon: the semantics of RISK and its neighbors" in Adrienne Lehrer and Eva Feder Kittay (eds.): Frames, Fields, and Contrasts. Lawrence Erlbaum Associates.
- Hanks, Patrick. 1994. "Linguistic Norms and Pragmatic Exploitations, or Why Lexicographers need Prototype Theory and Vice Versa" in F. Kiefer, G. Kiss, and J. Pajzs (eds.), Papers in Computational Lexicography: Complex '94, Research Institute for Linguistics, Hungarian Academy of Sciences.
- Hanks, Patrick, and Pustejovsky, James. 2004. "Common Sense About Word Meaning: Sense in Context".TSD 2004, Brno, Czech Republic
- Hanks, Patrick. 2004. "The Syntagmatics of Metaphor and Idioms". International Journal of Lexicography, Volume 17, Number 3.
- Hanks, Patrick, and James Pustejovsky. 2005. "A Pattern Dictionary for Natural Language Processing" in Revue Française de linguistique appliqué, 10:2.
- Hanks, Patrick. 2008. "Mapping meaning onto use: a Pattern Dictionary of English Verbs". AACL 2008, Utah.
- Hanks, Patrick. 2013. Lexical Analysis: Norms and Exploitations. Cambridge. MIT Press.
- Jackendoff, Ray. 2002. Foundations of Language. Oxford University Press.
- Pustejovsky, James. 1995. The Generative Lexicon. MIT Press.
- Pustejovsky, James, Hanks, Patrick, and Rumshisky, Anna. 2004. "Automated Induction of Sense in Context". COLING 2004.Geneva, Switzerland.
- Hanks, Patrick, and Pustejovsky, James. 2005. A Pattern Dictionary for Natural Language Processing. In: Revue Française de linguistique appliquée, 10:2.
- Rumshisky, Anna, Hanks, Patrick, Havasi, C., and Pustejovsky, James. 2006. "Constructing a Corpus-based Ontology using Model Bias". FLAIRS 2006, Melbourne Beach, Florida.
- Rumshisky, Anna, and Pustejovsky, James. 2006. "Inducing Sense-Discriminating Context Patterns from Sense-Tagged Corpora". LREC 2006, Genoa, Italy.
- Sinclair, John. 1966. "Beginning the Study of Lexis" in C. E. Bazell, J. C. Catford, M. A. K. Halliday, and R. H. Robins (eds.): In Memory of J. R. Firth. Longman.
- Sinclair, John (ed.). 1987. Looking Up: an Account of the Cobuild Project in Lexical Computing. HarperCollins.
- Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford University Press.
- Sinclair, John. 2004. Trust the Text: Language, Corpus, and Discourse. Routledge.
- Sinclair, John, Hanks, Patrick, et al. 1987. The Collins Cobuild English Language Dictionary. HarperCollins.
- Wilks, Yorick. 1975. A Preferential, Pattern-Seeking, Semantics for Natural Language Inference. In: Artificial Intelligence 6(1).
- Bradbury, Jane and El Maarouf, Ismail. 2013. An empirical classification of verbs based on Semantic Types: the case of the 'poison' verbs. In: Proceedings of JSSP.
- El Maarouf, Ismail and Baisa, Vít. 2013. Automatic classification of semantic patterns from the Pattern Dictionary of English Verbs. In: Proceedings of JSSP.
- El Maarouf, Ismail and Bradbury, Jane and Baisa, Vít and Hanks, Patrick. 2014. Disambiguating Verbs by Collocation: Corpus Lexicography meets Natural Language Processing. In: Proceedings of LREC, Reykjavik.
- Popescu, Octavian. 2012. Building a Resource of Patterns Using Semantic Types. In: Proceedings of LREC, Istanbul.
- Popescu, Octavian and Palmer, Martha and Hanks, Patrick. 2014. Mapping CPA onto OntoNotes Senses. In: Proceedings of LREC, Reykjavik.
- Baisa, Vít and El Maarouf, Ismaïl and Rychlý Pavel and Rambousek Adam. 2015. Software and Data for Corpus Pattern Analysis. in Proceedings of the Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno, Tribun EU. pp. 75–86.