miércoles, 25 de febrero de 2015

StNER: Interface to the Stanford Named Entity Recognizer

Introduction

StNER provides a Pharo Smalltalk interface to the Stanford Named Entity Recognizer (NER). The Stanford NER recognizer is an implementation of a Named Entity Recognizer, used for tagging raw text which is a central task in Information Retrieval and Natural Language Processing. The input is a sequence of words in a text, and the NER classifier - using already trained data - try to recognize typically three types of "Named Entities" (NEs) : NAME, LOCATION and ORGANIZATION (more classes exists). The output is the tagged text in some common tagging format for tagging tokens. This recognizer works better on input more similar to the already trained labeled data sets (muc6, muc7, conll2003), however there are reports to use it with tweets, and you can retrain to recognize entities for your particular needs.

To recognize text in other languages, for example, Chinese, German, or Spanish, a different classifier (in this context a .tgz file) can be used (see NLP Stanford Demo).

Installation

  • Java is required to run the server locally.
  • Download the Stanford NER packages.
  • Inside Pharo, open the Configuration Browser and select StNER, then Install. Or evaluate
    Gofer it
     smalltalkhubUser: 'hernan' project: 'StNER';
     configurationOf: 'StNER';
     loadStable
    

Launch the server

  • Start (from Smalltalk) the (Java) server using the StNER Smalltalk server interface. For example, to start the server with default parameters in Windows:
    StSocketNERServer new
        stanfordNERPath: 'c:\stanford-ner-2015-01-30\';
        startServer.
    
  • Query an input text using the StNER Smalltalk client interface.

Server Settings

Providing path location is mandatory. If no host or port is supplied, defaults to:
  • localhost (127.0.0.1),
  • port 8080
  • JVM memory 1000m.
  • output format: inlineXML

You can configure the server with the following taggers:
  • 3 class NER tagger that can label: PERSON, ORGANIZATION, and LOCATION entities. (#setEnglish3ClassTagger)
  • 4 class NER tagger trained on the CoNLL 2003 Shared Task training data that labels for PERSON, ORGANIZATION, LOCATION, and MISC. (#setEnglish4ClassTagger)
  • 7 class NER tagger trained only on data from MUC (#setEnglish7ClassTagger): TIME, LOCATION, ORGANIZATION, PERSON, MONEY, PERCENT, DATE.

Client Usage

To tag text you can use the #tagText: method as follows:
StSocketNERClient new 
  tagText: 'University of California is located in California, United States'
and the output will be:
'University of California 
is located in California, 
United States' "
Another example including PERSON tagging:
StSocketNERClient new 
 tagText: 'Argentina President Kirchner has been asked to testify in court on the death of Alberto Nisman the crusading prosecutor who had accused her of conspiring to cover up involvement of Iran'
which results in:
'Argentina President Kirchner has been asked to testify in court on the death of Alberto Nisman the crusading prosecutor who had accused her of conspiring to cover up involvement of Iran'
Parse text to in-line XML
StSocketNERClient new 
  parseText: 'University of California is located in California, United States'
results in a Dictionary of Bag's with occurrences of tagged classes.

0 comentarios:

Publicar un comentario