lang2vec
Author: Patrick Littell
Last updated: July 15, 2016

Usage: ./lang2vec (-m) (-f) (-r) <LANGUAGES>

<LANGUAGES> is a space-separated string of ISO 639-3 codes (e.g., "deu eng fra").  Any two letter codes ISO 639-1 codes will be mapped to their corresponding ISO-639-3 codes.

<SETS> is a named feature set (e.g., syntax_wals or phonology_knn), or an elementwise union A|B of two feature sets, or a concatenation A+B of two feature sets.  So "id+syntax_wals|syntax_sswl" gives the id vector concatenated with the elementwise union of the WALS and SSWL syntax feature sets.

The named sets are:

    Sets from feature and inventory databases:
        "syntax_wals",
        "phonology_wals",
        "syntax_sswl",
        "syntax_ethnologue",
        "phonology_ethnologue",
        "inventory_ethnologue",
        "inventory_phoible_aa",
        "inventory_phoible_gm",
        "inventory_phoible_saphon",
        "inventory_phoible_spa",
        "inventory_phoible_ph",
        "inventory_phoible_ra",
        "inventory_phoible_upsid",

    Averages of sets:
        "syntax_average",
        "phonology_average",
        "inventory_average",

    KNN predictions of feature values:
        "syntax_knn",
        "phonology_knn",
        "inventory_knn",

    Membership in language families and subfamilies:
        "fam",

    Distance from fixed points on Earth's surface
        "geo",
        
    One-hot identity vector:
        "id",
    
    
OPTIONS:

-m, --minimal: Suppresses columns that contain only zeros, only ones, or only nulls
-f, --fields: Display field names as the first row.
-r, --random: Randomize the values (as, for example, a control)

The "minimal" transformation applies after any union or concatenation.  (If it did not, sets in the same group, like the syntax_* sets, would not be the same dimensionality for comparison.) The "random" transformation applies after the "minimal" transformation.  (So if you're doing an experiment with a minimized set and using a randomized set as a control, the randomized set will be the same dimensionality as the original.)

REFERENCES:

The different sets above are derived from many sources:

*_wals -- Features derived from the World Atlas of Language Structures.
*_sswl -- Features derived from Syntactic Structures of the World's Languages.
*_ethnologue -- Features derived from (shallowly) parsing the prose typological descriptions in Ethnologue (Lewis et al. 2015).
*_phoible_aa -- AA = Alphabets of Africa. Features derived from PHOIBLE's normalization of *Systèmes alphabétiques des langues africaines* (Hartell 1993, Chanard 2006).
*_phoible_gm -- GM = Green and Moran.  Features derived from PHOIBLE's normalization of Christopher Green and Steven Moran's pan-African inventory database.
*_phoible-ph -- PH = PHOIBLE.  Features derived from PHOIBLE proper, by Moran, McCloy, and Wright (2012).
*_phoible-ra -- RA = Ramaswami.  Features derived from PHOIBLE's normalization of *Common Linguistic Features in Indian Languages: Phoentics* (Ramaswami 1999).
*_phoible-saphon - SAPHON = South American Phonological Inventory Database.  Features derived from PHOIBLE's normalization of SAPHON (Lev et al. 2012).
*_phoible-spa - SPA = Stanford Phonology Archive.  Features derived from PHOIBLE's normalization of SPA (Crothers et al., 1979).
*_phoible-upsid - UPSID = UCLA Phonological Segment Inventory Database.  Features derived from PHOIBLE's normalization of UPSID (Maddieson 1984, Maddieson and Precoda 1990).

