Commit 34e7918a authored by Aileen Reichelt's avatar Aileen Reichelt
Browse files

Restore Bolukbasi et al. repository

parent 0aecf184
Loading
Loading
Loading
Loading
+94 −0
Original line number Diff line number Diff line
# PROJECT SPECIFIC


# PYTHON RELATED

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject
+21 −0
Original line number Diff line number Diff line
MIT License

Copyright (c) 2016 Tolga

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
+37 −0
Original line number Diff line number Diff line
# Debiaswe: try to make word embeddings less sexist

🔴[FAT* 2018 tutorial slides](https://drive.google.com/file/d/1IxIdmreH4qVYnx68QVkqCC9-_yyksoxR/view?usp=sharing)


Here we have the code and data for the following paper:
[Man is to Computer Programmer as Woman is to
Homemaker? Debiasing Word Embeddings](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf) by 
Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. Proceedings of [NIPS 2016](https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings).

**Just looking to download a debiased embedding?**

You can download [binary](https://drive.google.com/file/d/0B5vZVlu2WoS5ZTBSekpUX0RSNDg/view?usp=sharing&resourcekey=0-qO1UY06KB42G1T6IeJ2XCQ)/[txt](https://drive.google.com/file/d/1_PvT4ZvtZjhq4HPywA8-u06epht9ccOw/view?usp=sharing) hard debiased version of the Google's Word2Vec embedding trained on Google News (Origin: GoogleNews-vectors-negative300.bin.gz found [here](https://code.google.com/archive/p/word2vec/)).

**Python scripts:**
- **learn_gender_specific.py**: given a word embedding and a seed set of gender-specific words (like <i>king</i>, <i>she</i>, etc.), it learns a much larger list of gender-specific words
- **debias.py**: given a word embedding, sets of gender-pairs, gender-specific words, and pairs to equalize, it outputs a new word embedding. This version basically reads/writes word2vec binary file format.  

```
python learn_gender_specific.py ../embeddings/GoogleNews-vectors-negative300.bin 50000 ../data/gender_specific_seed.json gender_specific_full.json
```

```
python debias.py ../embeddings/GoogleNews-vectors-negative300.bin ../data/definitional_pairs.json ../data/gender_specific_full.json ../data/equalize_pairs.json ../embeddings/GoogleNews-vectors-negative300-hard-debiased.bin
```


We also have seed data used to debias and crowd data used to evaluate the embeddings.

**Data files:**
- **gender_specific_seed.json**: A list of 218 gender-specific words
- **gender_specific_full.json**: A list of 1441 gender-specific words
- **definitional_pairs.json**: The ten pairs of words we use to define the gender direction
- **equalize_pairs.json**: Some crowdsourced F-M pairs of words that represent gender direction


(All external files that I refer within this repo can be found in [this folder](https://drive.google.com/drive/folders/0B5vZVlu2WoS5dkRFY19YUXVIU2M?resourcekey=0-rZ1HR4Fb0XCi4HFUERGhRA&usp=sharing).)
+1 −0
Original line number Diff line number Diff line
["germane", "ostgeld", "focaccia", "ostalgie", "volksgenosse", "hetman", "auslandsdeutscher", "sinto", "lech", "auslandsgeschäft", "bambino", "reichsbahn", "engadin", "schilling", "grundgesetz", "prosecco", "kleindeutsch", "aventiure", "europide", "flüchtlingsausweis", "weser", "völkerwanderung", "azzurri", "landammann", "trecento", "deutschfeindlichkeit", "polnisch", "baron", "mitteldeutsch", "bundesminister", "germanisch", "itaker", "groschen", "quempaslied", "flüchtlingshilfe", "baden-württemberg",  "thai", "zuwanderin", "edeling", "italienisch", "ausländerhass", "confoederatio helvetica", "germanisieren", "vaudeville", "italowestern", "mittelhochdeutsch", "schwarz-rot-gold", "westmitteldeutsch", "tamtam", "janitscharenmusik", "öterreichisch-ungarisch", "weichsel",  "germanentum", "jungdeutscher", "plattdeutsch", "grappa", "exarch",  "abate", "carabiniere", "bairisch", "alldeutsch", "quart", "sultan", "ramasan", "liechtenstein", "sachsen-anhalt", "settecento", "greyerzer", "reichsdeutsch", "urdeutsch", "bundesrepublikanisch", "thurgau", "germanist", "labiovelar", "kampanile", "ostmitteldeutsch", "frühneuhochdeutsch", "ostverträge", "geschäftsträger", "hochlautung", "reinheitsgebot", "wallis", "signore",  "brandenburg", "nazarener", "sbrinz", "carnotzet", "verhochdeutschen", "cinquecento", "großglockner", "aargau", "einwanderungsstrom", "beg", "kastell", "asylsuchende", "panje",  "spartakiade", "veltliner", "pizzaservice", "wittum", "helvetien", "nibelungen", "papagallo", "amerikahaus", "romand", "ausländisch", "boatpeople", "neudeutsch", "zibetkatze", "rom", "bundesstadt", "schweizweit", "mora", "signor", "chianti", "bundesstraße", "asylsuchender", "indogermane", "kenning", "sejm", "stadtammann", "wessi", "normanne", "ostdeutschland", "volksdeutscher", "rugier", "sütterlinschrift", "secondo", "gambir", "hochdeutsch", "bel paese", "frühneuhochdeutsch", "bundesliga", "devisenbewirtschaftung", "signorina", "zweigelt", "toskana", "auswärtig", "bundesadler", "deutschtum", "khan",  "hodscha", "hinterindien", "franken", "deutschlandchef", "helvetier", "afroasiatisch", "bundesbürger", "eindeutschung", "oberdeutsch", "deutschtümelei", "deutschsprachlich", "radicchio", "missingsch", "ostblock", "himalaja", "puzzolan", "taverne", "sundainseln", "außerdeutsch", "signoria", "cinzano", "wirtschaftsflüchtling", "markomanne", "austrofaschismus", "swissair", "ddr-bürger", "deutsche", "migrationshintergrund", "asylbewerberin", "deutschamerikanisch", "levante", "piefke", "padre", "theatiner", "westdeutschland", "zimber", "stasimitarbeiter", "deutschnational", "karawane", "norddeutsch", "russland", "spumante", "kosovokrieg", "italienisch", "italienerin", "mittelhochdeutsch", "bundeskanzlei", "gastarbeiter", "freiheitlich", "deutsch-schweizerisch", "italienreise", "flüchtlingselend", "ostler", "landesstraße", "sachsen", "verfassungsgerichtshof", "trentino-südtirol", "freisinn", "sprachgesellschaft", "ausländerfeindlich", "kurfürst", "deutschtürke", "schwäbeln", "kirchenstaat", "kebab", "staatsgerichtshof", "oberlandesgericht", "deutschkunde", "ostmitteldeutsch", "bergama", "cajunmusic", "spieloper", "rheinfall", "auffangen", "auslieferungsantrag", "spatha", "ausführen", "magnat", "polonistik", "undeutsch", "westfernsehen", "uri", "indoeuropäisch", "donnerer", "odal", "indogermanisch", "schufa", "dw", "po", "schleswig-holstein", "unter", "humanismus", "piazza", "ausländerkind", "westen", "kentumsprache", "reisefreiheit", "tessin", "greencard", "stasiakte", "ard", "einwandererkind", "pirogge", "lambrusco", "zimbal", "cavaliere", "mark", "raki", "aussiedler", "flüchtlingslager", "ufa", "deutschschweizerisch", "jot", "gorgonzola", "kanake", "u-häkchen", "fichtelberg", "südtirol", "asti", "burgunde", "ost-west-dialog", "centime", "tiber", "jul", "hartz", "quattrocento", "alphorn", "bundesverdienstkreuz", "pasta", "madrigal", "kelte", "deutschschweizer", "displaced person", "zentralasien", "handballbundesliga", "rentenmark", "althochdeutsch", "fremdenpolizei", "tarantella", "frühmittelhochdeutsch", "sibirien", "urgermanisch", "migrantenkind", "schwyzertütsch", "arte povera", "gastarbeiterin", "weißherbst", "bundespräsident", "öterreichweit", "ural", "pidginenglisch", "uckermark", "immigrantin", "ger", "scudo", "spätaussiedler", "lastenausgleichsgesetz", "kulturinstitut", "welschschweizer", "menschenhandel", "austriazismus", "bundeshaus", "italer", "mitternachtssonne", "deutschlandfunk", "welsche", "latsche", "weißbuch", "hilfswillige", "schweizerdeutsch", "kufe", "kolonialherrschaft", "bundesdeutsche", "mitteldeutschland", "stracciatella", "frühmittelhochdeutsch", "vielfraß", "bundesdeutscher", "kaspisches meer", "kanton", "deutschtürkisch", "pfingstochse", "teutonin", "dihk", "amerikadeutsche", "stasiunterlagen", "deutsch", "ausländerpolitik", "niedersachsen", "ausfuhrgarantie", "harz", "lateinisch", "palazzo", "futhark", "schriftdeutsch", "wien", "pan", "weißwurstäquator", "pazifischer ozean", "basso", "welschland", "stabreim", "bundesministerium", "quarta", "schwyzerdütsch", "lasagne", "aga", "karelien", "polnisch", "russlanddeutsch", "kurrentschrift", "canzone", "oberdeutsch", "härtefallkommission", "nachkriegsschweiz", "levantiner", "faschismus", "pole", "angelsachse", "ararat", "reichstag", "verismo", "börde", "paying guest", "balsamessig", "schwabenspiegel", "westmitteldeutsch", "landesversicherungsanstalt", "illyrer", "pagode", "treuhandanstalt", "doktorhut", "fußballbundesliga", "italoamerikaner", "kalabrien", "arbeitsemigrant", "deutschlehrer", "arier", "bajazzo", "kabinett", "lufthansa", "mikrozensus", "verrechnungseinheit", "hanswurst", "sezession", "schlepper", "aufenthaltsgenehmigung", "deutschjüdisch", "einwandererstrom", "außenwert", "misereor", "bundesgartenschau", "bezirkstag", "alpenrepublik", "zwangsumtausch", "auslandsdeutsch", "teehaus", "panasiatisch", "einwanderin", "westdeutscher", "duce", "konsularkorps", "italianisieren", "siamkatze", "auslese", "isolationismus", "expedition", "zav", "einreisen", "turkisieren", "ostöterreich", "anwerbestopp", "waadt", "ausländerin", "franzöisch-deutsch", "vorderasien", "administrator", "stradivari", "welschschweizerisch", "ostgermane", "kolonie", "bundesbetreuung", "gefolgschaft", "bundeshaushalt", "sizilien", "vendetta", "botschafter", "hermesbürgschaft", "nidwalden", "zahlungsbilanz", "apo", "generaloberst", "altnordisch", "jura", "ostasien", "pandschabi", "volksdeutsche", "einwanderer", "saarland", "effendi", "deutschlandlied", "intershop", "eisheiligen", "ch-laut", "bundschuh", "landeshauptmann", "cherusker", "migrantin", "deutsch gesinnt", "dolma", "pecorino", "nordrhein-westfalen", "inländerin", "obwalden", "schrumpfgermane", "osten", "eindeutschen", "amerikadeutscher", "thing", "ciabatta", "hamburg", "schweizergarde", "welscher", "parmesan", "altdeutsch", "mazurka", "böhmerwald", "sowjetzone", "westdeutsche", "berlin", "deutsch", "rhododendron", "fra", "hispano", "deutschsprachig", "osmane", "immigrant", "bundespolitiker", "ubier", "hilfswilliger", "wechselkurs", "marchese", "apulien", "reisescheck", "bergamotte", "defa", "sonata", "zentralschweiz", "apennin", "dax", "ostdeutsche", "bremen", "konsistorium", "deutschfreundlichkeit", "honved", "padrone", "schweizer", "kawass", "departement", "frikadelle", "großdeutsch",  "verdeutschung", "jiddisch", "neubürger", "trattoria",  "panettone", "austromarxismus", "metamusik", "ddr-bürgerin", "boreal", "nordgermane", "notaufnahme", "antipasto", "drk", "catenaccio", "hesperien", "pannacotta", "schweizerin", "moxibustion", "allgäu", "schriftdeutsch", "welschschweiz", "bundesgebiet", "auslandsdeutsche", "eurasier", "schakal", "jass", "bundesrat", "warenumsatzsteuer", "deutscher", "swiss", "westschweiz", "trakehner", "gote", "fürstentag", "autarkie", "flühtlingsstrom", "landesgartenschau", "futurismus", "ligurien", "bundesautobahn", "ku-klux-klan", "standarddeutsch", "kappadozien", "westdeutsch", "westlich", "innerschweiz", "steppenhuhn", "ösi", "orient", "achtundvierziger", "entsendegesetz", "hethiter", "deutsch-türkisch", "romanismus", "schweizerbürgerin", "daus", "franke", "senat", "bundesnachrichtendienst", "bundesbahn", "beamtendeutsch", "zuwandrer", "lombardei", "rittmeister", "lori", "alta moda", "standarddeutsch", "buntnessel", "belcanto", "deutschkenntnis", "piccolo", "tschibuk", "auffanglager", "elba", "arlecchino", "lira", "exilliteratur", "niederdeutsch", "bundesausbildungsförderungsgesetz", "ehrenspielführer", "durchgangslager", "apenninen-halbinsel", "cassata", "schwarz-weiß-rot", "deutschlandsender", "autark", "erzherzog", "eurokommunismus", "europider", "hennastrauch", "öterreichisch", "brd", "plateresk", "prignitz", "treck", "buch", "iberer", "pancetta", "lüneburger heide", "ostig", "fdp", "couvert", "asylbewerberheim", "quintal", "heldenlied", "asiatisch", "kandidat", "notlager", "ems", "bundestag", "hindukusch", "beitrittsgebiet", "türkisch", "güteraustausch", "importe", "mittelniederdeutsch", "mauerschütze", "bundeskanzleramt", "ß", "tagliatelle", "büffel", "ossi", "seconda", "zaubernuss", "ziehungsrecht", "brandgans", "katamaran", "feldgrau", "pizza", "afrodeutsch", "importhandel", "zloty", "italienische", "ostdeutsch", "anopheles", "betäubungsmittelgesetz", "kreuzer", "resident", "bundesdeutsch", "italianismus", "ötlich", "türkischstämmig", "welsch", "valuta", "schleichkatze", "fernamt", "südasien", "deutschlandpolitik", "germanin", "muchtar", "ostpolitik", "thüringen", "flüchtlingsrat", "brillenschlange", "met", "schabzieger", "piva", "krevette", "devise", "ausländerfeindlichkeit", "boccia", "konak", "alpenjäger", "prädikatswein", "preislied", "studienkolleg", "sudetenland", "chassidismus", "hemlocktanne", "baba", "novecento", "großdeutschland", "rheinland-pfalz", "lizenziat", "nachkriegsöterreich", "binnendeutsch", "geest", "billigflagge", "bundeswehr", "amischer", "getto", "kanzleideutsch", "moschustier", "neudeutsch", "polentum", "italienischsprachig", "kamtschatka", "vacherin", "fantasia", "volksgericht", "nationalratspräsident", "kontor", "scampi", "teutonisch", "plattdeutsch", "germanistik", "biedermeier", "certosa", "eurocityzug", "ausländer", "seele", "staatsrat", "bundeskabinett", "alitalia", "italien", "migrationspolitik", "verfassungsinitiative", "diplomatie", "neuhochdeutsch", "zwergkiefer", "marktamt", "dienstpragmatik", "deutschschweiz", "frascati", "kurrent", "türkisch", "fpö", "eurasien", "kemalismus", "landeskirche", "mittelmeerländer", "eidgenosse", "friedensfahrt", "renaissance", "rotwelsch", "hyäne", "italianist", "prälat", "pfalz", "fremdarbeiter", "quent", "spruch", "wandervogel", "hortensie", "türbe", "bundesgesetzblatt", "schwarzwald", "ausländeranteil", "hafenzoll", "integrationsbeauftragte", "mecklenburg-vorpommern", "ostdeutscher", "satemsprache", "mittelniederdeutsch", "botschaft", "maggiore", "schutztruppe", "ländle", "kreole", "hamam", "conte", "incoming", "ripuarisch", "lingua franca", "aare", "bundesversammlung", "bootsflühtling", "mitteldeutsch", "unteritalien", "althochdeutsch", "bigos", "ingwäonen", "schwarzes meer", "bundesanleihe", "fremde", "ober", "ausländeramt", "qualitätswein", "sardinien", "westler", "einigungsvertrag", "asean", "visconte", "don", "halbesel", "bundesbank", "gesandtschaft", "indogermanistik", "behördendeutsch", "notaufnahmelager", "ausländerbehörde", "josephinismus", "schwaben", "flühtlingspolitik", "rote-armee-fraktion", "schutzzoll", "katzelmacher", "deutschstämmig", "reichsdeutscher", "deutsch sprechend", "staatsminister", "präfekt", "deutschamerikaner", "asylgerichtshof", "glosse", "italianistisch", "alemanne", "legionär", "sammellager", "reichsdeutsche", "kapitalflucht", "ostschweiz", "germanien", "orientteppich", "landeshauptfrau", "romandie", "ultra", "oder-neiße-linie", "platt", "neuhochdeutsch", "staatssicherheitsdienst", "südeuropäisch", "deutschstämmige", "umweltflühtling", "ostzone", "mezzogiorno", "villanell", "frisör", "oberitalien", "süddeutsch", "treudeutsch", "bundesverfassungsgericht", "ischia", "mozzarella", "sudetendeutsch", "tramontana", "bayern", "einwandererfamilie", "sprachführer", "durchgangsverkehr", "arno", "rütlischwur", "volkskammer", "mad", "ns-staat", "volksmarine", "dienstleistungsverkehr", "expatriate", "gemeindeutsch", "österreicherin", "zonenrandgebiet", "amtssprache", "tifoso", "schweizerisch", "studienaufenthalt", "hansestadt", "hessen", "bure", "ostflüchtling", "flüchtlingstreck", "ristorante", "osteria", "teutonengrill", "assisen", "riviera", "kolonialherr", "wendezeit", "flüchtlingsheim", "bundesverwaltungsgericht", "diwan", "exequatur", "krautrock", "deutschstämmiger", "woiwod", "geniezeit", "anatolien", "bundessozialgericht", "freiburg"]
 No newline at end of file
+1 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading