Thursday, October 1, 2015

Using to infer gender in a LinkedIn network

A month or so ago, I got to wondering whether there was any way to determine the gender of my LinkedIn network. Surprisingly, LinkedIn doesn't even ask for gender on sign-up, so I couldn't just pull the info directly from LinkedIn. And I didn't need a 100% accurate solution – I just wanted a directionally-useful metric.

After doing a bit of Googling, I found, a nice little API that gives you a best guess for a gender if you give it a name. If you send it this string:
you get back this result:

In other words, believes with 100% confidence that "richard" is a male name. (From Genderize's documentation, the count "represents the number of data entries examined in order to calculate the response.")

I have more than 2,300 connections on LinkedIn, so getting a breakdown of everyone's gender was going to be too time-consuming. Instead of doing the names one at a time, I signed up for a developer account and paid for up to 100,000 queries/month. (For more than a handful of queries, will rate-limit you; with a developer account, you get an access token that bypasses the rate limits.)

With an access token, here are the steps I used to get a breakdown of my LinkedIn network's gender split:

  1. Export LinkedIn connections
  2. Import the file into a Google Sheet
  3. Delete everything but the first name field ("Given Name")
  4. In a separate column, create a a URL string that appends the contents of the Given Name column to a tokenized URL that includes your access token. For me this looked like:
  5. In a new column, use Google Sheets's "ImportHTML" function to execute the query represented in the adjacent column:
  6. Step 5 creates several columns, as Google Sheets will bring in the query results into the spreadsheet; unfortunately, it does not properly split the gender result into its own columns. Create a new column and use the "Split" command to break the string [gender:"female"] into separate cells, then use "CountIF" to count how many times the word "female" appears in your worksheet. Divide that number by the total number of rows in your spreadsheet, and you have your % of female contacts.
(If I was a better programmer, I could have built a simple Python script using's API to do this automatically. Maybe someone who reads this will want to build it? Let me know!)


  1. Here you go! Replace the "names" array with the result of your LinkedIn export. NB the API limits non-dev users to 1000 queries per day. Also the comments field is eating my indentations, which renders Python nonsensical - argh. Hopefully it's obvious from context.

    I noticed that a few of my contacts include a surname in the "First Name" field ('Hillary Rodham' in the below example), and those names come back as undetermined. Might be enough to skew your data, if you assume that women are more likely than men to do this.

    import requests
    import json

    names = [
    'Hillary Rodham'

    female = 0
    male = 0
    cant_tell = 0
    undetermined_names = []

    for name in names:
    request_string = "" + name
    r = requests.get(request_string)
    result = json.loads(r.content)
    if result['gender'] == 'female':
    female = female + 1
    elif result['gender'] == 'male':
    male = male + 1
    cant_tell = cant_tell + 1

    ratio = float(female)/(female + male)
    print "Female: " + str(female)
    print "Male: " + str(male)
    print "Percent female " + '{:.1%}'.format(ratio)
    print "Undetermined: " + str(cant_tell)
    print undetermined_names

  2. Code is up on github (with a fix for the multiple-given-names problem):