How to parse postal addresses

Parse address string, get street, city, postcode, and other address components
Parse address string, get street, city, postcode, and other address components

Often a postal address is provided as a string. For example, when an address is entered by a user or originates from a spreadsheet file. The address string can be stored as it is and utilized as an additional information string. However, more frequently it's required to parse the address string into address components: house number, street, city, postcode, etc.

In this article, we would like to give an overview of ways and methods that can help to parse postal addresses. The methods are listed starting from basic to more advanced ones.

Parsing address with Regex

The easiest way to parse an address is by applying a Regex. This method really proves itself when you have regular form addresses. For example, if all the address strings are like STREET_NAME XX, YYYYYY CITY_NAME, you can select a regexp that will split the strings to [STREET_NAME, XX, YYYYYY, CITY_NAME]. Here are some Stackoverflow threads with examples of regular expressions for address lines:

You can use an online Regexr tool to build and test the regular expession.

Pros

  • An effective way to parse structured strings
  • Powerful and configurable
  • No external library/service/API needed

Cons

  • Difficult to design, debug and support
  • Can create performance issues
  • Do not work if addresses are not regular

NPM-packages to parse addresses

There are NPM packages that are intended to parse addresses effectively. However, most of them focus on particular country standards. This could be a very good option when you deal with addresses that belong to one or several countries. In the table below are examples of NPM packages to parse postal addresses.

PackageDescriptionPopularityDependenciesLicense
parse-addressParsing US addresses+++1ISC
addresserParsing property addresses++0MIT
addressitFreeform street address parser++10MIT
humanparserParsing human and address information++32MIT
address-parseAnalysing and parsing Chinese addresses+6MIT
@minutrade/br-address-parserSimple parser for Brazilian addresses+0MIT
parse-address-australiaAustralia street address parser+0ISC

We recommend you try a few libraries to choose the one that works the best with the addresses you have.

Pros

  • Easy and convinient
  • Open source and comminity driven
  • No external service/API needed

Cons

  • Only limited countries address formats are covered
  • Carefully check licenses and dependencies before using a library in a commercial project

Libpostal - international address parser trained with NLP and open data

The Libpostal is a lightweight C library created by Mapzen company and designed to parse international addresses. The main difference from other address parsers is that the Libpostal uses machine learning and it's trained with millions of real-world addresses from open data sources.

You can use the library directly or by using bindings. The bindings for Python, Go, Ruby, Java, and NodeJS are available.

The Libpostal is Open Source and distributed under MIT license.

Here are some examples of parsing addresses with Libpostal:

> Franz-Rennefeld-Weg 8, 40472 Düsseldorf

Result:

{
  "road": "franz-rennefeld-weg",
  "house_number": "8",
  "postcode": "40472",
  "city": "düsseldorf"
}

> P.za Giuseppe Garibaldi, 80142 Napoli NA, Italy

Result:

{
  "road": "p.za giuseppe garibaldi",
  "postcode": "80142",
  "city": "napoli",
  "state_district": "na",
  "country": "italy"
}

> 19 East Rd, Sheffield S2 3BH, UK

Result:

{
  "house_number": "19"
  "road": "east rd",
  "city": "sheffield",
  "postcode": "s2 3bh",
  "country": "uk"
}

Pros

  • smart and effective
  • parses location-based strings. Can understand categories and phrases like "restaurants", "near", "in", etc.
  • open-source product with a permissive license

Cons

  • the library should be installed and supported
  • requires and keeps trained data model in memory, therefore consumes about 4GB of memory

Geocoding API - to parse, standardize, validate and verify addresses

The Geocoding API is the most powerful tool to parse addresses. Together with the address structure, you get useful location information and latitude/longitude coordinates.

However, we need to understand that the purpose of Geocoding API is not to parse the address but provide the most suitable existing location for the address. So for example, if you search an address that doesn't exist in the Geocoder database it will return the nearest results or empty results if no akin address is found.

Let's have a look at how to parse, standardize, validate and verify addresses with Geoapify Geocoding API.

The Geoapify Geocoding API returns the following information as a result:

  • corresponding address: latitude/longitude coordinates, standardized address, address components
  • confidence level for the found address, street, and city
  • original address test and parsed address

Here is an example of Geoapify Geocoding API result:

{
   "type":"FeatureCollection",
   "features":[
      {
         "type":"Feature",
         "properties":{
            "datasource":{
               "sourcename":"openstreetmap",
               "attribution":"© OpenStreetMap contributors",
               "license":"Open Database Licence",
               "url":"https://www.openstreetmap.org/copyright"
            },
            "housenumber":"70",
            "street":"Rue des Francs",
            "district":"Marcinelle",
            "city":"Charleroi",
            "county":"Hainaut",
            "state":"Wallonia",
            "country":"Belgium",
            "country_code":"be",
            "lon":4.433351295038911,
            "lat":50.40213325,
            "formatted":"Rue des Francs 70, Charleroi, Belgium",
            "address_line1":"Rue des Francs 70",
            "address_line2":"Charleroi, Belgium",
            "result_type":"building",
            "rank":{
               "importance":0.721,
               "popularity":7.051335815095913,
               "confidence":1,
               "confidence_city_level":1,
               "confidence_street_level":1,
               "confidence_building_level":1,
               "match_type":"full_match"
            },
            "place_id":"51487c1f71c0bb1140592db1321a79334940f00102f901e388422300000000c00203"
         },
         "geometry":{
            "type":"Point",
            "coordinates":[
               4.433351295038911,
               50.40213325
            ]
         },
         "bbox":[
            4.4332593,
            50.402073,
            4.4334427,
            50.4021931
         ]
      }
   ],
   "query":{
      "text":"Rue des Francs 70, 6001 Charleroi, Belgium",
      "parsed":{
         "housenumber":"70",
         "street":"rue des francs",
         "postcode":"6001",
         "city":"charleroi",
         "country":"belgium",
         "expected_type":"building"
      }
   }
}
  • When rank.confidence = 1, it's safe to say that the address is valid and verified. The housenumber, street, city, etc. represent the parsed and normalized address.
  • When rank.confidence_city_level < 1 or rank.confidence_street_level < 1, the address can't 100% be verified. However, the most similar address is returned as a result.
  • When rank.confidence_building_level = 0 and rank.confidence_street_level = 1, the address is valid and verified up to street level, but the building was not found.
  • When rank.confidence_building_level > 0 and rank.confidence_street_level = 1, the address is valid up to street level, but there are doubts about the exact building position.

Pros

  • crossplatform and flexible
  • normalizes the address
  • validates and verifies the address
  • returns address location

Cons

  • usually is not free when parsing a large number of addresses
  • requires additional logic to format not verified addresses

Which method is the best and what to choose?

We recommend the following algorithm when you need to make a decision of which method or way to choose by working with address strings:

  • If you need to get addresses locations or normalize addresses then use Geocoding API
  • If all the addresses have the same structure and are regular then use RegEx
  • Otherwise try to find the most suitable npm-library or use Libpostal to parse addresses
How to parse address strings

FAQ

How can I parse postal addresses?

Use regex if the addresses are regular, formatted the same way. Use an NPM-library or Libpostal to parse more complicated addresses. Use Geocoding API to parse addresses and find the corresponding location.

Which NPM libraries can be used to parse addresses?

There are many npm libraries to parse addresses. Search by street + address + parse tags. Try several and choose the one that fits best your requirements. Always check licenses before adding a new package to a commercial project.

How can I use Geocoding API to parse addresses?

The Geocoding API looks for a corresponding location for the provided address. Accordingly, it makes more than just parsing the address. It tries to understand the address, find the country, state, city, street it belongs to, and, in the end, the house number. So you get the standardized and verified address as the result.

What is address normalization?

Address Normalization or Address Standardisation is a process of formating free-form or structured addresses according to country mail address standards. The process includes replacing abbreviations with conventional names. For example, "Ansbacher Str. 7, 91541 Rothenburg o.d.T." will be normalized as "Ansbacher Straße 7, 91541 Rothenburg ob der Tauber, Germany"

World map

Try Geocoding API to parse addresses

Start now for Free and upgrade when you need it!