Skip to content Skip to sidebar Skip to footer

Python - Parse Ipv4 Addresses From String (even When Censored)

Objective: Write Python 2.7 code to extract IPv4 addresses from string. String content example: The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can al

Solution 1:

Here is a regex that works:

import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] formatchin re.findall(pattern, text)]
print ips

# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']

The regex has a few main parts, which I will explain here:

  • ([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])This matches the numerical parts of the ip address. | means "or". The first case handles numbers from 0 to 199 with or without leading zeroes. The second two cases handle numbers over 199.
  • [ (\[]?(\.|dot)[ )\]]?This matches the "dot" parts. There are three sub-components:
    • [ (\[]? The "prefix" for the dot. Either a space, an open paren, or open square brace. The trailing ? means that this part is optional.
    • (\.|dot) Either "dot" or a period.
    • [ )\]]? The "suffix". Same logic as the prefix.
  • {3} means repeat the previous component 3 times.
  • The final element is another number, which is the same as the first, except it is not followed by a dot.

Solution 2:

Description

This regex will match each of four octets of a what looks like an IP address. Each of the octets will be placed into it's own capture group for collection.

(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])

enter image description here

Given the following sample text this regex will match all 10 embedded IP strings in their entirety including the first one. Working example: http://www.rubular.com/r/1MbGZOhuj5

The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).

The resulting matches could be iterated over and a properly formatted IP string could be constructed by joining the 4 capture groups with a dot.

Solution 3:

The code below will...

  • find IPs in strings even when censored (ex: 192.168.1[dot]20 or 10.10.10 .21)
  • place them into a list
  • clean them of the censorship (spaces/braces/parenthesis)
  • and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing digit (6 and 3 from the aforementioned). If its first octet is invalid (ex: 256.10.10.10), it will drop the leading digit (resulting in 56.10.10.10).

import re

defextractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips


myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print"Original file content:\n{0}".format(fileContent)
print"--------------------------------"print"Parsed results:\n{0}".format(IPs)
Copy

Solution 4:

Extract and Categorize IPv4 Addresses (Even When Censored)

Note: This is just an implementation of a class I wrote for extracting IPv4 Addresses. I will likely update my class with a method for this functionality in the future. You can find it on my GitHub page.


What I'm demonstrating below is the following:

  1. Cleaning up your string content example

  2. Bringing your string data into a list

  3. Using the ExtractIPs() class to parse and categorize IPv4 Addresses

    • This class returns a dictionary containing 4 lists:

      • Valid IPv4 Addresses

      • Public IPv4 Addresses

      • Private IPv4 Addresses

      • Invalid IPv4 Addresses


  • ExtractIPs class

    #!/usr/bin/env python"""Extract and Classify IP Addresses."""import re  # Use Regular Expressions.
    
    
    __program__ = "IPAddresses.py"
    __author__ = "Johnny C. Wachter"
    __copyright__ = "Copyright (C) 2014 Johnny C. Wachter"
    __license__ = "MIT"
    __version__ = "0.0.1"
    __maintainer__ = "Johnny C. Wachter"
    __contact__ = "wachter.johnny@gmail.com"
    __status__ = "Development"classExtractIPs(object):
    
        """Extract and Classify IP Addresses From Input Data."""def__init__(self, input_data):
            """Instantiate the Class."""
    
            self.input_data = input_data
    
            self.ipv4_results = {
                'valid_ips': [],  # Store all valid IP Addresses.'invalid_ips': [],  # Store all invalid IP Addresses.'private_ips': [],  # Store all Private IP Addresses.'public_ips': []  # Store all Public IP Addresses.
            }
    
        defextract_ipv4_like(self):
            """Extract IP-like strings from input data.
            :rtype : list
            """
    
            ipv4_like_list = []
    
            ip_like_pattern = re.compile(r'([0-9]{1,3}\.){3}([0-9]{1,3})')
    
            for entry in self.input_data:
    
                if re.match(ip_like_pattern, entry):
    
                    iflen(entry.split('.')) == 4:
    
                        ipv4_like_list.append(entry)
    
            return ipv4_like_list
    
        defvalidate_ipv4_like(self):
            """Validate that IP-like entries fall within the appropriate range."""if self.extract_ipv4_like():
    
                # We're gonna want to ignore the below two addresses.
                ignore_list = ['0.0.0.0', '255.255.255.255']
    
                # Separate the Valid from Invalid IP Addresses.for ipv4_like in self.extract_ipv4_like():
    
                    # Split the 'IP' into parts so each part can be validated.
                    parts = ipv4_like.split('.')
    
                    # All part values should be between 0 and 255.ifall(0 <= int(part) < 256for part in parts):
    
                        ifnot ipv4_like in ignore_list:
    
                            self.ipv4_results['valid_ips'].append(ipv4_like)
    
                    else:
    
                        self.ipv4_results['invalid_ips'].append(ipv4_like)
    
            else:
                passdefclassify_ipv4_addresses(self):
            """Classify Valid IP Addresses."""if self.ipv4_results['valid_ips']:
    
                # Now we will classify the Valid IP Addresses.for valid_ip in self.ipv4_results['valid_ips']:
    
                    private_ip_pattern = re.findall(
    
                        r"""^10\.(\d{1,3}\.){2}\d{1,3}
    
                        (^127\.0\.0\.1)|  # Loopback
    
                        (^10\.(\d{1,3}\.){2}\d{1,3})|  # 10/8 Range
    
                        # Matching the 172.16/12 Range takes several matches
                        (^172\.1[6-9]\.\d{1,3}\.\d{1,3})|
                        (^172\.2[0-9]\.\d{1,3}\.\d{1,3})|
                        (^172\.3[0-1]\.\d{1,3}\.\d{1,3})|
    
                        (^192\.168\.\d{1,3}\.\d{1,3})|  # 192.168/16 Range
    
                        # Match APIPA Range.
                        (^169\.254\.\d{1,3}\.\d{1,3})
    
                        # VERBOSE for a clean look of this RegEx.
                        """, valid_ip, re.VERBOSE
                    )
    
                    if private_ip_pattern:
    
                        self.ipv4_results['private_ips'].append(valid_ip)
    
                    else:
                        self.ipv4_results['public_ips'].append(valid_ip)
    
            else:
                passdefget_ipv4_results(self):
            """Extract and classify all valid and invalid IP-like strings.
            :returns : dict
            """
    
            self.extract_ipv4_like()
            self.validate_ipv4_like()
            self.classify_ipv4_addresses()
    
            return self.ipv4_results
    
  • Example Extraction With Censorship

    censored = re.compile(
        r"""
    
        \(\.\)|
        \(dot\)|
        \[\.\]|
        \[dot\]|
        ( \.)
    
        """, re.VERBOSE | re.IGNORECASE
    )
    
    data_list = input_string.split()  # Bring your input string to a list.
    
    clean_list = []  # List to store the cleaned up input.for entry in data_list:
    
        # Remove undesired leading and trailing characters.
        clean_entry = entry.strip(' .,<>?/[]\\{}"\'|`~!@#$%^&*()_+-=')
    
        clean_list.append(clean_entry)  # Add the entry to the clean list.
    
    clean_unique_list = list(set(clean_list))  # Remove duplicates in list.# Now we can go ahead and extract IPv4 Addresses. Note that this will be a dict.
    results = ExtractIPs(clean_list).get_ipv4_results()
    
    for k, v in results.iteritems():
    
        # After all that work, make sure the results are nicely presented!print("\n%s: %s" % (k, v))
    
    • Results:

      public_ips: ['8.8.8.8', '101.099.098.000']valid_ips: ['192.168.1.1', '8.8.8.8', '101.099.098.000']invalid_ips: []
      
      private_ips: ['192.168.1.1']

Post a Comment for "Python - Parse Ipv4 Addresses From String (even When Censored)"