Python - Parse Ipv4 Addresses From String (even When Censored)
Solution 1:
Here is a regex that works:
import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] formatchin re.findall(pattern, text)]
print ips
# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']
The regex has a few main parts, which I will explain here:
([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
This matches the numerical parts of the ip address.|
means "or". The first case handles numbers from 0 to 199 with or without leading zeroes. The second two cases handle numbers over 199.[ (\[]?(\.|dot)[ )\]]?
This matches the "dot" parts. There are three sub-components:[ (\[]?
The "prefix" for the dot. Either a space, an open paren, or open square brace. The trailing?
means that this part is optional.(\.|dot)
Either "dot" or a period.[ )\]]?
The "suffix". Same logic as the prefix.
{3}
means repeat the previous component 3 times.- The final element is another number, which is the same as the first, except it is not followed by a dot.
Solution 2:
Description
This regex will match each of four octets of a what looks like an IP address. Each of the octets will be placed into it's own capture group for collection.
(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])
Given the following sample text this regex will match all 10 embedded IP strings in their entirety including the first one. Working example: http://www.rubular.com/r/1MbGZOhuj5
The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).
The resulting matches could be iterated over and a properly formatted IP string could be constructed by joining the 4 capture groups with a dot.
Solution 3:
The code below will...
- find IPs in strings even when censored (ex: 192.168.1[dot]20 or 10.10.10 .21)
- place them into a list
- clean them of the censorship (spaces/braces/parenthesis)
- and replace the uncleaned list entry with the cleaned one.
Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing digit (6 and 3 from the aforementioned). If its first octet is invalid (ex: 256.10.10.10), it will drop the leading digit (resulting in 56.10.10.10).
import re
defextractIPs(fileContent):
pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
ips = [each[0] for each in re.findall(pattern, fileContent)]
for item in ips:
location = ips.index(item)
ip = re.sub("[ ()\[\]]", "", item)
ip = re.sub("dot", ".", ip)
ips.remove(item)
ips.insert(location, ip)
return ips
myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()
IPs = extractIPs(fileContent)
print"Original file content:\n{0}".format(fileContent)
print"--------------------------------"print"Parsed results:\n{0}".format(IPs)
Copy
Solution 4:
Extract and Categorize IPv4 Addresses (Even When Censored)
Note: This is just an implementation of a class I wrote for extracting IPv4 Addresses. I will likely update my class with a method for this functionality in the future. You can find it on my GitHub page.
What I'm demonstrating below is the following:
Cleaning up your string content example
Bringing your string data into a list
Using the ExtractIPs() class to parse and categorize IPv4 Addresses
This class returns a dictionary containing 4 lists:
Valid IPv4 Addresses
Public IPv4 Addresses
Private IPv4 Addresses
Invalid IPv4 Addresses
ExtractIPs class
#!/usr/bin/env python"""Extract and Classify IP Addresses."""import re # Use Regular Expressions. __program__ = "IPAddresses.py" __author__ = "Johnny C. Wachter" __copyright__ = "Copyright (C) 2014 Johnny C. Wachter" __license__ = "MIT" __version__ = "0.0.1" __maintainer__ = "Johnny C. Wachter" __contact__ = "wachter.johnny@gmail.com" __status__ = "Development"classExtractIPs(object): """Extract and Classify IP Addresses From Input Data."""def__init__(self, input_data): """Instantiate the Class.""" self.input_data = input_data self.ipv4_results = { 'valid_ips': [], # Store all valid IP Addresses.'invalid_ips': [], # Store all invalid IP Addresses.'private_ips': [], # Store all Private IP Addresses.'public_ips': [] # Store all Public IP Addresses. } defextract_ipv4_like(self): """Extract IP-like strings from input data. :rtype : list """ ipv4_like_list = [] ip_like_pattern = re.compile(r'([0-9]{1,3}\.){3}([0-9]{1,3})') for entry in self.input_data: if re.match(ip_like_pattern, entry): iflen(entry.split('.')) == 4: ipv4_like_list.append(entry) return ipv4_like_list defvalidate_ipv4_like(self): """Validate that IP-like entries fall within the appropriate range."""if self.extract_ipv4_like(): # We're gonna want to ignore the below two addresses. ignore_list = ['0.0.0.0', '255.255.255.255'] # Separate the Valid from Invalid IP Addresses.for ipv4_like in self.extract_ipv4_like(): # Split the 'IP' into parts so each part can be validated. parts = ipv4_like.split('.') # All part values should be between 0 and 255.ifall(0 <= int(part) < 256for part in parts): ifnot ipv4_like in ignore_list: self.ipv4_results['valid_ips'].append(ipv4_like) else: self.ipv4_results['invalid_ips'].append(ipv4_like) else: passdefclassify_ipv4_addresses(self): """Classify Valid IP Addresses."""if self.ipv4_results['valid_ips']: # Now we will classify the Valid IP Addresses.for valid_ip in self.ipv4_results['valid_ips']: private_ip_pattern = re.findall( r"""^10\.(\d{1,3}\.){2}\d{1,3} (^127\.0\.0\.1)| # Loopback (^10\.(\d{1,3}\.){2}\d{1,3})| # 10/8 Range # Matching the 172.16/12 Range takes several matches (^172\.1[6-9]\.\d{1,3}\.\d{1,3})| (^172\.2[0-9]\.\d{1,3}\.\d{1,3})| (^172\.3[0-1]\.\d{1,3}\.\d{1,3})| (^192\.168\.\d{1,3}\.\d{1,3})| # 192.168/16 Range # Match APIPA Range. (^169\.254\.\d{1,3}\.\d{1,3}) # VERBOSE for a clean look of this RegEx. """, valid_ip, re.VERBOSE ) if private_ip_pattern: self.ipv4_results['private_ips'].append(valid_ip) else: self.ipv4_results['public_ips'].append(valid_ip) else: passdefget_ipv4_results(self): """Extract and classify all valid and invalid IP-like strings. :returns : dict """ self.extract_ipv4_like() self.validate_ipv4_like() self.classify_ipv4_addresses() return self.ipv4_results
Example Extraction With Censorship
censored = re.compile( r""" \(\.\)| \(dot\)| \[\.\]| \[dot\]| ( \.) """, re.VERBOSE | re.IGNORECASE ) data_list = input_string.split() # Bring your input string to a list. clean_list = [] # List to store the cleaned up input.for entry in data_list: # Remove undesired leading and trailing characters. clean_entry = entry.strip(' .,<>?/[]\\{}"\'|`~!@#$%^&*()_+-=') clean_list.append(clean_entry) # Add the entry to the clean list. clean_unique_list = list(set(clean_list)) # Remove duplicates in list.# Now we can go ahead and extract IPv4 Addresses. Note that this will be a dict. results = ExtractIPs(clean_list).get_ipv4_results() for k, v in results.iteritems(): # After all that work, make sure the results are nicely presented!print("\n%s: %s" % (k, v))
Results:
public_ips: ['8.8.8.8', '101.099.098.000']valid_ips: ['192.168.1.1', '8.8.8.8', '101.099.098.000']invalid_ips: [] private_ips: ['192.168.1.1']
Post a Comment for "Python - Parse Ipv4 Addresses From String (even When Censored)"