You could pattern match on the last substring to check for known domains vs file extensions. It's not too difficult to enumerate at least the basic top level domains like .com, .gov, .org, etc. If you are familiar with regular extensions, you can match on a pattern like '.com$'. Otherwise,...
Here some executable code based on your fragments: import java.net.MalformedURLException; import java.net.URL; public class URLExample { public static void main(String[] args) throws MalformedURLException { printURLInformation(new URL("https://www.somesite.com/?param1=val1")); printURLInformation(new URL("https://www.somesite.com?param1=val1")); } private static void printURLInformation(URL url) { System.out.println(url); System.out.println("Path:\t" + url.getPath()); System.out.println("File:\t" + url.getFile());...
php,html,url-parsing,input-sanitization
For anyone looking for an answer - I posted a related (more specific) question which solved the problem: PHP - remove words (http|https|www|.com|.net) from string that do not start with specific words