java - Extracting pairs of words using String.split()


Keywords:java 


Question: 

Given:

String input = "one two three four five six seven";

Is there a regex that works with String.split() to grab (up to) two words at a time, such that:

String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));

results in this:

[one two, three four, five six, seven]

This question is about the split regex. It is not about "finding a work-around" or other "making it work in another way" solutions.


4 Answers: 

Currently (including Java 10) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses \w+ which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.

Instead use Pattern and Matcher classes with regex like \w+\s+\w+ which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").


Is this what you are looking for?
(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to read regex with \\w\\s then \\S\\s)

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

In split we are trying to

  1. find spaces -> \\s
  2. that are not predicted -> (?<!negativeLookBehind)
  3. by some word -> \\w+
  4. with previously matched (space) -> \\G
  5. before it ->\\G\\w+.

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).


Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.


I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma 
 

This will work, but maximum word length needs to be set in advance:

String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));

I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.

 

this worked for me (\w+\s*){2}\K\s example here

  • a required word followed by an optional space (\w+\s*)
  • repeated two times {2}
  • ignore previously matched characters \K
  • the required space \s
 

You can try this:

[a-z]+\s[a-z]+

Updated:

([a-z]+\s[a-z]+)|[a-z]+

enter image description here

Updated:

 String pattern = "([a-z]+\\s[a-z]+)|[a-z]+";
 String input = "one two three four five six seven";

 Pattern splitter = Pattern.compile(pattern);
 String[] results = splitter.split(input);

 for (String pair : results) {
 System.out.println("Output = \"" + pair + "\"");