uniqueWords ( anyText ; minimWordLength ; caseSensitive ; shouldSkipNumbers )

Turns text into a value list of its unique words of a given minimum word length, with or without numbers.

Average rating: 3.9 (42 votes) Log in to vote

Jan Bardi - Show more from this author
http://users.skynet.be/fa022720/_computing/Welcome%20to%20my%20computer%20hobby%20page.html

Sample input:

("The Proof 2008, the proof - 2008",0,0,0)

("The Proof 2008, the proof - 2008",4,0,0)

("The Proof 2008, the proof - 2008",0,1,0)

("The Proof 2008, the proof - 2008",0,0,1)

Sample output:

the¶proof¶2008¶

proof¶2008¶

The¶Proof¶2008¶the¶proof¶

the¶proof¶

Function definition: (Copy & paste into FileMaker's Edit Custom Function window)

Case (

 caseSensitive < 2 ; uniqueWords (  (Substitute (  TextFormatRemove ( anyText )  ; [ "	" ; "¶" ]; [ " " ; "¶" ]; [ ". " ; "¶" ]; [ ", " ; "¶" ]; [ ": " ; "¶" ]; [ "•" ; "¶" ]; [ "'" ; "¶" ]; [ "§" ; "¶" ]; [ "!" ; "¶" ]; [ "\"" ; "¶" ]; [ "#" ; "¶" ]; [ "$" ; "¶" ]; [ "%" ; "¶" ]; [ "&" ; "¶" ]; [ "(" ; "¶" ]; [ ")" ; "¶" ]; [ "*" ; "¶" ]; [ "+" ; "¶" ]; [ "-" ; "¶" ]; [ "/" ; "¶" ]; [ "|" ; "¶" ]; [ ";" ; "¶" ]; [ "<" ; "¶" ]; [ "=" ; "¶" ]; [ ">" ; "¶" ]; [ "?" ; "¶" ]; [ "@" ; "¶" ]; [ "[" ; "¶" ]; [ "\\" ; "¶" ]; [ "]" ; "¶" ]; [ "^" ; "¶" ]; [ "_" ; "¶" ]; [ "`" ; "¶" ]; [ "{" ; "¶" ]; [ "|" ; "¶" ]; [ "}" ; "¶" ]; [ "~" ; "¶" ]; [ "£" ; "¶" ]; [ "¿" ; "¶" ]; [ "®" ; "¶" ]; [ "¬" ; "¶" ]; [ "¡" ; "¶" ]; [ "«" ; "¶" ]; [ "»" ; "¶" ]; [ "©" ; "¶" ]; [ "¢" ; "¶" ]; [ "¥" ; "¶" ]; [ "¯" ; "¶" ]; [ "´" ; "¶" ]; [ "±" ; "¶" ]; [ "" ; "¶" ]; [ "§" ; "¶" ]; [ "÷" ; "¶" ]; [ "¸" ; "¶" ]; [ "°" ; "¶" ]; [ "¨" ; "¶" ]; [ "·" ; "¶" ]; [ "’" ; "¶" ]; [ "’" ; "¶" ]; [ ":¶" ; "¶" ]; [ ".¶" ; "¶" ]; [ ",¶" ; "¶" ]))  & "¶" ; minimWordLength ; caseSensitive + 2 ; shouldSkipNumbers ) ;

caseSensitive = 2; uniqueWords ( Lower ( anyText) ; minimWordLength ; 3 ; shouldSkipNumbers ); 

WordCount ( anyText ) = 0 ; "" ;

Let ( [ 
firstValue = GetValue ( anyText ; 1 ) ; 
firstWordCount = WordCount ( firstValue );
firstWord = LeftWords ( firstValue ; 1);
isItANumber = If ( firstWord = GetAsText ( GetAsNumber ( firstWord ) ) ; 1 ; 0 );
filteredRest = Substitute ( "¶" & RightValues ( anyText ; ValueCount ( anyText ) - 1) ; [ "¶" & firstWord & "¶" ; "¶" ] ; [ "¶" & firstValue & "¶" ; "¶" ] );
otherValues = uniqueWords ( Right ( filteredRest ; Length ( filteredRest ) - 1 ) ; minimWordLength ; 3 ; shouldSkipNumbers )] ; 
If ( (firstWordCount > 0)   and (Length ( firstWord ) ≥ minimWordLength) and ( not (isItANumber and shouldSkipNumbers)) ; firstWord & "¶" ; "" ) & otherValues)

)

/*

QUICK & DIRTY WORD DEDUPLICATION IN 1 STANDALONE FUNCTION

This function recursively parses any text input into a return separated, deduplicated list of its words. Minimum word length and case sensitivity can be specified, as well as how to treat numbers in the text.

Uses: uniqueWords was created as part of a effort to automatically calculate an instant content correlation index between text records. But other obvious uses come to mind: (facilitating) glossary index list creation, automatic keyword list generation etc. 

With this updated version, in principle texts of more than 10000 words can now be processed, as long as there are less than 10000 DIFFERENT words. Calculations on texts of more than a few thousands of words causes noticeable stalling on most systems.

DETAILS & HOWTO:

UniqueWords uses recursion to call FileMaker Pro's "substitute" function for every different word and number in the input text. This has 3 important consequences:

1) The higher the amount of DIFFERENT words (and numbers!), the slower the calculation.
2) While it's usable in principle on texts with several tens of thousands of words and numbers, it fails when the amount of DIFFERENT words and numbers approaches or exceeds 10000 - FMPro's recursion limit is 10000 iterations.
3) Deduplication is decent overall but can be incomplete due to limitations in FMPro's "substitute" function - for 100% deduplication replace the "substitute" calls with Peter Wagemans' "substituteCompletely" (refer to its detail page in this site for more info on FMPro's "substitute" limitations). Alternatively, call uniqueWords nested inside a deduplicating sort function for a 100% deduplicated, sorted value list of unique words (search this site with keyword "sort" for several versions). 

The function first separates the words of the input text into a return-delimited word list based on the word separators listed in the multiple "substitute" call in the first "Case" block. Your browser may not correctly render all word separator characters in the HTML of this web page. If you run in trouble with that, download the raw text version via http://users.skynet.be/fa022720/_computing/txt-downloads/uniquewords.txt in my site. 

Parameter use:

1) Set minimWordLength to f.i. 3 to exclude the often generic or meaningless words of 1 and 2 letters (a, I, it, on, of, up,...). Set it to 0 or 1 to include everything.
2) Set caseSensitive to 1 for returning "Jobs" and "jobs" as 2 different words. If caseSensitive is set to 0, case is ignored in the deduplication and all unique words are returned lowercase.
3) Set shouldSkipNumbers to 1 to filter out unpunctuated, standalone numbers like 118 or 5 (but not 11,8 or 5th or hour notations like 23:59:59); set it to 0 to include them. The function parses date notations like 9-11-2001 into the numbers 9, 11 and 2001 before considering the shouldSkipNumbers setting. 

*/

QUICK & DIRTY WORD DEDUPLICATION IN 1 STANDALONE FUNCTION

This function recursively parses any text input into a return separated, deduplicated list of its words. Minimum word length and case sensitivity can be specified, as well as how to treat numbers in the text.

Uses: uniqueWords was created as part of a effort to automatically calculate an instant content correlation index between text records. But other obvious uses come to mind: (facilitating) glossary index list creation, automatic keyword list generation etc.

With this updated version, in principle texts of more than 10000 words can now be processed, as long as there are less than 10000 DIFFERENT words. Calculations on texts of more than a few thousands of words causes noticeable stalling on most systems.

DETAILS & HOWTO:

UniqueWords uses recursion to call FileMaker Pro's "substitute" function for every different word and number in the input text. This has 3 important consequences:

1) The higher the amount of DIFFERENT words (and numbers!), the slower the calculation.
2) While it's usable in principle on texts with several tens of thousands of words and numbers, it fails when the amount of DIFFERENT words and numbers approaches or exceeds 10000 - FMPro's recursion limit is 10000 iterations.
3) Deduplication is decent overall but can be incomplete due to limitations in FMPro's "substitute" function - for 100% deduplication replace the "substitute" calls with Peter Wagemans' "substituteCompletely" (refer to its detail page in this site for more info on FMPro's "substitute" limitations). Alternatively, call uniqueWords nested inside a deduplicating sort function for a 100% deduplicated, sorted value list of unique words (search this site with keyword "sort" for several versions).

The function first separates the words of the input text into a return-delimited word list based on the word separators listed in the multiple "substitute" call in the first "Case" block. Your browser may not correctly render all word separator characters in the HTML of this web page. If you run in trouble with that, download the raw text version via http://users.skynet.be/fa022720/_computing/txt-downloads/uniquewords.txt.zip in my site.

Parameter use:

1) Set minimWordLength to f.i. 3 to exclude the often generic or meaningless words of 1 and 2 letters (a, I, it, on, of, up,...). Set it to 0 or 1 to include everything.
2) Set caseSensitive to 1 for returning "Jobs" and "jobs" as 2 different words. If caseSensitive is set to 0, case is ignored in the deduplication and all unique words are returned lowercase.
3) Set shouldSkipNumbers to 1 to filter out unpunctuated, standalone numbers like 118 or 5 (but not 11,8 or 5th or hour notations like 23:59:59); set it to 0 to include them. The function parses date notations like 9-11-2001 into the numbers 9, 11 and 2001 before considering the shouldSkipNumbers setting.

Comments

		Mike Dec 7, 2009
This calc leaves a trailing return. How do I get rid of that?

		Mike Jul 9, 2013
That is just what I'm looking for. You've saved me a lot of work for which I'm very grateful!!

Note: these functions are not guaranteed or supported by BrianDunning.com. Please contact the individual developer with any questions or problems.

uniqueWords ( anyText ; minimWordLength ; caseSensitive ; shouldSkipNumbers )

Comments

Support this website.

Support this website.