outline of noteset #6 the ascii character encoding table...

26
1 COMP 110 Note Set #6: Characters, Strings, Tokens, Tokenizing (or Splitting) Strings Outline of Noteset #6 Summary/Review of Algorithms from earlier: accumulator, min/max, linear search Program Testing: compile-time checks, run-time checks, logical checks, performance checks Text: char and String datatypes The ASCII character encoding table. The char datatype and character constants. The String datatype and string constants. Converting Strings to numbers with Integer.parseInt(), Double.parseDouble(), etc. Converting numbers to Strings with Integer.toString(), Double.toString(), etc. Using the length() method for String (note that this is a method for String but a property for arrays). Comparing String content using equals(). Comparing String references using == Comparing String ordering using compareTo() Examining individual characters in a String with charAt() Extracting substrings with substring() Searching for characters in a String with indexOf() Tokenizing Strings with StringTokenizer class Tokenizing Strings with split() method. Other String formatting examples Command Line Arguments

Upload: others

Post on 20-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    COMP 110 Note Set #6: Characters, Strings, Tokens, Tokenizing (or Splitting) Strings Outline of Noteset #6 Summary/Review of Algorithms from earlier: accumulator, min/max, linear search Program Testing: compile-time checks, run-time checks, logical checks, performance checks Text: char and String datatypes The ASCII character encoding table. The char datatype and character constants. The String datatype and string constants. Converting Strings to numbers with Integer.parseInt(), Double.parseDouble(), etc. Converting numbers to Strings with Integer.toString(), Double.toString(), etc. Using the length() method for String (note that this is a method for String but a property for arrays). Comparing String content using equals(). Comparing String references using == Comparing String ordering using compareTo() Examining individual characters in a String with charAt() Extracting substrings with substring() Searching for characters in a String with indexOf() Tokenizing Strings with StringTokenizer class Tokenizing Strings with split() method. Other String formatting examples Command Line Arguments

  • 2

    Summary: Algorithms Based on Loops Here are a few of the algorithms discussed recently that you should memorize as fundamental “patterns” or “building blocks” that will come up over and over again in future problem solutions (examples use while loops, but for loops are equivalent): Accumulator

    int sum = 0; int[] data = ...; int i = 0; while (i < data.length) { sum = sum + data[i]; i++; }

    Min/Max

    int[] data = ...; int mi = 0; // index of max element (initially 0) int i = 1; // index of current element to compare while (i data[mi]) mi = i; // update max inde x i++; } int max = data[mi];

    Linear Search

    int[] data = ...; int value = ...; boolean found = false; int i = 0; while (i

  • 3

    Getting it Right: How do you know when your program is correct? As you develop your programs, your program passes through several gates or checkpoints. These checks get progressively harder to pass. As your program passes them, you gain increased confidence that your program is correct. But as you now realize, the compiler’s (and even the interpreter’s) ability to find errors is limited. A professional programmer must adopt a skeptical attitude toward his or her software, ie, you must constantly look for problems and work hard to convince yourself that your program is correct. The first hurdle: the compiler The compiler detects the most fundamental kind of programming error, usually thought of as syntax errors: mismatched parens and braces, spelling mistakes, etc. Errors caught by the compiler are said to be detected at compile time. In other words, these are errors found in your program before it ever even runs. The compiler won’t let you run your program until all compiler errors are removed. The second hurdle: the interpreter Even though the compiler says that your program is free of compiler errors, this does not alone guarantee that the program is correct. The program must be tested by executing it and seeing what happens. If your program runs into an unexpected situation, the JVM (Java Virtual Machine) environment will usually stop your program prematurely and print out an error message. The “unexpected situation” that an incorrect program creates is usually called an exception. You must decipher the message, correct the source code, recompile, and reexecute your program to verify that the exception has been eliminated. Errors caught by exceptions are said to be detected at run time . The final hurdle: subjecting the output to your own judgment The hardest of all errors to detect are subtle (or not-so-subtle) logic errors. These don’t cause compiler errors and may not even cause runtime exceptions. Only careful examination of a program’s inputs and outputs can detect such errors. The moral of the story is to adopt a very skeptical attitude when testing your programs. “Future” hurdles Actually, there are other measures of quality for programs that you will see in later classes. Already in this class, you are encouraged to write programs that are “general”. In COMP 182/282, you will learn how to solve problems efficiently, ie, solve the problem in a way that minimizes the use of time and space (memory). Example: compute the sum of integers a and b Solution #1: c = a + b; Solution #2: c = 0; for (int i=0; i

  • 4

    Characters The ASCII Character Encoding Standard As with all data stored in digital form, textual information is ultimately encoded as numbers within a computer program. The basis of this process is to create a character encoding system. The encoding assigns a number to every character that can appear in printable text (or text that can be entered from a keyboard). The encoding usually includes codes for a few special characters that are not directly displayable. The best-known encoding for English is called ASCII (American Standard Code for Information Interchange), officially ANSI Standard X3.4-1968. See example table at http://asciitable.com. A portion of the ASCII character encoding standard is repeated below: Decimal Octal Hex Binary Value ------- ----- --- ------ ----- ... 007 007 007 00000111 BEL (Bell) 008 010 008 00001000 BS (Backspace) 009 011 009 00001001 HT (Horizontal Tab) 010 012 00A 00001010 LF (Line Feed) 011 013 00B 00001011 VT (Vertical Tab) 012 014 00C 00001100 FF (Form Feed) 013 015 00D 00001101 CR (Carriage Return) ... 032 040 020 00100000 SP (Space) 033 041 021 00100001 ! 034 042 022 00100010 " 035 043 023 00100011 # 036 044 024 00100100 $ 037 045 025 00100101 % 038 046 026 00100110 & ... 048 060 030 00110000 0 049 061 031 00110001 1 050 062 032 00110010 2 ... 065 101 041 01000001 A 066 102 042 01000010 B 067 103 043 01000011 C 068 104 044 01000100 D 069 105 045 01000101 E ... 097 141 061 01100001 a 098 142 062 01100010 b 099 143 063 01100011 c 100 144 064 01100100 d 101 145 065 01100101 e ...

  • 5

    It’s hard to remember many numerical codes, so Java provides a set of mnemonics called character constants. Character constants (for printable characters) are created by writing the name of the single character surrounded by single quotes.

    'a' encoded by the number 97 'b' encoded by the number 98 'c’ encoded by the number 99 ... 'A' encoded by the number 65 'B' encoded by the number 66 'C' encoded by the number 67 ...

    An interesting usage of the encoding is for characters that represent numerical digits. '0' encoded by the number 48 '1' encoded by the number 49 '2' encoded by the number 50 ...

    Note: double quotes are used for String constants; double and single quotes are not interchangeable. • 'a' is a character constant for a single character • "a" is a String constant for a String of length 1 (a String containing one character). A single character constant is a primitive value stored in a char variable. A String containing one character is similar to an array with one element. It is stored differently from a single primitive char. The char Datatype To further support operations on characters, Java provides a data type called char (an abbreviation of the word “character”, just like “int” is an abbreviation of the word “integer”). This datatype can be used to declare variables of type char. Character constants can be assigned to variables of type char.

    char c; c = 'a';

    The assignment statement stores or assigns the numerical code for the character ‘a’ into the char variable c. If you were to look at the value actually stored in the variable c, you would find the number 97. To verify this, you can use casting, ie, cast the character constant to an int and print out the int value. char c; c = 'a'; int x; x = (int) c; System.out.println(x); Or to demonstrate the same principle in one line, you can just write: System.out.println( (int)'a'); Now look at the following program and try to figure out what it prints to the display:

    public class CharTest { public static void main(String[] args) { System.out.println('a'); // print character ‘a’ System.out.println((int)'a'); // print code for ‘ a’ } }

    Output is a 97

  • 6

    Special Characters Usually, character constants are written as a single char inside single quotes. There are a few special character constants that begin with the backslash character ‘\’. Here are a few of the most common ones:

    '\n' the newline character '\t' the tab character

    '\"' the double quote character (single quote + backslash + double quote + single quote) These are useful for detailed formatting of information output from your program to the display. They are frequently used by embedding them in the middle of a longer String constant containing other text. For example, to print the text Value of "x" is 3 to the display, including the double quotation marks, use int x = 3; System.out.println("Value of \"x\" is " + x); The special character for double quote is necessary to prevent the compiler from prematurely terminating the String. Strings A String is a series of characters that have been linked together for convenience, to represent text that contains more than one character. Informally, you can think of a String as being implemented as an array of characters. Caution for C/C++ programmers: Java String is not the same as C/C++ string. Conceptually a String is similar to an array of characters, but you cannot treat it as such syntactically in a Java program. In C/C++ a string really is an array of characters. C/C++ Only (not permitted in Java) char str1[] = "this is a string"; char c = str1[3]; // stores character ‘s’ into variable c int x = strlen( str1 ); // calculates number of characters in string str1 In Java, a String must be manipulated used object style syntax. For example, to obtain a single character from a String, the method “charAt()” must be applied to the String, and the method “length()” must be used to obtain the number of characters in the String: Java (using Object Style Syntax) String str1 = "this is a string"; char c = str1.charAt(3); // you cannot use the notation str1[3] // to get the character at position 3 in Java int x = str1.length(); String Constants A String constant is a series of characters inside double quotes. "a String constant" "a String constant that contains \n a newline char acter" The backslash can be used to put double quotes into the inside of the String "another \"String\" constant" String Variables Variables can be created to refer to Strings, in a very similar way that variables can be created to refer to arrays. That is, variables to refer to Strings are also reference variables. The statement

    String s; creates a variable s of type “reference to String”. As with arrays, the string doesn’t yet exist. It’s very common to assign String constants to a String reference variable.

    String s = "this is a string";

  • 7

    Many useful methods return Strings as their result. Such a result can also be assigned to a String. int x = 22; String s = Integer.toString(x);

    String Constructors A String constructor is a special predefined method String(). It is always used in conjunction with the “new” operator (which you already used to create arrays). It takes one parameter of type String (usually a String constant or expression, but other expressions are possible). The result is a reference to a newly created unique String in memory.

    String s = "abcde"; // assign a String constant to reference variable s String t = new String("abcde"); // assign a unique String to reference variable t

    Class String and Predefined String Methods The String in Java is actually defined by a class. Non-primitive data types are defined in Java by writing a class definition. The Java language has predefined many useful classes which are organized into packages. The String class is defined inside package “java.lang”. Other packages such as java.io and java.util will be used shortly. Normally, each package your program uses must be imported by your program import java.io.*; or import java.util.*; This statement “imports” all the classes inside package “java.io” or package “java.util”. The package “java.lang” is special because it is automatically imported into every Java program. So class String is automatically available to every Java program without importing its definition. There are many predefined methods (operations) that can be used to perform useful operations on Strings. int length(): The length of a String The length of a String indicates how many characters it contains. This is similar to the concept of length for arrays. But there is an important difference. For arrays, length is a property. For Strings, length is a method (an operation or a function).

    int[] data = new int[10]; int x = data. length ; // no parens String s = " this is a string " ; int x = s. length() ; // with parens

    Comparing Strings There are at least three ways to compare two Strings • ==: testing to see if two Strings are identically the same String • boolean equals(): comparing two Strings for equality (same content) • int compareTo(): comparing two Strings for relative ordering Why can’t we just use “==” when comparing two Strings? As a general rule, the equality comparison operator “==” should only be used with numbers. Two String values should be compared using the String equals() method. DO use String x = ...; String y = ...; if (x.equals(y)) { ... } // recommended DON’T use if (x==y) { ... } // NOT recommended It may be appropriate in some cases to use “==”, but usually the result is not what was intended.

  • 8

    String x = “abcde”; String y = “abcde”; String z = “AbCdE”; String w = x; x == x true x == y compiler dependent, but normally true (we will assume it’s true in this course) x == z false x == w true

    • Case “x == x” is trivially true • Case “x == z” is trivially false • Case “x == w” is interesting, because it illustrates a property about reference variables. Two

    reference variables can easily be made to refer to the same value in memory. The expression “x==y” is also interesting. The compiler will almost always optimize the creation of additional Strings. If the same String constant appears twice in the same program, the compiler will most likely only create the String constant once, and then reuse it as needed. So “x == y” is true for most compilers [Interestingly, the compiler is not required to do this. As a result, this expression can theoretically have different values on different compilers. Still, the Java compiler from Sun does use this optimization, so we’ll assume this behavior in our discussion of Strings.] Here’s a slightly different case with a subtle difference:

    String x = new String(“abcde”); String y = new String(“abcde”); String z = new String(“AbCdE”); String w = x; x == x true x == y false x == z false x == w true

    In the case “x == y”, the value is always false. The String constructor always creates unique copies of its parameter. The key difference is the use of the keyword “new” with the String constructor. This forces a physically different String constant to be created in memory which happens to contain the same characters as the original. x == y false // because Strings are physically distinct in memory x.equals(y) true // because Strings contain same characters

    x

    y

    z

    w

    “abcde”

    “AbCdE”

    x

    y

    z

    w

    “abcde”

    “AbCdE”

  • 9

    If we have two String values to compare, we usually don’t care if the two Strings are physically the same String or not. We only want to know if they have the same content or not. When testing for equivalent content, we want to use “x.equals(y)” Note: there is no notequals() method. To test for inequality, use !x.equals(y) String x = ...; String y = ...; if (!x.equals(y)) { ... } Equality Comparisons that Ignore Case It might be useful to compare Strings without regard for upper case – lower case. Not surprisingly, there’s a method for this: equalsIgnoreCase()

    “abcde”.equals(“abcde”) true “abcde”.equals(“AbCdE”) false “abcde”.equalsIgnoreCase(“AbCdE”) true

    What about =? No! Comparison operations are designed for use with primitive values (int, double). They cannot be used at all with String values. Ordered Comparisons: compareTo() equals() works as intended, but since its result is boolean, it can only be used to answer • “yes (true), the two Strings are equal” or • “no (false), the two Strings are not equal” In some applications we might want to know more than just simple equality or inequality. We might want to know if one String comes before or after another String alphabetically or lexicographically. The compareTo() method allows us to compare two String values lexicographically. Its result is an int whose value indicates the order of the two Strings. Remember the ASCII character encoding discussed earlier. The numerical values of all the character codes can be used to put Strings into a specific order. Normal alphanumeric or lexicographic comparison:

    “a” comes before “b” “aa” comes before “ab” ...

    Shorter Strings come before longer ones “a” comes before “aa” ...

    Digits start at 48, upper case letters at 65, lower case letters at 97 “0” comes before “A” “A” comes before “a” ...

    x

    y

    z

    w

    “abcde”

    “AbCdE”

    “abcde”

    x

    y

    z

    w

    “abcde”

    “AbCdE”

    “abcde”

  • 10

    The compareTo() method uses character codes to make decisions about String ordering. System.out.println("a".compareTo("aa")); System.out.println("aa".compareTo("a")); Output is • < 0 if the first String comes before the second • == 0 if the two Strings are equivalent • > 0 if the first String comes after the second Examples "a".compareTo("aa") < 0 "a".compareTo("a") == 0 "aa".compareTo("a") > 0 The exact numerical value of the comparisons that result in a non-zero value actually provides more detailed info about how the Strings are different, but in simple programs, it’s more common to just look for 0, < 0, or > 0. Parsing Strings that contain only Characters that represent numbers Special Strings that contain only characters for numbers can be converted into the corresponding numerical value. Let’s llustrate the difference between String constants such as “123” and the numerical constant 123.

    int x = 123;

    String s = "123";

    Or, knowing what we know about character encodings:

    49 50 51 Since this special kind of String comes up so often, we need easy-to-use predefined operations to do the conversions for us. When converting a String to a number, Java provides several operations for parsing, ie, converting the String into its numerical equivalent. Any attempt to parse a String that contains non-numerical characters generates a runtime exception. How to correctly handle such situations in your program will be covered in the unit on exceptions later in the course. (exercise: think about how the Integer.parseInt() operation works, given that you now know how Strings representing numbers are actually stored) When converting a number into a String, the operation is usually performed automatically. There are a few predefined operations that will do this operation, called formatting, for you explicitly. For example:

    int x = 3; System.out.println(“value of x is ” + x);

    Look at the information given to System.out.println() for output: "value of x is " + x Here is how the compiler deals with this "value of x is "

  • 11

    In order to make sense out of this expression, the int must be converted into an equivalent String before the append operation can be completed. So do the following:

    take the value of x, which is 3 convert it into an equivalent String, ie “3” append it to the first string “value of x is ” result is the String “value of x is 3”, and send this to the display

    Operations on Strings Use the Selector Operator “.” When writing expressions that use multiple Strings, we frequently use the dot or selection operator. Example:

    String s = "xyz"; String t = "abc";

    Instead of writing if (s == t) ... we write if ( s.equals(t) ) ... Other examples int x = s.length(); char c = s.charAt( 2 ); etc. In an expression like “s.equals(t)”, we say that the method “equals()” is applied to s, and t is a parameter to the method. This is actually an introduction to an OOP style of programming. Strings are a type of object because the String type is defined by a class definition. We create objects from class definitions. When performing an operation on an object, the method is applied to the object, rather than passing the object as a parameter. If other data is required in addition to the original object, then this data is passed as paramters normally. Examples: Length: instead of writing: String s = "xyz"; int x = length(s); // wrong We write String s = "xyz"; int x = s.length(); // correct Equals: instead of writing String s = "xyz"; String t = "abc"; if ( equals( s, t ) ... // wrong We write if ( s.equals( t ) ) ... // correct This kind of expression is going to be very common when dealing with objects and OOP which we’ll cover in lecture shortly.

  • 12

    Obtaining Individual Characters at Specific Positions A String is similar to an array of characters. Each character in a String occupies an indexed position. The indexes start at 0 and go until the length of the String – 1. String x = "Hello, world!"

    H e L l o , w o r l d ! 0 1 2 3 4 5 6 7 8 9 10 11 12

    System.out.println( x.length() ); // 13 System.out.println( x.charAt( 7 ) ); // w System.out.println( x.charAt( 11 ) ); // d

    The expression “x.charAt( 7 )” is conceptually similar to the expression “x[ 7 ]”. But remember that Strings are not the same as arrays of characters in Java, so the expression “x[ 7 ]” where x is a String is not allowed in Java. You must use “x.charAt( 7 )”. Comparison Between Arrays and String Strings are very similar to arrays of characters, but not identical. Each character in a String occupies a position or index, using 0-based counting.

    int[] x = { 3, 18, 22, 34, 19 }; x[ 0 ] refers to the number stored in array x at position 0: 3 x[ 1 ] refers to the number stored in array x at position 1: 18 x[ 2 ] refers to the number stored in array x at position 2: 22 x[ 3 ] refers to the number stored in array x at position 3: 34 x[ 4 ] refers to the number stored in array x at position 4: 19

    The analogous code for String data requires the charAt() operation to be applied to the String variable using the selector operator.

    String s = "pxvtmr"; s.charAt( 0 ) refers to the character at position 0: 'p' s.charAt( 1 ) refers to the character at position 1: 'x' s.charAt( 2 ) refers to the character at position 2: 'v' s.charAt( 3 ) refers to the character at position 3: 't' s.charAt( 4 ) refers to the character at position 4: 'm' s.charAt( 5 ) refers to the character at position 5: 'r'

    So Strings are similar to arrays of characters, but we do not use the square bracket notation with Strings.

  • 13

    Substrings Given an original String, the substring() method creates a new String that is a part of the original. Substrings are defined by their starting and stopping index positions. There are several versions.

    String s = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; String t = s.substring( 4, 8 ); System.out.println( t ); // "EFGH"

    Note that the substring(4,8) includes the characters at positions 4, 5, 6, and 7 (does not include 8). String v = s.substring(8); System.out.println( v ); // "IJKLMNOPQRSTUVWXYZ" This 2nd version of substring() only takes one argument. The result is the substring from that position to the end of the string. s.substring( x ) is the same as s.substring( x, s.length() ) Method Overloading: another preview of OOP In this example, we have seen two versions of a method with the same name: one version of substring() that takes two arguments, and one version of substring() that takes one argument. This is an example of method overloading, ie, defining two methods that have the same name but different numbers and/or types of arguments. [some examples in this section taken from Hubbard, Programming with Java, Schaum’s Outline Series, McGraw-Hill, 2004]. Locating Characters Within a String The method indexOf() returns an index position of the first occurrence of a character within a String. This is also an overloaded function, ie, two versions with different numbers of arguments. String str = "This is the Mississippi River."; int i = str.indexOf( ‘s’ );

    // 1 st occurrence of 's' in str from the beginning System.out.println( i ); // 3 int j = str.indexOf( 's', i+1 );

    // 1 st occurrence of 's' in str, starting from position i +1 System.out.println( j ); // 6

  • 14

    Tokens and Tokenization Many programs deal with the processing of textual data in the form of Strings. So far, we’ve been thinking about Strings as a single data item. Programs that prompt the user to input a numerical value receive a String that is then parsed into the numerical equivalent, for example. String s = "45"; int x = Integer.parseInt(s); But many String values that are input actually have some internal structure. For example, suppose you wanted to write a simple line-oriented calculator application that works something like this:

    > java Calc Please enter an expression: 4 + 3 7 Please enter an expression: 15 – 4 11 Please enter an expression: quit >

    Using the Scanner method nextLine(), we can obtain a line of input from the keyboard in the form of a single String: Scanner in = new Scanner(System.in); String s = in.nextLine(); // assume s contains "4 + 3" But now we have a problem. We want to parse s to obtain the numerical values that are “buried” inside it, but we can’t just parse s “as is”. It will generate an exception because the String s is not simply a number. It is a mixture of numerical characters, arithmetic symbol characters, and space characters:

    '4' ' ' '+' ' ' '3' 0 1 2 3 4

    Clearly what we need is to first break up the single String s into 3 separate Strings a, b, and c as follows: s: "4 + 3" original a: "4" 1st token b: "+" 2nd token c: "3" 3rd token We need a way to read the characters of the original String s, locate the pieces that we are interested in, discard uninteresting characters such as spaces, and save the remaining pieces as separate String values. This task comes up so frequently that Java provides predefined code to solve the problem for us. We just need to learn a little terminology to use it. The Scanner class with its nextInt() and nextDouble() methods take care of this problem automatically. In this example, we are doing it “the hard way”. In some cases the “hard way” might be “the only way” to solve some problems. The subparts of our original String "4 + 3" that we are interested in separating out are called tokens. In our example, the first token is "4", the second token is "+", and the third token is "3". The process of taking the original String and breaking it apart into its tokens is called tokenization. In order to know where one token stops and the next one begins, we have to decide or agree on the separator characters. In this case, we assume that the space character separates the tokens, but we might want to change the definition of the separators from time to time. The separator character or characters are called delimiters.

  • 15

    The “split()” method from class String The split() method is built into the String class. It takes an initial string and chops or splits it into substrings based on a set of splitting rules provided by the user. The returned result is an array of Strings, with each substring stored in one element of the array. Example: String s = " this is a multi word string " ; String[] t = s.split( " " ); The input parameter to the split() method is itself a string. This string specifies the rules for performing the split. In the simplest case as shown, the rule is simply the space character “ ”. This rule tells the split() method to create a new substring every time it encounters a space character. The complete set of split() rules is very complex. The method can be used to perform sophisticated text processing, but this is outside the scope of the course. After the split() method returns, the array can be examined like any other array: for (int i=0; i

  • 16

    Example String s = "This is a test."

    • To obtain the individual tokens, create a StringTokenizer. • Place this statement near the top of the file with any other import statements

    import java.util.*; • Create a StringTokenizer object associated with String s. StringTokenizer tk = new StringTokenizer(s); • Obtain the tokens by applying the nextToken() method to the StringTokenizer object tk. String t1 = tk.nextToken(); // "This" String t2 = tk.nextToken(); // "is" String t3 = tk.nextToken(); // "a" String t4 = tk.nextToken(); // "test." Putting all the pieces together gives:import java.util.*; public class TokenDemo { public static void main(String args[]) { String s = "This is a test."; StringTokenizer tk = new StringTokenizer(s);

    String t1 = tk.nextToken(); // "This" String t2 = tk.nextToken(); // "is" String t3 = tk.nextToken(); // "a" String t4 = tk.nextToken(); // "test."

    System.out.println("token #1 = \"" + t1 + "\""); System.out.println("token #2 = \"" + t2 + "\""); System.out.println("token #3 = \"" + t3 + "\""); System.out.println("token #4 = \"" + t4 + "\""); } } Output is

    token #1 = "This" token #2 = "is" token #3 = "a" token #4 = "test."

    Limitations with this approach? • Not general • What if we have a String with more than four tokens? • Better to use methods like countTokens() or hasMoreTokens() The following general version uses loops and built-in operations to test for more tokens.

    String s = "This is another test with a larger numb er of tokens."; StringTokenizer tk = new StringTokenizer(s); String[] tokens = new String[tk.countTokens()]; for (int i=0; i

  • 17

    Output is token #1 = "This" token #2 = "is" token #3 = "another" token #4 = "test" token #5 = "with" token #6 = "a" token #7 = "larger" token #8 = "number" token #9 = "of" token #10 = "tokens."

    Note: it’s not required for you to first put all the tokens into an array, but it’s sometimes convenient.

  • 18

    • Another common programming style with StringTokenizer is to use a while loop String s = ...; StringTokenizer tk = new StringTokenizer(s); while ( tk.hasMoreTokens() ) { String t = tk.nextToken(); ... } Another example: String s = "4 + 3"; StringTokenizer t = new StringTokenizer(s); System.out.println(t.nextToken()); System.out.println(t.nextToken()); System.out.println(t.nextToken()); Output is:

    4 + 3

    There is also a related class called a BreakIterator which partitions a string into subsets of characters. In the case of the StringTokenizer, the separator characters are “consumed” by the tokenizer and are not available for later examination. The BreakIterator reports information about the tokens by index position of characters within the original string and does not consume any characters. In general, the split() method of class String is the preferred way to tokenize a string. It should be your default choice. Only use StringTokenizer or BreakIterator for special purpose problems. Other Delimiters The documentation for the constructor of the usual StringTokenizer looks like this public StringTokenizer(String str)

    Constructs a string tokenizer for the specified string. The tokenizer uses the default delimiter set, which is "\t\n\r\f": the space character, the tab character, the newline character, the carriage-return character, and the form-feed character. Delimiter characters themselves will not be treated as tokens.

    Parameters: str - a string to be parsed.

    But for special situations, you want to be able to create a StringTokenizer that is customized to recognize other delimiter characters. Here’s the documentation for the customizable StringTokenizer: public StringTokenizer(String str, String delim)

    Constructs a string tokenizer for the specified string. The characters in the delim argument are the delimiters for separating tokens. Delimiter characters themselves will not be treated as tokens.

    Parameters: str - a string to be parsed delim - the delimiters.

    Example: String s = “this:is:a:string:with:colon:separators ”; StringTokenizer t = new StringTokenizer( s, “:” ); The StringTokenizer in this example is initialized to look for the colon character “:” as the delimiter or separator, rather than the default characters of space, tab, newline, etc.

  • 19

    More String Processing The tokenizer approach described above is a useful practice exercise for beginning programmers, but it’s not the best approach for solving the general problem. The best approach uses the rules of regular expressions to break up a string according to a specified pattern (the regular expression). One way to access this feature in Java is to use the split operation for Strings: Example: String t = "This is a test"; // note extra spac es String[] tokens = t.split(" +"); // “ +” is RE for

    // “one or more spaces” for (int i=0; i

  • 20

    Converting a char to a String There is a difference between a character constant like ‘b’ and a String consisting of a single character like “b”. A convenient operation to convert a character constant into a one-character String is Character.toString() For example: char c = ‘x’; // assigns char constant ‘x’ to char variable c String s = Character.toString(c);

    // converts ‘x’ to “x” in String variable s Shifting Characters by Manipulating their ASCII Cod e We can sometimes take advantage of the fact that the ASCII codes for the characters are assigned sequentially for certain ranges. From the ASCII table given earlier, here are a couple of lines:

    Decimal Octal Hex Binary Value ------- ----- --- ------ ----- 097 141 061 01100001 a 098 142 062 01100010 b 099 143 063 01100011 c ...

    The statement System.out.print(‘a’) prints the character a on the screen. But look at the following code:

    int mysterycode = (int) 'a'; mysterycode++; char mysterychar = (char) mysterycode; System.out.println(mysterychar);

    Hopefully, it won’t be too surprising that it prints out the character b on the screen. The expression ‘a’ is just a symbol for the ASCII code value of 97, which is assigned to the int variable. This int value is then incremented from 97 to 98, and turned back into a character constant with the cast operation. As a character code, 98 represents ‘b’, so this is what is printed out. You will need this little trick as you write your solution to the next lab.

  • 21

    Strings are Immutable Once a Java String has been created, it has a property called “immutability” , which is just a fancy word for “cannot be modified”. This is very different in general from C/C++, where strings are just arrays of character constants which can be modified however you want (unless you use the C++ keyword const). Example: String s = "abcde " ; char c = s.charAt(0); // assigns ‘a’ to variable c s.setCharAt(0, 'A'); // illegal, can’t modify a character

    // within the String Perhaps surprisingly, it’s okay to throw a String away and replace it with a new one. String s = "abcde"; // assign a reference to a Str ing to s s = "pqrst"; // replace the reference to the firs t

    // String with a second String // this effectively discards 1 st String But in general if you want to systematically modify an existing String to introduce a new one, you have a couple of options. Some of the predefined String operations create a new String which can be used to replace the original. String s = "abcde"; s = s.toUpperCase(); // s now has value “ABCDE” To create the new String one character at a time, you can use a loop and the “append” operation. What does the following code do? Trace it to find out. String t = "uvwxyz"; String u = ""; for (int i=0; i

  • 22

    String Operations (Summary) Comparisons and Testing for Equality/Inequality "a".compareTo("b") result is 0 because b comes after a "a".compareTo("a") result is 0 because the two Strings have the same content "abc".equals("abc") result is true !("abc".equals("cde")) result is true; note that there is no operation called

    .notequals(); use ! and .equals()

    Use “==” only to check if two Strings are physically the same String in memory (a much less common comparison than compare for content) String a = "abc"; String b = "cde"; String c = "abc"; String d = new String("abc"); a == b // false a == c // true (identical String constants unique in memory) a == d // false (constructor builds distinct Strin g) Extracting Characters from a String "abcde".charAt(3) // result is ‘d’ Extracting Substrings "wxyzabcd".substring(3,5) // result is “za” "wxyzabcd".substring(5) // result is “bcd” Parsing and Formatting Some String constants can represent a number, others cannot.

    "123" // can parse to an int "123.456" // can parse to a double "abc" // cannot parse to any number

    A String that contains characters representing a number can be converted into the corresponding numerical value This operation is called parsing. Going in the other direction, converting a number into a String, and optionally adding other punctuation, is called formatting . This extra punctuation is added purely for visual effect. It really has no effect on the numerical value of underlying data item. To parse a String into an int , use • int Integer.parseInt(String s) To parse a String into a double, use • double Double.parseDouble(String s) To format an int into a String use • Integer.toString(int n) To format a double into a String use • Double.toString(double d) Many operations that output numerical values automatically promote the numerical value to String, so most of the time, it is not necessary to explicitly format a number to convert it to a String. The exception is when you specifically want to control the formatting details, such as number of digits to display to the right of the decimal point, or whether to separate groups of three digits on the left with commas. The simple formatting methods shown above do not provide a way to control the formatting at this level of detail. Instead, there is a special class called DecimalFormat that is used to perform detailed formatting on numbers. This class is defined in the package java.text.

  • 23

    The operation of a DecimalFormat object has two steps or phases. The first step is that it must be created or instantiated (“instantiate” is the OOP term meaning “use a class definition to build a new non-primitive data item”). At the time it is created, it is given a formatting “pattern” to work with. The second step is, once it is created, it can then be used multiple times to turn numbers into Strings with the “format()” operation. The first step is performed once. The second step is performed as many times as needed. Here’s how to create a DecimalFormat object:

    import java.text.DecimalFormat; ... DecimalFormat f = new DecimalFormat("#.###");

    This statement creates a DecimalFormat object named f. The String “#.###” is called a format string . It is composed of a small number of characters including ‘#’ ‘0’ ‘.’ ‘,’ and a few others. The character ‘#’ is used to indicate that a digit should be printed in that position unless it is a trailing zero. The character ‘0’ means that a digit should be printed in that position even if it a trailing or leading zero. After the DecimalFormat object has been created, the format() method is used to convert numerical values into a String formatted according to the specified format String. import java.text.*; // remember to import java.tex t package

    DecimalFormat f; String s; f = new DecimalFormat("#.###");

    s = f.format(5.12345); System.out.println( s ); // 5.123 s = f.format(5.12); System.out.println( s ); // 5.12

    f = new DecimalFormat("#.000"); s = f.format(5.12345); System.out.println( s ); // 5.123 s = f.format(5.12); System.out.println( s ); // 5.120 Once the DecimalFormat object has been created, it can be reused as many times as desired to convert numerical values to properly formatted Strings. The resulting Strings can then be used in whatever way is desired, ie, print to the monitor, perform further internal processing, etc. System.out.printf() Method A more recent addition to the language is the System.out.printf() method, which is a throwback to the original printf() function introduced by the C language. Printf() uses embedded format sequences to accomplish similar formatting as DecimalFormat, but in an arguably simpler way, at least for programmers familiar with the function from the C language. The format is: System.out.printf("format string", expr, expr, exp r, …); where the format string contains normal text to be output as-is, plus format specifiers embedded in the string starting with the character %. For each format specifier, there should be an additional expression separated by a comma after the end of the format string. The compiler does not enforce this rule, but violations may generate a runtime exception. Format specifiers include: %d for integers %f for reals %b for booleans In addition to the basic datatype, precision and field width info can be added. For example %.3f real number with 3 digits to the right of the decimal %10.3f real number with a total minimum width of 10 characters (including decimal and sign) and 3 digits to the right of the decimal

  • 24

    System.format() is a similar function that takes the same inputs as System.out.printf(), but which produces a string and returns it rather than immediately printing the string to the output. The following two codes would be equivalent:

    int x = 123; double q = 23.456; boolean r = true; System.out.printf("x = %d, q = %f, r = %b", x, q, r ); System.out.println();

    Or int x = 123; double q = 23.456; boolean r = true; String s = String.format("x = %d, q = %f, r = %b", x, q, r); System.out.println(s);

  • 25

    Another Tokenizer Example Here’s a program that • Reads a String from the keyboard • Breaks the line apart into a series of tokens • Parses each token to convert it to an integer • Computes the sum of all the integers • Prints the sum to the monitor.

    import java.util.*; public class AddInts { public static void main(String args[]) throws IOEx ception {

    Scanner in = ...; System.out.print("Enter a series of integers: ") ; String line = in.nextLine(); int sum = 0; StringTokenizer tkline = new StringTokenizer(line ); while (tkline.hasMoreTokens()) { String currenttok = tkline.nextToken(); int currentval = Integer.parseInt(currenttok); sum += currentval; } System.out.println("The sum is " + sum); } }

    Output is

    C:\> java AddInts Please enter a series of integers: 5 6 7 8 The sum is 26

    Exercise: rewrite using string split().

  • 26

    Command-Line Arguments Here’s a little more detail on how a Java source file is turned into an executable application. Strings placed on the command line after the name of the class are called command-line arguments.

    C:> javac SomeClass.java C:> java SomeClass 10 11 12

    In this example the command line arguments are “10”, “11”, and “12”, ie, every item of information on the command line that follow the name of the Java class being executed. Command line arguments are passed to your program as an array of Strings. This array is made available to the main method as the input argument named args.

    public static void main( String[] args) { ... } In this case:

    java SomeClass 10 11 12 -- --

    java SomeClass 10 11 12 cmd class file args[0] args[1] args[2]

    args[0] = "10" args[1] = "11" args[2] = "12" args.length = 3

    The name of the command being executed (“java”) and the name of the class file (“SomeClass”) are not part of the argument list. The list starts with the first String or “token” on the command line after the class file. As a nice simple demo, here’s a program that does nothing but “echo” the command line arguments, if any, back to the display: public static void main(String[] args) { for (int i=0; i