有什么更好的方法来切割包括Java中的2字节字符的字符串

我正在写一个方法，为其他两个系统之间的接口创建一个固定长度的消息。有什么更好的方法来切割包括Java中的2字节字符的字符串

消息必须按约定的长度（字节）传送每个项目，但如果长度大于约定的长度，消息应按项目的长度截断。

该消息包含2个字节的字符，所以如果它在字符中间截断，它将被截断为断开状态。

为了计算正确的字节，它会搜索从头开始剪切的长度。如果消息很长，则性能应该很差。

我找不到更好的方法，所以我在这里寻求帮助。我很抱歉代码复杂且冗余。整个项目可用here。

package thecodinglog.string; 

public class StringHelper { 

public static String substrb2(String str, Number beginByte) { 
    return substrb2(str, beginByte, null, null, null); 
} 

public static String substrb2(String str, Number beginByte, Number byteLength) { 
    return substrb2(str, beginByte, byteLength, null, null); 
} 

/** 
* Returns the substring of the String. 
* It returns a string as specified length and byte position. 
* You can pad characters left or right when there is a specified length. 
* It distinguishes between 1 byte character and 2 byte character and returns it exactly as specified byte length. 
* If the start position or the specified length causes a 2-byte character to be truncated in the middle, 
* it will be converted to Space. 
* You can specify either left or right padding. 
* 
* If beginByte is 0, it is changed to 1 and processed. 
* If beginByte is less than 0, the string is searched for from right to left. 
* If beginByte or byteLength is a real number, the decimal point is discarded. 
* If you do not specify a length, returns everything from the starting position to the right-end string. 
* 
* Examples: 
* <blockquote><pre> 
*  StringHelper.substrb2("a好호b", 1, 10, null, "|") returns "a好호b||||" 
*  StringHelper.substrb2("ab한글", 4, 2) returns " " 
*  StringHelper.substrb2("한a글", -3, 2) returns "a " 
*  StringHelper.substrb2("abcde한글이han gul다ykd", 7) returns " 글이han gul다ykd" 
* </pre></blockquote> 
* 
* @param str a string to substring 
* @param beginByte the beginning byte 
* @param byteLength length of bytes 
* @param leftPadding a character for padding. It must be 1 byte character. 
* @param rightPadding a character for padding. It must be 1 byte character. 
* @return a substring 
*/ 
public static String substrb2(String str, Number beginByte, Number byteLength, String leftPadding, String rightPadding) { 
    if (str == null || str.equals("")) { 
     throw new IllegalArgumentException("The source string can not be an empty string or null."); 
    } 

    if (leftPadding != null && rightPadding != null) { 
     throw new IllegalArgumentException("Left padding, right padding Either of two must be null."); 
    } 

    if (leftPadding != null) { 
     if (leftPadding.length() != 1) { 
      throw new IllegalArgumentException("The length of the padding string must be one."); 
     } 
     if (getByteLengthOfChar(leftPadding.charAt(0)) != 1) { 
      throw new IllegalArgumentException("The padding string must be 1 Byte character."); 
     } 
    } 

    if (rightPadding != null) { 
     if (rightPadding.length() != 1) { 
      throw new IllegalArgumentException("The length of the padding string must be one."); 
     } 
     if (getByteLengthOfChar(rightPadding.charAt(0)) != 1) { 
      throw new IllegalArgumentException("The padding string must be 1 Byte character."); 
     } 
    } 

    int beginPosition = beginByte.intValue(); 
    if (beginPosition == 0) beginPosition = 1; 

    int length; 
    if (byteLength != null) { 
     length = byteLength.intValue(); 
     if (length < 0) { 
      return null; 
     } 
    } else { 
     length = -1; 
    } 

    if (length == 0) 
     return null; 

    boolean beginHalf = false; 
    int accByte = 0; 
    int startIndex = -1; 

    if (beginPosition >= 0) { 
     for (int i = 0; i < str.length(); i++) { 
      if (beginPosition - 1 == accByte) { 
       startIndex = i; 
       accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
       break; 
      } else if (beginPosition == accByte) { 
       beginHalf = true; 
       startIndex = i; 
       accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
       break; 
      } else if (accByte + 2 == beginPosition && i == str.length() - 1) { 
       beginHalf = true; 
       accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
       break; 
      } 
      accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
     } 
    } else { 
     beginPosition = beginPosition * -1; 
     if(length > beginPosition){ 
      length = beginPosition; 
     } 

     for (int i = str.length() - 1; i >= 0; i--) { 

      accByte = accByte + getByteLengthOfChar(str.charAt(i)); 

      if (i == str.length() - 1) { 
       if (getByteLengthOfChar(str.charAt(i)) == 1) { 
        if (beginPosition == accByte) { 
         startIndex = i; 
         break; 
        } 
       } else { 
        if (beginPosition == accByte) { 
         if (length > 1) { 
          startIndex = i; 
          break; 
         } else { 
          beginHalf = true; 
          break; 
         } 
        }else if(beginPosition == accByte - 1){ 
         if(length == 1){ 
          beginHalf = true; 
          break; 
         } 
        } 
       } 
      } else { 
       if (getByteLengthOfChar(str.charAt(i)) == 1) { 
        if (beginPosition == accByte) { 
         startIndex = i; 
         break; 
        } 
       } else { 
        if (beginPosition == accByte) { 
         if (length > 1) { 
          startIndex = i; 
          break; 
         } else { 
          beginHalf = true; 
          break; 
         } 

        } else if(beginPosition == accByte - 1) { 
         if(length > 1){ 
          startIndex = i + 1; 
         } 
         beginHalf = true; 
         break; 

        } 
       } 

      } 
     } 
    } 


    if (accByte < beginPosition) { 
     throw new IndexOutOfBoundsException("The start position is larger than the length of the original string."); 
    } 


    StringBuilder stringBuilder = new StringBuilder(); 
    int accSubstrLength = 0; 

    if (beginHalf) { 
     stringBuilder.append(" "); 
     accSubstrLength++; 
    } 


    if (byteLength == null) { 
     stringBuilder.append(str.substring(startIndex)); 
     return new String(stringBuilder); 
    } 


    for (int i = startIndex; i < str.length() && startIndex >= 0; i++) { 
     accSubstrLength = accSubstrLength + getByteLengthOfChar(str.charAt(i)); 
     if (accSubstrLength == length) { 
      stringBuilder.append(str.charAt(i)); 
      break; 
     } else if (accSubstrLength - 1 == length) { 
       stringBuilder.append(" "); 
      break; 
     } else if (accSubstrLength - 1 > length) { 

      break; 
     } 
     stringBuilder.append(str.charAt(i)); 
    } 

    if (leftPadding != null) { 
     int diffLength = byteLength.intValue() - accSubstrLength; 
     StringBuilder padding = new StringBuilder(); 
     for (int i = 0; i < diffLength; i++) { 
      padding.append(leftPadding); 
     } 
     stringBuilder.insert(0, padding); 
    } 

    if (rightPadding != null) { 
     int diffLength = byteLength.intValue() - accSubstrLength; 
     StringBuilder padding = new StringBuilder(); 
     for (int i = 0; i < diffLength; i++) { 
      padding.append(rightPadding); 
     } 
     stringBuilder.append(padding); 
    } 


    return new String(stringBuilder); 
} 

private static int getByteLengthOfChar(char c) { 
    if ((int) c < 128) { 
     return 1; 
    } else { 
     return 2; 
    } 
} 
}

新尝试代码

String testData = "한글이가득"; 

Charset charset = Charset.forName("EUC-KR"); 
ByteBuffer byteBuffer = charset.encode(testData); 

byte[] newone = Arrays.copyOfRange(byteBuffer.array(), 1, 5); 

CharsetDecoder charsetDecoder = charset.newDecoder() 
     .replaceWith(" ") 
     .onMalformedInput(CodingErrorAction.REPLACE) 
     .onUnmappableCharacter(CodingErrorAction.REPLACE); 

CharBuffer charBuffer = charsetDecoder.decode(ByteBuffer.wrap(newone)); 

System.out.println(charBuffer.toString());

我的预期 “글”，而是 “畸邦”。我认为开始索引必须是正确的解码位置，但我不认为有可能让该方法知道我想要的。

添加例如失败

index| 0 1 2 3 4 5 6 7 8 9 
Char | 한 | 글 | 이 | 가 | 득 
---- | ---- | ---- | ---- | ---- | ---- 
hex | c7d1 | b1db | c0cc | b0a1 | b5e6 
---- | ---- | ---- | ---- | ---- | ----

假设的起始索引为1和长度为4个字节，分十六进制码会是这样

index| 0 1 2 3 4 5 6 7 8 9 
Char | 한 | 글 | 이 | 가 | 득 
---- | ---- | ---- | ---- | ---- | ---- 
hex | c7d1 | b1db | c0cc | b0a1 | b5e6 
---- | ---- | ---- | ---- | ---- | ---- 
sub | d1 | b1db | c0

当解码器解码d1b1dbc0，它将d1b1作为一个字符并视为dbc0作为一个字符。这可能会因字符集而异，但在这种情况下，它会发生类似的变化。除非解码器知道原始字符的字节集合，否则解码器将用错误的字符解码它，因为字节不知道起始点。

我认为这种方法的关键是如何让解码器知道原始字符的起始位置（以字节为单位）。

来源

2017-08-08 JeongjinKim

你知道，char是在java中的两个字节？ – Rodney

这是很多要求人们通过的代码...请参阅如何创建一个[mcve] –

您的整个问题可以改为“找到字符表示在给定下的字符表达式的最长截断长度？”如果是这样，我会使用'CharsetEncoder'，通过'char'追加到'char'，然后等待直到结果溢出（或者更好，参见'encodeLoop'方法） – GPI

将整个字符串转换为byte []并剪切数组会更容易。然后尝试将数组件转换回String。如果转换失败，跳过片段数组的最后一个字节。

来源

2017-08-08 14:25:06 StanislavL

我已经尝试过了，但是2字节字符的中间位置是问题。 – JeongjinKim

有一个NIO方法。

使用CharsetEncoder#encode，一个可以编码一个字符串（或者更确切地说，CharBuffer，但转换是微不足道的），以一个字节阵列（实际上是一个ByteBuffer）的方式，从输入所有可能的字符将被转换，到输入完全处理的点，但决不会溢出输出。

CoderResult.OVERFLOW表示输出缓冲区中没有足够的空间来编码更多字符。应该再次使用具有更多剩余字节的输出缓冲区调用此方法。这通常是通过从输出缓冲区中排除任何编码字节来完成的。

Follwing你的编辑，这里有一个例子（althoug我还是不知道你想要完成的任务，这是我最好的猜测），使用编码EUC-KR您的字符串한글이가득。

首先，让我们看到的字节数组表达就是每个字符

Char | 한 | 글 | 이 | 가 | 득 
---- | ---- | ---- | ---- | ---- | ---- 
hex | c7d1 | b1db | c0cc | b0a1 | b5e6

所以这整个字符串需要10个字节写入

现在，假设我们有9个字节的消息长度。这将允许我们发送한글이가（8字节），这是0xc7d12b1dbc0ccb0a1，但由于没有足够的空间发送득（它需要2个字节的0xb5e6，我们只剩下一个），其余的缓冲区应该是空白。

事实上：

String testData = "한글이가득"; 
CharsetEncoder encoder = charset.newEncoder(); 
// We create a 9 bytes buffer 
ByteBuffer limitedSizeOutput = ByteBuffer.allocate(9); 
// We encode 
CoderResult coderResult = encoder.encode(CharBuffer.wrap(testData.toCharArray()), limitedSizeOutput, true); 
// The encoder tells us that it could not fit the whole chars in 9 bytes 
System.out.println(coderResult); // prints OVERFLOW 
// We can check that it encoded 8 bytes out of the 10 that compose the original string data 
limitedSizeOutput.flip(); 
System.out.println(limitedSizeOutput.limit()); // prints 8 
// We can see that these are in effect 한글이가 by reading the uffer 
System.out.println(charset.newDecoder().decode(limitedSizeOutput).toString());

来源

2017-08-08 14:36:49 GPI

谢谢你的回答。我尝试过使用NIO方法，但我无法得到我想要的结果。上面的代码。 – JeongjinKim

@JeongjinKim我编辑澄清。如果这不适合你，那么我可能误解你的意图。你能否澄清一下？ – GPI

事实上，从字符串开始切换（索引0）时没有问题。但是如果我想从字符串中间切下，则会出现一个复杂的问题。我会编辑我的Q更多细节。 – JeongjinKim

有什么更好的方法来切割包括Java中的2字节字符的字符串

回答

相关问题