2012-11-08 61 views
5

我正在使用开源方法将html文本解析为NSString。开源html解析类没有正确解析段落之间的空格

生成的字符串在第一对段落之间有大量空白,但后续段落只有一行空间。这是一个输出的例子。

enter image description here 下面是我打电话的方法。我只更改了两行代码。对于stopCharactersnewLineAndWhitespaceCharacters,我从字符集中删除/n,因为当它被包含时,整个文本是一个长段。

- (NSString *)stringByConvertingHTMLToPlainText { 

    // Pool 
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; 

    // Character sets 
    NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]]; 
    NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@" \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]]; 
    NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"]; 

    // Scan and find all tags 
    NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length]; 
    NSScanner *scanner = [[NSScanner alloc] initWithString:self]; 
    [scanner setCharactersToBeSkipped:nil]; 
    [scanner setCaseSensitive:YES]; 
    NSString *str = nil, *tagName = nil; 
    BOOL dontReplaceTagWithSpace = NO; 
    do { 

     // Scan up to the start of a tag or whitespace 
     if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) { 
      [result appendString:str]; 
      str = nil; // reset 
     } 

     // Check if we've stopped at a tag/comment or whitespace 
     if ([scanner scanString:@"<" intoString:NULL]) { 

      // Stopped at a comment or tag 
      if ([scanner scanString:@"!--" intoString:NULL]) { 

       // Comment 
       [scanner scanUpToString:@"-->" intoString:NULL]; 
       [scanner scanString:@"-->" intoString:NULL]; 

      } else { 

       // Tag - remove and replace with space unless it's 
       // a closing inline tag then dont replace with a space 
       if ([scanner scanString:@"/" intoString:NULL]) { 

        // Closing tag - replace with space unless it's inline 
        tagName = nil; dontReplaceTagWithSpace = NO; 
        if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) { 
         tagName = [tagName lowercaseString]; 
         dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] || 
                [tagName isEqualToString:@"b"] || 
                [tagName isEqualToString:@"i"] || 
                [tagName isEqualToString:@"q"] || 
                [tagName isEqualToString:@"span"] || 
                [tagName isEqualToString:@"em"] || 
                [tagName isEqualToString:@"strong"] || 
                [tagName isEqualToString:@"cite"] || 
                [tagName isEqualToString:@"abbr"] || 
                [tagName isEqualToString:@"acronym"] || 
                [tagName isEqualToString:@"label"]); 
        } 

        // Replace tag with string unless it was an inline 
        if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; 

       } 

       // Scan past tag 
       [scanner scanUpToString:@">" intoString:NULL]; 
       [scanner scanString:@">" intoString:NULL]; 

      } 

     } else { 

      // Stopped at whitespace - replace all whitespace and newlines with a space 
      if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) { 
       if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result 
      } 

     } 

    } while (![scanner isAtEnd]); 

    // Cleanup 
    [scanner release]; 

    // Decode HTML entities and return 
    NSString *retString = [[result stringByDecodingHTMLEntities] retain]; 
    [result release]; 

    // Drain 
    [pool drain]; 

    // Return 
    return [retString autorelease]; 

} 

编辑:

这里是字符串的NSLog的。我只粘贴了第一几段

Mitt Romney spent the past six years running for president. After his loss to President Barack Obama, he'll have to chart a different course. 


His initial plan: spend time with his family. He has five sons and 18 grandchildren, with a 19th on the way. 






"I don't look at postelection to be a time of regrouping. Instead it's a time of forward focus," Romney told reporters aboard his plane Tuesday evening as he returned to Boston after the final campaign stop of his political career. "I have, of course, a family and life important to me, win or lose." 

The most visible member of that family — wife Ann Romney — says neither she nor her husband will seek political office again. 

等....

for (int j = 25; j< 50; j++) { 
    char test = [completeTrimmed characterAtIndex:([completeTrimmed rangeOfString:@"chart a different course."].location + j)]; 

     NSLog(@"%hhd", test); 
    } 

012-11-11 17:15:57.668 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 10 
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 10 
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 10 
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 72 
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 105 
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 115 
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 105 
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 110 
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105 
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 116 
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105 
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 97 
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 108 
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 32 
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 112 
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 108 
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 97 
+0

可能需要使用stringByReplacingCharactersInRange方法从最终字符串中删除多余的空格。 – iDev

+0

我已经尝试过'completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@“”withString:@“”];',但它什么都不做 – Mahir

+0

@“”之间应该有大约15个空格,但是注释会自动将它解析为1空间 – Mahir

回答

1

我试图与上面的问题,这是我怎样固定它,

NSString *retString = [[result stringByDecodingHTMLEntities] retain]; 
[result release]; 

retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet whitespaceCharacterSet] withString:@" "]; 
retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet newlineCharacterSet] withString:@"\n"]; 

我已上的NSString定义的类中的方法为,

- (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString; 

实施如下,

- (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString { 

    NSMutableString *originalStr = [NSMutableString string]; 

    if (!self) { 
     return nil; 
    } 

    NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:characterSet]; 

    int counter = 0; 
    for (NSString *stringComponent in componentsArray) { 

     counter ++; 

     if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && ((![stringComponent isEqualToString:@"\n"]) || (![joiningString isEqualToString:@"\n"]))) { 

      if ([componentsArray count] == counter) { 
       [originalStr appendFormat:@"%@", stringComponent];     
      } else { 
       [originalStr appendFormat:@"%@%@", stringComponent, joiningString]; 
      } 
     } 
    } 

    return originalStr; 
} 

NSString+HTML.m文件中添加上述方法作为上的类别。基本上在你提供的html中,空格和换行符被混合了多次,试图单独去掉换行符不起作用。所以我通过比较字符串在剥离后是否有换行符或空白字符,然后将其附加到主字符串上来删除上面显示的重复换行符和空格。

或者,你也可以尝试的,

NSString *retString = [[result stringByDecodingHTMLEntities] retain]; 
[result release]; 

retString = [retString stripDuplicateNewlineCharacters]; 

的方法被定义为,

- (NSString *)stripDuplicateNewlineCharacters { 

    NSMutableString *originalStr = [NSMutableString string]; 

    if (!self) { 
     return nil; 
    } 

    NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]]; 

    int counter = 0; 
    for (NSString *stringComponent in componentsArray) { 

     counter ++; 

     stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@" " withString:@"<#$%$#>"]; 
     stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#><#$%$#>" withString:@"<#$%$#>"]; 
     stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#>" withString:@" "]; 

     if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && (![stringComponent isEqualToString:@"\n"])) { 

      if ([componentsArray count] == counter) { 
       [originalStr appendFormat:@"%@", stringComponent]; 
      } else { 
       [originalStr appendFormat:@"%@\n", stringComponent]; 
      } 
     } 
    } 

    return originalStr; 
} 

在这种情况下,重复的白色空间的方法本身移除,同时删除新线字符。

+0

现在它是一个长段。我尝试删除用于空格的'stripDuplicateCharacters'方法,以便只保留换行符方法,并且结果是在段落1之后有一行额外的空格,在段落2之后有正确的间距,对于其余段落,段落从新行开始,但段落之间没有空格。 – Mahir

+0

我编辑了换行符的方法,现在所有的段落都很好,除了1,它有一行额外的空格。我删除了条件'(![stringComponent isEqualToString:@“”])' – Mahir

+0

@Mahir,这很奇怪,因为它适用于我。你可以尝试添加这个'retString = [retString stringByReplacingOccurrencesOfString:@“\ n”withString:@“**#NEWLINE#**”];'在打印到控制台之前,看看段落之间有多少换行符? – iDev

4

检查与此,

//Decode HTML entities and return 
    NSString *retString = [result stringByDecodingHTMLEntities]; 
    [result release]; 

    //Drain 
    [pool drain]; 

    retString = [[retString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] retain]; 

    //Return 
    return [retString autorelease]; 
} 

如果以上不工作,还与

尝试
completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"\n" withString:@""]; 

and

completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"\r" withString:@""]; 
+0

仍然没有... – Mahir

+0

万一它有帮助,源代码来自这样的文章http://www.laloyolan.com/news/after-defeat-cloudy-future-ahead-for-mitt-romney/article_63f50e24- 294e-11e2-a963-001a4bcf6878.html – Mahir

+0

我正在使用正文段落 – Mahir

2

您可以使用@“/ n”替换@“/ n/n”以减少换行符的数量。

+0

我尝试过'stringByReplacingString:@“/ n/n”用:@“/ n”'但没有任何变化 – Mahir

+0

我意识到它没有工作,因为连续的/ n个字符被空格分开 – Mahir

+0

很高兴你发现问题。所以你可以用“/ n”替换“/ n/n” – Darren