2017-01-01 75 views
0

目的是变换驻留在一个文件中的整数:这个RDD来自哪里空白?

1 2 3 
4 5 6 
7 8 9 

成三个阵列,以便能够执行数学运算。

预计

[[1, 2, 3], [4, 5, 6], [7, 8, 9]] 

实际

[[u'1', u' ', u'2', u' ', u'3'], [u'4', u' ', u'5', u' ', u'6'], [u'7', u' ', u'8', u' ', u'9']] 

代码

txt = sc.textFile("integers.txt") 
print txt.collect() 
#[u'1 2 3', u'4 5 6', u'7 8 9'] 

pairs = txt.map(lambda x: x.split(' ')) 
print pairs.collect() 
#[[u'1', u'2', u'3'], [u'4', u'5', u'6'], [u'7', u'8', u'9']] 

pairs = txt.map(lambda x: [s for s in x]) 
print pairs.collect() 
#[[u'1', u' ', u'2', u' ', u'3'], [u'4', u' ', u'5', u' ', u'6'], [u'7', u' ', u'8', u' ', u'9']] 

回答

2

问题似乎是数字是unicode格式而不是int。 您可以将它们转换为int来解决它(请参阅https://docs.python.org/2/library/functions.html#int

>>> pairs = txt.map(lambda x: x.split(' ')) 
>>> print pairs.collect() 
[[u'1', u'2', u'3'], [u'4', u'5', u'6'], [u'7', u'8', u'9']] 

>>> pairs2 = pairs.map(lambda x: [int(s) for s in x]) 
>>> print pairs2.collect() 
[[1, 2, 3], [4, 5, 6], [7, 8, 9]] 
>>> 
-2
pairs = txt.map(lambda x: x.split(' ')) 
// this return every concatenated character that separated by space ' ', which kind of similar to following function (lamda also aware of newline from file) 
def AFunc(aString): 
    returnArray = [] 
    tempString = "" 
    foreach(char in aString) 
     if char == ' ': 
     if tempString != "": 
      returnArray.append(tempString) 
      tempString = "" 
     else: 
     tempString += char 
    return returnArray 


// .. 
pairs = txt.map(lambda x: [s for s in x]) 
// this return every character in a string, which kind of similar to following function (lamda also aware of newline from file) 
def BFunc(aString): 
    returnArray = [] 
    foreach(char in aString): 
    returnArray.append(char) 
    return returnArray 

http://www.python-course.eu/lambda.php