2017-07-20 23 views
0

到目前为止,我试图为情感分析实现适合生成器,因为我只有一个小的PGU和大数据集。不过,我不断收到此错误Keras fit_generator()&输入数组应该与目标示例相同

Using Theano backend. 
Can not use cuDNN on context None: cannot compile with cuDNN. We got this error: 
b'In file included from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/driver_types.h:53:0,\r\n     from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/cudnn.h:63,\r\n     from C:\\Users\\Def\\AppData\\Local\\Temp\\try_flags_p2iwer2o.c:4:\r\nC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/host_defines.h:84:0: warning: "__cdecl" redefined\r\n #define __cdecl\r\n ^\r\n<built-in>: note: this is the location of the previous definition\r\nd000029.o:(.idata$5+0x0): multiple definition of `__imp___C_specific_handler\'\r\nd000026.o:(.idata$5+0x0): first defined here\r\nC:/Users/Def/Anaconda3/envs/Final/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `__tmainCRTStartup\':\r\nC:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:285: undefined reference to `_set_invalid_parameter_handler\'\r\ncollect2.exe: error: ld returned 1 exit status\r\n' 
Mapped name None to device cuda: GeForce GTX 960M (0000:01:00.0) 
Epoch 1/10 
Traceback (most recent call last): 
    File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 136, in <module> 
    epochs=10) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper 
    return func(*args, **kwargs) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\models.py", line 1097, in fit_generator 
    initial_epoch=initial_epoch) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper 
    return func(*args, **kwargs) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1876, in fit_generator 
    class_weight=class_weight) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1614, in train_on_batch 
    check_batch_axis=True) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1307, in _standardize_user_data 
    _check_array_lengths(x, y, sample_weights) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 229, in _check_array_lengths 
    'and ' + str(list(set_y)[0]) + ' target samples.') 
ValueError: Input arrays should have the same number of samples as target arrays. Found 1000 input samples and 1 target samples. 

我有一个矩阵就是1000元长久以来我只拥有这是在标记生成器()指定的1000个字的最大语料库。

然后我有情绪,这是一个0为负面或1为正面。

我的问题是为什么我收到错误?我试图对数据和标签使用转换,但仍然收到相同的错误。这是我的代码。

from keras.models import Sequential 
from keras.layers import Dense, Dropout 
from keras.preprocessing.text import Tokenizer 
import numpy as np 
import pandas as pd 
import pickle 
import matplotlib.pyplot as plt 
import re 

""" 
the amount of samples out to the 1 million to use, my 960m 2GB can only handle 
about 30,000ish at the moment depending on a number of neurons in the 
deep layer and a number of layers. 
""" 
maxSamples = 3000 

#Load the CSV and get the correct columns 
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv") 
dx = pd.DataFrame() 
dy = pd.DataFrame() 
dy[['Sentiment']] = data[['Sentiment']] 
dx[['SentimentText']] = data[['SentimentText']] 

dataY = dy.iloc[0:maxSamples] 
dataX = dx.iloc[0:maxSamples] 

testY = dy.iloc[maxSamples: maxSamples + 1000] 
testX = dx.iloc[maxSamples: maxSamples + 1000] 


""" 
here I filter the data and clean it up by removing @ tags, hyperlinks and 
also any characters that are not alpha-numeric. 
""" 
def removeTagsAndLinks(dataframe): 
    for x in dataframe.iterrows(): 
     #Removes Hyperlinks 
     x[1].values[0] = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\[email protected]?^=%&/~+#-])?", "", str(x[1].values[0])) 
     #Removes @ tags 
     x[1].values[0] = re.sub("@\\w+", '', str(x[1].values[0])) 
     #keeps only alpha-numeric chars 
     x[1].values[0] = re.sub("\W+", ' ', str(x[1].values[0])) 
    return dataframe 

xData = removeTagsAndLinks(dataX) 
xTest = removeTagsAndLinks(testX) 

""" 
This loop looks for any Tweets with characters shorter than 2 and once found write the 
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the 
list of Tweets later 
""" 
indexOfBlankStrings = [] 
for index, string in enumerate(xData): 
    if len(string) < 2: 
     indexOfBlankStrings.append(index) 

for row in indexOfBlankStrings: 
    dataY.drop(row, axis=0, inplace=True) 

""" 
This makes a BOW model out of all the tweets then creates a 
vector for each of the tweets containing all the words from 
the BOW model, each vector is the same size becuase the 
network expects it 
""" 
def vectorise(tokenizer, list): 
    return tokenizer.fit_on_texts(list) 

#Make BOW model and vectorise it 
t = Tokenizer(lower=False, num_words=1000) 
t.fit_on_texts(dataX.iloc[:,0].tolist()) 
t.fit_on_texts(dataX.iloc[:,0].tolist()) 

""" 
Here im experimenting with multiple layers of the total 
amount of words in the syllabus divided by ^2 - This 
has given me quite accurate results compared to random guess's 
of amount of neron's. 
""" 
l1 = int(xData.shape[0]/4) #To big for my GPU 
l2 = int(xData.shape[0]/8) #To big for my GPU 
l3 = int(xData.shape[0]/16) 
l4 = int(xData.shape[0]/32) 
l5 = int(xData.shape[0]/64) 
l6 = int(xData.shape[0]/128) 


#Make the model 
model = Sequential() 
model.add(Dense(l1, input_dim=xData.shape[1])) 
model.add(Dropout(0.15)) 
model.add(Dense(l2)) 
model.add(Dropout(0.2)) 
model.add(Dense(l3)) 
model.add(Dropout(0.2)) 
model.add(Dense(l4)) 
model.add(Dense(1, activation='relu')) 

#Compile the model 
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc']) 

""" 
This here will use multiple batches to train the model. 
    startIndex: 
     This is the starting index of the array for which you want to 
     start training the network from. 
    dataRange: 
     The number of elements use to train the network in each batch so 
     since dataRange = 1000 this mean it goes from 
     startIndex...dataRange OR 0...1000 
    amountOfEpochs: 
     This is kinda self explanitory, the more Epochs the more it 
     is supposed to learn AKA updates the optimisation algo numbers 
""" 
amountOfEpochs = 1 
dataRange = 1000 
startIndex = 0 

def generator(tokenizer, data, labels, totalSize=maxSamples, startIndex=0): 
    l = labels.as_matrix() 
    while True: 
     for i in range(startIndex, totalSize): 
      batch_features = tokenizer.texts_to_matrix(xData.iloc[i]) 
      batch_labels = l[i] 
      yield batch_features, batch_labels 

derp = generator(t, data=xData, labels=dataY) 
##This runs the model for batch AKA load a little them process then load a little more 
for amountOfData in range(1000, maxSamples, 1000): 
    #(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData])) 
    history = model.fit_generator(generator=generator(tokenizer=t, 
              data=xData, 
              labels=dataY), 
              steps_per_epoch=1, 
              epochs=10) 

感谢

+0

问题是你有1000个样本中,你X输入矩阵,1个在您的输出Y矩阵 – DJK

+0

但Y矩阵中的1是情绪。每个Tweet应该只有1或0 – Definity

回答

0

您所遇到的问题是,在您的输入数组的样本数量,不等于你的目标数组中的样本数量。这意味着矩阵中的行数不匹配。问题来自您的发电机功能。您将数据索引为

batch_labels = l[i] 

它只返回一个样本(矩阵行)。当它应该是类似的东西...

batch_labels = l[i:i+1000] 

但是,还有其他问题与您使用fit_generator。你不应该在循环中使用它。我不明白它是如何使程序受益的,并且在循环中调用fit_generator会挫伤使用生成器的目的。该功能你会用训练中的数据的一个单独的批次将

train_on_batch() 

所看到的docs

相关问题