F ree webcam chat poolse
We also varied the recognition features provided to the techniques, using both character and token n-grams.
For all techniques and features, we ran the same 5-fold cross-validation experiments in order to determine how well they could be used to distinguish between male and female authors of tweets.
The authors do not report the set of slang words, but the non-dictionary words appear to be more related to style than to content, showing that purely linguistic behaviour can contribute information for gender recognition as well.
Gender recognition has also already been applied to Tweets. (2010) examined various traits of authors from India tweeting in English, combining character N-grams and sociolinguistic features like manner of laughing, honorifics, and smiley use.
In this paper, we start modestly, by attempting to derive just the gender of the authors 1 automatically, purely on the basis of the content of their tweets, using author profiling techniques.
For our experiment, we selected 600 authors for whom we were able to determine with a high degree of certainty a) that they were human individuals and b) what gender they were.
Then we describe our experimental data and the evaluation method (Section 3), after which we proceed to describe the various author profiling strategies that we investigated (Section 4). Gender Recognition Gender recognition is a subtask in the general field of authorship recognition and profiling, which has reached maturity in the last decades(for an overview, see e.g. Even so, there are circumstances where outright recognition is not an option, but where one must be content with profiling, i.e.
If no cue is found in a user s profile, no gender is assigned.
The general quality of the assignment is unknown, but in the (for this purpose) rather unrepresentative sample of users we considered for our own gender assignment corpus (see below), we find that about 44% of the users are assigned a gender, which is correct in about 87% of the cases.
With lexical N-grams, they reached an accuracy of 67.7%, which the combination with the sociolinguistic features increased to 72.33%. (2011) attempted to recognize gender in tweets from a whole set of languages, using word and character N-grams as features for machine learning with Support Vector Machines (SVM), Naive Bayes and Balanced Winnow2.
Their highest score when using just text features was 75.5%, testing on all the tweets by each author (with a train set of 3.3 million tweets and a test set of about 418,000 tweets). (2012) used SVMlight to classify gender on Nigerian twitter accounts, with tweets in English, with a minimum of 50 tweets.