Support Vector Machine In Python Using Libsvm Example Of Features

October 26, 2023 Post a Comment

I have scraped a lot of ebay titles like this one: Apple iPhone 5 White 16GB Dual-Core and I have manually tagged all of them in this way B M C S NA where B=Brand (Apple) M=Model

Solution 1:

Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)

Step 0: Install dependencies

You need to install the following libraries:

pandas
scikit-learn

From command line:

pip install pandas
pip install scikit-learn

Step 1: Load the data

We will use pandas to load our data. pandas is a library for easily loading data. For illustration, we first save sample data to a csv and then load it.

We will train the SVM with train.csv and get test labels with test.csv

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""withopen('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

Step 2: Process the data

We will convert our dataframe into numpy arrays which is a format that scikit- learn understands.

We need to convert the labels "B", "M", "C",... to numbers also because svm does not understand strings.

Then we will train a linear svm with the data

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print"train labels: "print train_labels
printprint"train features:"print train_features

We see here that the length of train_labels (5) exactly matches how many rows we have in trainfeatures. Each item in train_labels corresponds to a row.

Step 3: Train the SVM

from sklearn importsvmclassifier= svm.SVC()
classifier.fit(train_features, train_labels)

Step 4: Evaluate the SVM on some testing data

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""withopen('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print"model accuracy (%): ", recall * 100, "%"

Links & Tips

Example code for how to load LinearSVC: http://scikitlearn.org/stable/modules/svm.html#svm
Long list of scikit-learn examples: http://scikitlearn.org/stable/auto_examples/index.html. I've found these mildly helpful but often confusing myself.
If you find that the SVM is taking a long time to train, try LinearSVC instead: http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Here's another tutorial on getting familiar with machine learning models: http://scikit-learn.org/stable/tutorial/basic/tutorial.html

You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.

Note that since you're evaluating using the data you trained on the accuracy will be unusually high.

Solution 2:

I echo the comment of @MarcoPashkov but will try to elaborate on the LibSVM file format. I find the documentation comprehensive yet hard to find, for the Python lib I recommend the README on GitHub.

An important piece to recognize is that there is a Sparse format where all features which are 0 get removed and a Dense format where features which are 0 are not removed. These two are equivalent examples of each taken from the README.

# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]

The y variable stores a list of all the categories for the data.

The x variable stores the feature vector.

assert len(y) == len(x), "Both lists should be the same length"

The format found in the Heart Scale Example is a Sparse format where the dictionary key is the feature index and the dictionary value is the feature value while the first value is the category.

The Sparse format is incredibly useful while using a Bag of Words Representation for your feature vector.

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

For an example using the feature vector you started with, I trained a basic LibSVM 3.20 model. This code isn't meant to be used but may help in showing how to create and test a model.

from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])

# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name inenumerate("B M C S NA".split(' ')):
    # LibSVM expects index to start at 1, not 0.
    categories[name] = Category(index + 1, name)
categories

Out[0]: {'B': Category(index=1, name='B'),
   'C': Category(index=3, name='C'),
   'M': Category(index=2, name='M'),
   'NA': Category(index=5, name='NA'),
   'S': Category(index=4, name='S')}

# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]

# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
    split_values = line.split(',')
    # Create a Feature with the values converted to integers.
    features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))

features

Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
 Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
 Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
 Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
 Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]

# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)

from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)

# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)

Out[3]: Accuracy = 100% (5/5) (classification)

I hope this example helps, it shouldn't be used for your training. It is meant as an example only because it is inefficient.

alezinhacris