Python - Remove Extended Ascii

May 30, 2024 Post a Comment

Okay, so I am new to the whole python world so bear with me. Background: We are trying to offload logs into mongo to be able to query and search for them quicker. The device alrea

Solution 1:

Encode the string to bytes and then decode back to ASCII:

data.encode().decode('ascii',errors='ignore')
# {"id":"xxx","timestamp":xxx,...}}

You can also use regular expressions to remove all characters outside of the outermost curly braces:

re.sub(r'^[^{]*(?={)|(?<=})[^}{]*(?={)|(?<=})[^}]*$', '', data)

The latter mechanism incidentally also removes the ASCII 'C' character that you do not want.

Solution 2:

import re

str='¾ïúÀï{"id":"xxx","timestamp":xxx,"payloadType":"xxx","payload":{"protocol":"xxx","zoneID":xxx,"zoneName":"xxx","eventType":"xxx"}}’ÂCº¾ïúÀï{"id":"xxx","timestamp":xxx,"payloadType":"xxx","payload":{"protocol":"xxx","zoneID":xxx,"zoneName":"xxx","eventType":"xx}}'

str=re.sub('[^\x00-\x7F]','',str)
print(str)

Should produce output as...

'{"id":"xxx","timestamp":xxx,"payloadType":"xxx","payload":{"protocol":"xxx","zoneID":xxx,"zoneName":"xxx","eventType":"xxx"}}C{"id":"xxx","timestamp":xxx,"payloadType":"xxx","payload":{"protocol":"xxx","zoneID":xxx,"zoneName":"xxx","eventType":"xx}}'

Solution 3:

what about something like:

import string

cleaned_string = ''forcharin ugly_string:
    ifcharinstring.printable:
        cleaned_string += char

This question also deals with a similar problem.

Solution 4:

If the garbage bytes do not contain opening curly braces, you can do something like this:

def decode_all(data):
    decoder = JSONDecoder()
    end_index = 0whiledata:
        try:
            data = data[data.index('{', end_index):]
        except ValueError:
            break

        obj, end_index = decoder.raw_decode(data)
        yield obj

Otherwise, without knowing what those garbage bytes can contain and whether or not your JSON is pure ASCII, I think the best solution would be to try parsing a JSON-encoded object out of your data over and over and skipping the garbage bytes implicitly:

from json import JSONDecoder

data = '''¾ïúÀï{"id":"123","timestamp":123,"payloadType":"123","payload":{"protocol":"123","zoneID":123,"zoneName":"123","eventType":"123"}}’ÂCº¾ïúÀï{"id":"123","timestamp":123,"payloadType":"123","payload":{"protocol":"123","zoneID":123,"zoneName":"123","eventType":"xx"}}'''

def decode_all(data):
    decoder = JSONDecoder()

    while data:
        try:
            obj, end_index = decoder.raw_decode(data)
            data = data[end_index:]

            yield obj
        except ValueError:
            end_index = None

        start = data.find('{')

        if start == -1:
            break
        elif end_index is None and start == 0:
            start = 1

        data = data[start:]

foroindecode_all(data):
    print(o)

Solution 5:

I used exifread. In short, use .printable to get a string of the value. Here is the code to get the value of some select tags and pack them in a dictionary called context:

photo_file = '/absolute/path/photo.jpg'
f = open(latest_f, 'rb')
tags = exifread.process_file(f)
select_tags = ['Image Make', 'Image Model','EXIF DateTimeOriginal', 'EXIF ExposureTime']
context = dict()
for i, tag inenumerate(select_tags):
   context[tag] = tags[select_tags[i]]*.printable*
print(context)

alezinhacris