API Reference

An overview of chirptext modules.

Chirp Text - Minimalist Text Processing Library

Enhanced IO module

Chirptext’s enhanced IO functions

chirptext.chio.is_file(path)[source]

Check if path is a path to an existing file

chirptext.chio.iter_csv_stream(input_stream, fieldnames=None, sniff=False, *args, **kwargs)[source]

Read CSV content as a table (list of lists) from an input stream

chirptext.chio.process_file(path, processor, encoding='utf-8', mode='rt', *args, **kwargs)[source]

Process a text file’s content. If the file name ends with .gz, read it as gzip file

chirptext.chio.read(path, encoding='utf-8', *args, **kwargs)

Read text file content. If the file name ends with .gz, read it as gzip file. If mode argument is provided as ‘rb’, content will be read as byte stream. By default, content is read as text (string).

# Read content as text >>> txt = chio.read_file(“sample.txt”) # Read content as binary (bytes) >>> bin = chio.read_file(“sample.dat.gz”, mode=”rb”)

Parameters

encoding – defaulted to UTF-8. Will be ignored if reading mode is ‘rb’

chirptext.chio.read_csv(path, fieldnames=None, sniff=True, encoding='utf-8', *args, **kwargs)[source]

Read CSV rows as table from a file. By default, csv.reader() will be used any output will be a list of lists. If fieldnames is provided, DictReader will be used and output will be list of OrderedDict instead. CSV sniffing (dialect detection) is enabled by default, set sniff=False to switch it off.

chirptext.chio.read_csv_iter(path, fieldnames=None, sniff=True, mode='rt', encoding='utf-8', *args, **kwargs)[source]

Iterate through CSV rows in a file. By default, csv.reader() will be used any output will be a list of lists. If fieldnames is provided, DictReader will be used and output will be list of OrderedDict instead. CSV sniffing (dialect detection) is enabled by default, set sniff=False to switch it off.

chirptext.chio.read_file(path, encoding='utf-8', *args, **kwargs)[source]

Read text file content. If the file name ends with .gz, read it as gzip file. If mode argument is provided as ‘rb’, content will be read as byte stream. By default, content is read as text (string).

# Read content as text >>> txt = chio.read_file(“sample.txt”) # Read content as binary (bytes) >>> bin = chio.read_file(“sample.dat.gz”, mode=”rb”)

Parameters

encoding – defaulted to UTF-8. Will be ignored if reading mode is ‘rb’

chirptext.chio.write(path, content, mode=None, encoding='utf-8')

Write content to a file. If the path ends with .gz, gzip will be used.

chirptext.chio.write_csv(path, rows, dialect='excel', fieldnames=None, quoting=1, extrasaction='ignore', encoding='utf-8', newline='', *args, **kwargs)[source]

Write rows data to a CSV file (with or without fieldnames)

By default content will be written in excel-csv dialect. This can be changed by using the optional argument dialect.

chirptext.chio.write_file(path, content, mode=None, encoding='utf-8')[source]

Write content to a file. If the path ends with .gz, gzip will be used.

chirptext.chio.write_tsv(path, rows, *args, **kwargs)[source]

Write rows data in tab-separated values (TSV) format

By default content will be written in excel-tab dialect. This can be changed by using the optional argument dialect.

Text annotation (TTL) module

Text Annotation (texttaglib - TTL) module

Japanese parser

Convenient Japanese text parser that produces results in TTL format

chirptext.deko.analyse(content, splitlines=True, format=None, **kwargs)[source]

Japanese text > tokenize/txt/html

chirptext.deko.get_mecab_bin()

Get MeCab binary location

chirptext.deko.set_mecab_bin(location)

Set MeCab binary location

Chinese character radicals

Tools for processing Chinese

class chirptext.sino.Radical(idseq='', radical='', variants='', strokes='', meaning='', pinyin='', hanviet='', hiragana='', romaji='', hangeul='', romaja='', frequency='', simplified='', examples='')[source]

Chinese Radical Source: https://en.wikipedia.org/wiki/Kangxi_radical#Table_of_radicals

Swadesh list

Language profile: UK English

class chirptext.luke.Word(ID, word, score=0, description='', rank=0)[source]

Swadesh word

Vietnamese support functions

Dao Phay: A collection of tools for processing Vietnamese text using Python.

chirptext.daophay.sorted(list_of_strings)[source]

Sort a list of Vietnamese strings

Utilities

Miscellaneous tools for text processing

class chirptext.leutile.AppConfig(name, mode='ini', working_dir='.', extra_potentials=None)[source]

Application Configuration Helper This class supports guessing configuration file location, and reads either INI (default) or JSON format.

add_potential(*patterns)[source]

Add a potential config file pattern

property config

Read config automatically if required

property config_path

Path to config file

load(file_path)[source]

Load configuration from a specific file

locate_config()[source]

Locate config file

read_config(key, strict=False, **kwargs)[source]

Read a config by key

Default value can be passed by using the kwarg default

>>> read_config(key, default='my value')
Parameters
  • key – configuration key

  • strict – Set to True to raise KeyError if config key was not set. Defaulted to False

  • default – Optional kwarg to set default value when key could not be found

read_file(file_path)[source]

Read a configuration file and return configuration data

class chirptext.leutile.Counter(priority=None, *args, **kwargs)[source]

Powerful counter class

get_report_order()[source]

Keys are sorted based on report order (i.e. some keys to be shown first) Related: see sorted_by_count

class chirptext.leutile.FileHub(*filenames, working_dir='.', default_mode='a', ext='txt')[source]

A helper class for working with multiple text reports at the same time

class chirptext.leutile.StringTool[source]

Common string function

class chirptext.leutile.Table(header=True, padding=True, NoneValue=None)[source]

A text-based table which can be used with TextReport

format()[source]

Format table to print out

class chirptext.leutile.Timer(logger=None, report=None)[source]

Measure tasks’ runtime

exec_time()[source]

Calculate run time

class chirptext.leutile.Value(value=None)[source]

Value holder

chirptext.leutile.hamilton_allocate(numbers, total=100, precision=2)[source]

Use largest remainder (Hamilton) method to make sure rounded percentages add up to 100 >>> hamilton_allocate((33.33, 33.33, 33.33)) [33.34, 33.33, 33.33] >>> hamilton_allocate((24.99, 24.99, 24.99, 24.99)) [25.0, 25.0, 25.0, 25.0] >>> hamilton_allocate((76.69, 20.83, 2.49)) [76.69, 20.83, 2.48] >>> hamilton_allocate([13.626332, 47.989636, 9.596008, 28.788024]) [13.63, 47.99, 9.59, 28.79]

chirptext.leutile.header(*msg, level='h1', separator=' ', print_out=<built-in function print>)[source]

Print header block in text mode

chirptext.leutile.is_number(s)[source]

Check if something is a number

class chirptext.leutile.piter(iterable)[source]

Peep-able iterator

fetch(value_obj=None)[source]

Fetch the next two values

Command-line applications

Command-line interface helper

class chirptext.cli.CLIApp(desc, add_vq=True, add_tasks=True, **kwargs)[source]

A simple template for command-line interface applications

add_task(task, func=None, **kwargs)[source]

Add a task parser

add_version_func(show_version)[source]

Enable –version and -V to show version information

add_vq(parser)[source]

Add verbose & quiet options

property logger

Lazy logger

run(func=None)[source]

Run the app

chirptext.cli.config_logging(args)[source]

Override root logger’s level

chirptext.cli.setup_logging(config_path, log_dir=None, force_setup=False, default_level=30, silent=True)[source]

Try to load logging configuration from a file. Set level to INFO if failed.

Parameters
  • config_path – Path to the logging config file (JSON)

  • log_dir – Path to log output directory. When log_dir is not None and the directory does not exist, it will be created automatically.

Python data mapping functions

Data mapping functions

class chirptext.anhxa.TypedJSONDecoder(type_map=None, **kwargs)[source]
class chirptext.anhxa.TypedJSONEncoder(*args, type_map=None, **kwargs)[source]
default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
class chirptext.anhxa.TypelessSONEncoder(*args, type_map=None, **kwargs)[source]
default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
chirptext.anhxa.dumps(obj, *args, **kwargs)[source]

Typeless dump an object to json string

chirptext.anhxa.flex_update_obj(source, target, __silent, *fields, **field_map)[source]

Pull data from source to target. Target’s __dict__ (object data) will be used by default. Otherwise, it’ll be treated as a dictionary

chirptext.anhxa.to_dict(obj, *fields, **field_map)[source]

Convert an object into a dictionary

chirptext.anhxa.to_obj(cls, obj_data=None, *fields, **field_map)[source]

Use obj_data (dict-like) to construct an object of type cls prioritize obj_dict when there are conficts