Skip to main content

Data text realfake

The following is a dataset that takes a folder of text, and draws real or fake samples from that folder of text. This means that it will take a folder of text, and draw samples from that folder of text. It will then return a sample of text that is either real or fake. This is useful for training a model to detect real or fake text. This uses a random variable.

Register​

c data.text.realfake register tag=whadup

Serve​


c data.text.realfake serve tag=whadup

Test​


c data.text.realfake test

Sample​


c call data.text.realfake::whadup sample

DataTextRealfake Module Documentation

The DataTextRealfake module is a Python class that provides functionality to generate and manipulate text samples from Python source code files. It can be used to create training data for various machine learning tasks, especially those related to detecting real and fake (synthetic) code snippets. In this documentation, we will walk through the different components of the code and explain their functionality.

Class Definition: DataTextRealfake​

class DataTextRealfake(c.Module):

The DataTextRealfake class is defined, inheriting from the c.Module class (presumably from the commune module). This indicates that DataTextRealfake extends the functionality provided by the base c.Module class.

Initialization: __init__ Method​

def __init__(self, **kwargs):
config = self.set_config(kwargs=kwargs)
self.folder_path = self.resolve_path(config.folder_path)
self.filepaths = sorted([f for f in self.walk(self.folder_path) if f.endswith('.py')])

The __init__ method is the class constructor that initializes an instance of the DataTextRealfake class. It takes keyword arguments kwargs, which presumably allow the user to pass additional configuration parameters.

  • The method first uses the set_config function to process and set the configuration based on the provided keyword arguments.
  • It resolves the folder path specified in the configuration using the resolve_path method.
  • It retrieves a sorted list of file paths (Python source code files) within the specified folder using the walk method. It filters the list to include only files ending with the ".py" extension.

Generating Random Index: random_idx Method​

def random_idx(self):
return self.random_int(0, len(self.filepaths)-1)

The random_idx method generates a random index within the range of valid indices for the filepaths list. It utilizes the random_int method from the base class (likely commune) to achieve this.

Generating a Sample: sample Method​

def sample(self, idx=None, input_chars: int = 500, output_chars: int = 500, start_index: int = 0, real_prob: float = 0.5):

The sample method generates a text sample from a randomly selected Python source code file. It takes several optional parameters:

  • idx: Index of the file to use as the source for the sample. If not provided, a random index is chosen.
  • input_chars: The number of characters to use as the input text for the sample.
  • output_chars: The number of characters to use as the output text for the sample.
  • start_index: The starting index within the selected file for extracting input and output text.
  • real_prob: Probability of selecting a real (non-synthetic) sample.

The method works as follows:

  • If idx is not provided, a random index is generated using the random_idx method.
  • The file path corresponding to the selected index is retrieved.
  • The content of the file is read using the get_text function from the commune module.
  • A suitable starting index within the file's content is determined, ensuring that there's enough content for both input and output.
  • Input and output text bounds are defined based on the starting index and provided character counts.
  • A dictionary called sample is populated with information about the selected sample, including input text, output text, and file path.
  • The real key in the sample dictionary is set based on a random probability check, indicating whether the sample is considered real or synthetic.

If the selected sample is synthetic (not real), a different sample is recursively selected using a probability of 0 for real samples.

Generating Test Samples: test Method​

def test(self, n=100):

The test method is used to generate and test multiple samples. It takes an optional parameter n which specifies the number of samples to generate and test.

  • The method initializes a timer and then iterates over the range of n.
  • For each iteration, a sample is generated using the sample method.
  • The samples_per_second metric is calculated as the number of iterations divided by the time taken.
  • The metric is printed using the print function from the base class.

Parsing Output: parse_output Method​

def parse_output(self, output: dict) -> dict:

The parse_output method is responsible for converting the output of a prediction (presumably obtained from a machine learning model) into binary values (0 or 1) based on certain keywords. It takes a dictionary output as input, presumably containing the result of a model prediction.

  • The method checks the content of the output using case-insensitive comparisons to determine whether it contains keywords like '0', '1', 'yes', or 'no'.
  • If the keywords are found, the method returns 0 or 1 accordingly.
  • If no valid keyword is found, an exception is raised with an error message indicating the invalid output.

Scoring a Model: score Method​

def score(self, model, w: float = 0.0):

The score method evaluates the performance of a given machine learning model using the generated samples. It takes a machine learning model as input, along with an optional weight w.

  • The method attempts to generate a sample using the sample method.
  • It measures the time taken for generating the sample.
  • The model is used to generate an output prediction based on the sample.
  • The parse_output method is used to convert the prediction into binary form (0 or 1).
  • The weight w is then updated to 0.2 (though the purpose of this is not entirely clear from this code snippet).

If the prediction matches the ground truth (i.e., the real/fake label of the sample), the weight w is set to 1. A dictionary containing various metrics and information about the scoring process is returned.