Added support for more models in ReplicateAPI. Completed automation script for Project Gutenberg. Updated README.
parent
ab3228b234
commit
068b4df6ea
|
@ -8,6 +8,9 @@
|
|||
**/empty_alt_text_sample.txt
|
||||
**/book_outputs
|
||||
**/downloaded_books
|
||||
**/results
|
||||
**/alts.txt
|
||||
**/images.txt
|
||||
|
||||
**/keys.py
|
||||
**/vertex-key.json
|
144
README.md
144
README.md
|
@ -1,6 +1,10 @@
|
|||
# Alt-Text
|
||||
|
||||
A PyPi package used for finding, generating, and setting alt-text for images in HTML and EPUB files.
|
||||
A PyPi package used for finding, generating, and setting alt-text for images in HTML files.
|
||||
|
||||
Developed as a Computer Science Senior Design Project at [Stevens Institute of Technology](https://www.stevens.edu/) in collaboration with the [Free Ebook Foundation](https://ebookfoundation.org/).
|
||||
|
||||
[Learn more about the developers](#the-deveolpers).
|
||||
|
||||
## Getting Started
|
||||
|
||||
|
@ -26,14 +30,18 @@ As of the moment, the image analyzation tools that Alt-Text uses are not fully b
|
|||
|
||||
Description Engines are used to generate descriptions of an image. If you are to use one of these, you will need to fulfill that specific Engine's dependencies before use.
|
||||
|
||||
##### ReplicateMiniGPT4API
|
||||
##### ReplicateAPI
|
||||
|
||||
ReplicateMiniGPT4API Engine uses the [Replicate API](https://replicate.com/), hence you will need to get an API key via [Logging in with Github](https://replicate.com/signin) on the Replicate website.
|
||||
ReplicateAPI Engine uses the [Replicate API](https://replicate.com/), hence you will need to get an API key via [Logging in with Github](https://replicate.com/signin) on the Replicate website.
|
||||
|
||||
##### GoogleVertexAPI
|
||||
|
||||
GoogleVertexAPI Engine uses the [Vertex AI API](https://cloud.google.com/vertex-ai), hence you will need to get access from the [Google API Marketplace](https://console.cloud.google.com/marketplace/product/google/aiplatform.googleapis.com). Additionally, Alt-Text uses Service Account Keys to get authenticated with Google Cloud, hence you will need to [Create a Service Account Key](https://cloud.google.com/iam/docs/keys-create-delete#creating) with permission for the Vertex AI API and have its according JSON.
|
||||
|
||||
##### BlipLocal
|
||||
|
||||
The BlipLocal Engine uses a modified version of the [cobanov/image-captioning repository](https://github.com/cobanov/image-captioning), which allows for the use of Blip locally via a CLI. To get started, you must download [this fork](https://github.com/xxmistacruzxx/image-captioning) of the repository and download/install the [BLIP-Large](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth) checkpoint as described in the README.
|
||||
|
||||
#### OCR Engines
|
||||
|
||||
Optical Character Recognition Engines are used to find text within images. If you are to use one of these, you will need to fulfill that specific Engine's dependencies before use.
|
||||
|
@ -42,9 +50,126 @@ Optical Character Recognition Engines are used to find text within images. If yo
|
|||
|
||||
The Tesseract Engine uses [Tesseract](https://github.com/tesseract-ocr/tesseract), hence you will need to install the [Tesseract OCR](https://tesseract-ocr.github.io/tessdoc/Installation.html).
|
||||
|
||||
#### Language Engines
|
||||
|
||||
Language Engines are used to generate a alt-text given an image description (from the [Description Engine](#Description-Engines)), characters found in an image (from the [OCR Engine](#OCR-Engines)), and context within the Ebook. If you are to use one of these, you will need to fulfill that specific Engine's dependencies before use.
|
||||
|
||||
##### OpenAI API
|
||||
|
||||
The OpenAI API Engine gives access to [Open AI's GPT Models via their API](https://platform.openai.com/docs/models). To use this, you will need an [API Key](https://openai.com/blog/openai-api) with access to the appropriate tier (more info on their [pricing page](https://openai.com/pricing)).
|
||||
|
||||
##### PrivateGPT
|
||||
|
||||
The PrivateGPT Engine gives allows for easy integration with an instance of [PrivateGPT](https://github.com/zylon-ai/private-gpt). To use this, you'll need a running instance of a [PrivateGPT API Server](https://docs.privategpt.dev/overview/welcome/introduction).
|
||||
|
||||
## Quickstart & Usage
|
||||
|
||||
To be added...
|
||||
### Setup
|
||||
|
||||
#### Standard Setup
|
||||
|
||||
The standard setup assumes that you have access to a [Description Engine](#Description-Engines) and [Language Engine](#Language-Engines) (the [OCR Engine](#OCR-Engines) being optional).
|
||||
|
||||
```python
|
||||
from alttext.alttext import AltTextHTML
|
||||
|
||||
alt = AltTextHTML(
|
||||
ReplicateAPI("REPLICATE_KEY"),
|
||||
# Tesseract(),
|
||||
OpenAIAPI("OPENAI_KEY", "gpt-3.5-turbo"),
|
||||
)
|
||||
```
|
||||
|
||||
#### Legacy Setup
|
||||
|
||||
This setup assumes that you have access to a [Description Engine]() (the [OCR Engine]() and [Language Engine]() being optional).
|
||||
|
||||
```python
|
||||
from alttext.alttext import AltTextHTML
|
||||
|
||||
alt = AltTextHTML(
|
||||
ReplicateAPI("REPLICATE_KEY"),
|
||||
# Tesseract(),
|
||||
# OpenAIAPI("OPENAI_KEY", "gpt-3.5-turbo"),
|
||||
options = {"version": 1}
|
||||
)
|
||||
```
|
||||
|
||||
#### Options
|
||||
|
||||
Below are the default options for the `AltTextHTML` class. You can change these by passing a `dict` into the `options` parameter during instantiation. When passing options, you only need the options you'd like to change from the default values in the `dict`.
|
||||
|
||||
```python
|
||||
DEFOPTIONS = {
|
||||
"withContext": True,
|
||||
"withHash": True,
|
||||
"multiThreaded": True,
|
||||
"version": 2,
|
||||
}
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
#### Loading an Ebook
|
||||
|
||||
```python
|
||||
# from a file
|
||||
alt.parseFile("/path/to/ebook.html")
|
||||
|
||||
# or from a string
|
||||
alt.parse("<HTML>...</HTML>")
|
||||
```
|
||||
|
||||
#### Getting Images
|
||||
|
||||
```python
|
||||
# getting all images
|
||||
imgs : list[bs4.element.Tag] = alt.getAllImgs()
|
||||
|
||||
# getting all images with no alt attribute or where alt = ""
|
||||
imgs_noalt : list[bs4.element.Tag] = alt.getNoAltImgs()
|
||||
|
||||
# get a specific image by src
|
||||
img : bs4.element.Tag = alt.getImg("path_as_in_html/image.png")
|
||||
```
|
||||
|
||||
#### Generating Alt-Text
|
||||
|
||||
```python
|
||||
# generate alt-text for a single image by src
|
||||
alt_text : str = alt.genAltText("path_as_in_html/image.png")
|
||||
|
||||
# generate an association from an image tag
|
||||
# example_association = {
|
||||
# "src" : "path_as_in_html/image.png"
|
||||
# "alt" : "generated alt text"
|
||||
# "hash" : 1234
|
||||
# }
|
||||
association : dict = alt.genAssociation(img : bs4.element.Tag)
|
||||
|
||||
# generate a list of associations given a list of image tags
|
||||
associations : list[dict] = alt.genAltAssociations(imgs : list[bs4.element.Tag])
|
||||
```
|
||||
|
||||
#### Setting Alt-Text
|
||||
|
||||
```python
|
||||
# setting alt-text for a single image by src
|
||||
new_img_tag : bs4.element.Tag = alt.setAlt("path_as_in_html/image.png", "new alt")
|
||||
|
||||
# setting alt-text for multiple images given a list of associations
|
||||
new_img_tags : list[bs4.element.Tag] = alt.setAlts(associations : list[dict])
|
||||
```
|
||||
|
||||
#### Exporting Current HTML Status
|
||||
|
||||
```python
|
||||
# getting current html as string
|
||||
html : str = alt.export()
|
||||
|
||||
# exporting to a file
|
||||
path : str = alt.exportToFile("path/to/new_html.html")
|
||||
```
|
||||
|
||||
## Our Mission
|
||||
|
||||
|
@ -52,9 +177,9 @@ The Alt-Text project is developed for the [Free Ebook Foundation](https://ebookf
|
|||
|
||||
As Ebooks become a more prominant way to consume written materials, it only becomes more important for them to be accessible to all people. Alternative text (aka alt-text) in Ebooks are used as a way for people to understand images in Ebooks if they are unable to use images as intended (e.g. a visual impaired person using a screen reader to read an Ebook).
|
||||
|
||||
While this feature exists, it is still not fully utilized and many Ebooks lack alt-text in some, or even all their images. To illustrate this, the [Gutenberg Project](https://gutenberg.org/), the creator of the Ebook and now a distributor of Public Domain Ebooks, have over 70,000 Ebooks in their collection and of those, there are about 470,000 images without alt-text.
|
||||
While this feature exists, it is still not fully utilized and many Ebooks lack alt-text in some, or even all their images. To illustrate this, the [Gutenberg Project](https://gutenberg.org/), the creator of the Ebook and now a distributor of Public Domain Ebooks, have over 70,000 Ebooks in their collection and of those, there are about 470,000 images without alt-text (not including images with insufficient alt-text).
|
||||
|
||||
The Alt-Text project's goal is to use the power of AI, Automation, and the Internet to craft a solution capable of automatically generating descriptions for images lacking alt-text in Ebooks, closing the accessibility gap and improving collections, such as the [Gutenberg Project](https://gutenberg.org/).
|
||||
The Alt-Text project's goal is to use the power of various AI technologies, such as machine vision and large language models, to craft a solution capable of assisting in the creation of alt-text for Ebooks, closing the accessibility gap and improving collections, such as the [Gutenberg Project](https://gutenberg.org/).
|
||||
|
||||
### Contact Information
|
||||
|
||||
|
@ -90,7 +215,7 @@ The emails and relevant information of those involved in the Alt-Text project ca
|
|||
|
||||
## APIs, Tools, & Libraries Used
|
||||
|
||||
Alt-Text is developed using an assortment of modern Python tools...
|
||||
Alt-Text is developed using an assortment of tools...
|
||||
|
||||
### Development Tools
|
||||
|
||||
|
@ -100,13 +225,18 @@ Alt-Text is developed using...
|
|||
- [EbookLib](https://pypi.org/project/EbookLib/)
|
||||
- [Replicate](https://pypi.org/project/replicate/)
|
||||
- [Google-Cloud-AIPlatform](https://pypi.org/project/google-cloud-aiplatform/)
|
||||
- [PyTorch](https://pypi.org/project/torch/)
|
||||
- [PyTesseract](https://pypi.org/project/pytesseract/)
|
||||
- [OpenAI Python API](https://pypi.org/project/openai/)
|
||||
|
||||
### APIs and Supplementary Tools
|
||||
|
||||
- [Replicate API](https://replicate.com/)
|
||||
- [Vertex AI API](https://cloud.google.com/vertex-ai)
|
||||
- [cobanov/image-captioning](https://github.com/cobanov/image-captioning)
|
||||
- [Tesseract](https://github.com/tesseract-ocr/tesseract)
|
||||
- [OpenAI API](https://openai.com/blog/openai-api)
|
||||
- [PrivateGPT](https://github.com/zylon-ai/private-gpt)
|
||||
|
||||
### Packaging/Distribution Tools
|
||||
|
||||
|
|
|
@ -5,18 +5,22 @@ import os
|
|||
from .descengine import DescEngine
|
||||
|
||||
REPLICATE_MODELS = {
|
||||
"blip-2": "andreasjansson/blip-2:f677695e5e89f8b236e52ecd1d3f01beb44c34606419bcc19345e046d8f786f9",
|
||||
"blip": "salesforce/blip:2e1dddc8621f72155f24cf2e0adbde548458d3cab9f00c0139eea840d0ac4746",
|
||||
"clip_prefix_caption": "rmokady/clip_prefix_caption:9a34a6339872a03f45236f114321fb51fc7aa8269d38ae0ce5334969981e4cd8",
|
||||
"clip-caption-reward": "j-min/clip-caption-reward:de37751f75135f7ebbe62548e27d6740d5155dfefdf6447db35c9865253d7e06",
|
||||
"llava-13b": "yorickvp/llava-13b:b5f6212d032508382d61ff00469ddda3e32fd8a0e75dc39d8a4191bb742157fb",
|
||||
"img2prompt": "methexis-inc/img2prompt:50adaf2d3ad20a6f911a8a9e3ccf777b263b8596fbd2c8fc26e8888f8a0edbb5",
|
||||
"clip_prefix_caption": "rmokady/clip_prefix_caption:9a34a6339872a03f45236f114321fb51fc7aa8269d38ae0ce5334969981e4cd8",
|
||||
"clip-interrogator": "pharmapsychotic/clip-interrogator:8151e1c9f47e696fa316146a2e35812ccf79cfc9eba05b11c7f450155102af70",
|
||||
"clip-caption-reward": "j-min/clip-caption-reward:de37751f75135f7ebbe62548e27d6740d5155dfefdf6447db35c9865253d7e06",
|
||||
"minigpt4": "daanelson/minigpt-4:b96a2f33cc8e4b0aa23eacfce731b9c41a7d9466d9ed4e167375587b54db9423",
|
||||
"image-captioning-with-visual-attention": "nohamoamary/image-captioning-with-visual-attention:9bb60a6baa58801aa7cd4c4fafc95fcf1531bf59b84962aff5a718f4d1f58986",
|
||||
}
|
||||
|
||||
|
||||
class ReplicateAPI(DescEngine):
|
||||
def __init__(self, key: str, model: str = "blip") -> None:
|
||||
def __init__(self, key: str, modelName: str = "blip") -> None:
|
||||
self.__setKey(key)
|
||||
self.__setModel(model)
|
||||
self.__setModel(modelName)
|
||||
return None
|
||||
|
||||
def __getModel(self) -> str:
|
||||
|
@ -42,10 +46,18 @@ class ReplicateAPI(DescEngine):
|
|||
base64_utf8_str = base64.b64encode(imgData).decode("utf-8")
|
||||
model = self.__getModel()
|
||||
ext = src.split(".")[-1]
|
||||
prompt = "Create alternative-text for this image."
|
||||
if context != None:
|
||||
prompt = f"Create alternative-text for this image given the following context...\n{context}"
|
||||
|
||||
dataurl = f"data:image/{ext};base64,{base64_utf8_str}"
|
||||
output = replicate.run(model, input={"image": dataurl, "prompt": prompt})
|
||||
return output
|
||||
|
||||
input = {"image": dataurl}
|
||||
if self.model == REPLICATE_MODELS["blip-2"]:
|
||||
input["caption"] = True
|
||||
input["question"] = ""
|
||||
if self.model == REPLICATE_MODELS["llava-13b"]:
|
||||
input["prompt"] = "What is this a picture of?"
|
||||
if self.model == REPLICATE_MODELS["minigpt4"]:
|
||||
input["prompt"] = "What is this a picture of?"
|
||||
|
||||
output = replicate.run(model, input=input)
|
||||
if self.model == REPLICATE_MODELS["llava-13b"]:
|
||||
return "".join(output)
|
||||
return output
|
||||
|
|
|
@ -10,9 +10,13 @@ import keys
|
|||
|
||||
sys.path.append("../")
|
||||
from src.alttext.alttext import AltTextHTML
|
||||
from src.alttext.descengine.descengine import DescEngine
|
||||
from src.alttext.descengine.replicateapi import ReplicateAPI
|
||||
from src.alttext.descengine.bliplocal import BlipLocal
|
||||
from src.alttext.descengine.googlevertexapi import GoogleVertexAPI
|
||||
from src.alttext.ocrengine.tesseract import Tesseract
|
||||
from src.alttext.langengine.openaiapi import OpenAIAPI
|
||||
from src.alttext.langengine.privategpt import PrivateGPT
|
||||
|
||||
|
||||
class AltTextGenerator(AltTextHTML):
|
||||
|
@ -29,10 +33,15 @@ class AltTextGenerator(AltTextHTML):
|
|||
|
||||
# Description generation timing
|
||||
# print("starting desc")
|
||||
genDesc_start_time = time.time()
|
||||
desc = self.genDesc(imgdata, src, context)
|
||||
genDesc_end_time = time.time()
|
||||
genDesc_total_time = genDesc_end_time - genDesc_start_time
|
||||
genDesc = None
|
||||
with open("./results/llava-13b.csv", mode="r") as csvfile:
|
||||
reader = csv.DictReader(csvfile)
|
||||
for row in reader:
|
||||
if row["book"] == book_id and row["image"] == src:
|
||||
genDesc = row["genDesc"]
|
||||
break
|
||||
if genDesc == None:
|
||||
raise Exception("Description not found in llava-13b.csv")
|
||||
|
||||
# OCR processing timing
|
||||
# print("starting ocr")
|
||||
|
@ -44,7 +53,11 @@ class AltTextGenerator(AltTextHTML):
|
|||
# Refinement processing timing
|
||||
# print("starting refinement")
|
||||
refine_start_time = time.time()
|
||||
refined_desc = self.langEngine.refineAlt(desc, chars, context, None)
|
||||
if context[0] is not None:
|
||||
context[0] = context[0][:1000]
|
||||
if context[1] is not None:
|
||||
context[1] = context[1][:1000]
|
||||
refined_desc = self.langEngine.refineAlt(genDesc, chars[:1000], context, None)
|
||||
refine_end_time = time.time()
|
||||
refine_total_time = refine_end_time - refine_start_time
|
||||
|
||||
|
@ -60,10 +73,7 @@ class AltTextGenerator(AltTextHTML):
|
|||
"status": status, # Set false if failed, set true is worked
|
||||
"beforeContext": context[0],
|
||||
"afterContext": context[1],
|
||||
"genDesc": desc,
|
||||
"genDesc-Start": genDesc_start_time,
|
||||
"genDesc-End": genDesc_end_time,
|
||||
"genDesc-Time": genDesc_total_time,
|
||||
"genDesc": genDesc,
|
||||
"genOCR": chars,
|
||||
"genOCR-Start": ocr_start_time,
|
||||
"genOCR-End": ocr_end_time,
|
||||
|
@ -95,11 +105,14 @@ def benchmarkBooks(booksDir: str, srcsDir: str):
|
|||
generator = AltTextGenerator(
|
||||
ReplicateAPI(keys.ReplicateEricKey()),
|
||||
Tesseract(),
|
||||
OpenAIAPI(keys.OpenAIKey(), "gpt-3.5-turbo"),
|
||||
# OpenAIAPI(keys.OpenAIKey(), "gpt-4-0125-preview"),
|
||||
PrivateGPT("http://127.0.0.1:8001"),
|
||||
)
|
||||
|
||||
records = []
|
||||
for bookId in os.listdir(booksDir):
|
||||
for bookId in os.listdir(srcsDir):
|
||||
bookId = bookId.split("_")[1].split(".")[0]
|
||||
time.sleep(1)
|
||||
try:
|
||||
bookPath = os.path.join(booksDir, bookId)
|
||||
|
||||
|
@ -120,13 +133,77 @@ def benchmarkBooks(booksDir: str, srcsDir: str):
|
|||
record = generator.genAltTextV2(src, bookId, src, bookPath)
|
||||
records.append(record)
|
||||
except Exception as e:
|
||||
print(f"Error processing image {src} in book {bookId}: {e}")
|
||||
print(f"ERROR processing image {bookId} | {src}: {e}")
|
||||
except Exception as e:
|
||||
print(f"Error processing book {bookId}: {e}")
|
||||
print(f"ERROR processing book {bookId}: {e}")
|
||||
|
||||
generateCSV("test_benchmark.csv", records)
|
||||
generateCSV("private-gpt.csv", records)
|
||||
|
||||
|
||||
def benchmarkDescEngine(
|
||||
descEngine: DescEngine, booksDir: str, srcsDir: str, outputFilename: str
|
||||
):
|
||||
generator = AltTextHTML(descEngine)
|
||||
|
||||
records = []
|
||||
for bookId in os.listdir(srcsDir):
|
||||
bookId = bookId.split("_")[1].split(".")[0]
|
||||
try:
|
||||
print("STARTING BOOK ID: ", bookId)
|
||||
bookPath = os.path.join(booksDir, bookId)
|
||||
|
||||
htmlpath = None
|
||||
for object in os.listdir(bookPath):
|
||||
if object.endswith(".html"):
|
||||
htmlpath = os.path.join(bookPath, object)
|
||||
break
|
||||
generator.parseFile(htmlpath)
|
||||
|
||||
srcs = []
|
||||
with open(f"{srcsDir}/ebook_{bookId}.txt", "r") as file:
|
||||
for line in file:
|
||||
srcs.append(line.split(f"{bookId}/")[1].strip())
|
||||
|
||||
for src in srcs:
|
||||
time.sleep(8)
|
||||
try:
|
||||
print("STARTING IMAGE: ", src)
|
||||
context = generator.getContext(generator.getImg(src))
|
||||
genDesc_start_time = time.time()
|
||||
desc = generator.genDesc(generator.getImgData(src), src, context)
|
||||
print(f"TEST: {desc}")
|
||||
genDesc_end_time = time.time()
|
||||
genDesc_total_time = genDesc_end_time - genDesc_start_time
|
||||
record = {
|
||||
"book": bookId,
|
||||
"image": src,
|
||||
"path": bookPath,
|
||||
# "beforeContext": context[0],
|
||||
# "afterContext": context[1],
|
||||
"genDesc": desc.replace('"', "'"),
|
||||
"genDesc-Start": genDesc_start_time,
|
||||
"genDesc-End": genDesc_end_time,
|
||||
"genDesc-Time": genDesc_total_time,
|
||||
}
|
||||
records.append(record)
|
||||
except Exception as e:
|
||||
print(f"ERROR processing image {bookId} | {src}: {e}")
|
||||
except Exception as e:
|
||||
print(f"ERROR processing book {bookId}: {e}")
|
||||
|
||||
generateCSV(outputFilename, records)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("RUNNING AUTOMATE.PY")
|
||||
benchmarkBooks("./downloaded_books", "./book_outputs")
|
||||
# benchmarkDescEngine(
|
||||
# ReplicateAPI(
|
||||
# keys.ReplicateEricKey(), modelName="image-captioning-with-visual-attention"
|
||||
# ),
|
||||
# BlipLocal("C:/Users/dacru/Desktop/ALT/image-captioning"),
|
||||
# GoogleVertexAPI(keys.VertexProject(), keys.VertexRegion(), keys.VertexGAC()),
|
||||
# "./downloaded_books",
|
||||
# "./book_outputs2",
|
||||
# "vertexai.csv",
|
||||
# )
|
||||
|
|
|
@ -0,0 +1,83 @@
|
|||
import random
|
||||
import requests
|
||||
import bs4
|
||||
import time
|
||||
import os
|
||||
|
||||
|
||||
def extractImage(imgs: list[bs4.element.Tag]) -> list[bs4.element.Tag]:
|
||||
if len(imgs) == 0:
|
||||
return None
|
||||
index = random.randint(0, len(imgs) - 1)
|
||||
img = imgs[index]
|
||||
if img.has_attr("alt") and img.attrs["alt"].strip() != "":
|
||||
return img
|
||||
return extractImage(imgs[:index] + imgs[index + 1 :])
|
||||
|
||||
|
||||
def collect(
|
||||
num: int, image_output: str = "images.txt", alt_output: str = "alts.txt"
|
||||
) -> int:
|
||||
"""
|
||||
Collect images with alt-text from random ebooks
|
||||
|
||||
Args:
|
||||
num (int): Number of images to collect.
|
||||
image_output (str, optional): Path to output image URLs. Defaults to "images.txt".
|
||||
alt_output (str, optional): Path to output alt-text. Defaults to "alts.txt".
|
||||
"""
|
||||
count = 0
|
||||
while count < num:
|
||||
time.sleep(0.5)
|
||||
bookid = random.randint(1, 70000)
|
||||
bookurl = f"https://gutenberg.org/cache/epub/{bookid}/pg{bookid}-images.html"
|
||||
|
||||
response = requests.get(bookurl)
|
||||
if response.status_code != 200:
|
||||
print(f"Failed to fetch book {bookid}.")
|
||||
continue
|
||||
|
||||
soup = bs4.BeautifulSoup(response.text, "html.parser")
|
||||
div = soup.find("div", id="pg-machine-header")
|
||||
if not div:
|
||||
print(f"No 'pg-machine-header' found in book {bookid}.")
|
||||
continue
|
||||
|
||||
languageP = div.find_all(recursive=False)[3]
|
||||
if languageP.text.strip() != "Language: English":
|
||||
print(f"Book {bookid} is not in English.")
|
||||
continue
|
||||
|
||||
imgs: list[bs4.element.Tag] = soup.find_all("img")
|
||||
img = extractImage(imgs)
|
||||
if img is None:
|
||||
print(
|
||||
f"Out of {len(imgs)} images, no images with alt-text found in book {bookid}."
|
||||
)
|
||||
continue
|
||||
|
||||
with open(image_output, "a") as imagefile:
|
||||
imagefile.write(f"{bookid} cache/epub/{bookid}/{img['src']}\n")
|
||||
with open(alt_output, "a") as altfile:
|
||||
altfile.write(f"{img['alt'].encode('ascii', 'ignore').decode()}\n")
|
||||
|
||||
count += 1
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def split(input_file, book_output, image_output):
|
||||
with open(input_file, "r") as file:
|
||||
for line in file:
|
||||
book_number = line.split()[0] # Extracting book number
|
||||
image = line.split()[1] # Extracting image
|
||||
|
||||
with open(book_output, "a") as output_file:
|
||||
output_file.write(f"{book_number}\n")
|
||||
with open(image_output, "a") as output_file:
|
||||
output_file.write(f"{image}\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# collect(150)
|
||||
split("images.txt", "books.txt", "images2.txt")
|
|
@ -10,7 +10,7 @@ download_folder = "downloaded_books/download_files"
|
|||
extraction_folder = "downloaded_books"
|
||||
|
||||
|
||||
def download_and_unzip_books(folder_path, download_folder, extraction_folder):
|
||||
def downloadAndUnzipBooks(folder_path, download_folder, extraction_folder):
|
||||
base_url = "https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}-h.zip"
|
||||
|
||||
# Ensure the download and extraction folders exist
|
||||
|
@ -68,4 +68,5 @@ def download_and_unzip_books(folder_path, download_folder, extraction_folder):
|
|||
print(f"No book ID found in {filename}")
|
||||
|
||||
|
||||
download_and_unzip_books(folder_path, download_folder, extraction_folder)
|
||||
if __name__ == "__main__":
|
||||
downloadAndUnzipBooks(folder_path, download_folder, extraction_folder)
|
||||
|
|
|
@ -4,31 +4,23 @@
|
|||
|
||||
import os
|
||||
|
||||
input_file = "./empty_alt_text_sample.txt" # The file path of whatever initial .txt you are working with
|
||||
input_file = "./images.txt"
|
||||
output_folder = "./book_outputs"
|
||||
|
||||
|
||||
def createIndividualBookFiles(input_file, output_folder):
|
||||
# Ensure the output folder exists
|
||||
def splitSampleByBook(input_file, output_folder):
|
||||
if not os.path.exists(output_folder):
|
||||
os.makedirs(output_folder)
|
||||
|
||||
# Keep track of the last book number processed
|
||||
last_book_number = None
|
||||
|
||||
with open(input_file, "r") as file:
|
||||
for line in file:
|
||||
book_number = line.split()[0] # Extracting book number
|
||||
# Check if this line is for a new book
|
||||
if book_number != last_book_number:
|
||||
output_file_name = f"ebook_{book_number}.txt"
|
||||
output_path = os.path.join(output_folder, output_file_name)
|
||||
# print(f"Creating/Updating file for book {book_number}")
|
||||
last_book_number = book_number
|
||||
output_file_name = f"ebook_{book_number}.txt"
|
||||
output_path = os.path.join(output_folder, output_file_name)
|
||||
|
||||
# Append to the file (creates a new file if it doesn't exist)
|
||||
with open(output_path, "a") as output_file:
|
||||
output_file.write(line)
|
||||
|
||||
|
||||
createIndividualBookFiles(input_file, output_folder)
|
||||
if __name__ == "__main__":
|
||||
splitSampleByBook(input_file, output_folder)
|
||||
|
|
|
@ -10,6 +10,7 @@ from src.alttext.langengine.openaiapi import OpenAIAPI
|
|||
import keys
|
||||
|
||||
# HTML BOOK FILEPATHS
|
||||
HTML_ADVENTURES = "../books/pg76-h/pg76-images.html"
|
||||
HTML_BIRD = "../books/pg30221-h/pg30221-images.html"
|
||||
HTML_HUNTING = "../books/pg37122-h/pg37122-images.html"
|
||||
HTML_MECHANIC = "../books/pg71856-h/pg71856-images.html"
|
||||
|
@ -33,11 +34,20 @@ def testHTML():
|
|||
OpenAIAPI(keys.OpenAIKey(), "gpt-3.5-turbo"),
|
||||
)
|
||||
|
||||
alt.parseFile(HTML_HUNTING)
|
||||
imgs = alt.getAllImgs()
|
||||
src = imgs[7].attrs["src"]
|
||||
print(src)
|
||||
print(alt.genAltText(src))
|
||||
# imgs = alt.getAllImgs()
|
||||
|
||||
alt.parseFile(HTML_ADVENTURES)
|
||||
img = alt.getImg("images/c01-21.jpg")
|
||||
src = img.attrs["src"]
|
||||
imgData = alt.getImgData(src)
|
||||
chars = alt.genChars(imgData, src)
|
||||
desc = alt.genDesc(imgData, src, alt.getContext(img))
|
||||
altText = alt.genAltText(src)
|
||||
print(chars)
|
||||
print("=====================================")
|
||||
print(desc)
|
||||
print("=====================================")
|
||||
print(altText)
|
||||
|
||||
# desc = alt.genDesc(alt.getImgData(src), src)
|
||||
# print(desc)
|
||||
|
|
Loading…
Reference in New Issue